Textacy

Latest version: v0.13.0

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 6

0.3.4

New and Changed:

- Improved and expanded calculation of basic counts and readability statistics
in `text_stats` module.
- Added a `TextStats()` class for more convenient, granular access to
individual values. See usage docs for more info. When calculating, say, just
one readability statistic, performance with this class should be slightly better;
if calculating _all_ statistics, performance is worse owing to unavoidable,
added overhead in Python for variable lookups. The legacy function
`text_stats.readability_stats()` still exists and behaves as before, but a
deprecation warning is displayed.
- Added functions for calculating Wiener Sachtextformel (PR 77), LIX, and GULPease
readability statistics.
- Added number of long words and number of monosyllabic words to basic counts.
- Clarified the need for having spacy models installed for most use cases of textacy,
in addition to just the spacy package.
- README updated with comments on this, including links to more extensive spacy
documentation. (Issues 66 and 68)
- Added a function, `compat.get_config()` that includes information about which
(if any) spacy models are installed.
- Recent changes to spacy, including a warning message, will also make model
problems more apparent.
- Added an `ngrams` parameter to `keyterms.sgrank()`, allowing for more flexibility
in specifying valid keyterm candidates for the algorithm. (PR 75)
- Dropped dependency on `fuzzywuzzy` package, replacing usage of
`fuzz.token_sort_ratio()`
with a textacy equivalent in order to avoid license incompatibilities. As a bonus,
the new code seems to perform faster! (Issue 62)
- Note: Outputs are now floats in [0.0, 1.0], consistent with other similarity
functions, whereas before outputs were ints in [0, 100]. This has implications
for `match_threshold` values passed to `similarity.jaccard()`; a warning
is displayed and the conversion is performed automatically, for now.
- A MANIFEST.in file was added to include docs, tests, and distribution files in the source distribution. This is just good practice. (PR 65)

Fixed:

- Known acronym-definition pairs are now properly handled in
`extract.acronyms_and_definitions()` (Issue 61)
- WikiReader no longer crashes on null page element content while parsing (PR 64)
- Fixed a rare but perfectly legal edge case exception in `keyterms.sgrank()`,
and added a window width sanity check. (Issue 72)
- Fixed assignment of 2-letter language codes to `Doc` and `Corpus` objects
when the lang parameter is specified as a full spacy model name.
- Replaced several leftover print statements with proper logging functions.

Contributors:

Big thanks to oroszgy, rolando, covuworie, and RolandColored for the pull requests!

0.3.3

New and Changed:

- Added a consistent `normalize` param to functions and methods that require
token/span text normalization. Typically, it takes one of the following values:
'lemma' to lemmatize tokens, 'lower' to lowercase tokens, False-y to *not* normalize
tokens, or a function that converts a spacy token or span into a string, in
whatever way the user prefers (e.g. `spacy_utils.normalized_str()`).
- Functions modified to use this param:
`Doc.to_bag_of_terms()`,
`Doc.to_bag_of_words()`,
`Doc.to_terms_list()`,
`Doc.to_semantic_network()`,
`Corpus.word_freqs()`,
`Corpus.word_doc_freqs()`,
`keyterms.sgrank()`,
`keyterms.textrank()`,
`keyterms.singlerank()`,
`keyterms.key_terms_from_semantic_network()`,
`network.terms_to_semantic_network()`,
`network.sents_to_semantic_network()`
- Tweaked `keyterms.sgrank()` for higher quality results and improved internal
performance.
- When getting both n-grams and named entities with `Doc.to_terms_list()`, filtering
out numeric spans for only one is automatically extended to the other. This prevents
unexpected behavior, such as passing `filter_nums=True` but getting numeric named
entities back in the terms list.

Fixed:

- `keyterms.sgrank()` no longer crashes if a term is missing from `idfs` mapping.
(jeremybmerrill, issue 53)
- Proper nouns are no longer excluded from consideration as keyterms in `keyterms.sgrank()`
and `keyterms.textrank()`. (jeremybmerrill, issue 53)
- Empty strings are now excluded from consideration as keyterms — a bug inherited
from spaCy. (mlehl88, issue 58)

0.3.2

New and Changed:

- Preliminary inclusion of custom spaCy pipelines
- updated `load_spacy()` to include explicit path and create_pipeline kwargs,
and removed the already-deprecated `load_spacy_pipeline()` function to avoid
confusion around spaCy languages and pipelines
- added `spacy_pipelines` module to hold implementations of custom spaCy pipelines,
including a basic one that merges entities into single tokens
- note: necessarily bumped minimum spaCy version to 1.1.0+
- see the announcement here: https://explosion.ai/blog/spacy-deep-learning-keras
- To reduce code bloat, made the `matplotlib` dependency optional and dropped
the `gensim` dependency
- to install `matplotlib` at the same time as textacy, do `$ pip install textacy[viz]`
- bonus: `backports.csv` is now only installed for Py2 users
- thanks to mbatchkarov for the request
- Improved performance of `textacy.corpora.WikiReader().texts()`; results should
stream faster and have cleaner plaintext content than when they were produced
by `gensim`. This *should* also fix a bug reported in Issue 51 by baisk
- Added a `Corpus.vectors` property that returns a matrix of shape
( documents, vector dim) containing the average word2vec-style vector
representation of constituent tokens for all `Doc` s

0.3.1

Changed:

- Updated spaCy dependency to the latest v1.0.1; set a floor on other dependencies'
versions to make sure everyone's running reasonably up-to-date code

Fixed:

- Fixed incorrect kwarg in `sgrank` 's call to `extract.ngrams()` (patcollis34, issue 44)
- Fixed import for `cachetool` 's `hashkey`, which changed in the v2.0 (gramonov, issue 45)

0.3.0

New and Changed:

- Refactored and streamlined `TextDoc`; changed name to `Doc`
- simplified init params: `lang` can now be a language code string or an equivalent
`spacy.Language` object, and `content` is either a string or `spacy.Doc`;
param values and their interactions are better checked for errors and inconsistencies
- renamed and improved methods transforming the Doc; for example, `.as_bag_of_terms()`
is now `.to_bag_of_terms()`, and terms can be returned as integer ids (default)
or as strings with absolute, relative, or binary frequencies as weights
- added performant `.to_bag_of_words()` method, at the cost of less customizability
of what gets included in the bag (no stopwords or punctuation); words can be
returned as integer ids (default) or as strings with absolute, relative, or
binary frequencies as weights
- removed methods wrapping `extract` functions, in favor of simply calling that
function on the Doc (see below for updates to `extract` functions to make
this more convenient); for example, `TextDoc.words()` is now `extract.words(Doc)`
- removed `.term_counts()` method, which was redundant with `Doc.to_bag_of_terms()`
- renamed `.term_count()` => `.count()`, and checking + caching results is now
smarter and faster
- Refactored and streamlined `TextCorpus`; changed name to `Corpus`
- added init params: can now initialize a `Corpus` with a stream of texts,
spacy or textacy Docs, and optional metadatas, analogous to `Doc`; accordingly,
removed `.from_texts()` class method
- refactored, streamlined, *bug-fixed*, and made consistent the process of
adding, getting, and removing documents from `Corpus`
- getting/removing by index is now equivalent to the built-in `list` API:
`Corpus[:5]` gets the first 5 `Doc`s, and `del Corpus[:5]` removes the
first 5, automatically keeping track of corpus statistics for total
\ docs, sents, and tokens
- getting/removing by boolean function is now done via the `.get()` and `.remove()`
methods, the latter of which now also correctly tracks corpus stats
- adding documents is split across the `.add_text()`, `.add_texts()`, and
`.add_doc()` methods for performance and clarity reasons
- added `.word_freqs()` and `.word_doc_freqs()` methods for getting a mapping
of word (int id or string) to global weight (absolute, relative, binary, or
inverse frequency); akin to a vectorized representation (see: `textacy.vsm`)
but in non-vectorized form, which can be useful
- removed `.as_doc_term_matrix()` method, which was just wrapping another function;
so, instead of `corpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus))`,
do `textacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))`
- Updated several `extract` functions
- almost all now accept either a `textacy.Doc` or `spacy.Doc` as input
- renamed and improved parameters for filtering for or against certain POS or NE
types; for example, `good_pos_tags` is now `include_pos`, and will accept
either a single POS tag as a string or a set of POS tags to filter for; same
goes for `exclude_pos`, and analogously `include_types`, and `exclude_types`
- Updated corpora classes for consistency and added flexibility
- enforced a consistent API: `.texts()` for a stream of plain text documents
and `.records()` for a stream of dicts containing both text and metadata
- added filtering options for `RedditReader`, e.g. by date or subreddit,
consistent with other corpora (similar tweaks to `WikiReader` may come later,
but it's slightly more complicated...)
- added a nicer `repr` for `RedditReader` and `WikiReader` corpora, consistent
with other corpora
- Moved `vsm.py` and `network.py` into the top-level of `textacy` and thus
removed the `representations` subpackage
- renamed `vsm.build_doc_term_matrix()` => `vsm.doc_term_matrix()`, because
the "build" part of it is obvious
- Renamed `distance.py` => `similarity.py`; all returned values are now similarity
metrics in the interval [0, 1], where higher values indicate higher similarity
- Renamed `regexes_etc.py` => `constants.py`, without additional changes
- Renamed `fileio.utils.split_content_and_metadata()` => `fileio.utils.split_record_fields()`,
without further changes (except for tweaks to the docstring)
- Added functions to read and write delimited file formats: `fileio.read_csv()`
and `fileio.write_csv()`, where the delimiter can be any valid one-char string;
gzip/bzip/lzma compression is handled automatically when available
- Added better and more consistent docstrings and usage examples throughout
the code base

0.2.8

New:

- Added two new corpora!
- the CapitolWords corpus: a collection of 11k speeches (~7M tokens) given by
the main protagonists of the 2016 U.S. Presidential election that had
previously served in the U.S. Congress — including Hillary Clinton, Bernie Sanders,
Barack Obama, Ted Cruz, and John Kasich — from January 1996 through June 2016
- the SupremeCourt corpus: a collection of 8.4k court cases (~71M tokens)
decided by the U.S. Supreme Court from 1946 through 2016, with metadata on
subject matter categories, ideology, and voting patterns
- **DEPRECATED:** the Bernie and Hillary corpus, which is a small subset of
CapitolWords that can be easily recreated by filtering CapitolWords by
`speaker_name={'Bernie Sanders', 'Hillary Clinton'}`

Changed:

- Refactored and improved `fileio` subpackage
- moved shared (read/write) functions into separate `fileio.utils` module
- almost all read/write functions now use `fileio.utils.open_sesame()`,
enabling seamless fileio for uncompressed or gzip, bz2, and lzma compressed
files; relative/user-home-based paths; and missing intermediate directories.
NOTE: certain file mode / compression pairs simply don't work (this is Python's
fault), so users may run into exceptions; in Python 3, you'll almost always
want to use text mode ('wt' or 'rt'), but in Python 2, users can't read or
write compressed files in text mode, only binary mode ('wb' or 'rb')
- added options for writing json files (matching stdlib's `json.dump()`) that
can help save space
- `fileio.utils.get_filenames()` now matches for/against a regex pattern rather
than just a contained substring; using the old params will now raise a
deprecation warning
- **BREAKING:** `fileio.utils.split_content_and_metadata()` now has `itemwise=False`
by default, rather than `itemwise=True`, which means that splitting
multi-document streams of content and metadata into parallel iterators is
now the default action
- added `compression` param to `TextCorpus.save()` and `.load()` to optionally
write metadata json file in compressed form
- moved `fileio.write_conll()` functionality to `export.doc_to_conll()`, which
converts a spaCy doc into a ConLL-U formatted string; writing that string to
disk would require a separate call to `fileio.write_file()`
- Cleaned up deprecated/bad Py2/3 `compat` imports, and added better functionality
for Py2/3 strings
- now `compat.unicode_type` used for text data, `compat.bytes_type` for binary
data, and `compat.string_types` for when either will do
- also added `compat.unicode_to_bytes()` and `compat.bytes_to_unicode()` functions,
for converting between string types

Fixed:

- Fixed document(s) removal from `TextCorpus` objects, including correct decrementing
of `.n_docs`, `.n_sents`, and `.n_tokens` attributes (michelleful 29)
- Fixed OSError being incorrectly raised in `fileio.open_sesame()` on missing files
- `lang` parameter in `TextDoc` and `TextCorpus` can now be unicode *or* bytes,
which was bug-like

Page 4 of 6

Releases

Has known vulnerabilities

Previous Next

Textacy

Page 4 of 6

0.3.4

0.3.3

0.3.2

0.3.1

0.3.0

0.2.8

Page 4 of 6

Links

Releases