Textacy

Latest version: v0.13.0

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 6

0.6.1

New:

- **Add a new spacier sub-package for spaCy-oriented functionality** (168, 187)
- Thus far, this includes a `components` module with two custom spaCy
pipeline components: one to compute text stats on parsed documents, and
another to merge named entities into single tokens in an efficient manner.
More to come!
- Similar functionality in the top-level `spacy_pipelines` module has been
deprecated; it will be removed in v0.7.0.

Changed:

- Update the readme, usage, and API reference docs to be clearer and (I hope)
more useful. (186)
- Removing punctuation from a text via the `preprocessing` module now replaces
punctuation marks with a single space rather than an empty string. This gives
better behavior in many situations; for example, "won't" => "won t" rather than
"wont", the latter of which is a valid word with a different meaning.
- Categories are now correctly extracted from non-English language Wikipedia
datasets, starting with French and German and extendable to others. (175)
- Log progress when adding documents to a corpus. At the debug level, every
doc's addition is logged; at the info level, only one message per batch
of documents is logged. (183)

Fixed:

- Fix two breaking typos in `extract.direct_quotations()`. (issue 177)
- Prevent crashes when adding non-parsed documents to a `Corpus`. (180)
- Fix bugs in `keyterms.most_discriminating_terms()` that used `vsm`
functionality as it was *before* the changes in v0.6.0. (189)
- Fix a breaking typo in `vsm.matrix_utils.apply_idf_weighting()`, and rename
the problematic kwarg for consistency with related functions. (190)

Contributors:

Big thanks to sammous, dixiekong (nice name!), and SandyRogers for the pull
requests, and many more for pointing out various bugs and the rougher edges /
unsupported use cases of this package.

0.6.0

Changed:

- **Rename, refactor, and extend I/O functionality** (PR 151)
- Related read/write functions were moved from `read.py` and `write.py` into
format-specific modules, and similar functions were consolidated into one
with the addition of an arg. For example, `write.write_json()` and
`write.write_json_lines()` => `json.write_json(lines=True|False)`.
- Useful functionality was added to a few readers/writers. For example,
`write_json()` now automatically handles python dates/datetimes, writing
them to disk as ISO-formatted strings rather than raising a TypeError
("datetime is not JSON serializable", ugh). CSVs can now be written to /
read from disk when each row is a dict rather than a list. Reading/writing
HTTP streams now allows for basic authentication.
- Several things were renamed to improve clarity and consistency from a user's
perspective, most notably the subpackage name: `fileio` => `io`. Others:
`read_file()` and `write_file()` => `read_text()` and `write_text()`;
`split_record_fields()` => `split_records()`, although I kept an alias
to the old function for folks; `auto_make_dirs` boolean kwarg => `make_dirs`.
- `io.open_sesame()` now handles zip files (provided they contain only 1 file)
as it already does for gzip, bz2, and lzma files. On a related note, Python 2
users can now open lzma (`.xz`) files if they've installed `backports.lzma`.
- **Improve, refactor, and extend vector space model functionality** (PRs 156 and 167)
- BM25 term weighting and document-length normalization were implemented, and
and users can now flexibly add and customize individual components of an
overall weighting scheme (local scaling + global scaling + doc-wise normalization).
For API sanity, several additions and changes to the `Vectorizer` init
params were required --- sorry bout it!
- Given all the new weighting possibilities, a `Vectorizer.weighting` attribute
was added for curious users, to give a mathematical representation of how
values in a doc-term matrix are being calculated. Here's a simple and a
not-so-simple case:

python
>>> Vectorizer(apply_idf=True, idf_type='smooth').weighting
'tf * log((n_docs + 1) / (df + 1)) + 1'
>>> Vectorizer(tf_type='bm25', apply_idf=True, idf_type='smooth', apply_dl=True).weighting
'(tf * (k + 1)) / (tf + k * (1 - b + b * (length / avg(lengths))) * log((n_docs - df + 0.5) / (df + 0.5))'


- Terms are now sorted alphabetically after fitting, so you'll have a consistent
and interpretable ordering in your vocabulary and doc-term-matrix.
- A `GroupVectorizer` class was added, as a child of `Vectorizer` and
an extension of typical document-term matrix vectorization, in which each
row vector corresponds to the weighted terms co-occurring in a single document.
This allows for customized grouping, such as by a shared author or publication year,
that may span multiple documents, without forcing users to merge /concatenate
those documents themselves.
- Lastly, the `vsm.py` module was refactored into a `vsm` subpackage with
two modules. Imports should stay the same, but the code structure is now
more amenable to future additions.
- **Miscellaneous additions and improvements**
- Flesch Reading Ease in the `textstats` module is now multi-lingual! Language-
specific formulations for German, Spanish, French, Italian, Dutch, and Russian
were added, in addition to (the default) English. (PR 158, prompted by Issue 155)
- Runtime performance, as well as docs and error messages, of functions for
generating semantic networks from lists of terms or sentences were improved. (PR 163)
- Labels on named entities from which determiners have been dropped are now
preserved. There's still a minor gotcha, but it's explained in the docs.
- The size of `textacy`'s data cache can now be set via an environment
variable, `TEXTACY_MAX_CACHE_SIZE`, in case the default 2GB cache doesn't
meet your needs.
- Docstrings were improved in many ways, large and small, throughout the code.
May they guide you even more effectively than before!
- The package version is now set from a single source. This isn't for you so
much as me, but it does prevent confusing version mismatches b/w code, pypi,
and docs.
- All tests have been converted from `unittest` to `pytest` style. They
run faster, they're more informative in failure, and they're easier to extend.

Fixed:

- Fixed an issue where existing metadata associated with a spacy Doc was being
overwritten with an empty dict when using it to initialize a textacy Doc.
Users can still overwrite existing metadata, but only if they pass in new data.
- Added a missing import to the README's usage example. (149)
- The intersphinx mapping to `numpy` got fixed (and items for `scipy` and
`matplotlib` were added, too). Taking advantage of that, a bunch of broken
object links scattered throughout the docs got fixed.
- Fixed broken formatting of old entries in the changelog, for your reading pleasure.

0.5.0

Changed:

- **Bumped version requirement for spaCy from < 2.0 to >= 2.0** --- textacy no longer
works with spaCy 1.x! It's worth the upgrade, though. v2.0's new features and
API enabled (or required) a few changes on textacy's end
- `textacy.load_spacy()` takes the same inputs as the new `spacy.load()`,
i.e. a package `name` string and an optional list of pipes to `disable`
- textacy's `Doc` metadata and language string are now stored in `user_data`
directly on the spaCy `Doc` object; although the API from a user's perspective
is unchanged, this made the next change possible
- `Doc` and `Corpus` classes are now de/serialized via pickle into a single
file --- no more side-car JSON files for metadata! Accordingly, the `.save()`
and `.load()` methods on both classes have a simpler API: they take
a single string specifying the file on disk where data is stored.
- **Cleaned up docs, imports, and tests throughout the entire code base.**
- docstrings and https://textacy.readthedocs.io 's API reference are easier to
read, with better cross-referencing and far fewer broken web links
- namespaces are less cluttered, and textacy's source code is easier to follow
- `import textacy` takes less than half the time from before
- the full test suite also runs about twice as fast, and most tests are now
more robust to changes in the performance of spaCy's models
- consistent adherence to conventions eases users' cognitive load :)
- **The module responsible for caching loaded data in memory was cleaned up and
improved**, as well as renamed: from `data.py` to `cache.py`, which is more
descriptive of its purpose. Otherwise, you shouldn't notice much of a difference
besides *things working correctly*.
- All loaded data (e.g. spacy language pipelines) is now cached together in a
single LRU cache whose max size is set to 2GB, and the size of each element
in the cache is now accurately computed. (tl;dr: `sys.getsizeof` does not
work on non-built-in objects like, say, a `spacy.tokens.Doc`.)
- Loading and downloading of the DepecheMood resource is now less hacky and
weird, and much closer to how users already deal with textacy's various
`Dataset` s, In fact, it can be downloaded in exactly the same way as the
datasets via textacy's new CLI: `$ python -m textacy download depechemood`.
P.S. A brief guide for using the CLI got added to the README.
- **Several function/method arguments marked for deprecation have been removed.**
If you've been ignoring the warnings that print out when you use `lemmatize=True`
instead of `normalize='lemma'` (etc.), now is the time to update your calls!
- Of particular note: The `readability_stats()` function has been removed;
use `TextStats(doc).readability_stats` instead.

Fixed:

- In certain situations, the text of a spaCy span was being returned without
whitespace between tokens; that has been avoided in textacy, and the source bug
in spaCy got fixed (by yours truly! https://github.com/explosion/spaCy/pull/1621).
- When adding already-parsed `Doc`s to a `Corpus`, including `metadata`
now correctly overwrites any existing metadata on those docs.
- Fixed a couple related issues involving the assignment of a 2-letter language
string to the `.lang` attribute of `Doc` and `Corpus` objects.
- textacy's CLI wasn't correctly handling certain dataset kwargs in all cases;
now, all kwargs get to their intended destinations.

0.4.2

New:

- Added a CLI for downloading `textacy`-related data, inspired by the `spaCy`
equivalent. It's *temporarily* undocumented, but to see available commands and
options, just pass the usual flag: `$ python -m textacy --help`. Expect more
functionality (and docs!) to be added soonish. (144)
- Note: The existing `Dataset.download()` methods work as before, and in fact,
they are being called under the hood from the command line.

Changed:

- Made usage of `networkx` v2.0-compatible, and therefore dropped the <2.0
version requirement on that dependency. Upgrade as you please! (131)
- Improved the regex for identifying phone numbers so that it's easier to view
and interpret its matches. (128)

Fixed:

- Fixed caching of counts on `textacy.Doc` instance-specific, rather than
shared by all instances of the class. Oops.
- Fixed currency symbols regex, so as not to replace all instances of the letter "z"
when a custom string is passed into `replace_currency_symbols()`. (137)
- Fixed README usage example, which skipped downloading of dataset data. Btw,
see above for another way! (124)
- Fixed typo in the API reference, which included the SupremeCourt dataset twice
and omitted the RedditComments dataset. (129)
- Fixed typo in `RedditComments.download()` that prevented it from downloading
any data. (143)

Contributors:

Many thanks to asifm, harryhoch, and mdlynch37 for submitting PRs!

0.4.1

Changed:

- Added key classes to the top-level `textacy` imports, for convenience:
- `textacy.text_stats.TextStats` => `textacy.TextStats`
- `textacy.vsm.Vectorizer` => `textacy.Vectorizer`
- `textacy.tm.TopicModel` => `textacy.TopicModel`
- Added tests for `textacy.Doc` and updated the README's usage example

Fixed:

- Added explicit encoding when opening Wikipedia database files in text mode to
resolve an issue when doing so without encoding on Windows (PR 118)
- Fixed `keyterms.most_discriminating_terms` to use the `vsm.Vectorizer` class
rather than the `vsm.doc_term_matrix` function that it replaced (PR 120)
- Fixed mishandling of a couple optional args in `Doc.to_terms_list`

Contributors:

Thanks to minketeer and Gregory-Howard for the fixes!

0.4.0

New and Changed:

- Refactored and expanded built-in `corpora`, now called `datasets` (PR 112)
- The various classes in the old `corpora` subpackage had a similar but
frustratingly not-identical API. Also, some fetched the corresponding dataset
automatically, while others required users to do it themselves. Ugh.
- These classes have been ported over to a new `datasets` subpackage; they
now have a consistent API, consistent features, and consistent documentation.
They also have some new functionality, including pain-free downloading of
the data and saving it to disk in a stream (so as not to use all your RAM).
- Also, there's a new dataset: A collection of 2.7k Creative Commons texts
from the Oxford Text Archive, which rounds out the included datasets with
English-language, 16th-20th century _literary_ works. (h/t JonathanReeve)
- A `Vectorizer` class to convert tokenized texts into variously weighted
document-term matrices (Issue 69, PR 113)
- This class uses the familiar `scikit-learn` API (which is also consistent
with the `textacy.tm.TopicModel` class) to convert one or more documents
in the form of "term lists" into weighted vectors. An initial set of documents
is used to build up the matrix vocabulary (via `.fit()`), which can then
be applied to new documents (via `.transform()`).
- It's similar in concept and usage to sklearn's `CountVectorizer` or
`TfidfVectorizer`, but doesn't convolve the tokenization task as they do.
This means users have more flexibility in deciding which terms to vectorize.
This class outright replaces the `textacy.vsm.doc_term_matrix()` function.
- Customizable automatic language detection for `Doc` s
- Although `cld2-cffi` is fast and accurate, its installation is problematic
for some users. Since other language detection libraries are available
(e.g. [`langdetect`](https://github.com/Mimino666/langdetect) and
[`langid`](https://github.com/saffsd/langid.py)), it makes sense to let
users choose, as needed or desired.
- First, `cld2-cffi` is now an optional dependency, i.e. is not installed
by default. To install it, do `pip install textacy[lang]` or (for it and
all other optional deps) do `pip install textacy[all]`. (PR 86)
- Second, the `lang` param used to instantiate `Doc` objects may now
be a callable that accepts a unicode string and returns a standard 2-letter
language code. This could be a function that uses `langdetect` under the
hood, or a function that always returns "de" -- it's up to users. Note that
the default value is now `textacy.text_utils.detect_language()`, which
uses `cld2-cffi`, so the default behavior is unchanged.
- Customizable punctuation removal in the `preprocessing` module (Issue 91)
- Users can now specify which punctuation marks they wish to remove, rather
than always removing _all_ marks.
- In the case that all marks are removed, however, performance is now 5-10x
faster by using Python's built-in `str.translate()` method instead of
a regular expression.
- `textacy`, installable via `conda` (PR 100)
- The package has been added to Conda-Forge ([here](https://github.com/conda-forge/textacy-feedstock)),
and installation instructions have been added to the docs. Hurray!
- `textacy`, now with helpful badges
- Builds are now automatically tested via Travis CI, and there's a badge in
the docs showing whether the build passed or not. The days of my ignoring
broken tests in `master` are (probably) over...
- There are also badges showing the latest releases on GitHub, pypi, and
conda-forge (see above).

Fixed:

- Fixed the check for overlap between named entities and unigrams in the
`Doc.to_terms_list()` method (PR 111)
- `Corpus.add_texts()` uses CPU_COUNT - 1 threads by default, rather than
always assuming that 4 cores are available (Issue 89)
- Added a missing coding declaration to a test file, without which tests failed
for Python 2 (PR 99)
- `readability_stats()` now catches an exception raised on empty documents and
logs a message, rather than barfing with an unhelpful `ZeroDivisionError`.
(Issue 88)
- Added a check for empty terms list in `terms_to_semantic_network` (Issue 105)
- Added and standardized module-specific loggers throughout the code base; not
a bug per sé, but certainly some much-needed housecleaning
- Added a note to the docs about expectations for bytes vs. unicode text (PR 103)

Contributors:

Thanks to henridwyer, rolando, pavlin99th, and kyocum for their contributions!
:raised_hands:

Page 3 of 6

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.