
Latest version: v0.13.0

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 6



- **Add a new spacier sub-package for spaCy-oriented functionality** (168, 187)
- Thus far, this includes a `components` module with two custom spaCy
pipeline components: one to compute text stats on parsed documents, and
another to merge named entities into single tokens in an efficient manner.
More to come!
- Similar functionality in the top-level `spacy_pipelines` module has been
deprecated; it will be removed in v0.7.0.


- Update the readme, usage, and API reference docs to be clearer and (I hope)
more useful. (186)
- Removing punctuation from a text via the `preprocessing` module now replaces
punctuation marks with a single space rather than an empty string. This gives
better behavior in many situations; for example, "won't" => "won t" rather than
"wont", the latter of which is a valid word with a different meaning.
- Categories are now correctly extracted from non-English language Wikipedia
datasets, starting with French and German and extendable to others. (175)
- Log progress when adding documents to a corpus. At the debug level, every
doc's addition is logged; at the info level, only one message per batch
of documents is logged. (183)


- Fix two breaking typos in `extract.direct_quotations()`. (issue 177)
- Prevent crashes when adding non-parsed documents to a `Corpus`. (180)
- Fix bugs in `keyterms.most_discriminating_terms()` that used `vsm`
functionality as it was *before* the changes in v0.6.0. (189)
- Fix a breaking typo in `vsm.matrix_utils.apply_idf_weighting()`, and rename
the problematic kwarg for consistency with related functions. (190)


Big thanks to sammous, dixiekong (nice name!), and SandyRogers for the pull
requests, and many more for pointing out various bugs and the rougher edges /
unsupported use cases of this package.



- **Rename, refactor, and extend I/O functionality** (PR 151)
- Related read/write functions were moved from `read.py` and `write.py` into
format-specific modules, and similar functions were consolidated into one
with the addition of an arg. For example, `write.write_json()` and
`write.write_json_lines()` => `json.write_json(lines=True|False)`.
- Useful functionality was added to a few readers/writers. For example,
`write_json()` now automatically handles python dates/datetimes, writing
them to disk as ISO-formatted strings rather than raising a TypeError
("datetime is not JSON serializable", ugh). CSVs can now be written to /
read from disk when each row is a dict rather than a list. Reading/writing
HTTP streams now allows for basic authentication.
- Several things were renamed to improve clarity and consistency from a user's
perspective, most notably the subpackage name: `fileio` => `io`. Others:
`read_file()` and `write_file()` => `read_text()` and `write_text()`;
`split_record_fields()` => `split_records()`, although I kept an alias
to the old function for folks; `auto_make_dirs` boolean kwarg => `make_dirs`.
- `io.open_sesame()` now handles zip files (provided they contain only 1 file)
as it already does for gzip, bz2, and lzma files. On a related note, Python 2
users can now open lzma (`.xz`) files if they've installed `backports.lzma`.
- **Improve, refactor, and extend vector space model functionality** (PRs 156 and 167)
- BM25 term weighting and document-length normalization were implemented, and
and users can now flexibly add and customize individual components of an
overall weighting scheme (local scaling + global scaling + doc-wise normalization).
For API sanity, several additions and changes to the `Vectorizer` init
params were required --- sorry bout it!
- Given all the new weighting possibilities, a `Vectorizer.weighting` attribute
was added for curious users, to give a mathematical representation of how
values in a doc-term matrix are being calculated. Here's a simple and a
not-so-simple case:

>>> Vectorizer(apply_idf=True, idf_type='smooth').weighting
'tf * log((n_docs + 1) / (df + 1)) + 1'
>>> Vectorizer(tf_type='bm25', apply_idf=True, idf_type='smooth', apply_dl=True).weighting
'(tf * (k + 1)) / (tf + k * (1 - b + b * (length / avg(lengths))) * log((n_docs - df + 0.5) / (df + 0.5))'

- Terms are now sorted alphabetically after fitting, so you'll have a consistent
and interpretable ordering in your vocabulary and doc-term-matrix.
- A `GroupVectorizer` class was added, as a child of `Vectorizer` and
an extension of typical document-term matrix vectorization, in which each
row vector corresponds to the weighted terms co-occurring in a single document.
This allows for customized grouping, such as by a shared author or publication year,
that may span multiple documents, without forcing users to merge /concatenate
those documents themselves.
- Lastly, the `vsm.py` module was refactored into a `vsm` subpackage with
two modules. Imports should stay the same, but the code structure is now
more amenable to future additions.
- **Miscellaneous additions and improvements**
- Flesch Reading Ease in the `textstats` module is now multi-lingual! Language-
specific formulations for German, Spanish, French, Italian, Dutch, and Russian
were added, in addition to (the default) English. (PR 158, prompted by Issue 155)
- Runtime performance, as well as docs and error messages, of functions for
generating semantic networks from lists of terms or sentences were improved. (PR 163)
- Labels on named entities from which determiners have been dropped are now
preserved. There's still a minor gotcha, but it's explained in the docs.
- The size of `textacy`'s data cache can now be set via an environment
variable, `TEXTACY_MAX_CACHE_SIZE`, in case the default 2GB cache doesn't
meet your needs.
- Docstrings were improved in many ways, large and small, throughout the code.
May they guide you even more effectively than before!
- The package version is now set from a single source. This isn't for you so
much as me, but it does prevent confusing version mismatches b/w code, pypi,
and docs.
- All tests have been converted from `unittest` to `pytest` style. They
run faster, they're more informative in failure, and they're easier to extend.


- Fixed an issue where existing metadata associated with a spacy Doc was being
overwritten with an empty dict when using it to initialize a textacy Doc.
Users can still overwrite existing metadata, but only if they pass in new data.
- Added a missing import to the README's usage example. (149)
- The intersphinx mapping to `numpy` got fixed (and items for `scipy` and
`matplotlib` were added, too). Taking advantage of that, a bunch of broken
object links scattered throughout the docs got fixed.
- Fixed broken formatting of old entries in the changelog, for your reading pleasure.



- **Bumped version requirement for spaCy from < 2.0 to >= 2.0** --- textacy no longer
works with spaCy 1.x! It's worth the upgrade, though. v2.0's new features and
API enabled (or required) a few changes on textacy's end
- `textacy.load_spacy()` takes the same inputs as the new `spacy.load()`,
i.e. a package `name` string and an optional list of pipes to `disable`
- textacy's `Doc` metadata and language string are now stored in `user_data`
directly on the spaCy `Doc` object; although the API from a user's perspective
is unchanged, this made the next change possible
- `Doc` and `Corpus` classes are now de/serialized via pickle into a single
file --- no more side-car JSON files for metadata! Accordingly, the `.save()`
and `.load()` methods on both classes have a simpler API: they take
a single string specifying the file on disk where data is stored.
- **Cleaned up docs, imports, and tests throughout the entire code base.**
- docstrings and https://textacy.readthedocs.io 's API reference are easier to
read, with better cross-referencing and far fewer broken web links
- namespaces are less cluttered, and textacy's source code is easier to follow
- `import textacy` takes less than half the time from before
- the full test suite also runs about twice as fast, and most tests are now
more robust to changes in the performance of spaCy's models
- consistent adherence to conventions eases users' cognitive load :)
- **The module responsible for caching loaded data in memory was cleaned up and
improved**, as well as renamed: from `data.py` to `cache.py`, which is more
descriptive of its purpose. Otherwise, you shouldn't notice much of a difference
besides *things working correctly*.
- All loaded data (e.g. spacy language pipelines) is now cached together in a
single LRU cache whose max size is set to 2GB, and the size of each element
in the cache is now accurately computed. (tl;dr: `sys.getsizeof` does not
work on non-built-in objects like, say, a `spacy.tokens.Doc`.)
- Loading and downloading of the DepecheMood resource is now less hacky and
weird, and much closer to how users already deal with textacy's various
`Dataset` s, In fact, it can be downloaded in exactly the same way as the
datasets via textacy's new CLI: `$ python -m textacy download depechemood`.
P.S. A brief guide for using the CLI got added to the README.
- **Several function/method arguments marked for deprecation have been removed.**
If you've been ignoring the warnings that print out when you use `lemmatize=True`
instead of `normalize='lemma'` (etc.), now is the time to update your calls!
- Of particular note: The `readability_stats()` function has been removed;
use `TextStats(doc).readability_stats` instead.


- In certain situations, the text of a spaCy span was being returned without
whitespace between tokens; that has been avoided in textacy, and the source bug
in spaCy got fixed (by yours truly! https://github.com/explosion/spaCy/pull/1621).
- When adding already-parsed `Doc`s to a `Corpus`, including `metadata`
now correctly overwrites any existing metadata on those docs.
- Fixed a couple related issues involving the assignment of a 2-letter language
string to the `.lang` attribute of `Doc` and `Corpus` objects.
- textacy's CLI wasn't correctly handling certain dataset kwargs in all cases;
now, all kwargs get to their intended destinations.



- Added a CLI for downloading `textacy`-related data, inspired by the `spaCy`
equivalent. It's *temporarily* undocumented, but to see available commands and
options, just pass the usual flag: `$ python -m textacy --help`. Expect more
functionality (and docs!) to be added soonish. (144)
- Note: The existing `Dataset.download()` methods work as before, and in fact,
they are being called under the hood from the command line.


- Made usage of `networkx` v2.0-compatible, and therefore dropped the <2.0
version requirement on that dependency. Upgrade as you please! (131)
- Improved the regex for identifying phone numbers so that it's easier to view
and interpret its matches. (128)


- Fixed caching of counts on `textacy.Doc` instance-specific, rather than
shared by all instances of the class. Oops.
- Fixed currency symbols regex, so as not to replace all instances of the letter "z"
when a custom string is passed into `replace_currency_symbols()`. (137)
- Fixed README usage example, which skipped downloading of dataset data. Btw,
see above for another way! (124)
- Fixed typo in the API reference, which included the SupremeCourt dataset twice
and omitted the RedditComments dataset. (129)
- Fixed typo in `RedditComments.download()` that prevented it from downloading
any data. (143)


Many thanks to asifm, harryhoch, and mdlynch37 for submitting PRs!



- Added key classes to the top-level `textacy` imports, for convenience:
- `textacy.text_stats.TextStats` => `textacy.TextStats`
- `textacy.vsm.Vectorizer` => `textacy.Vectorizer`
- `textacy.tm.TopicModel` => `textacy.TopicModel`
- Added tests for `textacy.Doc` and updated the README's usage example


- Added explicit encoding when opening Wikipedia database files in text mode to
resolve an issue when doing so without encoding on Windows (PR 118)
- Fixed `keyterms.most_discriminating_terms` to use the `vsm.Vectorizer` class
rather than the `vsm.doc_term_matrix` function that it replaced (PR 120)
- Fixed mishandling of a couple optional args in `Doc.to_terms_list`


Thanks to minketeer and Gregory-Howard for the fixes!


New and Changed:

- Refactored and expanded built-in `corpora`, now called `datasets` (PR 112)
- The various classes in the old `corpora` subpackage had a similar but
frustratingly not-identical API. Also, some fetched the corresponding dataset
automatically, while others required users to do it themselves. Ugh.
- These classes have been ported over to a new `datasets` subpackage; they
now have a consistent API, consistent features, and consistent documentation.
They also have some new functionality, including pain-free downloading of
the data and saving it to disk in a stream (so as not to use all your RAM).
- Also, there's a new dataset: A collection of 2.7k Creative Commons texts
from the Oxford Text Archive, which rounds out the included datasets with
English-language, 16th-20th century _literary_ works. (h/t JonathanReeve)
- A `Vectorizer` class to convert tokenized texts into variously weighted
document-term matrices (Issue 69, PR 113)
- This class uses the familiar `scikit-learn` API (which is also consistent
with the `textacy.tm.TopicModel` class) to convert one or more documents
in the form of "term lists" into weighted vectors. An initial set of documents
is used to build up the matrix vocabulary (via `.fit()`), which can then
be applied to new documents (via `.transform()`).
- It's similar in concept and usage to sklearn's `CountVectorizer` or
`TfidfVectorizer`, but doesn't convolve the tokenization task as they do.
This means users have more flexibility in deciding which terms to vectorize.
This class outright replaces the `textacy.vsm.doc_term_matrix()` function.
- Customizable automatic language detection for `Doc` s
- Although `cld2-cffi` is fast and accurate, its installation is problematic
for some users. Since other language detection libraries are available
(e.g. [`langdetect`](https://github.com/Mimino666/langdetect) and
[`langid`](https://github.com/saffsd/langid.py)), it makes sense to let
users choose, as needed or desired.
- First, `cld2-cffi` is now an optional dependency, i.e. is not installed
by default. To install it, do `pip install textacy[lang]` or (for it and
all other optional deps) do `pip install textacy[all]`. (PR 86)
- Second, the `lang` param used to instantiate `Doc` objects may now
be a callable that accepts a unicode string and returns a standard 2-letter
language code. This could be a function that uses `langdetect` under the
hood, or a function that always returns "de" -- it's up to users. Note that
the default value is now `textacy.text_utils.detect_language()`, which
uses `cld2-cffi`, so the default behavior is unchanged.
- Customizable punctuation removal in the `preprocessing` module (Issue 91)
- Users can now specify which punctuation marks they wish to remove, rather
than always removing _all_ marks.
- In the case that all marks are removed, however, performance is now 5-10x
faster by using Python's built-in `str.translate()` method instead of
a regular expression.
- `textacy`, installable via `conda` (PR 100)
- The package has been added to Conda-Forge ([here](https://github.com/conda-forge/textacy-feedstock)),
and installation instructions have been added to the docs. Hurray!
- `textacy`, now with helpful badges
- Builds are now automatically tested via Travis CI, and there's a badge in
the docs showing whether the build passed or not. The days of my ignoring
broken tests in `master` are (probably) over...
- There are also badges showing the latest releases on GitHub, pypi, and
conda-forge (see above).


- Fixed the check for overlap between named entities and unigrams in the
`Doc.to_terms_list()` method (PR 111)
- `Corpus.add_texts()` uses CPU_COUNT - 1 threads by default, rather than
always assuming that 4 cores are available (Issue 89)
- Added a missing coding declaration to a test file, without which tests failed
for Python 2 (PR 99)
- `readability_stats()` now catches an exception raised on empty documents and
logs a message, rather than barfing with an unhelpful `ZeroDivisionError`.
(Issue 88)
- Added a check for empty terms list in `terms_to_semantic_network` (Issue 105)
- Added and standardized module-specific loggers throughout the code base; not
a bug per sé, but certainly some much-needed housecleaning
- Added a note to the docs about expectations for bytes vs. unicode text (PR 103)


Thanks to henridwyer, rolando, pavlin99th, and kyocum for their contributions!

Page 3 of 6

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.