Textacy

Latest version: v0.13.0

Safety actively analyzes 693883 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 6

0.9.0

Note: `textacy` is now PY3-only! 🎉 Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR 261). See below for related changes.

New:

- **Added `augmentation` subpackage for basic text data augmentation** (PR 268, 269)
- implemented several transformer functions for substituting, inserting, swapping, and deleting elements of text at both the word- and character-level
- implemented an `Augmenter` class for combining multiple transforms and applying them to spaCy `Doc`s in a randomized but configurable manner
- **Note:** This API is provisional, and subject to change in future releases.
- **Added `resources` subpackage for standardized access to linguistic resources** (PR 265)
- **DepecheMood++:** high-coverage emotion lexicons for understanding the emotions evoked by a text. Updated from a previous version, and now features better English data and Italian data with expanded, consistent functionality.
- removed `lexicon_methods.py` module with previous implementation
- **ConceptNet:** multilingual knowledge base for representing relationships between words, similar to WordNet. Currently supports getting word antonyms, hyponyms, meronyms, and synonyms in dozens of languages.
- **Added `UDHR` dataset, a collection of translations of the Universal Declaration of Human Rights** (PR 271)

Changed:

- Updated and extended functionality previously blocked by PY2 compatibility while reducing code bloat / complexity
- made many args keyword-only, to prevent user error
- args accepting strings for directory / file paths now also accept `pathlib.Path` objects, with `pathlib` adopted widely under the hood
- increased minimum versions and/or uncapped maximum versions of several dependencies, including `jellyfish`, `networkx`, and `numpy`
- Added a Portuguese-specific formulation of Flesch Reading Ease score to `text_stats` (PR 263)
- Reorganized and grouped together some like functionality
- moved core functionality for loading spaCy langs and making spaCy docs into `spacier.core`, out of `cache.py` and `doc.py`
- moved some general-purpose functionality from `dataset.utils` to `io.utils` and `utils.py`
- moved function for loading "hyphenator" out of `cache.py` and into `text_stats.py`, where it's used
- Re-trained and released language identification pipelines using a better mix of training data, for slightly improved performance; also added the script used to train the pipeline
- Changed API Reference docs to show items in source code rather than alphabetical order, which should make the ordering more human-friendly
- Updated repo README and PyPi metadata to be more consistent and representative of current functionality
- Removed previously deprecated `textacy.io.split_record_fields()` function

Fixed:

- Fixed a regex for cleaning up crufty terms to prevent catastrophic backtracking in certain edge cases (true story: this bug was encountered in _production code_, and ruined my day)
- Fixed bad handling of edge cases in sCAKE keyterm extraction (Issue 270)
- Changed order in which URL regexes are applied in `preprocessing.replace_urls()` to properly handle certain edge case URLs (Issue 267)

Contributors:

Thanks much to hugoabonizio for the contribution. 🤝

0.8.0

New and Changed:

- **Refactored and expanded text preprocessing functionality** (PR 253)
- Moved code from a top-level `preprocess` module into a `preprocessing` sub-package, and reorganized it in the process
- Added new functions:
- `replace_hashtags()` to replace hashtags like `FollowFriday` or `spacyIRL2019` with `_TAG_`
- `replace_user_handles()` to replace user handles like `bjdewilde` or `spacy_io` with `_USER_`
- `replace_emojis()` to replace emoji symbols like 😉 or 🚀 with `_EMOJI_`
- `normalize_hyphenated_words()` to join hyphenated words back together, like `antici- pation` => `anticipation`
- `normalize_quotation_marks()` to replace "fancy" quotation marks with simple ascii equivalents, like `“the god particle”` => `"the god particle"`
- Changed a couple functions for clarity and consistency:
- `replace_currency_symbols()` now replaces _all_ dedicated ascii and unicode currency symbols with `_CUR_`, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like `€` => `EUR`)
- `remove_punct()` now has a `fast (bool)` kwarg rather than `method (str)`
- Removed `normalize_contractions()`, `preprocess_text()`, and `fix_bad_unicode()` functions, since they were bad/awkward and more trouble than they were worth
- **Refactored and expanded keyterm extraction functionality** (PR 257)
- Moved code from a top-level `keyterms` module into a `ke` sub-package, and cleaned it up / standardized arg names / better shared functionality in the process
- Added new unsupervised keyterm extraction algorithms: YAKE (`ke.yake()`), sCAKE (`ke.scake()`), and PositionRank (`ke.textrank()`, with non-default parameter values)
- Added new methods for selecting candidate keyterms: longest matching subsequence candidates (`ke.utils.get_longest_subsequence_candidates()`) and pattern-matching candidates (`ke.utils.get_pattern_matching_candidates()`)
- Improved speed of SGRank implementation, and generally optimized much of the code
- **Improved document similarity functionality** (PR 256)
- Added a character ngram-based similarity measure (`similarity.character_ngrams()`), for something that's useful in different contexts than the other measures
- Removed Jaro-Winkler string similarity measure (`similarity.jaro_winkler()`), since it didn't add much beyond other measures
- Improved speed of Token Sort Ratio implementation
- Replaced `python-levenshtein` dependency with `jellyfish`, for its active development, better documentation, and _actually-compliant_ license
- **Added customizability to certain functionality**
- Added options to `Doc._.to_bag_of_words()` and `Corpus.word_counts()` for filtering out stop words, punctuation, and/or numbers (PR 249)
- Allowed for objects that _look like_ `sklearn`-style topic modeling classes to be passed into `tm.TopicModel()` (PR 248)
- Added options to customize rc params used by `matplotlib` when drawing a "termite" plot in `viz.draw_termite_plot()` (PR 248)
- Removed deprecated functions with direct replacements: `io.utils.get_filenames()` and `spacier.components.merge_entities()`

Contributors:

Huge thanks to kjoshi and zf109 for the PRs! 🙌

0.7.1

New:

- Added a default, built-in language identification classifier that's moderately fast, moderately accurate, and covers a relatively large number of languages [PR 247]
- Implemented a Google CLD3-inspired model in `scikit-learn` and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (`langid`, `langdetect`, `cld2-cffi`, and `cld3`)
- Dropped `cld2-cffi` dependency [Issue 246]
- Added `extract.matches()` function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-based `Matcher` and a more powerful replacement for textacy's existing (now deprecated) `extract.pos_regex_matches()`
- Added `preprocess.normalize_unicode()` function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removed `fix_unicode()` function

Changed:

- Enabled loading blank spaCy `Language` pipelines (tokenization only -- no model-based tagging, parsing, etc.) via `load_spacy_lang(name, allow_blank=True)` for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprises
- Changed inclusion/exclusion and de-duplication of entities and ngrams in `to_terms_list()` [Issues 169, 179]
- `entities = True` => include entities, and drop exact duplicate ngrams
- `entities = False` => don't include entities, and also drop exact duplicate ngrams
- `entities = None` => use ngrams as-is without checking against entities
- Moved `to_collection()` function from the `datasets.utils` module to the top-level `utils` module, for use throughout the code base
- Added `quoting` option to `io.read_csv()` and `io.write_csv()`, for problematic cases
- Deprecated the `spacier.components.merge_entities()` pipeline component, an implementation of which has since been added into spaCy itself
- Updated documentation for developer convenience and reader clarity
- Split API reference docs into related chunks, rather than having them all together in one long page, and tidied up headers
- Fixed errors / inconsistencies in various docstrings (a never-ending struggle...)
- Ported package readme and changelog from `.rst` to `.md` format

Fixed:

- The `NotImplementedError` previously added to `preprocess.fix_unicode()` is now _raised_ rather than returned [Issue 243]

0.7.0

New and Changed:

- **Removed textacy.Doc, and split its functionality into two parts**
- **New:** Added `textacy.make_spacy_doc()` as a convenient and flexible entry point
for making spaCy `Doc` s from text or (text, metadata) pairs, with optional
spaCy language pipeline specification. It's similar to `textacy.Doc.__init__`,
with the exception that text and metadata are passed in together as a 2-tuple.
- **New:** Added a variety of custom doc property and method extensions to
the global `spacy.tokens.Doc` class, accessible via its `Doc._` "underscore"
property. These are similar to the properties/methods on `textacy.Doc`,
they just require an interstitial underscore. For example,
`textacy.Doc.to_bag_of_words()` => `spacy.tokens.Doc._.to_bag_of_words()`.
- **New:** Added functions for setting, getting, and removing these extensions.
Note that they are set automatically when textacy is imported.
- **Simplified and improved performance of textacy.Corpus**
- Documents are now added through a simpler API, either in `Corpus.__init__`
or `Corpus.add()`; they may be one or a stream of texts, (text, metadata)
pairs, or existing spaCy `Doc` s. When adding many documents, the spaCy
language processing pipeline is used in a faster and more efficient way.
- Saving / loading corpus data to disk is now more efficient and robust.
- Note: `Corpus` is now a collection of spaCy `Doc` s rather than `textacy.Doc` s.
- **Simplified, standardized, and added Dataset functionality**
- **New:** Added an `IMDB` dataset, built on the classic 2011 dataset
commonly used to train sentiment analysis models.
- **New:** Added a base `Wikimedia` dataset, from which a reworked
`Wikipedia` dataset and a separate `Wikinews` dataset inherit.
The underlying data source has changed, from XML db dumps of raw wiki markup
to JSON db dumps of (relatively) clean text and metadata; now, the code is
simpler, faster, and totally language-agnostic.
- `Dataset.records()` now streams (text, metadata) pairs rather than a dict
containing both text and metadata, so users don't need to know field names
and split them into separate streams before creating `Doc` or `Corpus`
objects from the data.
- Filtering and limiting the number of texts/records produced is now clearer
and more consistent between `.texts()` and `.records()` methods on
a given `Dataset` --- and more performant!
- Downloading datasets now always shows progress bars and saves to the same
file names. When appropriate, downloaded archive files' contents are
automatically extracted for easy inspection.
- Common functionality (such as validating filter values) is now standardized
and consolidated in the `datasets.utils` module.
- **Quality of life improvements**
- Reduced load time for `import textacy` from ~2-3 seconds to ~1 second,
by lazy-loading expensive variables, deferring a couple heavy imports, and
dropping a couple dependencies. Specifically:
- `ftfy` was dropped, and a `NotImplementedError` is now raised
in textacy's wrapper function, `textacy.preprocess.fix_bad_unicode()`.
Users with bad unicode should now directly call `ftfy.fix_text()`.
- `ijson` was dropped, and the behavior of `textacy.read_json()`
is now simpler and consistent with other functions for line-delimited data.
- `mwparserfromhell` was dropped, since the reworked `Wikipedia` dataset
no longer requires complicated and slow parsing of wiki markup.
- Renamed certain functions and variables for clarity, and for consistency with
existing conventions:
- `textacy.load_spacy()` => `textacy.load_spacy_lang()`
- `textacy.extract.named_entities()` => `textacy.extract.entities()`
- `textacy.data_dir` => `textacy.DEFAULT_DATA_DIR`
- `filename` => `filepath` and `dirname` => `dirpath` when specifying
full paths to files/dirs on disk, and `textacy.io.utils.get_filenames()`
=> `textacy.io.utils.get_filepaths()` accordingly
- compiled regular expressions now consistently start with `RE_`
- `SpacyDoc` => `Doc`, `SpacySpan` => `Span`, `SpacyToken` => `Token`,
`SpacyLang` => `Language` as variables and in docs
- Removed deprecated functionality
- top-level `spacy_utils.py` and `spacy_pipelines.py` are gone;
use equivalent functionality in the `spacier` subpackage instead
- `math_utils.py` is gone; it was long neglected, and never actually used
- Replaced `textacy.compat.bytes_to_unicode()` and `textacy.compat.unicode_to_bytes()`
with `textacy.compat.to_unicode()` and `textacy.compat.to_bytes()`, which
are safer and accept either binary or text strings as input.
- Moved and renamed language detection functionality,
`textacy.text_utils.detect_language()` => `textacy.lang_utils.detect_lang()`.
The idea is to add more/better lang-related functionality here in the future.
- Updated and cleaned up documentation throughout the code base.
- Added and refactored _many_ tests, for both new and old functionality,
significantly increasing test coverage while significantly reducing run-time.
Also, added a proper coverage report to CI builds. This should help prevent
future errors and inspire better test-writing.
- Bumped the minimum required spaCy version: `v2.0.0` => `v2.0.12`,
for access to their full set of custom extension functionality.

Fixed:

- The progress bar during an HTTP download now always closes, preventing weird
nesting issues if another bar is subsequently displayed.
- Filtering datasets by multiple values performed either a logical AND or OR
over the values, which was confusing; now, a logical OR is always performed.
- The existence of files/directories on disk is now checked _properly_ via
`os.path.isfile()` or `os.path.isdir()`, rather than `os.path.exists()`.
- Fixed a variety of formatting errors raised by sphinx when generating HTML docs.

0.6.3

New:

- Added a proper contributing guide and code of conduct, as well as separate
GitHub issue templates for different user situations. This should help folks
contribute to the project more effectively, and make maintaining it a bit easier,
too. [Issue 212]
- Gave the documentation a new look, using a template popularized by `requests`.
Added documentation on dealing with multi-lingual datasets. [Issue 233]
- Made some minor adjustments to package dependencies, the way they're specified,
and the Travis CI setup, making for a faster and better development experience.
- Confirmed and enabled compatibility with v2.1+ of `spacy`. :dizzy:

Changed:

- Improved the `Wikipedia` dataset class in a variety of ways: it can now read
Wikinews db dumps; access records in namespaces other than the usual "0"
(such as category pages in namespace "14"); parse and extract category pages
in several languages, including in the case of bad wiki markup; and filter out
section headings from the accompanying text via an `include_headings` kwarg.
[PR 219, 220, 223, 224, 231]
- Removed the `transliterate_unicode()` preprocessing function that transliterated
non-ascii text into a reasonable ascii approximation, for technical and
philosophical reasons. Also removed its GPL-licensed `unidecode` dependency,
for legal-ish reasons. [Issue 203]
- Added convention-abiding `exclude` argument to the function that writes
`spacy` docs to disk, to limit which pipeline annotations are serialized.
Replaced the existing but non-standard `include_tensor` arg.
- Deprecated the `n_threads` argument in `Corpus.add_texts()`, which had not
been working in `spacy.pipe` for some time and, as of v2.1, is defunct.
- Made many tests model- and python-version agnostic and thus less likely to break
when `spacy` releases new and improved models.
- Auto-formatted the entire code base using `black`; the results aren't always
more readable, but they are pleasingly consistent.

Fixed:

- Fixed bad behavior of `key_terms_from_semantic_network()`, where an error
would be raised if no suitable key terms could be found; now, an empty list
is returned instead. [Issue 211]
- Fixed variable name typo so `GroupVectorizer.fit()` actually works. [Issue 215]
- Fixed a minor typo in the quick-start docs. [PR 217]
- Check for and filter out any named entities that are entirely whitespace,
seemingly caused by an issue in `spacy`.
- Fixed an undefined variable error when merging spans. [Issue 225]
- Fixed a unicode/bytes issue in experimental function for deserializing `spacy`
docs in "binary" format. [Issue 228, PR 229]

Contributors:

Many thanks to abevieiramota, ckot, Jude188, and digest0r for their help!

0.6.2

Changed:

- Add a `spacier.util` module, and add / reorganize relevant functionality
- move (most) `spacy_util` functions here, and add a deprecation warning to
the `spacy_util` module
- rename `normalized_str()` => `get_normalized_text()`, for consistency and clarity
- add a function to split long texts up into chunks but combine them into
a single `Doc`. This is a workaround for a current limitation of spaCy's
neural models, whose RAM usage scales with the length of input text.
- Add experimental support for reading and writing spaCy docs in binary format,
where multiple docs are contained in a single file. This functionality was
supported by spaCy v1, but is not in spaCy v2; I've implemented a workaround
that should work well in most situations, but YMMV.
- Package documentation is now "officially" hosted on GitHub pages. The docs
are automatically built on and deployed from Travis via `doctr`, so they
stay up-to-date with the master branch on GitHub. Maybe someday I'll get
ReadTheDocs to successfully build `textacy` once again...
- Minor improvements/updates to documentation

Fixed:

- Add missing return statement in deprecated `text_stats.flesch_readability_ease()`
function (Issue 191)
- Catch an empty graph error in bestcoverage-style keyterm ranking (Issue 196)
- Fix mishandling when specifying a single named entity type to in/exclude in
`extract.named_entities` (Issue 202)
- Make `networkx` usage in keyterms module compatible with v1.11+ (Issue 199)

Page 2 of 6

Releases

Has known vulnerabilities

Previous Next

Textacy

Page 2 of 6

0.9.0

0.8.0

0.7.1

0.7.0

0.6.3

0.6.2

Page 2 of 6

Links

Releases