Textacy

Latest version: v0.13.0

Safety actively analyzes 688694 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 6

0.13.0

- upgraded built-in language identification model (PR 375)
- replaced v2 thinc/cld3 model with v3 floret/fasttext model, which has much faster predictions and comparable but more consistent performance
- modernized and improved Python packaging for faster, simpler installation and testing (PR 368 and 369)
- all package metadata and configuration moved into a single `pyproject.toml` file
- code formatting and linting updated to use `ruff` plus newer versions of `mypy` and `black`, and their use in GitHub Actions CI has been consolidated
- bumped supported Python versions range from 3.8–3.10 to 3.9–3.11 (PR 369)
- added full CI testing matrix for PY 3.9/3.10/3.11 x Linux/macOS/Windows, and removed extraneous AppVeyor integration
- updated and improved type hints throughout, reducing number of `mypy` complaints by ~80% (PR 372)

Fixed

- fixed ReDoS bugs in regex patterns (PR 371)
- fixed breaking API issues with newer networkx/scikit-learn versions (PR 367)
- improved dev workflow documentation and code to better incorporate language data (PR 363)
- updated caching code with a fix from upstream pysize library, which was preventing Russian-language spaCy model from loading properly (PR 358)

Contributors

Big thanks to jonwiggins, Hironsan, amnd kevinbackhouse for the fixes!

0.12.0

- Refactored and extended text statistics functionality (PR 350)
- Added functions for computing measures of lexical diversity, such as the clasic Type-Token-Ratio and modern Hypergeometric Distribution Diversity
- Added functions for counting token-level attributes, including morphological features and parts-of-speech, in a convenient form
- Refactored all text stats functions to accept a `Doc` as their first positional arg, suitable for use as custom doc extensions (see below)
- Deprecated the `TextStats` class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.
- Standardized functionality for getting/setting/removing doc extensions (PR 352)
- Now, custom extensions are accessed by name, and users have more control over the process:

python
>>> import textacy
>>> from textacy import extract, text_stats
>>> textacy.set_doc_extensions("extract")
>>> textacy.set_doc_extensions("text_stats.readability")
>>> textacy.remove_doc_extensions("extract.matches")
>>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease()
118.17500000000001


- Moved top-level extensions into `spacier.core` and `extract.bags`
- Standardized `extract` and `text_stats` subpackage extensions to use the new setup, and made them more customizable
- Improved package code, tests, and docs
- Fixed outdated code and comments in the "Quickstart" guide, then renamed it "Walkthrough" since it wasn't actually quick; added a new and, yes, quick "Quickstart" guide to fill the gap (PR 353)
- Added a `pytest` conftest file to improve maintainability and consistency of unit test suite (PR 353)
- Improved quality and consistency of type annotations, everywhere (PR 349)
- **Note:** Bumped Python version support from 3.7–3.9 to 3.8–3.10 in order to take advantage of new typing features in PY3.8 and formally support the current major version (PR 348)
- Modernized and streamlined package builds and configuration (PR 347)
- Removed deprecated `setup.py` and switched from `setuptools` to `build` for builds
- Consolidated tool configuration in `pyproject.toml`
- Extended and tidied up dev-oriented `Makefile`
- Addressed some CI/CD issues

Fixed

- Added missing import, args in `TextStats` docs (PR 331, Issue 334)
- Fixed normalization in YAKE keyword extraction (PR 332)
- Fixed text encoding issue when loading `ConceptNet` data on Windows systems (Issue 345)

Contributors

Thanks to austinjp, scarroll32, MirkoLenz for their help!

0.11.0

- **Refactored, standardized, and extended several areas of functionality**
- text preprocessing (`textacy.preprocessing`)
- Added functions for normalizing bullet points in lists (`normalize.bullet_points()`), removing HTML tags (`remove.html_tags()`), and removing bracketed contents such as in-line citations (`remove.brackets()`).
- Added `make_pipeline()` function for combining multiple preprocessors applied sequentially to input text into a single callable.
- Renamed functions for flexibility and clarity of use; in most cases, this entails replacing an underscore with a period, e.g. `preprocessing.normalize_whitespace()` => `preprocessing.normalize.whitespace()`.
- Renamed and standardized some funcs' args; for example, all "replace" functions had their (optional) second argument renamed from `replace_with` => `repl`, and `remove.punctuation(text, marks=".?!")` => `remove.punctuation(text, only=[".", "?", "!"])`.
- structured information extraction (`textacy.extract`)
- Consolidated and restructured functionality previously spread across the `extract.py` and `text_utils.py` modules and `ke` subpackage. For the latter two, imports have changed:
- `from textacy import ke; ke.textrank()` => `from textacy import extract; extract.keyterms.textrank()`
- `from textacy import text_utils; text_utils.keywords_in_context()` => `from textacy import extract; extract.keywords_in_context()`
- Added new extraction functions:
- `extract.regex_matches()`: For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.
- `extract.acronyms()`: For extracting acronym-like tokens, without looking around for related definitions.
- `extract.terms()`: For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.
- Improved the generality and quality of extracted "triples" such as Subject-Verb-Objects, and changed the structure of returned objects accordingly. Previously, only contiguous spans were permitted for each element, but this was overly restrictive: A sentence like "I did not really like the movie." would produce an SVO of `("I", "like", "movie")` which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces `(["I"], ["did", "not", "like"], ["movie"])`. For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g. `svo.subject` == `svo[0]`).
- Changed `extract.keywords_in_context()` to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to accept `Doc` or `str` objects as input.
- Removed deprecated `extract.pos_regex_matches()` function, which is superseded by the more powerful `extract.token_matches()`.
- string and sequence similarity metrics (`textacy.similarity`)
- Refactored top-level `similarity.py` module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics.
- Added several similarity metrics:
- edit-based Jaro (`similarity.jaro()`)
- token-based Cosine (`similarity.cosine()`), Bag (`similarity.bag()`), and Tversky (`similarity.tvserky()`)
- sequence-based Matching Subsequences Ratio (`similarity.matching_subsequences_ratio()`)
- hybrid Monge-Elkan (`similarity.monge_elkan()`)
- Removed a couple similarity metrics: Word Movers Distance relied on a troublesome external dependency, and Word2Vec+Cosine is available in spaCy via `Doc.similarity`.
- network- and vector-based document representations (`textacy.representations`)
- Consolidated and reworked networks functionality in `representations.network` module
- Added `build_cooccurrence_network()` function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred.
- Added `build_similarity_network()` function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity.
- Removed obsolete `network.py` module and duplicative `extract.keyterms.graph_base.py` module.
- Refined vectorizer initialization, and moved from `vsm.vectorizers` to `representations.vectorizers` module.
- For both `Vectorizer` and `GroupVectorizer`, applying global inverse document frequency weights is now handled by a single arg: `idf_type: Optional[str]`, rather than a combination of `apply_idf: bool, idf_type: str`; similarly, applying document-length weight normalizations is handled by `dl_type: Optional[str]` instead of `apply_dl: bool, dl_type: str`
- Added `representations.sparse_vec` module for higher-level access to document vectorization via `build_doc_term_matrix()` and `build_grp_term_matrix()` functions, for cases when a single fit+transform is all you need.
- automatic language identification (`textacy.lang_id`)
- Moved functionality from `lang_utils.py` module into a subpackage, and added the primary user interface (`identify_lang()` and `identify_topn_langs()`) as package-level imports.
- Implemented and trained a more accurate `thinc`-based language identification model that's closer to the original CLD3 inspiration, replacing the simpler `sklearn`-based pipeline.
- **Updated interface with spaCy for v3, and better leveraged the new functionality**
- Restricted `textacy.load_spacy_lang()` to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately, `textacy` can no longer play fast and loose with automatic language identification => pipeline loading...
- Extended `textacy.make_spacy_doc()` to accept a `chunk_size` arg that splits input text into chunks, processes each individually, then joins them into a single `Doc`; supersedes `spacier.utils.make_doc_from_text_chunks()`, which is now deprecated.
- Moved core `Doc` extensions into a top-level `extensions.py` module, and improved/streamlined the collection
- Refactored and improved performance of `Doc._.to_bag_of_words()` and `Doc._.to_bag_of_terms()`, leveraging related functionality in `extract.words()` and `extract.terms()`
- Removed redundant/awkward extensions:
- `Doc._.lang` => use `Doc.lang_`
- `Doc._.tokens` => use `iter(Doc)`
- `Doc._.n_tokens` => `len(Doc)`
- `Doc._.to_terms_list()` => `extract.terms(doc)` or `Doc._.extract_terms()`
- `Doc._.to_tagged_text()` => NA, this was an old holdover that's not used in practice anymore
- `Doc._.to_semantic_network()` => NA, use a function in `textacy.representations.networks`
- Added `Doc` extensions for `textacy.extract` functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use either `textacy.extract.acronyms(doc)` or `doc._.extract_acronyms()`. Keyterm extraction functions share a single extension: `textacy.extract.keyterms.textrank(doc)` <> `doc._.extract_keyterms(method="textrank")`
- Leveraged spaCy's new `DocBin` for efficiently saving/loading `Doc`s in binary format, with corresponding arg changes in `io.write_spacy_docs()` and `Corpus.save()`+`.load()`
- **Improved package documentation, tests, dependencies, and type annotations**
- Added two beginner-oriented tutorials to documentation, showing how to use various aspects of the package in the context of specific tasks.
- Reorganized API reference docs to put like functionality together and more consistently provide summary tables up top
- Updated dependencies list and package versions
- Removed: `pyemd` and `srsly`
- Un-capped max versions: `numpy` and `scikit-learn`
- Bumped min versions: `cytoolz`, `jellyfish`, `matplotlib`, `pyphen`, and `spacy` (v3.0+ only!)
- Bumped min Python version from 3.6 => 3.7, and added PY3.9 support
- Removed `textacy.export` module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency on `gensim` and CONLL-U that wasn't enforced or guaranteed, so better to remove.
- Added `types.py` module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base.
- Improved, added, and parametrized literally hundreds of tests.

Contributors

Many thanks to timgates42, datanizing, 8W9aG, 0x2b3bfa0, and gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.

0.10.1

New and Changed:

- **Expanded text statistics and refactored into a sub-package (PR 307)**
- Refactored `text_stats` module into a sub-package with the same name and top-level API, but restructured under the hood for better consistency
- Improved performance, API, and documentation on the main `TextStats` class, and improved documentation on many of the individual stats functions
- Added new readability tests for texts in Arabic (Automated Arabic Readability Index), Spanish (µ-legibility and perspecuity index), and Turkish (a lang-specific formulation of Flesch Reading Ease)
- _Breaking change:_ Removed `TextStats.basic_counts` and `TextStats.readability_stats` attributes, since typically only one or a couple needed for a given use case; also, some of the readability tests are language-specific, which meant bad results could get mixed in with good ones
- **Improved and standardized some code quality and performance (PR 305, 306)**
- Standardized error messages via top-level `errors.py` module
- Replaced `str.format()` with f-strings (almost) everywhere, for performance and readability
- Fixed a whole mess of linting errors, significantly improving code quality and consistency
- **Improved package configuration, and maintenance (PRs 298, 305, 306)**
- Added automated GitHub workflows for building and testing the package, linting and formatting, publishing new releases to PyPi, and building documentation (and ripped out Travis CI)
- Added a makefile with common commands for dev work, plus instructions
- Adopted the new `pyproject.toml` package configuration standard; updated and streamlined `setup.py` and `setup.cfg` accordingly; and removed `requirements.txt`
- Moved all source code into a `/src` directory, for technical reasons
- Added `mypy`-specific config file to reduce output noisiness when type-checking
- **Improved and moved package documentation (PR 309)**
- Moved the docs site back to ReadTheDocs (https://textacy.readthedocs.io)! Pardon the years-long detour into GitHub Pages...
- Enabled markdown-based documentation using `recommonmark` instead of `m2r`, and migrated all "narrative" docs from `.rst` to equivalent `.md` files
- Added auto-generated summary tables to many sections of the API Reference, to help users get an overview of functionality and better find what they're looking for; also added auto-generated section heading references
- Tidied up and further standardized docstrings throughout the code
- **Kept up with the Python ecosystem**
- Trained a v1.1 language identifier model using `scikit-learn==0.23.0`, and bumped the upper bound on that dependency's version accordingly
- Updated and parametrized many tests using modern `pytest` functionality (PR 306)
- Got `textacy` versions 0.9.1 and 0.10.0 up on `conda-forge` (Issue 294)
- Added spectral seriation as a term-ordering technique when making a "Termite" visualization by taking advantage of `pandas.DataFrame` functionality, and otherwise tidied up the default for nice-looking plots (PR 295)

Fixed:

- Corrected an incorrect and misleading reference in the quickstart docs (Issue 300, PR 302)
- Fixed a bug in the `delete_words()` augmentation transform (Issue 308)

Contributors:

Special thanks to tbsexton, marius-mather, and rmax for their contributions! 💐

0.10.0

New:

- Added a logo to textacy's documentation and social preview :page_with_curl:
- Added type hints throughout the code base, for more expressive type indicators in docstrings and for static type checkers used by developers to code more effectively (PR 289)
- Added a preprocessing function to normalize sequences of repeating characters (Issue 275)

Changed:

- Improved core `Corpus` functionality using recent additions to spacy (PR 285)
- Re-implemented `Corpus.save()` and `Corpus.load()` using spacy's new `DocBin` class, which resolved a few bugs/issues (Issue 254)
- Added `n_process` arg to `Corpus.add()` to set the number of parallel processes used when adding many items to a corpus, following spacy's updates to `nlp.pipe()` (Issue 277)
- Bumped minimum spaCy version from 2.0.12 => 2.2.0, accordingly
- Added handling for zero-width whitespaces into `normalize_whitespace()` function (Issue 278)
- Improved a couple rough spots in package administration:
- Moved package setup information into a declarative configuration file, in an attempt to keep up with evolving best practices for Python packaging
- Simplified the configuration and interoperability of sphinx + github pages for generating package documentation

Fixed:

- Fixed typo in ConceptNet docstring (Issue 280)
- Trained and distributed a `LangIdentifier` model using `scikit-learn==0.22`, to prevent ambiguous errors when trying to load a file that didn't exist (Issues 291, 292)

0.9.1

Changed:

- Tweaked `TopicModel` class to work with newer versions of `scikit-learn`, and updated version requirements accordingly from `>=0.18.0,<0.21.0` to `>=0.19`

Fixed:

- Fixed residual bugs in the script for training language identification pipelines, then trained and released one using `scikit-learn==0.19` to prevent errors for users on that version

Page 1 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.