Stanza

Latest version: v1.10.1

Safety actively analyzes 691334 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 5

1.2.2

Not secure
Overview

A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.

Bugfixes

- **Fix Sentiment not loading correctly on Windows because of pickling issue** (https://github.com/stanfordnlp/stanza/pull/742) (thanks to BramVanroy)

- **Fix NER bulk process not filling out data structures as expected** (https://github.com/stanfordnlp/stanza/issues/721) (https://github.com/stanfordnlp/stanza/pull/722)

- **Fix NER space issue causing a performance regression** (https://github.com/stanfordnlp/stanza/issues/739) (https://github.com/stanfordnlp/stanza/pull/732)

Interface improvements

- **Add an NER run script** (https://github.com/stanfordnlp/stanza/pull/738)

1.2.1

Not secure
Overview

All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

Model improvements

- **Add Bulgarian, Finnish, Hungarian, Vietnamese NER models**
- The Bulgarian model is trained on BSNLP 2019 data.
- The Finnish model is trained on the Turku NER data.
- The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
- The Vietnamese model is trained on the VLSP 2018 data.
- Furthermore, the script for preparing the lang-uk NER data has been integrated (https://github.com/stanfordnlp/stanza/commit/c1f0bee1074997d9376adaec45dc00f813d00b38)

- **Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset**(https://github.com/stanfordnlp/stanza/pull/718/commits/d9e8301addc93450dc880b06cb665ad10d869242)

- **Add copy mechanism in the seq2seq model**. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (https://github.com/stanfordnlp/stanza/pull/692 https://github.com/stanfordnlp/stanza/issues/684)

- **Fix Spanish POS and depparse mishandling a leading `¿` missing** (https://github.com/stanfordnlp/stanza/pull/699 https://github.com/stanfordnlp/stanza/issues/698)

- **Fix tokenization breaking when a newline splits a Chinese token**(https://github.com/stanfordnlp/stanza/pull/632 https://github.com/stanfordnlp/stanza/issues/531)

- **Fix tokenization of parentheses in Chinese**(https://github.com/stanfordnlp/stanza/commit/452d842ed596bb7807e604eeb2295fd4742b7e89)

- **Fix various issues with characters not present in UD training data** such as ellipses characters or unicode apostrophe
(https://github.com/stanfordnlp/stanza/pull/719/commits/db0555253f0a68c76cf50209387dd2ff37794197 https://github.com/stanfordnlp/stanza/pull/719/commits/f01a1420755e3e0d9f4d7c2895e0261e581f7413 https://github.com/stanfordnlp/stanza/pull/719/commits/85898c50f14daed75b96eed9cd6e9d6f86e2d197)

- **Fix a variety of issues with Vietnamese tokenization** - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (https://github.com/stanfordnlp/stanza/pull/719/commits/3ccb132e03ce28a9061ec17d2c0ae84cc2000548)

- **Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse**(https://github.com/stanfordnlp/stanza/pull/719/commits/197212269bc33b66759855a5addb99d1f465e4f4)

- **Include UD_English-GUMReddit in the GUM models**(https://github.com/stanfordnlp/stanza/pull/719/commits/9e6367cb9bdd635d579fd8d389cb4d5fa121c413)

- **Add Pronouns & PUD to the mixed English models** (various data improvements made this more appealing)(https://github.com/stanfordnlp/stanza/pull/719/commits/f74bef7b2ed171bf9c027ae4dfd3a10272040a46)

Interface enhancements

- **Add ability to pass a Document to the pipeline in pretokenized mode**(https://github.com/stanfordnlp/stanza/commit/f88cd8c2f84aedeaec34a11b4bc27573657a66e2 https://github.com/stanfordnlp/stanza/issues/696)

- **Track comments when reading and writing conll files** (https://github.com/stanfordnlp/stanza/pull/676 originally from danielhers in https://github.com/stanfordnlp/stanza/pull/155)

- **Add a proxy parameter for downloads to pass through to the requests module** (https://github.com/stanfordnlp/stanza/pull/638)

- **add sent_idx to tokens** (https://github.com/stanfordnlp/stanza/commit/ee6135c538e24ff37d08b86f34668ccb223c49e1)

Bugfixes

- **Fix Windows encoding issues when reading conll documents** from yanirmr (b40379eaf229e7ffc7580def57ee1fad46080261 https://github.com/stanfordnlp/stanza/pull/695)

- **Fix tokenization breaking when second batch is exactly eval_length**(https://github.com/stanfordnlp/stanza/commit/726368644d7b1019825f915fabcfe1e4528e068e https://github.com/stanfordnlp/stanza/issues/634 https://github.com/stanfordnlp/stanza/issues/631)

Efficiency improvements

- **Bulk process for tokenization** - greatly speeds up the use case of many small docs (https://github.com/stanfordnlp/stanza/pull/719/commits/5d2d39ec822c65cb5f60d547357ad8b821683e3c)

- **Optimize MWT usage in pipeline & fix MWT bulk_process** (https://github.com/stanfordnlp/stanza/pull/642 https://github.com/stanfordnlp/stanza/pull/643 https://github.com/stanfordnlp/stanza/pull/644)

CoreNLP integration

- **Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer** (https://github.com/stanfordnlp/stanza/pull/675)

- **Add an interface to CoreNLP tokensregex using stanza tokenization** (https://github.com/stanfordnlp/stanza/pull/659)

1.2.0

Overview

All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

New features and enhancements

- **Models trained on combined datasets in English and Italian** The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.

- **NER Transfer Learning** Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (351, thanks to gawy for the contribution)

- **Multi-document support** The Stanza `Pipeline` now supports multi-`Document` input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza `Document` objects into the `Pipeline`. (https://github.com/stanfordnlp/stanza/issues/70 https://github.com/stanfordnlp/stanza/pull/577)

- **Added API links from token to sentence** It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use `token.sent` or `word.sent`. (https://github.com/stanfordnlp/stanza/issues/533 https://github.com/stanfordnlp/stanza/pull/554)

- **New external tokenizer for Thai with PyThaiNLP** Try it out with, for example, `stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None)`. (https://github.com/stanfordnlp/stanza/pull/567)

- **Faster tokenization** We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (522)

- **Added a method for getting all the supported languages from the resources file** Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try `stanza.resources.common.list_available_languages()`. (https://github.com/stanfordnlp/stanza/issues/511 https://github.com/stanfordnlp/stanza/commit/fa52f8562f20ab56807b35ba204d6f9ca60b47ab)

- **Load mwt automagically if a model needs it** Multi-word token expansion is one of the most common things to miss from your `Pipeline` instantiation, and remembering to include it is a pain -- until now. (https://github.com/stanfordnlp/stanza/pull/516 https://github.com/stanfordnlp/stanza/issues/515 and many others)

- **Vietnamese sentiment model based on VSFC** This is now part of the default language package for Vietnamese that you get from `stanza.download("vi")`. Enjoy!

- **More informative errors for missing models** Stanza now throws more helpful exceptions with informative exception messages when you are missing models (https://github.com/stanfordnlp/stanza/pull/437 https://github.com/stanfordnlp/stanza/issues/430 ... https://github.com/stanfordnlp/stanza/issues/324 https://github.com/stanfordnlp/stanza/pull/438 ... https://github.com/stanfordnlp/stanza/issues/529 https://github.com/stanfordnlp/stanza/commit/953966539c955951d01e3d6b4561fab02a1f546c ... https://github.com/stanfordnlp/stanza/issues/575 https://github.com/stanfordnlp/stanza/pull/578)

Bugfixes

- **Fixed NER documentation for German** to correctly point to the GermEval 2014 model for download. (https://github.com/stanfordnlp/stanza/commit/4ee9f12be5911bb600d2f162b1684cb4686c391e https://github.com/stanfordnlp/stanza/issues/559)

- **External tokenization library integration respects `no_ssplit`** so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (https://github.com/stanfordnlp/stanza/issues/523 https://github.com/stanfordnlp/stanza/pull/556)

- **Telugu lemmatizer and tokenizer improvements** Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (https://github.com/stanfordnlp/stanza/issues/524 https://github.com/stanfordnlp/stanza/commit/ba0aec30e6e691155bc0226e4cdbb829cb3489df)

- **Spanish model would not tokenize foo,bar** Now fixed (https://github.com/stanfordnlp/stanza/issues/528 https://github.com/stanfordnlp/stanza/commit/123d5029303a04185c5574b76fbed27cb992cadd)

- **Arabic model would not tokenize `asdf .`** Now fixed (https://github.com/stanfordnlp/stanza/issues/545 https://github.com/stanfordnlp/stanza/commit/03b7ceacf73870b2a15b46479677f4914ea48745)

- **Various tokenization models would split URLs and/or emails** Now URLs and emails are robustly handled with regexes. (https://github.com/stanfordnlp/stanza/issues/539 https://github.com/stanfordnlp/stanza/pull/588)

- **Various parser and pos models would deterministically label "punct" for the final word** Resolved via data augmentation (https://github.com/stanfordnlp/stanza/issues/471 https://github.com/stanfordnlp/stanza/issues/488 https://github.com/stanfordnlp/stanza/pull/491)

- **Norwegian tokenizers retrained to separate final punct** The fix is an upstream data fix (https://github.com/stanfordnlp/stanza/issues/305 https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal/pull/5)

- **Bugfix for conll eval** Fix the error in data conversion from python object of Document to CoNLL format. (https://github.com/stanfordnlp/stanza/pull/484 https://github.com/stanfordnlp/stanza/issues/483, thanks m0re4u )

- **Less randomness in sentiment results** Fixes prediction fluctuation in sentiment prediction. (https://github.com/stanfordnlp/stanza/issues/458 https://github.com/stanfordnlp/stanza/commit/274474c3b0e4155ab6e221146ac347ca433f81a6)

- **Bugfix which should make it easier to use in jupyter / colab** This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (https://github.com/stanfordnlp/stanza/pull/434 https://github.com/stanfordnlp/stanza/issues/431)

- **Misc fixes for training, concurrency, and edge cases in basic Pipeline usage**
- **Fix for mwt training** (https://github.com/stanfordnlp/stanza/pull/446)
- **Fix for race condition in seq2seq models** (https://github.com/stanfordnlp/stanza/pull/463 https://github.com/stanfordnlp/stanza/issues/462)
- **Fix for race condition in CRF** (https://github.com/stanfordnlp/stanza/pull/566 https://github.com/stanfordnlp/stanza/issues/561)
- **Fix for empty text in pipeline** (https://github.com/stanfordnlp/stanza/pull/475 https://github.com/stanfordnlp/stanza/issues/474)
- **Fix for resources not freed when downloading** (https://github.com/stanfordnlp/stanza/issues/502 https://github.com/stanfordnlp/stanza/pull/503)
- **Fix for vietnamese pipeline not working** (https://github.com/stanfordnlp/stanza/issues/531 https://github.com/stanfordnlp/stanza/pull/535)

BREAKING CHANGES

- **Renamed `stanza.models.tokenize` -> `stanza.models.tokenization`** https://github.com/stanfordnlp/stanza/pull/452 This stops the tokenize directory shadowing a built in library

1.1.1

Not secure
Overview

This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the `CoreNLPClient` functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

New Features and Enhancements

- **New Sentiment Analysis Models for English, German, Chinese**: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.

- **New Biomedical and Clinical English Model Packages**: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit [Stanza's biomedical models page](https://stanfordnlp.github.io/stanza/biomed.html).

- **Support for Adding User Customized Processors via Python Decorators**: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via `register_processor` or `register_processor_variant` decorators. See Stanza website for more information and examples (see [custom Processors](https://stanfordnlp.github.io/stanza/pipeline.html#building-your-own-processors-and-using-them-in-the-neural-pipeline) and [Processor variants](https://stanfordnlp.github.io/stanza/pipeline.html#processor-variants)). (PR https://github.com/stanfordnlp/stanza/pull/322)

- **Support for Editable Properties For Data Objects**: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., `Document`, `Sentence`, `Token`, etc). Aside from the annotation they already support, additional annotation can be easily attached through `data_object.add_property()`. See [our documentation](https://stanfordnlp.github.io/stanza/data_objects.html#adding-new-properties-to-stanza-data-objects) for more information and examples. (PR https://github.com/stanfordnlp/stanza/pull/323)

- **Support for Automated CoreNLP Installation and CoreNLP Model Download**: CoreNLP can now be easily downloaded in Stanza with `stanza.install_corenlp(dir='path/to/corenlp/installation')`; CoreNLP models can now be downloaded with `stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation')`. For more details please see the Stanza website. (PR https://github.com/stanfordnlp/stanza/pull/363)

- **Japanese Pipeline Supports SudachiPy as External Tokenizer**: You can now use the [SudachiPy library](https://github.com/WorksApplications/SudachiPy) as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with `nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}`. Note that this will require a separate installation of the SudachiPy library via pip. (PR https://github.com/stanfordnlp/stanza/pull/365)

- **New Alternative Server for Stable Download of Resource Files**: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new `resources_url` argument. For example, `stanza.download(lang='en', resources_url='stanford')` will now download the resource file and English pipeline from Stanford servers. (Issue https://github.com/stanfordnlp/stanza/issues/331, PR https://github.com/stanfordnlp/stanza/pull/356)

- **`CoreNLPClient` Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server**: The `CoreNLPClient` now supports a new `Enum` values with better semantics for its `start_server` argument for finer-grained control over how the server is launched, including a new option called `StartServer.TRY_START` that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for `CoreNLPClient` to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend `StartServer.FORCE_START` and `StartSerer.DONT_START` for better readability. (PR https://github.com/stanfordnlp/stanza/pull/302)

- **New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages**: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue https://github.com/stanfordnlp/stanza/issues/399, PR https://github.com/stanfordnlp/stanza/pull/392)

- **New Tokenizer for Thai Language**: The available UD data for Thai is quite small. The authors of [pythainlp](https://github.com/PyThaiNLP/pythainlp) helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue https://github.com/stanfordnlp/stanza/issues/148)

- **Support for Serialization of Document Objects**: Now you can serialize and deserialize the entire document by running `serialized_string = doc.to_serialized()` and `doc = Document.from_serialized(serialized_string)`. The serialized string can be decoded into Python objects by running `objs = pickle.loads(serialized_string)`. (Issue https://github.com/stanfordnlp/stanza/issues/361, PR https://github.com/stanfordnlp/stanza/pull/366)

- **Improved Tokenization Speed**: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: https://github.com/stanfordnlp/stanza/commit/546ed13563c3530b414d64b5a815c0919ab0513a, https://github.com/stanfordnlp/stanza/commit/8e2076c6a0bc8890a54d9ed6931817b1536ae33c, https://github.com/stanfordnlp/stanza/commit/7f5be823a587c6d1bec63d47cd22818c838901e7, etc.)

- **User provided Ukrainian NER model**: We now have a [model](https://github.com/gawy/stanza-lang-uk/releases/tag/v0.9) built from the [lang-uk NER dataset](https://github.com/lang-uk/ner-uk), provided by a user for redistribution.

Breaking Interface Changes

- **Token.id is Tuple and Word.id is Integer**: The `id` attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the `id` for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: https://github.com/stanfordnlp/stanza/issues/211, PR: https://github.com/stanfordnlp/stanza/pull/357)

- **Changed Default Pipeline Packages for Several Languages for Improved Robustness**: Languages that have changed default packages include: Polish (default is now `PDB` model, from previous `LFG`, https://github.com/stanfordnlp/stanza/issues/220), Korean (default is now `GSD`, from previous `Kaist`, https://github.com/stanfordnlp/stanza/issues/276), Lithuanian (default is now `ALKSNIS`, from previous `HSE`, https://github.com/stanfordnlp/stanza/issues/415).

- **CoreNLP 4.1.0 is required**: `CoreNLPClient` requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.

- **Properties Cache removed from CoreNLP client**: The properties_cache has been removed from `CoreNLPClient` and the `CoreNLPClient's` `annotate()` method no longer has a `properties_key` argument. Python dictionaries with custom request properties should be directly supplied to `annotate()` via the `properties` argument.

Bugfixes and Other Improvements

- **Fixed Logging Behavior**: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue https://github.com/stanfordnlp/stanza/issues/278, PR https://github.com/stanfordnlp/stanza/pull/290)

- **Compatibility Fix for PyTorch v1.6.0**: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues https://github.com/stanfordnlp/stanza/issues/412 https://github.com/stanfordnlp/stanza/issues/417, PR https://github.com/stanfordnlp/stanza/pull/406)

- **Improved Batching for Long Sentences in Dependency Parser**: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue https://github.com/stanfordnlp/stanza/issues/387)

- **Improved neural tokenizer robustness to whitespaces**: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR https://github.com/stanfordnlp/stanza/pull/380)

- **Resolved properties issue when switching languages with requests to CoreNLP server**: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.

1.0.1

Not secure
Overview

This is a maintenance release of Stanza. It features new support for `jieba` as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

Enhancements

- **Supporting `jieba` library as Chinese tokenizer**. The Stanza (simplified and traditional) Chinese pipelines now support using the `jieba` Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: `nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}`, or by specifying argument `tokenize_with_jieba=True`.

- **Setting resource directory with environment variable**. You can now override the default model location `$HOME/stanza_resources` by setting an environmental variable `STANZA_RESOURCES_DIR` (https://github.com/stanfordnlp/stanza/issues/227). The new directory will then be used to store and look up model files. Thanks to dhpollack for implementing this feature.

- **Faster lemmatizer implementation.** The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (https://github.com/stanfordnlp/stanza/issues/249). Thanks to mahdiman for identifying the original issue.

- **Improved compatibility with CoreNLP 4.0.0**. The client is now fully compatible with the latest [v4.0.0 release of the CoreNLP package](https://stanfordnlp.github.io/CoreNLP/).

Bugfixes

- **Correct character offsets in NER outputs from pre-tokenized text**. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (https://github.com/stanfordnlp/stanza/issues/229). Thanks to RyanElliott10 for reporting the issue.

- **Correct Vietnamese tokenization on sentences beginning with punctuation**. We fixed an issue where the Vietnamese tokenizer may throw an `AssertionError` on sentences that begin with a punctuation (https://github.com/stanfordnlp/stanza/issues/217). Thanks to aryamccarthy for reporting this issue.

- **Correct pytorch version requirement**. Stanza is now asking for `pytorch>=1.3.0` to avoid a runtime error raised by pytorch ((https://github.com/stanfordnlp/stanza/issues/231)). Thanks to Vodkazy for reporting this.

Known Model Issues & Solutions

- **Default Korean Kaist tokenizer failing on punctuation.** The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (https://github.com/stanfordnlp/stanza/issues/276). Switching to the Korean `GSD` model may solve this issue.

- **Default Polish LFG POS tagger incorrectly labeling last word in sentence as `PUNCT`**. The default Polish model trained on the `LFG` treebank may incorrectly tag the last word in a sentence as `PUNCT` (https://github.com/stanfordnlp/stanza/issues/220). This issue may be solved by switching to the Polish `PDB` model.

1.0.0

Not secure
Overview
This is the first major release of Stanza (previously known as [StanfordNLP](https://github.com/stanfordnlp/stanfordnlp/)), a software package to process many human languages. The main features of this release are
* **Multi-lingual named entity recognition support**. Stanza supports named entity recognition in 8 languages (and 12 datasets): Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The most comprehensive NER models in each language is now part of the default model download of that model, along with other models trained on the largest dataset available.
* **Accurate neural network models**. Stanza features highly accurate data-driven neural network models for a wide collection of natural language processing tasks, including tokenization, sentence segmentation, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition.
* **State-of-the-art pretrained models freely available**. Stanza features a few hundred pretrained models for 60+ languages, all freely availble and easily downloadable from native Python code. Most of these models achieve state-of-the-art (or competitive) performance on these tasks.
* **Expanded language support**. Stanza now supports more than 60 human languages, representing a wide-range of language families.
* **Easy-to-use native Python interface**. We've improved the usability of the interface to maximize transparency. Now intermediate processing results are more easily viewed and accessed as native Python objects.
* **Anaconda support**. Stanza now officially supports installation from Anaconda. You can install Stanza through Stanford NLP Group's Anaconda channel `conda install -c stanfordnlp stanza`.
* **Improved documentation**. We have improved [our documentation](https://stanfordnlp.github.io/stanza/) to include a comprehensive coverage of the basic and advanced functionalities supported by Stanza.
* **Improved CoreNLP support in Python**. We have improved the robustness and efficiency of the `CoreNLPClient` to access the Java CoreNLP software from Python code. It is also forward compatible with the next major release of CoreNLP.
Enhancements and Bugfixes
This release also contains many enhancements and bugfixes:
* [Enhancement] Improved lemmatization support with proper conditioning on POS tags (143). Thanks to nljubesi for the report!
* [Enhancement] Get the text corresponding to sentences in the document. Access it through `sentence.text`. (80)
* [Enhancement] Improved logging. Stanza now uses Python's `logging` for all procedual logging, which can be controlled globally either through `logging_level` or a `verbose` shortcut. See [this page](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) for more information. (81)
* [Enhancement] Allow the user to use the Stanza tokenizer with their own sentence split, which might be useful for applications like machine translation. Simply set `tokenize_no_ssplit` to `True` at pipeline instantiation. (108)
* [Enhancement] Support running the dependency parser only given tokenized, sentence segmented, and POS/morphological feature tagged data. Simply set `depparse_pretagged` to `True` at pipeline instantiation. (141) Thanks mrapacz for the contribution!
* [Enhancement] Added spaCy as an option for tokenizing (and sentence segmenting) English text for efficiency. See this [documentation page](https://stanfordnlp.github.io/stanza/tokenize.html#use-spacy-for-fast-tokenization-and-sentence-segmentation) for a quick example.
* [Enhancement] Add character offsets to tokens, sentences, and spans.
* [Bugfix] Correctly decide whether to load pretrained embedding files given training flags. (120)
* [Bugfix] Google proto buffers reporting errors for long input when using the `CoreNLPClient`. (154)
* [Bugfix] Remove deprecation warnings from newer versions of PyTorch. (162)
Breaking Changes
Note that if your code was developed on a previous version of the package, there are potentially many breaking changes in this release. The most notable changes are in the `Document` objects, which contain all the annotations for the raw text or document fed into the Stanza pipeline. The underlying implementation of `Document` and all related data objects have broken away from using the CoNLL-U format as its internal representation for more flexibility and efficiency accessing their attributes, although it is still compatible with CoNLL-U to maintain ease of conversion between the two. Moreover, many properties have been renamed for clarity and sometimes aliased for ease of access. Please see our documentation page about these [data objects](https://stanfordnlp.github.io/stanza/data_objects.html) for more information.

Page 4 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.