Overview
This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the `CoreNLPClient` functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.
New Features and Enhancements
- **New Sentiment Analysis Models for English, German, Chinese**: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.
- **New Biomedical and Clinical English Model Packages**: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit [Stanza's biomedical models page](https://stanfordnlp.github.io/stanza/biomed.html).
- **Support for Adding User Customized Processors via Python Decorators**: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via `register_processor` or `register_processor_variant` decorators. See Stanza website for more information and examples (see [custom Processors](https://stanfordnlp.github.io/stanza/pipeline.html#building-your-own-processors-and-using-them-in-the-neural-pipeline) and [Processor variants](https://stanfordnlp.github.io/stanza/pipeline.html#processor-variants)). (PR https://github.com/stanfordnlp/stanza/pull/322)
- **Support for Editable Properties For Data Objects**: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., `Document`, `Sentence`, `Token`, etc). Aside from the annotation they already support, additional annotation can be easily attached through `data_object.add_property()`. See [our documentation](https://stanfordnlp.github.io/stanza/data_objects.html#adding-new-properties-to-stanza-data-objects) for more information and examples. (PR https://github.com/stanfordnlp/stanza/pull/323)
- **Support for Automated CoreNLP Installation and CoreNLP Model Download**: CoreNLP can now be easily downloaded in Stanza with `stanza.install_corenlp(dir='path/to/corenlp/installation')`; CoreNLP models can now be downloaded with `stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation')`. For more details please see the Stanza website. (PR https://github.com/stanfordnlp/stanza/pull/363)
- **Japanese Pipeline Supports SudachiPy as External Tokenizer**: You can now use the [SudachiPy library](https://github.com/WorksApplications/SudachiPy) as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with `nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}`. Note that this will require a separate installation of the SudachiPy library via pip. (PR https://github.com/stanfordnlp/stanza/pull/365)
- **New Alternative Server for Stable Download of Resource Files**: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new `resources_url` argument. For example, `stanza.download(lang='en', resources_url='stanford')` will now download the resource file and English pipeline from Stanford servers. (Issue https://github.com/stanfordnlp/stanza/issues/331, PR https://github.com/stanfordnlp/stanza/pull/356)
- **`CoreNLPClient` Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server**: The `CoreNLPClient` now supports a new `Enum` values with better semantics for its `start_server` argument for finer-grained control over how the server is launched, including a new option called `StartServer.TRY_START` that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for `CoreNLPClient` to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend `StartServer.FORCE_START` and `StartSerer.DONT_START` for better readability. (PR https://github.com/stanfordnlp/stanza/pull/302)
- **New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages**: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue https://github.com/stanfordnlp/stanza/issues/399, PR https://github.com/stanfordnlp/stanza/pull/392)
- **New Tokenizer for Thai Language**: The available UD data for Thai is quite small. The authors of [pythainlp](https://github.com/PyThaiNLP/pythainlp) helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue https://github.com/stanfordnlp/stanza/issues/148)
- **Support for Serialization of Document Objects**: Now you can serialize and deserialize the entire document by running `serialized_string = doc.to_serialized()` and `doc = Document.from_serialized(serialized_string)`. The serialized string can be decoded into Python objects by running `objs = pickle.loads(serialized_string)`. (Issue https://github.com/stanfordnlp/stanza/issues/361, PR https://github.com/stanfordnlp/stanza/pull/366)
- **Improved Tokenization Speed**: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: https://github.com/stanfordnlp/stanza/commit/546ed13563c3530b414d64b5a815c0919ab0513a, https://github.com/stanfordnlp/stanza/commit/8e2076c6a0bc8890a54d9ed6931817b1536ae33c, https://github.com/stanfordnlp/stanza/commit/7f5be823a587c6d1bec63d47cd22818c838901e7, etc.)
- **User provided Ukrainian NER model**: We now have a [model](https://github.com/gawy/stanza-lang-uk/releases/tag/v0.9) built from the [lang-uk NER dataset](https://github.com/lang-uk/ner-uk), provided by a user for redistribution.
Breaking Interface Changes
- **Token.id is Tuple and Word.id is Integer**: The `id` attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the `id` for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: https://github.com/stanfordnlp/stanza/issues/211, PR: https://github.com/stanfordnlp/stanza/pull/357)
- **Changed Default Pipeline Packages for Several Languages for Improved Robustness**: Languages that have changed default packages include: Polish (default is now `PDB` model, from previous `LFG`, https://github.com/stanfordnlp/stanza/issues/220), Korean (default is now `GSD`, from previous `Kaist`, https://github.com/stanfordnlp/stanza/issues/276), Lithuanian (default is now `ALKSNIS`, from previous `HSE`, https://github.com/stanfordnlp/stanza/issues/415).
- **CoreNLP 4.1.0 is required**: `CoreNLPClient` requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.
- **Properties Cache removed from CoreNLP client**: The properties_cache has been removed from `CoreNLPClient` and the `CoreNLPClient's` `annotate()` method no longer has a `properties_key` argument. Python dictionaries with custom request properties should be directly supplied to `annotate()` via the `properties` argument.
Bugfixes and Other Improvements
- **Fixed Logging Behavior**: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue https://github.com/stanfordnlp/stanza/issues/278, PR https://github.com/stanfordnlp/stanza/pull/290)
- **Compatibility Fix for PyTorch v1.6.0**: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues https://github.com/stanfordnlp/stanza/issues/412 https://github.com/stanfordnlp/stanza/issues/417, PR https://github.com/stanfordnlp/stanza/pull/406)
- **Improved Batching for Long Sentences in Dependency Parser**: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue https://github.com/stanfordnlp/stanza/issues/387)
- **Improved neural tokenizer robustness to whitespaces**: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR https://github.com/stanfordnlp/stanza/pull/380)
- **Resolved properties issue when switching languages with requests to CoreNLP server**: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.