V1.6.1 is a patch of a bug in the Arabic POS tagger.
We also mark Python 3.11 as supported in the `setup.py` classifiers. **This will be the last release that supports Python 3.6**
Multiple model levels
The `package` parameter for building the `Pipeline` now has three default settings:
- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome
Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.
PR: https://github.com/stanfordnlp/stanza/pull/1287
addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284
Multiple output heads for one NER model
The NER models now can learn multiple output layers at once.
https://github.com/stanfordnlp/stanza/pull/1289
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.
Future plans include using multiple NER datasets for other models as well.
Other features
- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty Jemoka). When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https://github.com/stanfordnlp/stanza/pull/1290
- Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd
- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb
- charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
- build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
Bugfixes
- V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https://github.com/stanfordnlp/stanza/commit/b56f442d4d179c07411a44a342c224408eb6a6a9
- Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824
- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7
- Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef