stanza Changelog

1.8.1

Integrating PEFT into several different annotators

We integrate [PEFT](https://github.com/huggingface/peft) into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
- NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
- depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

- Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
- wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
- Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

- Using a proxy with `download_resources_json` was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you ider-zh
- Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you sterliakov
- Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737
- Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 https://github.com/stanfordnlp/stanza/issues/1310

Additional 1.8.1 Bugfixes

- Older POS models not loaded correctly... need to use `.get()` https://github.com/stanfordnlp/stanza/commit/13ee3d5cbc2c9174c3e0c67ca75b580e4fe683b1 https://github.com/stanfordnlp/stanza/issues/1357
- Debug logging for the Constituency retag pipeline to better support someone working on Icelandic https://github.com/stanfordnlp/stanza/commit/6e2520f24d63fa8af4136f10137e57b195fda20a https://github.com/stanfordnlp/stanza/issues/1356
- `device` arg in `MultilingualPipeline` would crash if `device` was passed for an individual `Pipeline`: https://github.com/stanfordnlp/stanza/commit/44058a0ec296c6da5997bfaf8911a26d425d2cec

1.8.0

Integrating PEFT into several different annotators

We integrate [PEFT](https://github.com/huggingface/peft) into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
- NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
- depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

- Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
- wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
- Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

- Using a proxy with `download_resources_json` was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you ider-zh
- Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you sterliakov
- Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737
- Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 https://github.com/stanfordnlp/stanza/issues/1310

1.7.0

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution

1.6.1

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the `setup.py` classifiers. **This will be the last release that supports Python 3.6**

Multiple model levels

The `package` parameter for building the `Pipeline` now has three default settings:

- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.

Future plans include using multiple NER datasets for other models as well.

Other features

- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty Jemoka). When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https://github.com/stanfordnlp/stanza/pull/1290

- Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd

- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb

- charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

- build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

- V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https://github.com/stanfordnlp/stanza/commit/b56f442d4d179c07411a44a342c224408eb6a6a9

- Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824

- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7

- Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

1.6.0

Multiple model levels

The `package` parameter for building the `Pipeline` now has three default settings:

- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.

Future plans include using multiple NER datasets for other models as well.

Other features

- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty Jemoka). When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https://github.com/stanfordnlp/stanza/pull/1290

- Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd

- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb

- charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

- build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

- Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824

- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7

- Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

1.5.1

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 https://github.com/stanfordnlp/stanza/commit/a87ffd0a4f43262457cf7eecf5555a621c6dc24e

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in it https://github.com/stanfordnlp/stanza/pull/1282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863 https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https://github.com/stanfordnlp/stanza/commit/78ff85ce7eed596ad195a3f26474065717ad63b3

Fix an empty `bulk_process` throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: https://github.com/stanfordnlp/stanza/commit/da2ae33e8ef9e48842685dfed88896b646dba8c4 https://github.com/stanfordnlp/stanza/commit/8c4cb04d38c1101318755270f3aa75c54236e3fe

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses `torch.no_grad()` for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting `transformer`, `charlm` or `nocharlm` in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the `--help` for the `run` scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on `six` https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you BLKSerene )

New Models

VLSP constituency https://github.com/stanfordnlp/stanza/commit/500435d3ec1b484b0f1152a613716565022257f2

VLSP constituency -> tagging https://github.com/stanfordnlp/stanza/commit/cb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of `GSD`, `ParisStories`, `Rhapsodie`, and `Sequoia` https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058

Stanza

Page 2 of 5