stanza Changelog

2310.06165

original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to [vdobrovolskii](https://github.com/vdobrovolskii), who graciously agreed to allow for integration of his work into Stanza, to KarelDO for his support of his training enhancement, and to Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as `won't` is now marked as a single **token**, split into two **words**, `will` and `not`. Previously it was expected to be tokenized into two pieces, but the `Sentence` object containing that text would not have a single `Token` object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with `for word in sentence.words` will continue to work as before, but `for token in sentence.tokens` will now produce **one** object for MWT such as `won't`, `cannot`, `Stanza's`, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at `Pipeline` creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

- NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 https://github.com/stanfordnlp/stanza/pull/1295 https://github.com/stanfordnlp/stanza/pull/1298
- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https://github.com/stanfordnlp/stanza/issues/1000 https://github.com/stanfordnlp/stanza/pull/1303
- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! https://github.com/stanfordnlp/stanza/pull/1302
- Remove deprecated output methods such as `conll_as_string` and `doc2conll_text`. Use `"{:C}".format(doc)` instead https://github.com/stanfordnlp/stanza/commit/e01650f9c56382495082a9a24fa0310414c46651
- Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
- Sentences now have a `doc_id` field if the document they are created from has a `doc_id`. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
- Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) https://github.com/stanfordnlp/stanza/pull/1314/commits/3d90d2b8a82048c5cea549b654e52544ed241833

Updated requirements

- Support dropped for python 3.6 and 3.7. The `peft` module used for finetuning the transformer used in the coref processor does not support those versions.
- Added `peft` as an optional dependency to transformer based installations
- Added `networkx` as a dependency for reading enhanced dependencies. Added `toml` as a dependency for reading the coref config.

1.10.0

In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.

Other notable changes:

- Include a contextual lemmatizer in English for `'s` -> `be` or `have` in the `default_accurate` package. Also built is a HI model. Others potentially to follow. https://github.com/stanfordnlp/stanza/pull/1422
- Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold https://github.com/stanfordnlp/stanza/commit/ad1f938276ef81ac9a602d7f1f21f50fd67e5d24
- Pytorch compatibility: set `weights_only=True` when loading models. https://github.com/stanfordnlp/stanza/pull/1430 https://github.com/stanfordnlp/stanza/issues/1429
- augment MWT tokenization to accommodate unexpected `'` characters, including `"` used in `"s` - https://github.com/stanfordnlp/stanza/pull/1437 https://github.com/stanfordnlp/stanza/issues/1436
- when training the lemmatizer, take advantage of `CorrectForm` annotations in the UD treebanks https://github.com/stanfordnlp/stanza/commit/dbdf429aff4175fec33856501e6899e96b390e86
- add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: https://github.com/stanfordnlp/stanza/commit/99f7038634101ea7b92140696c8383a333af1cbc
- add VLSP 2023 constituency dataset: https://github.com/stanfordnlp/stanza/commit/1159d0db8ea1d20c6cf9fb37f8fa8676e0f60f49

Bugfixes:

- `raise_for_status` earlier when failing to download something, so that the proper error gets displayed.
Thank you pattersam https://github.com/stanfordnlp/stanza/pull/1432
- Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: https://github.com/stanfordnlp/stanza/commit/53081c28ba3128fc89ad36919762a54f6cb88f77
- reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: https://github.com/stanfordnlp/stanza/commit/1a36efb53135e53dd40ad550bc3a659c81b15980 https://github.com/stanfordnlp/stanza/issues/1436
- similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: https://github.com/stanfordnlp/stanza/commit/215c69e53bf9f11e174b82bb064767749f7dd403
- missing text for a Document does not cause the NER model to crash: https://github.com/stanfordnlp/stanza/commit/07326289ce0efef1ba17a0632c011652f884363c https://github.com/stanfordnlp/stanza/issues/1428
- tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: https://github.com/stanfordnlp/stanza/commit/f59ccd86b9d146737dd5c0325ac31e4da814ddfa https://github.com/stanfordnlp/stanza/issues/1423

1.9.2

multilingual coref!

- Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

- streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
- updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
- accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

- Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval

bugfixes

- update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 https://github.com/stanfordnlp/stanza/commit/3de69cac904cf023eba4463380b63bc3039be7fd
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue https://github.com/stanfordnlp/stanza/commit/56350a0eebf4e2a7b3c54151f83b34db881553fc
- actually include the visualization: https://github.com/stanfordnlp/stanza/issues/1421 thank you bollwyvl

1.9.1

multilingual coref!

- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

- streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
- updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
- accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

- Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval

bugfixes

- update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 https://github.com/stanfordnlp/stanza/commit/3de69cac904cf023eba4463380b63bc3039be7fd
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue https://github.com/stanfordnlp/stanza/commit/56350a0eebf4e2a7b3c54151f83b34db881553fc
- actually include the visualization: https://github.com/stanfordnlp/stanza/issues/1421 thank you bollwyvl

1.9.0

multilingual coref!

- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref https://github.com/stanfordnlp/stanza/pull/1406

new features

- streamlit visualizer for semgrex/ssurgeon https://github.com/stanfordnlp/stanza/pull/1396
- updates to the constituency parser ensemble https://github.com/stanfordnlp/stanza/pull/1387
- accuracy improvements to the IN_ORDER oracle https://github.com/stanfordnlp/stanza/pull/1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE https://github.com/stanfordnlp/stanza/issues/1417 https://github.com/stanfordnlp/stanza/pull/1419
- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet https://github.com/stanfordnlp/stanza/pull/1408 https://github.com/stanfordnlp/stanza/issues/1399

new models

- Spanish combined models https://github.com/stanfordnlp/stanza/issues/1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval

bugfixes

- update tqdm usage to remove some duplicate code: https://github.com/stanfordnlp/stanza/issues/1413 https://github.com/stanfordnlp/stanza/commit/3de69cac904cf023eba4463380b63bc3039be7fd
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https://github.com/stanfordnlp/stanza/issues/1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue https://github.com/stanfordnlp/stanza/commit/56350a0eebf4e2a7b3c54151f83b34db881553fc

1.8.2

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

- Add Old English (ANG) annotation! Thank you to dmetola https://github.com/stanfordnlp/stanza/issues/1365

MWT improvements

- Fix words ending with `-nna` split into MWT https://github.com/stanfordnlp/handparsed-treebank/commit/2c48d4093daddc790bf89d7b35c47ee4d7d272d1 https://github.com/stanfordnlp/stanza/issues/1366

- Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378

- Mark `start_char` and `end_char` on an MWT if it is composed of exactly its subwords https://github.com/stanfordnlp/stanza/commit/23840891c37d54a5cf491ea58b0702987dd4a6d7 https://github.com/stanfordnlp/stanza/issues/1361

Peft memory management

- Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. https://github.com/huggingface/peft/issues/1523 https://github.com/stanfordnlp/stanza/pull/1381 https://github.com/stanfordnlp/stanza/pull/1384

Other bugfixes and minor upgrades

- Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 https://github.com/stanfordnlp/stanza/commit/381736f8fb9b60a929002cc750bd0df3d7dad03a

- Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: https://github.com/stanfordnlp/stanza/commit/d180ae02b278dd09dff53bc910e7aa43656e944d https://github.com/stanfordnlp/stanza/issues/1367

- Try to avoid OOM in the POS in the Pipeline by reducing its max batch length https://github.com/stanfordnlp/stanza/commit/42718135e2ab4b145bbb5861d55bb9424ca3549f

- Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to Jemoka) https://github.com/stanfordnlp/stanza/commit/597d48f1ead89fa9a0cca86cf9f0b530ed249792

Other upgrades

- Add \* to the list of functional tags to drop in the constituency parser, helping Icelandic annotation https://github.com/stanfordnlp/stanza/commit/57bfa8bbd8d3d42d4ee29d4a406640b126ce0f46 https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-1981216912

- Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: https://github.com/stanfordnlp/stanza/commit/4048caed1b89030082d23b8f71d23bae6c9c54f1 https://github.com/stanfordnlp/stanza/commit/15b136bb30dda272d318a61a5f602e7fc81e7a31

- Add a constituency model for German https://github.com/stanfordnlp/stanza/commit/7a4f48c738f0db8923aa5da88d0a9743eaee4c6a https://github.com/stanfordnlp/stanza/commit/86ddaab31c73a7d0a389d0557f3696c29d441657 https://github.com/stanfordnlp/stanza/issues/1368

Stanza

Page 1 of 5