original implementation: https://github.com/KarelDO/wl-coref/tree/master
Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref
If you use Stanza's coref module in your work, please be sure to cite both of the above papers.
Special thanks to [vdobrovolskii](https://github.com/vdobrovolskii), who graciously agreed to allow for integration of his work into Stanza, to KarelDO for his support of his training enhancement, and to Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.
Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.
Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models
https://github.com/stanfordnlp/stanza/pull/1309
Interface change: English MWT
English now has an MWT model by default. Text such as `won't` is now marked as a single **token**, split into two **words**, `will` and `not`. Previously it was expected to be tokenized into two pieces, but the `Sentence` object containing that text would not have a single `Token` object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.
Code that used to operate with `for word in sentence.words` will continue to work as before, but `for token in sentence.tokens` will now produce **one** object for MWT such as `won't`, `cannot`, `Stanza's`, etc.
Pipeline creation will not change, as MWT is automatically (but not silently) added at `Pipeline` creation time if the language and package includes MWT.
https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1
Other updates
- NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 https://github.com/stanfordnlp/stanza/pull/1295 https://github.com/stanfordnlp/stanza/pull/1298
- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https://github.com/stanfordnlp/stanza/issues/1000 https://github.com/stanfordnlp/stanza/pull/1303
- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! https://github.com/stanfordnlp/stanza/pull/1302
- Remove deprecated output methods such as `conll_as_string` and `doc2conll_text`. Use `"{:C}".format(doc)` instead https://github.com/stanfordnlp/stanza/commit/e01650f9c56382495082a9a24fa0310414c46651
- Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
- Sentences now have a `doc_id` field if the document they are created from has a `doc_id`. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
- Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) https://github.com/stanfordnlp/stanza/pull/1314/commits/3d90d2b8a82048c5cea549b654e52544ed241833
Updated requirements
- Support dropped for python 3.6 and 3.7. The `peft` module used for finetuning the transformer used in the coref processor does not support those versions.
- Added `peft` as an optional dependency to transformer based installations
- Added `networkx` as a dependency for reading enhanced dependencies. Added `toml` as a dependency for reading the coref config.