Ssurgeon interface
Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/
In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.
CoreNLP integration:
- Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. https://github.com/stanfordnlp/stanza/pull/1205 https://aclanthology.org/2023.tlt-1.7/
- English Morphology class (deterministic English lemmatizer) https://github.com/stanfordnlp/stanza/commit/6aed177731e883ce92057be7e78abdce3141a862
- English constituency -> dependency converter https://github.com/stanfordnlp/stanza/commit/0987794c9e960b32ed75d5804dd5c586466ae061
Bugfixes:
- Bugfix for older versions of torch: https://github.com/stanfordnlp/stanza/commit/376d7ea76248131a96d23e236ab165e7d5a544bb
- Bugfix for training (integration with new scoring script) https://github.com/stanfordnlp/stanza/issues/1167 https://github.com/stanfordnlp/stanza/commit/9c39636c438cbeb00ab7a7e8d9caa0bcd31ccc44
- Demo was showing constituency parser along with dependency parsing, even with conparse off: https://github.com/stanfordnlp/stanza/commit/cbc13b0219281f2c27e89ccf2914e13f8aa2bb1b
- Replace absurdly long characters with UNK (thank you khughitt) https://github.com/stanfordnlp/stanza/issues/1137 https://github.com/stanfordnlp/stanza/pull/1140
- Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. https://github.com/stanfordnlp/stanza/commit/435685f875766e0b9b2b9b1d4792db1c452f9722
- stanza-train NER training bugfix (wrong pretrain): https://github.com/stanfordnlp/stanza/commit/2757cb40edf7a4bf9f62e31eec4b3632ac5ebcb9
- Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy https://github.com/stanfordnlp/stanza/issues/1209 https://github.com/stanfordnlp/stanza/pull/1159
- Fix error in preparing tokenizer datasets (thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1161
- Fix unnecessary slowness in preparing tokenizer datasets (again, thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1162
- Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks dvzubarev): https://github.com/stanfordnlp/stanza/pull/1170
- When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): https://github.com/stanfordnlp/stanza/commit/b118473604d50d678c2857c0f39f59ba0cd9c2a3
- Update use of emoji to match latest releases: https://github.com/stanfordnlp/stanza/issues/1195 https://github.com/stanfordnlp/stanza/commit/ea345a88f8916c2ab2cd2e6260caa7831dfe2f23
Features:
- Mechanism for resplitting tokens into MWT https://github.com/stanfordnlp/stanza/issues/95 https://github.com/stanfordnlp/stanza/commit/8fac17f625173b2c2bf1cecf611deecb37399322
- CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) https://github.com/stanfordnlp/stanza/commit/cfd44d17f806703b7ed6719993501366a52afbb1
- `detach().cpu()` speeds things up significantly in some cases https://github.com/stanfordnlp/stanza/commit/ccfbc56b3b312fdde1350104a0d0d5645c9c80cc
- Potentially use a constituency model as a classifier - WIP research project https://github.com/stanfordnlp/stanza/pull/1190
- add an output format `"{:C}"` for document objects which prints out documents as CoNLL: https://github.com/stanfordnlp/stanza/pull/1169
- If a constituency tree is available, include it when outputting conll format for documents: https://github.com/stanfordnlp/stanza/pull/1171
- Same with sentiment: https://github.com/stanfordnlp/stanza/commit/abb581945a70fec335dbfadd71bf8c457fa908eb
- Additional language code coverage (thank you juanro49) https://github.com/stanfordnlp/stanza/commit/5802b10882026c4694a4d966e4200c48c5469b1b https://github.com/stanfordnlp/stanza/commit/f06bf86b566772ea6551c663835ddb9a6f5584ff https://github.com/stanfordnlp/stanza/commit/32f83fa2f2333f42925323c4ac9da059dffdf1dc https://github.com/stanfordnlp/stanza/commit/34505758c9d8de4ca70bfbe5418448ad54af088f
- Allow loading a pipeline for new languages (useful when developing a new suite of models) https://github.com/stanfordnlp/stanza/commit/e7fcd262a6c5f3f71b339fe989bcaa177fb378f1
- Script to count the work done by annotators on aws sagemaker private workforce: https://github.com/stanfordnlp/stanza/pull/1186
- Streaming interface which batch processes items in the stream: https://github.com/stanfordnlp/stanza/commit/2c9fe3dad434b271fa23c20a9cf8ccaf63991f16 https://github.com/stanfordnlp/stanza/issues/550
- Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: https://github.com/stanfordnlp/stanza/commit/70fd2fdc94575dec79c4994ea2dc66a719768ab0 https://github.com/stanfordnlp/stanza/issues/1199
- Transformer at bottom layer of POS - currently only available in English as the `en_combined_bert` model, others to come https://github.com/stanfordnlp/stanza/pull/1132
New models:
- Armenian NER model using an NER labeling of armtdp (thanks to ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER https://github.com/stanfordnlp/stanza/issues/1206 https://github.com/stanfordnlp/stanza/pull/1212
- Sindhi tokenization from ISRA https://github.com/stanfordnlp/stanza/pull/1117
- Sindhi NER from SiNER: https://github.com/stanfordnlp/stanza/commit/2a8ded4b0c327761b047caf433128f13b1ad14bf
- Erzya from UD 2.11 https://github.com/stanfordnlp/stanza/commit/0344ac34b5df602a49da25d58655a24a0ffcd208
Conparser experiments:
- Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 https://github.com/stanfordnlp/stanza/commit/110031e29259b34be6f958fd6d67d4774d6b084a
- TREE_LSTM constituent composition method (didn't beat MAX) https://github.com/stanfordnlp/stanza/commit/2f722c828fa1364131b670da5b925082e9aa336a
- Learned weighting between bert layers (this did help a little) https://github.com/stanfordnlp/stanza/commit/2d0c69ee449501155225efc2afb53b4ba6eeefe7
- Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN https://github.com/stanfordnlp/stanza/pull/1148
- New in_order_compound transition scheme: no improvement https://github.com/stanfordnlp/stanza/commit/f560b08902cf9f9e20656697c367500389115057
- Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency https://github.com/stanfordnlp/stanza/commit/2706c4b100285e50f3d9a69e51ca5955e15ba41d https://github.com/stanfordnlp/stanza/commit/f500936b5ca4ba2305a028241996e5d198afd94b
- Report the scores of tags when retagging (does not affect the conparser training) https://github.com/stanfordnlp/stanza/commit/766341942962e5a5a0aa0cda3dd170ac098ac6f9
- FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 https://github.com/stanfordnlp/stanza/commit/90a8337083f0dc057ea2a9ee794595a6b292850f
- LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax https://github.com/stanfordnlp/stanza/commit/5edd7242073720aff94f07904009ce0cad47b7ff
- Maxout layer: didn't help https://arxiv.org/abs/1302.4389 https://github.com/stanfordnlp/stanza/commit/c708ce7736ffb021f9a0065f2bedaa8b73de52ba
- Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. https://github.com/stanfordnlp/stanza/commit/4954845ba4b16240e6acf8d45d83161a0dec8d33