Isanlp-rst

Latest version: v3.0.1a5

Safety actively analyzes 714792 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 2

3.0

Key Features and Improvements

- **Multiple End-to-End Models for Russian and English:**
- Russian (`rstreebank`): As always, this version includes a model trained on RuRSTreebank, providing robust parsing capabilities for Russian texts.
- Bilingual Model (`gumrrg`): A new bilingual model trained on a mix of GUM and RRG, offering enhanced parsing performance across multiple genres.
- English Model (`rstdt`): An English model trained on the RST-DT benchmark.
- **Modified DMRST Architecture:** We have implemented modifications to the DMRST architecture, improving both segmentation and tree construction.

2.1

The algorithm is from v2.0, but with data improvement, fast and accurate morphosyntactic analysis, and minor bug fixes.

Data
RuRSTreebank updated (attached jul22 version):
- All files have been fixed and are now readable. No more non-constituency subtrees, dangling EDUs and xml-crushing symbols!
- Fixed all paragraph boundaries. " " always means the beginning of a new line. Illustration markers (IMG, IMG-TXT, [код]) are now combined into separate EDUs which can be easily filtered.
- Fixed punctuation and word parts mistakenly placed in the next EDU in segmentation annotations.
- Improved sentence integrity, especially in Blogs:
| | News1| News2 | Blogs |
|---|---|---|---|
|Before | 93.55% | 81.79% | 74.17% |
|Now | **94.79%** | **82.37%** | **81.63%** |
- The consistency of formal structures in the corpus has been improved. Titles, subtitles, lists, illustrations, and conclusions are now annotated similarly throughout the corpus. Same relation-focused structures are now annotated in the same way. For example, Attribution satellites are now strictly within continuous citation boundaries.
- The consistency of relations annotation has been improved. Significantly reduced the number of obvious relation assignment errors according to annotator instructions and statistics in the rest of the data.

Morphosyntactic analysis
Now use the recent [ru_core_news_lg](https://spacy.io/models/ru#ru_core_news_lg) model from SpaCy, for it's fast and accurate.

Minor bug fixes
- Fixed a bug with defining some top-level DU boundaries when extracting features for feature-rich classifiers. So it now works fine with not pretokenized long texts.
- Also produces the DUs without tokenization.

Evaluation
Evaluation of end-to-end parsing on RuRSTreebank, macro averaged over test documents of different genres (attached jul version):
Level| S | N | R | F|
|---|---|---|---|---|

2.0

Paragraph-level trees are constructed with top-down algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.

Described and applied in [Discourse-aware Text Classification for Argument Mining](https://github.com/tchewik/discourse-aware-classification).

- Optimized feature-rich classifiers
- Speed up x5 (~70s per document in RuRSTreebank)
- EDUs sharing the "same-unit" relation are now joined into single EDU
- RuRSTreebank updated (attached feb22 version):
- now all the documents are in .rs3 only.
- all the documents contain "" paragraph boundary markers.
- some files are fixed and now readable.

End-to-end parsing evaluation on RuRSTreebank (attached feb22 version):
Level| S | N | R | F|
|---|---|---|---|---|

1.0.1

Trees are constructed with greedy bottom-up algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.

Trained and evaluated on the first version of RuRSTreebank corpus (see ``src/maintenance/corpus/``).

Described in https://link.springer.com/chapter/10.1007/978-3-030-72610-2_8

Page 2 of 2

Releases

Has known vulnerabilities

Isanlp-rst

Page 2 of 2

3.0

2.1

2.0

1.0.1

Page 2 of 2

Links

Releases