The algorithm is from v2.0, but with data improvement, fast and accurate morphosyntactic analysis, and minor bug fixes.
Data
RuRSTreebank updated (attached jul22 version):
- All files have been fixed and are now readable. No more non-constituency subtrees, dangling EDUs and xml-crushing symbols!
- Fixed all paragraph boundaries. " " always means the beginning of a new line. Illustration markers (IMG, IMG-TXT, [код]) are now combined into separate EDUs which can be easily filtered.
- Fixed punctuation and word parts mistakenly placed in the next EDU in segmentation annotations.
- Improved sentence integrity, especially in Blogs:
| | News1| News2 | Blogs |
|---|---|---|---|
|Before | 93.55% | 81.79% | 74.17% |
|Now | **94.79%** | **82.37%** | **81.63%** |
- The consistency of formal structures in the corpus has been improved. Titles, subtitles, lists, illustrations, and conclusions are now annotated similarly throughout the corpus. Same relation-focused structures are now annotated in the same way. For example, Attribution satellites are now strictly within continuous citation boundaries.
- The consistency of relations annotation has been improved. Significantly reduced the number of obvious relation assignment errors according to annotator instructions and statistics in the rest of the data.
Morphosyntactic analysis
Now use the recent [ru_core_news_lg](https://spacy.io/models/ru#ru_core_news_lg) model from SpaCy, for it's fast and accurate.
Minor bug fixes
- Fixed a bug with defining some top-level DU boundaries when extracting features for feature-rich classifiers. So it now works fine with not pretokenized long texts.
- Also produces the DUs without tokenization.
Evaluation
Evaluation of end-to-end parsing on RuRSTreebank, macro averaged over test documents of different genres (attached jul version):
Level| S | N | R | F|
|---|---|---|---|---|