Estnltk

Latest version: v1.7.4

Safety actively analyzes 723152 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

3.5

* Text class has been redesigned.
Text annotations are now decomposed into Span-s, SpanList-s and Layer-s;
* A common class for text annotators -- Tagger class -- has been introduced;
* Word segmentation has been redesigned.
It is now a three-step process, which includes basic tokenization (layer 'tokens'), creation of compound tokens (layer 'compound\_tokens'), and creation of words (layer 'words') based on 'tokens' and 'compound\_tokens'.
Token compounding rules that are aware of text units containing punctuation (such as abbreviations, emoticons, web addresses) have been implemented;
* The segmentation order has been changed: word segmentation now comes before the sentence segmentation, and the paragraph segmentation comes after the sentence segmentation;
* Sentence segmentation has been redesigned.
Sentence segmenter is now aware of the compound tokens (fixing compound tokens can improve sentence segmentation results), and special post-correction steps are applied to improve quality of sentence segmentation;
* Morphological analysis interface has been redesigned.
Morphological analyses are no longer attached to the layer 'words' (although they can be easily accessed through the words, if needed), but are contained in a separate layer named 'morph_analysis'.
* Morphological analysis process can now more easily decomposed into analysis and disambiguation (using special taggers VabamorfAnalyzer and VabamorfDisambiguator).
Also, a tagger responsible for post-corrections of morphological analysis (PostMorphAnalysisTagger) has been introduced, and post-corrections for improving quality of part of speech, and quality of analysis of numbers and pronouns have been implemented;
* Rules for converting morphological analysis categories from Vabamorf's format to GT (giellatekno) format have been ported from the previous version of EstNLTK.
Note, however, that the porting is not complete: full functionality requires 'clauses' annotation (which is currently not available);
* ...
* Other components of EstNLTK (such as the temporal expression tagger, and the named entity recognizer) are yet to be ported to the new version in the future;

Added
-----
* SyntaxIgnoreTagger, which can be used for detecting parts of text that should be ignored by the syntactic analyser.
Note: it is yet to be integrated with the pre-processing module of syntactic analysis;

1.7.4

Changed

* Changed licensing: EstNLTK is now dual licensed: choose either [GNU General Public License v2.0](http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html) or [Apache 2.0 License](http://www.apache.org/licenses/LICENSE-2.0).
* Updated `resource_utils`: removed rudimentary version constraints checking; version constraints are now checked with the [packaging](https://packaging.pypa.io/) utility which is a new dependency of EstNLTK;
* Refactored `estnltk.vabamorf.morf` module:
* Removed legacy `deconvert` function;
* Removed legacy shortcut functions `analyze` and `disambiguate` (use `VabamorfInstance.analyze` and `VabamorfInstance.disambiguate`) instead;
* Changed `syllabify_word`: it now uses stem-based morph analysis for compound word splitting (removed rudimentary and error-prone compound word splitting heuristic);
* Changed `BertTokens2WordsRewriter`: it can now also create a sublayer of the words layer, mapping each word to corresponding Bert embeddings, and added a constructor parameter <code>ambiguous</code> to control whether the output layer is ambiguous or not. In case unambiguous output (the default setting), the decorator function is expected to assign a single annotation (dictionary) to each span of the output layer, otherwise, the decorator must return a list of annotations.
* Updated `EstBERTNERTagger`: added `device` argument for [switching between cpu and gpu in transformers pipeline](https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html#transformers.Pipeline);
* Remove `PropBankPreannotator`'s lexicon from the package and made it available as a downloadable resource;
* Updated `PropBankPreannotator`: added flag `discard_overlapped_frames` for removing entirely overlapped frames; Note: this is a heuristic and not entirely unharmful. Removing overlapped frames can reduce redundant frames roughly 9 %pt, but with a cost of decreasing correct frame detection accuracy 0.3 %pt (based on measurements on EDT-UD corpus). For evaluation of `PropBankPreannotator`, see notebook [this repository](https://github.com/estnltk/estnltk-model-data/tree/main/propbank_sem_roles).

Added

* Added stem-based morphological analysis (and disambiguation) option to Vabamorf instance. Modified `VabamorfTagger` & its sub-components: and added stem-based morph_analysis option (flag `stem`). See [this tutorial](https://github.com/estnltk/estnltk/blob/0886be3ecbc5548827e0a70e4cef0f1aa313798b/tutorials/nlp_pipeline/B_morphology/02_morphological_analysis_stem_based.ipynb) for details;
* Be aware that there is _no lemma_ in the output of stem-based morphological analysis. This is also hinders the usability of the stem-based analysis by other tools in EstNLTK, because most of the tools require lemma-based morphological analysis.
* Added `CompoundWordTagger` for tagging linguistic compound word/subword boundaries on words. Details in the [tutorial](https://github.com/estnltk/estnltk/blob/0886be3ecbc5548827e0a70e4cef0f1aa313798b/tutorials/nlp_pipeline/B_morphology/compound_word_detection.ipynb);
* Added [packaging](https://packaging.pypa.io/) dependency to `estnltk_core` & `estnltk` (this is required for checking versions of Python packages used by EstNLTK).
* Added `GrammarCorrectorWebTagger` which tags grammatical error correction suggestions via TartuNLP's ws, see details in [this tutorial](https://github.com/estnltk/estnltk/blob/179ea6865383a3a349195907fb85b1da01c96eba/tutorials/taggers/web_taggers/web_taggers.ipynb);
* Added `BertMorphTagger` which tags part-of-speech and morphological form features in Estonian texts using Vabamorf's tagset, leveraging a fine-tuned Bert model. The tagger can also be used as a disambiguator to resolve ambiguities of an existing morphological analysis layer that uses Vabamorf's tagset. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/752ec7276b891c5d754433286058f26f9c973e74/tutorials/nlp_pipeline/B_morphology/08_bert_based_morph_tagger.ipynb). Note that `BertMorphTagger` depends on [sentencepiece](https://pypi.org/project/sentencepiece/), which won't be installed automatically with `estnltk-neural`, but needs to be installed manually;
* Added `GliLemTagger` which enhances Vabamorf's lemmatizer with an external disambiguation module based on GliNER and can either improve the lemmatization accuracy or provide an alternative lemmatization. The tagger can also be used as a disambiguator to resolve lemma ambiguities of an existing morphological analysis layer. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/b074fefa6597df7d6981b7920ca19a3621afbc0d/tutorials/nlp_pipeline/B_morphology/08_glilem_lemmatizer_and_disambiguator.ipynb). Note that `GliLemTagger` depends on [gliner](https://pypi.org/project/gliner/), which won't be installed automatically with `estnltk-neural`, but needs to be installed manually;
* Added a tutorial about using [TimeLocTagger](https://github.com/estnltk/estnltk/blob/179ea6865383a3a349195907fb85b1da01c96eba/tutorials/nlp_pipeline/X_miscellaneous/03_time_and_location_adverbials.ipynb) -- an experimental tagger which can be used to classify oblique phrases into time and location adverbials;
* Added a tutorial about [PropBankPreannotator](https://github.com/estnltk/estnltk/blob/179ea6865383a3a349195907fb85b1da01c96eba/tutorials/nlp_pipeline/X_miscellaneous/04_propbank_semantic_roles_preannotation.ipynb), which tags Estonian PropBank semantic roles based on a (manually-crafted) lexicon;
* Added `converters.label_studio` `PhraseTaggingTask` & `PhraseClassificationTask` for importing/exporting [Labelstudio](https://labelstud.io/) annotations. For usage details, see [this tutorial](https://github.com/estnltk/estnltk/blob/179ea6865383a3a349195907fb85b1da01c96eba/tutorials/converters/labelstudio_exporter_importer.ipynb).

Fixed

* Bugfix in `conll_to_texts_list`;
* Fixed issue [122](https://github.com/estnltk/estnltk/issues/122): `SentenceTokenizer`'s base tokenizer now uses resources from `"punkt_tab"` (so estnltk is now compatible with `nltk > 3.8.1`);
* Bugfix in `StanzaSyntaxTagger`: do not crash if the input is accidentially missing `'xpos'` (affects only `input_type='sentences'`);
* Fix in `resource_utils._normalize_resource_size`: resource sizes smaller than 1M will be normalized properly now;

1.7.3

Changed

* Updated `BaseText`:
* `BaseText.sorted_layers` covers both span layers and relation layers now. however, for backwards compatibility, it returns only sorted span layers by default. set flag `relation_layers=True` to include relation layers;
* removed `BaseText.sorted_relation_layers`;
* updated `Text` import/export functions to take account of dependencies between span and relation layers;
* Updated `Layer` and `RelationLayer`:
* `None` values will appear translucent by default in the HTML representation;
* Updated `json_to_layer`: added possibility to load a single layer instead of a list of layers. Examples in [tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/converters/json_exporter_importer.ipynb);
* Removed deprecated module `estnltk.resolve_layer_dag` (use `estnltk.default_resolver` instead);
* Updated `UserDictTagger`: made `add_word` & `add_words_from_csv_file` non-public methods, which should no longer be used directly. Instead, `UserDictTagger`'s constructor should be used for adding all words;
* Added [deprecation warning](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/B_morphology/syllabification.ipynb) about `Vabamorf`'s syllabifier. Instead of using `Vabamorf`'s syllabifier, please use [the finite state transducer based syllabifier](https://gitlab.com/tilluteenused/docker_elg_syllabifier) which provides a complete syllabification functionality of Estonian;
* Update `parse_enc`:
* adapted to processing ENC 2023 .vert files;
* added `add_document_index` option to parse_enc (saves document's original locations in the vert file) for processing ENC 2023;
* added `focus_block` parameter to enable and control data parallelization;
* added `extended_morph_form` parameter to enable importing of additional morphological form information in CG categories. This is possible only in ENC 2021 and ENC 2023 versions of the corpora which have both regular `morph_analysis` and `morph_extended` annotations available.
* details in [the tutorial](https://github.com/estnltk/estnltk/blob/592a1689b9e07822d821d5b7bfa18f5981ec8f6d/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb);
* Updated `PhraseTagger`, `RegexTagger`, `SpanTagger`, `SubstringTagger` with `get_decorator_inputs` method, which can be used for debugging while creating rules, see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/B_decorator_development.ipynb) for examples;
* Renamed `NerWebTagger`'s parameter `ner_output_layer` to `output_layer`;
* Renamed `SyntaxDependencyRetagger`'s parameter `conll_syntax_layer` to `syntax_layer`;
* Renamed `HfstClMorphAnalyser`'s method `lookup` to `analyze_token` and added corresponding implementation;
* Refactored & simplified `PhraseExtractor` interface;
* Removed `legacy.dict_taggers`;
* Updated `estnltk/setup.py`: customised build_py to build swig extensions before python modules;
* Updated `StanzaSyntaxEnsembleTagger`'s `majority_voting` algorithm: added Chu–Liu/Edmonds' post-processing to assure a valid tree structure;
* Updated `CoreferenceTagger`: added parameter `xgb_tree_method` which allows to set a `tree_method` training parameter in `xgboost` (since `xgboost` version 2.0, the default `tree_method` has been changed, so this parameter allows to roll back to the previous `tree_method` to restore the old behaviour of the model);
* Refactored `BertTagger` & `RobertaTagger`:
* create unambiguous layers by default, avoid unnecessary nested lists in the output; Examples in [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/E_embeddings/bert_embeddings_tagger.ipynb);
* Optimized `PgCollection` initialization: do not count all rows during the initialization. This should make initialization relatively fast even for large tables;
* Removed `PostgresStorage` from short import path `estnltk.storage`. Please use long import path: `from estnltk.storage.postgres import PostgresStorage`;
* Refactored `PgCollection`: moved `create_collection_table` to `CollectionStructureBase` as creating collection table now depends on the collection structure version; stand-alone function `create_collection_table` is deprecated;
* Renamed `CollectionStructureBase.create_table` -> `create_layer_info_table`;

Added

* Updated `RelationLayer`:
* added possibility to define enveloping `RelationLayer`;
* added `display_order`, which can be used for specifying the order of `span_names` & `attributes` in the HTML representation of the layer;
* added `display()` method to `RelationLayer` for showing relation (named span) annotations in text;
* added `relations_v1` serialization module for serializing updated version of the layer;
* For examples about the updated relation layer, please see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/system/relation_layer.ipynb);
* Added `DisplayNamedSpans` & `NamedSpanVisualiser` classes that support `RelationLayer` visualisation;
* Updated `Relation`: added HTML representation;
* Updated `NamedSpan`: added `resolve_attribute` method, which allows to get an access to foreign attributes; Examples in [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/system/relation_layer.ipynb);
* Updated `TokensTagger`: added quotation marks postfixes (flag `apply_quotes_postfixes`);
* Added `LocalTokenSplitter` that splits tokens into smaller tokens based on regular expression patterns and user-defined functions for determining the split point; For details, see [the tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/A_text_segmentation/01_tokens.ipynb);
* Updated `VabamorfAnalyzer`: added function `analyze_token` for analysing a single word; Usage details in [the tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/B_morphology/01_morphological_analysis.ipynb);
* Added `TimeLocTagger` which tags time/location OBL phrases based on UD syntax layer;
* Added `PropBankPreannotator` which tags Estonian PropBank semantic roles based on a (manually-crafted) lexicon; Note that this is a preliminary version of the tagger;
* Added `RegexElement`, `StringList` and `ChoiceGroup` classes that wrap around [regex library](https://pypi.org/project/regex/) and allow to systematically document and test regular expressions. For usage details, please see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/A_regex_development.ipynb);
* Added a [tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/B_decorator_development.ipynb) about rule_tagger's decorator development (by swenlaur);
* Added `enc_layer_to_conll` for converting ENC morphosyntactic layer to CONLLU string. See details in the [tutorial](https://github.com/estnltk/estnltk/blob/38a50d30c938aed8c811b9ef0b296d0e4f01fcc0/tutorials/converters/conll_exporter.ipynb);
* Added `syntax_phrases_v0` serialization module (used by `PhraseExtractor`);
* Updated `StanzaSyntaxEnsembleTagger`: added calculation of predictions' entropy (optional); For usage details, see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* Updated `PgCollection`'s create layer functions: if layer creation fails, document id-s will be logged;
* Updated `PgCollection` default version to V4:
* Introduced new collection structure `postgres/structure/v40/collectionstructure.py`;
* Added possibility to hide documents in pg collection. Note: a row-level security policy must be defined for the hiding to take effect;
* Enable storing relation layers;
* Added `pg_collection.is_relation_layer` method;

Fixed

* Fixed `BaseLayer` `__getitem__` & `get` methods: preserve `serialisation_module` while extracting a sub-layer;
* Fixed `UserDictTagger`: `morph_layer`'s spans are now properly changed so that old values would not reappear after retagging;
* Fixed DeprecationWarning in NER `model_storage_util`;
* Fixed `syllabify_word`: do not crash on empty input;
* Fixed `RegexTagger` & `SpanTagger`: drop annotations if global decorator insists;
* Fixed `TimeLocDecorator`:
* proper reading of lemmas list from file;
* avoid circular wordnet import;
* Fixed matplotlib's imports (use internal imports to avoid occasional importing errors in Windows conda packages);
* Fixed `Wordnet.__del__` error (before closing database, assure it is open);
* Fixed `CoreferenceTagger`.`expand_mentions_to_named_entities`: remove any duplicate relations;
* Fixed `StanzaSyntaxTagger`/`StanzaSyntaxEnsembleTagger` a preprocessing bug introduced in stanza version update to 1.7.0+;
* Fixed `EstBERTNERTagger`: added proper input tokenization that does not fail on rare unicode symbols;
* Fixed `BertTagger` & `RobertaTagger`:
* added proper tokenization processing that does not fail on rare unicode symbols;
* do not use deprecated `tokenizer.encode_plus`;

1.7.2

Changed

* Renamed `PgCollection.meta` -> `meta_columns`;
* Deprecated `PgCollection.create()`. Use `PostgresStorage.add_collection` method to create new collections;
* Deprecated `PgCollection.delete()`, `PostgresStorage.delete(collection)` and `PostgresStorage.__delitem__(collection)`. Use `PostgresStorage.delete_collection` method to remove collections;
* Deprecated `PgCollection.select_fragment_raw()` (no longer relevant) and `continue_creating_layer` (use `create_layer(..., mode="append")` instead);
* Deprecated `PgCollection.has_fragment()`, `get_fragment_names()`, `get_fragment_tables()`. Use `collection.has_layer(name, layer_type="fragmented")` and `collection.get_layer_names_by_type(layer_type="fragmented")` instead;
* Merged `PgCollection.create_fragment` into `PgCollection.create_fragmented_layer`;
* Merged `PgCollection._create_layer_table` into `PgCollection.add_layer`;
* `StorageCollections.load()` removed legacy auto-insert behaviour;
* Refactored `PostgresStorage`: upon connecting to database, a new schema is now automatically created if the flag `create_schema_if_missing` has been set and the user has enough privileges. No need to manually call `create_schema` anymore;
* Refactored `StorageCollections` & `PostgresStorage`: relocated `storage_collections` table insertion and deletion logic to `PostgresStorage`;
* Refactored `PgCollection.add_layer`: added `layer_type` parameter, deprecated `fragmented_layer` paramater and added `'multi'` to layer types;
* Replaced function `conll_to_str` with `converters.conll.layer_to_conll`. For usage, see [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) ("CONLL exporter");
* Refactored `DateTagger`, `AddressPartTagger`, `SyntaxIgnoreTagger`, `CompoundTokenTagger`: use new RegexTagger instead of the legacy one;
* Refactored `AdjectivePhrasePartTagger`: use `rule_taggers` instead of legacy `dict_taggers`;
* Updated `StanzaSyntax(Ensemble)Tagger`: random picking of ambiguous analyses is no longer deterministic: you'll get different result on each run if the input is morphologically ambiguous. However, if needed, you can use seed values to ensure repeatability. For details, see [stanza parser tutorials](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* Updated `StanzaSyntaxEnsembleTagger`:
* if user attempts to process sentences longer than 1000 words with GPU / CUDA, a guarding exception will be thrown. Pass parameter `gpu_max_words_in_sentence=None` to the tagger to disable the exception;
* added `aggregation_algorithm` parameter, which defaults to `'las_coherence'` (the same algorithm that has been used in the previous versions);
* added a new aggregation algorithm: `'majority_voting'`. With the majority voting, the input will be processed token-wise and head & deprel that gets most votes from models will be picked for each token. Note, however, that this method can produce invalid tree structures, as there is no mechanism to ensure that majority-voting-picked tokens will make up a valid tree.
* Renamed `SoftmaxEmbTagSumWebTagger` to `NeuralMorphDisambWebTagger` and made following updates:
* `NeuralMorphDisambWebTagger` is now a `BatchProcessingWebTagger`;
* `NeuralMorphDisambWebTagger` is now also a `Retagger` and can be used to disambiguate ambiguous `morph_analysis` layer. In the same vein, `NeuralMorphTagger` was also made `Retagger` and can be used for disambiguation. For details on the usage, see [the neural morph tutorial](https://github.com/estnltk/estnltk/blob/b67ca34ef0702bb7d7fbe1b55639327dfda55830/tutorials/nlp_pipeline/B_morphology/08_neural_morph_tagger_py37.ipynb);
* `estnltk_neural` package requirements: removed explicit `tensorflow` requirement.
* Note, however, that `tensorflow <= 1.15.5` (along with Python `3.7`) is still required if you want to use `NeuralMorphTagger`;
* `Wordnet`:
* default database is no longer distributed with the package, wordnet now downloads the database automatically via `estnltk_resources`;
* alternatively, a local database can now be imported via parameter `local_dir`;
* updated wordnet database version to **2.6.0**;
* `HfstClMorphAnalyser`:
* the model is no longer distributed with the package, the analyser now downloads model automatically via `estnltk_resources`;
* Refactored `BatchProcessingWebTagger`:
* renamed parameter `batch_layer_max_size` -> `batch_max_size`;
* the tagger has now 2 working modes: a) batch splitting guided by text size limit, b) batch splitting guided by layer size limit (the old behaviour);
* Updated `vabamorf`'s function `syllabify_word`:
* made compound word splitting heuristic more tolerant to mismatches, and as a result, we can now more properly syllabify words which root tokens do not match exactly with the surface form. Examples: `kolmekümne` (`kol-me-küm-ne`), `paarisada` (`paa-ri-sa-da`), `ühesainsas` (`ü-hes-ain-sas`). However, if you need to use the old syllabification behaviour, pass parameter `tolerance=0` to the function, e.g. `syllabify_word('ühesainsas', tolerance=0)`.

Added

* `MultiLayerTagger` -- interface for taggers that create multiple layers at once;
* `NerWebTagger` that tags NER layers via [tartuNLP NER webservice](https://ner.tartunlp.ai/api) (uses EstBERTNER v1 model). See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `EstBERTNERTagger` that tags NER layers using huggingface EstBERTNER models. See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `RelationLayer` -- new type of layer for storing information about relations between entities mentioned in text, such as coreference relations between names and pronouns, or semantic roles/argument structures of verbs. However, `RelationLayer` has not yet completely integrated with EstNLTK's tools, and there are following limitations:
* you cannot access attributes of foreign layers (such as `lemmas` from `morph_analysis`) via spans of a relation layer;
* `estnltk_core.layer_operations` do not support `RelationLayer`;
* `estnltk.storage.postgres` does not support `RelationLayer`;
* `estnltk.visualisation` does not handle `RelationLayer`;

For usage examples, see the [RelationLayer's tutorial](https://github.com/estnltk/estnltk/blob/b8ad0932a852daedb1e3eddeb02c944dd1f292ee/tutorials/system/relation_layer.ipynb).

* `RelationTagger` -- interface for taggers creating `RelationLayer`-s. Instructions on how to create a `RelationTagger` can be found in [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/taggers/base_tagger.ipynb);
* `WebRelationTagger` & `BatchProcessingWebRelationTagger`, which allow to create web-based `RelationTagger`-s;
* `CoreferenceTagger` which detects pronominal coreference relations. The tool is based on [Estonian Coreference System v1.0.0](https://github.com/SoimulPatriei/EstonianCoreferenceSystem) and currently relies on stanza 'et' models for pre-processing the input text. In future, the tool also becomes available via a web service (by `CoreferenceV1WebTagger`). For details, see the [coreference tutorial](https://github.com/estnltk/estnltk/blob/28814a3fa9ff869cd4cfc88308f6ce7e29157889/tutorials/nlp_pipeline/D_information_extraction/04_pronominal_coreference.ipynb);
* Updated `VabamorfDisambiguator`, `VabamorfTagger` & `VabamorfCorpusTagger`: added possibility to preserve phonetic mark-up (even with disambiguation);
* `UDMorphConverter` -- tagger that converts Vabamorf's morphology categories to Universal Dependencies morphological categories. Note that the conversion can introduce additional ambiguities as there is no disambiguation included, and roughly 3% to 9% of words do not obtain correct UD labels with this conversion. More details in [tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/B_morphology/06_morph_analysis_with_ud_categories.ipynb);
* `RobertaTagger` for tagging `EMBEDDIA/est-roberta` embeddings. The interface is analogous to that of `BertTagger`. [Tutorial](https://github.com/estnltk/estnltk/blob/e223a7e6245d29a6b1838335bfa3872a0aa92840/tutorials/nlp_pipeline/E_embeddings/bert_embeddings_tagger.ipynb).
* `BertTokens2WordsRewriter` -- tagger that rewrites BERT tokens layer to a layer enveloping EstNLTK's words layer. Can be useful for mapping Bert's output to EstNLTK's tokenization (currently used by `EstBERTNERTagger`).
* `PhraseExtractor` -- tagger for removing phrases and specific dependency relations based on UD-syntax;
* `ConsistencyDecorator` -- decorator for PhraseExtractor. Calculates syntax conservation scores after removing phrase from text;
* `StanzaSyntaxTaggerWithIgnore` -- entity ignore tagger. Retags text with StanzaSyntaxTagger and excludes phrases found by PhraseExtractor;
* `estnltk.resource_utils.delete_all_resources`. Apply it before uninstalling EstNLTK to remove all resources;
* `clauses_and_syntax_consistency` module, which allows to 1) detect potential errors in clauses layer using information from the syntax layer, 2) fix clause errors with the help of syntactic information. [Tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/F_annotation_consistency/clauses_and_syntax_consistency.ipynb);
* `PostgresStorage` methods:
* `add_collection`
* `refresh`
* `delete_collection`
* `PgCollection` methods:
* `refresh`
* `get_layer_names_by_type`
* `PgCollectionMeta` (provides views to `PgCollection`'s metadata, and allows to query metadata) and `PgCollectionMetaSelection` (read-only iterable selection over `PgCollection`'s metadata values);
* Parameter `remove_empty_nodes` to `conll_to_text` importer -- if switched on (default), then empty / null nodes (ellipsis in the enhanced representation) will be discarded (left out from textual content and also from annotations) while importing from conllu files;
* Added a simplified example about how to get whitespace tokenization for words to tutorial [`restoring_pretokenized_text.ipynb`](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/corpus_processing/restoring_pretokenized_text.ipynb);
* `pg_operations.drop_all`;

Fixed

* `extract_(discontinuous_)sections`: should now also work on non-ambiguous layer that has a parent;
* `BaseText.topological_sort`: should now also work on layers with malformed/unknown dependencies;
* `CompoundTokenTagger`: 2nd level compounding rules now also work on detached layers;
* `TimexTagger`'s rules: disabled extraction of too long year values (which could break _Joda-Time_ integer limits);
* Bug that caused collection metadata to disappear when using `PgCollection.insert` (related to `PgCollection.column_names` not returning automatically correct metadata column names on a loaded collection; newly introduced `PgCollectionMeta` solved that problem);
* `StanzaSyntax(Ensemble)Tagger`: should now also work on detached layers;
* Fixed `BaseLayer.diff`: now also takes account of a difference in `secondary_attributes`;
* Fixed `downloader._download_and_install_hf_resource`: disabled default behaviour and `use_symlinks` option, because it fails under the Windows;
* Fixed `download`: made it more flexible on parsing (idiosyncratic) 'Content-Type' values;
* Fixed `BertTagger` tokenization: `BertTagger` can now better handle misalignments between bert tokens and word spans caused by emojiis, letters with diacritics, and the invisible token `\xad`;

1.7.1

Changed

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials), including:
* Relocated introductory tutorials into the folder ['basics'](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics);
* Relocated 'estner_training' tutorials to 'nlp_pipeline/D_information_extraction';
* Updated syntax tutorials and split into parser-wise sub tutorials:
* [maltparser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_maltparser.ipynb);
* [stanza's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* [udpipe's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_udpipe.ipynb);
* [vislcg3 tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_vislcg3.ipynb);
* Updated `parse_enc` -- it can now be used for parsing [ENC 2021](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2021-vert/f176ccc0d05511eca6e4fa163e9d454794df2849e11048bb9fa104f1fec2d03f/). See the details from [the tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb);
* The function `parse_enc_file_iterator` now attempts to _automatically fix malformed paragraph annotations_ . As a result, more words and sentences can be imported from corpora, but the side effect is that there will be artificially created paragraph annotations -- even for documents that do not have paragraph annotations originally. The setting can be turned off, if needed;
* Updated `get_resource_paths` function: added EstNLTK version checking. A resource description can now contain version specifiers, which declare estnltk or estnltk_neural version required for using the resource. Using version constraints is optional, but if they are used and constraints are not satisfied, then `get_resource_paths` won't download the resource nor return its path;
* Relocated `estnltk.transformers` (`MorphAnalysisWebPipeline`) into `estnltk.web_taggers`;
* Refactoring: moved functions `_get_word_texts` & `_get_word_text` to `estnltk.common`;

Added

* `ResourceView` class, which lists EstNLTK's resources as a table, and shows their download status. See the [resources tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics/estnltk_resources.ipynb) for details.
* `SyntaxIgnoreCutter` class, which cuts the input Text object into a smaller Text by leaving out all spans from the syntax_ignore layer (produced by `SyntaxIgnoreTagger`). The resulting Text can then be analysed syntactically while skipping parts of a text may be difficult to analyse. For details, see the [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);
* function `add_syntax_layer_from_cut_text`, which can be used to carry over the syntactic analysis layer from the cut text (created by `SyntaxIgnoreCutter`) to the original text. The [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);

Fixed

* Syntax preprocessing [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) to describe the current state of preprocessing;

1.7.0

Changed

* EstNLTK's tools that require large resources (e.g. syntactic parsers and neural analysers) can now download resources automatically upon initialization. This stops the program flow with an interactive prompt asking
for user's permission to download the resource. However, you can predownload the resource in order to avoid the interruption, see this [tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb) for details.

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials). However, the work on updating tutorials is still not complete.

* `PgCollection`: now uses `CollectionStructure.v30` by default.

* Disambiguator (a system tagger): it's now a Retagger, but can work either as a retagger or a tagger, depending on the inputs. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/taggers/system/disambiguator.ipynb).

Added

* `downloader` & `resources_utils` for downloading additional resources and handling paths of downloaded resources. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb)

* Collocation net -- allows to find different connections between words based on the collocations each word was in. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/collocation_net/tutorial.ipynb).

* `PgCollection`: added `CollectionStructure.v30` which allows to create sparse layer tables. Sparse layer tables do not store empty layers, which can save up the storage space and allow faster queries over tables & collection. The [main db tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/storage/storing_text_objects_in_postgres.ipynb) exemplfies the creation and usage of sparse layers.

* `PgCollection.create_layer` & `PgCollection.add_layer` now take parameter `sparse=True` which turns layer into a sparse layer;

* `PgCollection.select` now has a boolean parameter `keep_all_texts`: turning the parameter off yields only texts with non-empty sparse layers;

* `PgSubCollection` now has methods `create_layer` and `create_layer_block` which can be used to create a sparse layer from specific subcollection;

Fixed

* `BaseText.__repr__` method using wrong variable name;

* `NeuralMorphTagger`'s configuration reading and handling: model locations can now be freely customized;

* `TimexTagger`'s rules on detecting dates with roman numeral months & dates with slashes.

Page 1 of 4

Releases

Has known vulnerabilities

Estnltk

Page 1 of 4

3.5

1.7.4

1.7.3

1.7.2

1.7.1

1.7.0

Page 1 of 4

Links

Releases