Estnltk

Latest version: v1.7.3

Safety actively analyzes 638316 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

3.5

* Text class has been redesigned.
Text annotations are now decomposed into Span-s, SpanList-s and Layer-s;
* A common class for text annotators -- Tagger class -- has been introduced;
* Word segmentation has been redesigned.
It is now a three-step process, which includes basic tokenization (layer 'tokens'), creation of compound tokens (layer 'compound\_tokens'), and creation of words (layer 'words') based on 'tokens' and 'compound\_tokens'.
Token compounding rules that are aware of text units containing punctuation (such as abbreviations, emoticons, web addresses) have been implemented;
* The segmentation order has been changed: word segmentation now comes before the sentence segmentation, and the paragraph segmentation comes after the sentence segmentation;
* Sentence segmentation has been redesigned.
Sentence segmenter is now aware of the compound tokens (fixing compound tokens can improve sentence segmentation results), and special post-correction steps are applied to improve quality of sentence segmentation;
* Morphological analysis interface has been redesigned.
Morphological analyses are no longer attached to the layer 'words' (although they can be easily accessed through the words, if needed), but are contained in a separate layer named 'morph_analysis'.
* Morphological analysis process can now more easily decomposed into analysis and disambiguation (using special taggers VabamorfAnalyzer and VabamorfDisambiguator).
Also, a tagger responsible for post-corrections of morphological analysis (PostMorphAnalysisTagger) has been introduced, and post-corrections for improving quality of part of speech, and quality of analysis of numbers and pronouns have been implemented;
* Rules for converting morphological analysis categories from Vabamorf's format to GT (giellatekno) format have been ported from the previous version of EstNLTK.
Note, however, that the porting is not complete: full functionality requires 'clauses' annotation (which is currently not available);
* ...
* Other components of EstNLTK (such as the temporal expression tagger, and the named entity recognizer) are yet to be ported to the new version in the future;

Added
-----
* SyntaxIgnoreTagger, which can be used for detecting parts of text that should be ignored by the syntactic analyser.
Note: it is yet to be integrated with the pre-processing module of syntactic analysis;

1.7.3

Changed

* Updated `BaseText`:
* `BaseText.sorted_layers` covers both span layers and relation layers now. however, for backwards compatibility, it returns only sorted span layers by default. set flag `relation_layers=True` to include relation layers;
* removed `BaseText.sorted_relation_layers`;
* updated `Text` import/export functions to take account of dependencies between span and relation layers;
* Updated `Layer` and `RelationLayer`:
* `None` values will appear translucent by default in the HTML representation;
* Updated `json_to_layer`: added possibility to load a single layer instead of a list of layers. Examples in [tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/converters/json_exporter_importer.ipynb);
* Removed deprecated module `estnltk.resolve_layer_dag` (use `estnltk.default_resolver` instead);
* Updated `UserDictTagger`: made `add_word` & `add_words_from_csv_file` non-public methods, which should no longer be used directly. Instead, `UserDictTagger`'s constructor should be used for adding all words;
* Added [deprecation warning](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/B_morphology/syllabification.ipynb) about `Vabamorf`'s syllabifier. Instead of using `Vabamorf`'s syllabifier, please use [the finite state transducer based syllabifier](https://gitlab.com/tilluteenused/docker_elg_syllabifier) which provides a complete syllabification functionality of Estonian;
* Update `parse_enc`:
* adapted to processing ENC 2023 .vert files;
* added `add_document_index` option to parse_enc (saves document's original locations in the vert file) for processing ENC 2023;
* added `focus_block` parameter to enable and control data parallelization;
* added `extended_morph_form` parameter to enable importing of additional morphological form information in CG categories. This is possible only in ENC 2021 and ENC 2023 versions of the corpora which have both regular `morph_analysis` and `morph_extended` annotations available.
* details in [the tutorial](https://github.com/estnltk/estnltk/blob/592a1689b9e07822d821d5b7bfa18f5981ec8f6d/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb);
* Updated `PhraseTagger`, `RegexTagger`, `SpanTagger`, `SubstringTagger` with `get_decorator_inputs` method, which can be used for debugging while creating rules, see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/B_decorator_development.ipynb) for examples;
* Renamed `NerWebTagger`'s parameter `ner_output_layer` to `output_layer`;
* Renamed `SyntaxDependencyRetagger`'s parameter `conll_syntax_layer` to `syntax_layer`;
* Renamed `HfstClMorphAnalyser`'s method `lookup` to `analyze_token` and added corresponding implementation;
* Refactored & simplified `PhraseExtractor` interface;
* Removed `legacy.dict_taggers`;
* Updated `estnltk/setup.py`: customised build_py to build swig extensions before python modules;
* Updated `StanzaSyntaxEnsembleTagger`'s `majority_voting` algorithm: added Chu–Liu/Edmonds' post-processing to assure a valid tree structure;
* Updated `CoreferenceTagger`: added parameter `xgb_tree_method` which allows to set a `tree_method` training parameter in `xgboost` (since `xgboost` version 2.0, the default `tree_method` has been changed, so this parameter allows to roll back to the previous `tree_method` to restore the old behaviour of the model);
* Refactored `BertTagger` & `RobertaTagger`:
* create unambiguous layers by default, avoid unnecessary nested lists in the output; Examples in [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/E_embeddings/bert_embeddings_tagger.ipynb);
* Optimized `PgCollection` initialization: do not count all rows during the initialization. This should make initialization relatively fast even for large tables;
* Removed `PostgresStorage` from short import path `estnltk.storage`. Please use long import path: `from estnltk.storage.postgres import PostgresStorage`;
* Refactored `PgCollection`: moved `create_collection_table` to `CollectionStructureBase` as creating collection table now depends on the collection structure version; stand-alone function `create_collection_table` is deprecated;
* Renamed `CollectionStructureBase.create_table` -> `create_layer_info_table`;


Added

* Updated `RelationLayer`:
* added possibility to define enveloping `RelationLayer`;
* added `display_order`, which can be used for specifying the order of `span_names` & `attributes` in the HTML representation of the layer;
* added `display()` method to `RelationLayer` for showing relation (named span) annotations in text;
* added `relations_v1` serialization module for serializing updated version of the layer;
* For examples about the updated relation layer, please see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/system/relation_layer.ipynb);
* Added `DisplayNamedSpans` & `NamedSpanVisualiser` classes that support `RelationLayer` visualisation;
* Updated `Relation`: added HTML representation;
* Updated `NamedSpan`: added `resolve_attribute` method, which allows to get an access to foreign attributes; Examples in [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/system/relation_layer.ipynb);
* Updated `TokensTagger`: added quotation marks postfixes (flag `apply_quotes_postfixes`);
* Added `LocalTokenSplitter` that splits tokens into smaller tokens based on regular expression patterns and user-defined functions for determining the split point; For details, see [the tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/A_text_segmentation/01_tokens.ipynb);
* Updated `VabamorfAnalyzer`: added function `analyze_token` for analysing a single word; Usage details in [the tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/B_morphology/01_morphological_analysis.ipynb);
* Added `TimeLocTagger` which tags time/location OBL phrases based on UD syntax layer;
* Added `PropBankPreannotator` which tags Estonian PropBank semantic roles based on a (manually-crafted) lexicon; Note that this is a preliminary version of the tagger;
* Added `RegexElement`, `StringList` and `ChoiceGroup` classes that wrap around [regex library](https://pypi.org/project/regex/) and allow to systematically document and test regular expressions. For usage details, please see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/A_regex_development.ipynb);
* Added a [tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/taggers/rule_taggers/B_decorator_development.ipynb) about rule_tagger's decorator development (by swenlaur);
* Added `enc_layer_to_conll` for converting ENC morphosyntactic layer to CONLLU string. See details in the [tutorial](https://github.com/estnltk/estnltk/blob/38a50d30c938aed8c811b9ef0b296d0e4f01fcc0/tutorials/converters/conll_exporter.ipynb);
* Added `syntax_phrases_v0` serialization module (used by `PhraseExtractor`);
* Updated `StanzaSyntaxEnsembleTagger`: added calculation of predictions' entropy (optional); For usage details, see [this tutorial](https://github.com/estnltk/estnltk/blob/caa53bc4cda8198a93fc07a8a3146a23287dab5f/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* Updated `PgCollection`'s create layer functions: if layer creation fails, document id-s will be logged;
* Updated `PgCollection` default version to V4:
* Introduced new collection structure `postgres/structure/v40/collectionstructure.py`;
* Added possibility to hide documents in pg collection. Note: a row-level security policy must be defined for the hiding to take effect;
* Enable storing relation layers;
* Added `pg_collection.is_relation_layer` method;

Fixed

* Fixed `BaseLayer` `__getitem__` & `get` methods: preserve `serialisation_module` while extracting a sub-layer;
* Fixed `UserDictTagger`: `morph_layer`'s spans are now properly changed so that old values would not reappear after retagging;
* Fixed DeprecationWarning in NER `model_storage_util`;
* Fixed `syllabify_word`: do not crash on empty input;
* Fixed `RegexTagger` & `SpanTagger`: drop annotations if global decorator insists;
* Fixed `TimeLocDecorator`:
* proper reading of lemmas list from file;
* avoid circular wordnet import;
* Fixed matplotlib's imports (use internal imports to avoid occasional importing errors in Windows conda packages);
* Fixed `Wordnet.__del__` error (before closing database, assure it is open);
* Fixed `CoreferenceTagger`.`expand_mentions_to_named_entities`: remove any duplicate relations;
* Fixed `StanzaSyntaxTagger`/`StanzaSyntaxEnsembleTagger` a preprocessing bug introduced in stanza version update to 1.7.0+;
* Fixed `EstBERTNERTagger`: added proper input tokenization that does not fail on rare unicode symbols;
* Fixed `BertTagger` & `RobertaTagger`:
* added proper tokenization processing that does not fail on rare unicode symbols;
* do not use deprecated `tokenizer.encode_plus`;

1.7.2

Changed

* Renamed `PgCollection.meta` -> `meta_columns`;
* Deprecated `PgCollection.create()`. Use `PostgresStorage.add_collection` method to create new collections;
* Deprecated `PgCollection.delete()`, `PostgresStorage.delete(collection)` and `PostgresStorage.__delitem__(collection)`. Use `PostgresStorage.delete_collection` method to remove collections;
* Deprecated `PgCollection.select_fragment_raw()` (no longer relevant) and `continue_creating_layer` (use `create_layer(..., mode="append")` instead);
* Deprecated `PgCollection.has_fragment()`, `get_fragment_names()`, `get_fragment_tables()`. Use `collection.has_layer(name, layer_type="fragmented")` and `collection.get_layer_names_by_type(layer_type="fragmented")` instead;
* Merged `PgCollection.create_fragment` into `PgCollection.create_fragmented_layer`;
* Merged `PgCollection._create_layer_table` into `PgCollection.add_layer`;
* `StorageCollections.load()` removed legacy auto-insert behaviour;
* Refactored `PostgresStorage`: upon connecting to database, a new schema is now automatically created if the flag `create_schema_if_missing` has been set and the user has enough privileges. No need to manually call `create_schema` anymore;
* Refactored `StorageCollections` & `PostgresStorage`: relocated `storage_collections` table insertion and deletion logic to `PostgresStorage`;
* Refactored `PgCollection.add_layer`: added `layer_type` parameter, deprecated `fragmented_layer` paramater and added `'multi'` to layer types;
* Replaced function `conll_to_str` with `converters.conll.layer_to_conll`. For usage, see [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) ("CONLL exporter");
* Refactored `DateTagger`, `AddressPartTagger`, `SyntaxIgnoreTagger`, `CompoundTokenTagger`: use new RegexTagger instead of the legacy one;
* Refactored `AdjectivePhrasePartTagger`: use `rule_taggers` instead of legacy `dict_taggers`;
* Updated `StanzaSyntax(Ensemble)Tagger`: random picking of ambiguous analyses is no longer deterministic: you'll get different result on each run if the input is morphologically ambiguous. However, if needed, you can use seed values to ensure repeatability. For details, see [stanza parser tutorials](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* Updated `StanzaSyntaxEnsembleTagger`:
* if user attempts to process sentences longer than 1000 words with GPU / CUDA, a guarding exception will be thrown. Pass parameter `gpu_max_words_in_sentence=None` to the tagger to disable the exception;
* added `aggregation_algorithm` parameter, which defaults to `'las_coherence'` (the same algorithm that has been used in the previous versions);
* added a new aggregation algorithm: `'majority_voting'`. With the majority voting, the input will be processed token-wise and head & deprel that gets most votes from models will be picked for each token. Note, however, that this method can produce invalid tree structures, as there is no mechanism to ensure that majority-voting-picked tokens will make up a valid tree.
* Renamed `SoftmaxEmbTagSumWebTagger` to `NeuralMorphDisambWebTagger` and made following updates:
* `NeuralMorphDisambWebTagger` is now a `BatchProcessingWebTagger`;
* `NeuralMorphDisambWebTagger` is now also a `Retagger` and can be used to disambiguate ambiguous `morph_analysis` layer. In the same vein, `NeuralMorphTagger` was also made `Retagger` and can be used for disambiguation. For details on the usage, see [the neural morph tutorial](https://github.com/estnltk/estnltk/blob/b67ca34ef0702bb7d7fbe1b55639327dfda55830/tutorials/nlp_pipeline/B_morphology/08_neural_morph_tagger_py37.ipynb);
* `estnltk_neural` package requirements: removed explicit `tensorflow` requirement.
* Note, however, that `tensorflow <= 1.15.5` (along with Python `3.7`) is still required if you want to use `NeuralMorphTagger`;
* `Wordnet`:
* default database is no longer distributed with the package, wordnet now downloads the database automatically via `estnltk_resources`;
* alternatively, a local database can now be imported via parameter `local_dir`;
* updated wordnet database version to **2.6.0**;
* `HfstClMorphAnalyser`:
* the model is no longer distributed with the package, the analyser now downloads model automatically via `estnltk_resources`;
* Refactored `BatchProcessingWebTagger`:
* renamed parameter `batch_layer_max_size` -> `batch_max_size`;
* the tagger has now 2 working modes: a) batch splitting guided by text size limit, b) batch splitting guided by layer size limit (the old behaviour);
* Updated `vabamorf`'s function `syllabify_word`:
* made compound word splitting heuristic more tolerant to mismatches, and as a result, we can now more properly syllabify words which root tokens do not match exactly with the surface form. Examples: `kolmekümne` (`kol-me-küm-ne`), `paarisada` (`paa-ri-sa-da`), `ühesainsas` (`ü-hes-ain-sas`). However, if you need to use the old syllabification behaviour, pass parameter `tolerance=0` to the function, e.g. `syllabify_word('ühesainsas', tolerance=0)`.

Added

* `MultiLayerTagger` -- interface for taggers that create multiple layers at once;
* `NerWebTagger` that tags NER layers via [tartuNLP NER webservice](https://ner.tartunlp.ai/api) (uses EstBERTNER v1 model). See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `EstBERTNERTagger` that tags NER layers using huggingface EstBERTNER models. See [this tutorial](https://github.com/estnltk/estnltk/blob/b970ea98532921a4e06022fff2cd3755fc181edf/tutorials/nlp_pipeline/D_information_extraction/02_named_entities.ipynb) for details;
* `RelationLayer` -- new type of layer for storing information about relations between entities mentioned in text, such as coreference relations between names and pronouns, or semantic roles/argument structures of verbs. However, `RelationLayer` has not yet completely integrated with EstNLTK's tools, and there are following limitations:
* you cannot access attributes of foreign layers (such as `lemmas` from `morph_analysis`) via spans of a relation layer;
* `estnltk_core.layer_operations` do not support `RelationLayer`;
* `estnltk.storage.postgres` does not support `RelationLayer`;
* `estnltk.visualisation` does not handle `RelationLayer`;

For usage examples, see the [RelationLayer's tutorial](https://github.com/estnltk/estnltk/blob/b8ad0932a852daedb1e3eddeb02c944dd1f292ee/tutorials/system/relation_layer.ipynb).

* `RelationTagger` -- interface for taggers creating `RelationLayer`-s. Instructions on how to create a `RelationTagger` can be found in [this tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/taggers/base_tagger.ipynb);
* `WebRelationTagger` & `BatchProcessingWebRelationTagger`, which allow to create web-based `RelationTagger`-s;
* `CoreferenceTagger` which detects pronominal coreference relations. The tool is based on [Estonian Coreference System v1.0.0](https://github.com/SoimulPatriei/EstonianCoreferenceSystem) and currently relies on stanza 'et' models for pre-processing the input text. In future, the tool also becomes available via a web service (by `CoreferenceV1WebTagger`). For details, see the [coreference tutorial](https://github.com/estnltk/estnltk/blob/28814a3fa9ff869cd4cfc88308f6ce7e29157889/tutorials/nlp_pipeline/D_information_extraction/04_pronominal_coreference.ipynb);
* Updated `VabamorfDisambiguator`, `VabamorfTagger` & `VabamorfCorpusTagger`: added possibility to preserve phonetic mark-up (even with disambiguation);
* `UDMorphConverter` -- tagger that converts Vabamorf's morphology categories to Universal Dependencies morphological categories. Note that the conversion can introduce additional ambiguities as there is no disambiguation included, and roughly 3% to 9% of words do not obtain correct UD labels with this conversion. More details in [tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/B_morphology/06_morph_analysis_with_ud_categories.ipynb);
* `RobertaTagger` for tagging `EMBEDDIA/est-roberta` embeddings. The interface is analogous to that of `BertTagger`. [Tutorial](https://github.com/estnltk/estnltk/blob/e223a7e6245d29a6b1838335bfa3872a0aa92840/tutorials/nlp_pipeline/E_embeddings/bert_embeddings_tagger.ipynb).
* `BertTokens2WordsRewriter` -- tagger that rewrites BERT tokens layer to a layer enveloping EstNLTK's words layer. Can be useful for mapping Bert's output to EstNLTK's tokenization (currently used by `EstBERTNERTagger`).
* `PhraseExtractor` -- tagger for removing phrases and specific dependency relations based on UD-syntax;
* `ConsistencyDecorator` -- decorator for PhraseExtractor. Calculates syntax conservation scores after removing phrase from text;
* `StanzaSyntaxTaggerWithIgnore` -- entity ignore tagger. Retags text with StanzaSyntaxTagger and excludes phrases found by PhraseExtractor;
* `estnltk.resource_utils.delete_all_resources`. Apply it before uninstalling EstNLTK to remove all resources;
* `clauses_and_syntax_consistency` module, which allows to 1) detect potential errors in clauses layer using information from the syntax layer, 2) fix clause errors with the help of syntactic information. [Tutorial](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/nlp_pipeline/F_annotation_consistency/clauses_and_syntax_consistency.ipynb);
* `PostgresStorage` methods:
* `add_collection`
* `refresh`
* `delete_collection`
* `PgCollection` methods:
* `refresh`
* `get_layer_names_by_type`
* `PgCollectionMeta` (provides views to `PgCollection`'s metadata, and allows to query metadata) and `PgCollectionMetaSelection` (read-only iterable selection over `PgCollection`'s metadata values);
* Parameter `remove_empty_nodes` to `conll_to_text` importer -- if switched on (default), then empty / null nodes (ellipsis in the enhanced representation) will be discarded (left out from textual content and also from annotations) while importing from conllu files;
* Added a simplified example about how to get whitespace tokenization for words to tutorial [`restoring_pretokenized_text.ipynb`](https://github.com/estnltk/estnltk/blob/4ba6d9896b851d0a922a6a43bf2cc08a09667802/tutorials/corpus_processing/restoring_pretokenized_text.ipynb);
* `pg_operations.drop_all`;

Fixed

* `extract_(discontinuous_)sections`: should now also work on non-ambiguous layer that has a parent;
* `BaseText.topological_sort`: should now also work on layers with malformed/unknown dependencies;
* `CompoundTokenTagger`: 2nd level compounding rules now also work on detached layers;
* `TimexTagger`'s rules: disabled extraction of too long year values (which could break _Joda-Time_ integer limits);
* Bug that caused collection metadata to disappear when using `PgCollection.insert` (related to `PgCollection.column_names` not returning automatically correct metadata column names on a loaded collection; newly introduced `PgCollectionMeta` solved that problem);
* `StanzaSyntax(Ensemble)Tagger`: should now also work on detached layers;
* Fixed `BaseLayer.diff`: now also takes account of a difference in `secondary_attributes`;
* Fixed `downloader._download_and_install_hf_resource`: disabled default behaviour and `use_symlinks` option, because it fails under the Windows;
* Fixed `download`: made it more flexible on parsing (idiosyncratic) 'Content-Type' values;
* Fixed `BertTagger` tokenization: `BertTagger` can now better handle misalignments between bert tokens and word spans caused by emojiis, letters with diacritics, and the invisible token `\xad`;

1.7.1

Changed

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials), including:
* Relocated introductory tutorials into the folder ['basics'](https://github.com/estnltk/estnltk/tree/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics);
* Relocated 'estner_training' tutorials to 'nlp_pipeline/D_information_extraction';
* Updated syntax tutorials and split into parser-wise sub tutorials:
* [maltparser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_maltparser.ipynb);
* [stanza's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_stanza.ipynb);
* [udpipe's parser tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_udpipe.ipynb);
* [vislcg3 tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/03_syntactic_analysis_with_vislcg3.ipynb);
* Updated `parse_enc` -- it can now be used for parsing [ENC 2021](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2021-vert/f176ccc0d05511eca6e4fa163e9d454794df2849e11048bb9fa104f1fec2d03f/). See the details from [the tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb);
* The function `parse_enc_file_iterator` now attempts to _automatically fix malformed paragraph annotations_ . As a result, more words and sentences can be imported from corpora, but the side effect is that there will be artificially created paragraph annotations -- even for documents that do not have paragraph annotations originally. The setting can be turned off, if needed;
* Updated `get_resource_paths` function: added EstNLTK version checking. A resource description can now contain version specifiers, which declare estnltk or estnltk_neural version required for using the resource. Using version constraints is optional, but if they are used and constraints are not satisfied, then `get_resource_paths` won't download the resource nor return its path;
* Relocated `estnltk.transformers` (`MorphAnalysisWebPipeline`) into `estnltk.web_taggers`;
* Refactoring: moved functions `_get_word_texts` & `_get_word_text` to `estnltk.common`;

Added

* `ResourceView` class, which lists EstNLTK's resources as a table, and shows their download status. See the [resources tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/basics/estnltk_resources.ipynb) for details.
* `SyntaxIgnoreCutter` class, which cuts the input Text object into a smaller Text by leaving out all spans from the syntax_ignore layer (produced by `SyntaxIgnoreTagger`). The resulting Text can then be analysed syntactically while skipping parts of a text may be difficult to analyse. For details, see the [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);
* function `add_syntax_layer_from_cut_text`, which can be used to carry over the syntactic analysis layer from the cut text (created by `SyntaxIgnoreCutter`) to the original text. The [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/02_syntax_preprocessing_with_ignoretagger.ipynb);

Fixed

* Syntax preprocessing [tutorial](https://github.com/estnltk/estnltk/blob/811978b24b9bacd4b53d301d379ffad2bd8b41e9/tutorials/nlp_pipeline/C_syntax/01_syntax_preprocessing.ipynb) to describe the current state of preprocessing;

1.7.0

Changed

* EstNLTK's tools that require large resources (e.g. syntactic parsers and neural analysers) can now download resources automatically upon initialization. This stops the program flow with an interactive prompt asking
for user's permission to download the resource. However, you can predownload the resource in order to avoid the interruption, see this [tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb) for details.

* Stucture and organization of [EstNLTK's tutorials](https://github.com/estnltk/estnltk/tree/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials). However, the work on updating tutorials is still not complete.

* `PgCollection`: now uses `CollectionStructure.v30` by default.

* Disambiguator (a system tagger): it's now a Retagger, but can work either as a retagger or a tagger, depending on the inputs. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/taggers/system/disambiguator.ipynb).

Added

* `downloader` & `resources_utils` for downloading additional resources and handling paths of downloaded resources. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/estnltk_resources.ipynb)

* Collocation net -- allows to find different connections between words based on the collocations each word was in. [Tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/collocation_net/tutorial.ipynb).

* `PgCollection`: added `CollectionStructure.v30` which allows to create sparse layer tables. Sparse layer tables do not store empty layers, which can save up the storage space and allow faster queries over tables & collection. The [main db tutorial](https://github.com/estnltk/estnltk/blob/cad31cc63b583bbef56b5f5fbcc3218ba8f5461c/tutorials/storage/storing_text_objects_in_postgres.ipynb) exemplfies the creation and usage of sparse layers.

* `PgCollection.create_layer` & `PgCollection.add_layer` now take parameter `sparse=True` which turns layer into a sparse layer;

* `PgCollection.select` now has a boolean parameter `keep_all_texts`: turning the parameter off yields only texts with non-empty sparse layers;

* `PgSubCollection` now has methods `create_layer` and `create_layer_block` which can be used to create a sparse layer from specific subcollection;

Fixed

* `BaseText.__repr__` method using wrong variable name;

* `NeuralMorphTagger`'s configuration reading and handling: model locations can now be freely customized;

* `TimexTagger`'s rules on detecting dates with roman numeral months & dates with slashes.

1.7.0rc0

EstNLTK has gone through a major package restructuring and refactoring process.

Package restructuring

EstNLTK has been split into 3 Python packages:

* `estnltk-core` -- package containing core datastructures, interfaces and data conversion functions of the EstNLTK library;
* `estnltk` -- the standard package, which contains basic linguistic analysis (including Vabamorf morphological analysis, syntactic parsing and information extraction models), system taggers and Postgres database tools;
* `estnltk-neural` -- package containing linguistic analysis based on neural models (Bert embeddings tagger, Stanza syntax taggers and neural morphological tagger);

Normally, end users only need to install `estnltk` (as `estnltk-core` will be installed automatically).

Tools in `estnltk-neural` require installation of deep learning frameworks (`tensorflow`, `pytorch`), and are demanding for computational resources; they also rely on large models (which need to be downloaded separately).

Changed

* `Text`:

* method `text.analyse` is deprecated and no longer functional. Use `text.tag_layer` to create layers. Calling `text.analyse` will display an error message with additional information on migrating from `analyse` to `tag_layer`;
* added instance variable `text.layer_resolver` which uses EstNLTK's default pipeline to create layers. The following new layers were added to the pipeline: `'timexes'`,` 'address_parts`', `'addresses'`, `'ner'`, `'maltparser_conll_morph'`, `'gt_morph_analysis'`, `'maltparser_syntax'`,`'verb_chains'`, `'np_chunks'`
* Shallow copying of a `Text` is no longer allowed. Only `deepcopy` can be used;
* Renamed method: `text.list_layers` -> `text.sorted_layers`;
* Renamed property: `text.attributes` -> `text.layer_attributes`;
* `Text` is now a subclass of `BaseText` (from `estnltk-core`). `BaseText` stores raw text, metadata and layers, has methods for adding and removing layers, and provides layer access via indexing (square brackets). `Text` provides an alternative access to layers (layers as attributes), and allows to call for text analysers / NLP pipeline (`tag_layer`)

* `Layer`:
* Removed `to_dict()` and `from_dict()` methods. Use `layer_to_dict` and `dict_to_layer` from `estnltk.converters` instead;
* Shallow copying of a `Layer` is no longer allowed. Only `deepcopy` can be used;
* Renamed `Layer.attribute_list()` to `Layer.attribute_values()`;
* indexing attributes (`start`, `end`, `text`) should now be passed to the method via keyword argument `index_attributes`. They will be prepended to the selection of normal attributes;
* Renamed `Layer.metadata()` to `Layer.get_overview_dataframe()`;
* Method `Layer.add_annotation(base_span, annotations)`:
* now allows to pass `annotations` as a dictionary (formerly, `annotations` could be passed only as keyword arguments);
* `Annotation` object cannot be passed as a `base_span`;
* HTML representation: maximum length of a column is 100 characters and longer strings will be truncated; however, you can change the maximum length via `OUTPUT_CONFIG['html_str_max_len']` (a configuration dictionary in `estnltk_core.common`);
* `Layer` is now a subclass of `BaseLayer` (from `estnltk-core`). `BaseLayer` stores text's annotations, attributes of annotations and metadata, has methods for adding and removing annotations, and provides span/attribute access via indexing (square brackets). `Layer` adds layer operations (such as finding descendant and ancestor layers, and grouping spans or annotations of the layer), provides an alternative access to local attributes (via dot operator), and adds possibility to access foreign attributes (e.g. attributes of a parent layer).

* ` SpanList/Envelopingspan/Span/Annotation`:
* Removed `to_records()`/`to_record()` methods. The same functionality is provided by function `span_to_records` (from `estnltk_core.converters`), but note that the conversion to records does not support all EstNLTK's data structures and may result in information loss. Therefore, we recommend converting via functions `layer_to_dict`/`text_to_dict` instead;
* Method `Span.add_annotation(annotation)` now allows to pass `annotation` as a dictionary (formerly, `annotation` could be passed only as keyword arguments);
* Constructor `Annotation(span, attributes)` now allows to pass `attributes` as a dictionary (formerly, `attributes` could be passed only as keyword arguments);

* `Tagger`:
* trying to `copy` or `deepcopy` a tagger now raises `NotImplementedError`. Copying a tagger is a specific operation, requires handling of tagger's resources and therefore no copying should attempted by default. Instead, you should create a new tagger instance;

* `PgCollection`: Removed obsolete `create_layer_table` method. Use `add_layer` method instead.

* `estnltk.layer_operations`
* moved obsolete functions `compute_layer_intersection`, `apply_simple_filter`, `count_by_document`, `dict_to_df`, `group_by_spans`, `conflicts`, `iterate_conflicting_spans`, `combine`, `count_by`, `unique_texts`, `get_enclosing_spans`, `apply_filter`, `drop_annotations`, `keep_annotations`, `copy_layer` (former `Layer.copy()`) to `estnltk_core.legacy`;

* Renamed `Resolver` -> `LayerResolver` and changed:
* `default_layers` (used by `Text.tag_layer`) are held at the `LayerResolver` and can be changed;
* `DEFAULT_RESOLVER` is now available from `estnltk.default_resolver`. Former location `estnltk.resolve_layer_dag` was preserved for legacy purposes, but will be removed in future;
* Renamed property `list_layers` -> `layers`;
* HTML/string represenations now display default_layers and a table, which lists names of creatable layers, their prerequisite layers, names of taggers responsible for creating the layers and descriptions of corresponding taggers;
* Trying to `copy` or `deepcopy` a layer resolver results in an exception. You should only create new instances of `LayerResolver` -- use function `make_resolver()` from `estnltk.default_resolver` to create a new default resolver;

* Renamed `Taggers` -> `TaggersRegistry` and changed:
* now retaggers can also be added to the registry. For every tagger creating a layer, there can be 1 or more retaggers modifying the layer. Also, retaggers of a layer can be removed via `clear_retaggers`;
* taggers and retaggers can now be added as `TaggerLoader` objects: they declare input layers, output layer and importing path of a tagger, but do not load the tagger until explicitly demanded ( _lazy loading_ );

* Refactored `AnnotationRewriter`:
* tagger should now clearly define whether it only changes attribute values (default) or modifies the set of attributes in the layer;
* tagger should not add or delete annotations (this is job for `SpanAnnotationsRewriter`);

* Restructured `estnltk.taggers` into 3 submodules:
* `standard` -- tools for standard NLP tasks in Estonian, such as text segmentation, morphological processing, syntactic parsing, named entity recognition and temporal expression tagging;
* `system` -- system level taggers for finding layer differences, flattening and merging layers, but also taggers for rule-based information extraction, such as phrase tagger and grammar parsing tagger;
* `miscellaneous` -- taggers made for very specific analysis purposes (such as date extraction from medical records), and experimental taggers (verb chain detection, noun phrase chunking);
* _Note_: this should not affect importing taggers: you can still import most of the taggers from `estnltk.taggers` (except neural ones, which are now in the separate package `estnltk-neural`);

* `serialisation_map` (in `estnltk.converters`) was replaced with `SERIALISATION_REGISTRY`:
* `SERIALISATION_REGISTRY` is a common registry used by all serialisation functions (such as `text_to_json` and `json_to_text` in `estnltk_core.converters`). The registry is defined in the package `estnltk_core` (contains only the `default` serialization module), and augmented in `estnltk` package (with `legacy_v0` and `syntax_v0` serialization modules);

* Renamed `estnltk.taggers.dict_taggers` -> `estnltk.taggers.system.rule_taggers` and changed:
* `Vocabulary` class is replaced by `Ruleset` and `AmbiguousRuleset` classes
* All taggers now follow a common structure based on a pipeline of static rules, dynamic rules and a global decorator
* Added new tagger `SubstringTagger` to tag occurences of substrings in text
* Old versions of the taggers are moved to `estnltk.legacy` for backward compatibility

* Relocated TCF, CONLL and CG3 conversion utils to submodules in `estnltk.converters`;

* Relocated `estnltk.layer` to `estnltk_core.layer`;

* Relocated `estnltk.layer_operations` to `estnltk_core.layer_operations`;

* Moved functionality of `layer_operations.group_by_layer` into `GroupBy` class;

* Relocated `TextaExporter` to `estnltk.legacy` (not actively developed);

* Renamed `TextSegmentsTagger` -> `HeaderBasedSegmenter`;

* Renamed `DisambiguatingTagger` -> `Disambiguator`;

* Rename `AttributeComparisonTagger` --> `AttributeComparator`;

* Relocated Vabamorf's default parameters from `estnltk.taggers.standard.morph_analysis.morf_common` to `estnltk.common`;

* Merged `EnvelopingGapTagger` into `GapTagger`:
* `GapTagger` now has 2 working modes:
* Default mode: look for sequences of consecutive characters not covered by input layers;
* EnvelopingGap mode: look for sequences of enveloped layer's spans not enveloped by input enveloping layers;

* Refactored `TimexTagger`:
* removed `TIMEXES_RESOLVER` and moved all necessary preprocessing (text segmentation and morphological analysis) inside `TimexTagger`;
* `'timexes'` is now a flat layer by default. It can be made enveloping `'words'`, but this can result in broken timex phrases due to differences in `TimexTagger`'s tokenization and EstNLTK's default tokenization;

* `Vabamorf`'s optimization:
* Disabled [Swig proxy classes](http://www.swig.org/Doc3.0/Python.html#Python_builtin_types). As a result, the morphological analysis is faster. However, this update is under testing and may not be permanent, because disabled proxy classes are known to cause conflicts with other Python Swig extensions compiled under different settings (for more details, see [here](https://stackoverflow.com/q/21103242) and [here](https://github.com/estnltk/estnltk/blob/b0d0ba6d943fb42b923fa6999c752fead927c992/dev_documentation/hfst_integration_problems/solving_stringvector_segfault.md));

* Dropped Python 3.6 support;


Added

* `Layer.secondary_attributes`: a list of layer's attributes which will be skipped while comparing two layers. Usually this means that these attributes contain redundant information. Another reason for marking attribute as _secondary_ is the attribute being recursive, thus skipping the attribute avoids infinite recursion in comparison;

* `Layer.span_level` property: an integer conveying depth of enveloping structure of this layer; `span_level=0` indicates no enveloping structure: spans of the layer mark raw text positions `(start, end)`, and `span_level` > 0 indicates that spans of the layer envelop around smaller level spans (for details, see the `BaseSpan` docstring in `estnltk_core.layer.base_span`);

* `Layer.clear_spans()` method that removes all spans (and annotations) from the layer. Note that clearing does not change the `span_level` of the layer, so spans added after the clearing must have the same level as before clearing;

* `find_layer_dependencies` function to `estnltk_core.layer_operations` -- finds all layers that the given layer depends on. Can also be used for reverse search: find all layers depending on the given layer (e.g. enveloping layers and child layers);

* `SpanAnnotationsRewriter` (a replacement for legacy `SpanRewriter`) -- a tagger that applies a modifying function on each span's annotations. The function takes span's annotations (a list of `Annotation` objects) as an input and is allowed to change, delete and add new annotations to the list. The function must return a list with modified annotations. Removing all annotations of a span is forbidden.

Fixed

* Property `Layer.end` giving wrong ending index;
* `Text` HTML representation: Fixed "FutureWarning: The frame.append method is deprecated /.../ Use pandas.concat instead";
* `Layer.ancestor_layers` and `Layer.descendant_layers` having their functionalities swaped (`ancestor_layers` returned descendants instead of ancestors), now they return what the function names insist;
* `Span.__repr__` now avoids overly long representations and renders fully only values of basic data types (such as `str`, `int`, `list`);
* `SyntaxDependencyRetagger` now marks `parent_span` and `children` as `secondary_attributes` in order to avoid infinite recursion in syntax layer comparison;
* `PgCollection`: `collection.layers` now returns `[]` in case of an empty collection;
* `PgCollection`: added proper exception throwing for cases where user wants to modify an empty collection;

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.