Estnltk

Latest version: v1.7.3

Safety actively analyzes 682404 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

1.6.9.1beta

This is an intermediate release of PyPI packages. The version 1.6.9b0 and 1.6.9.1b0 are equal considering the main functionalities, so no conda packages will be generated. The list of changes will be documented in the next release.

1.6.9beta

Changed

* `Tagger` API: added methods `get_layer_template` (returns an empty detached layer that contains all the proper attribute initializations), and `_make_layer_template` (a concrete implementation of `get_layer_template` in a derived class). New taggers should now provide `_make_layer_template` implementations in addition to `_make_layer` implementations, and it is recommended that `_make_layer` also uses `_make_layer_template` whenever it needs to create a new layer. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/f45ca4b546703b8e069c398403d81e825be5c186/tutorials/taggers/base_tagger.ipynb)

* Most `estnltk.taggers` were updated so that they have `_make_layer_template` implementations. However, in some system taggers (e.g. in `DiffTagger` and `DisambiguatingTagger`), the exact configuration of layer's attributes is not known before `_make_layer()` has been called, and thus `_make_layer_template` functionality is not supported;

* An important usage of layer templates is `PgCollection`'s new `add_layer` method, see [this tutorial](https://github.com/estnltk/estnltk/blob/3776a140b542855b5fef0b38a80828067e7b2022/tutorials/storage/storing_text_objects_in_postgres.ipynb);

* `PgCollection`'s method `create_layer_block`: an existing (unfinished) block can now be continued if `mode='append'` is passed to the method. For details about using `create_layer_block`, see [this tutorial](https://github.com/estnltk/estnltk/blob/3776a140b542855b5fef0b38a80828067e7b2022/tutorials/storage/storing_text_objects_in_postgres.ipynb)

* `layer_operations` function `flatten`: added parameter `disambiguation_strategy`. If `disambiguation_strategy='pick_first'` then the resulting layer will be disambiguated by preserving only the first annotation of every span. By default, the resulting layer will remain ambiguous.


Added

* method `add_layer` to `PgCollection` -- adds a new detached or fragmented layer to the collection. The method initializes a new empty layer before filling it with data. [Tutorial](https://github.com/estnltk/estnltk/blob/c7e093303566ce7691ffbf899841d741d654bc46/tutorials/storage/storing_text_objects_in_postgres.ipynb)

* new `layer_operations`: function `join_layers` concatenates a list of layers into one layer, and function `join_texts` concatenates a list of Text objects into a single Text object (roughly a reverse of `split_by` and `extract_sections` functionalities). Function `join_layers_while_reusing_spans` joins layers efficiently by reusing spans of the input layers (but as a result, the input layers will be broken). Details about `join_texts` are given in the [tutorial](https://github.com/estnltk/estnltk/blob/7256c8c692f7b718d7b6dee8504fcb705bd71997/tutorials/system/layer_operations.ipynb).

* `BatchProcessingWebTagger` -- a web tagger which processes the input via small batches ( splits large requests into smaller ones ). More detailed usage information [here](https://github.com/estnltk/estnltk/blob/7256c8c692f7b718d7b6dee8504fcb705bd71997/estnltk/taggers/web_taggers/v01/batch_processing_web_tagger.py).

* `BertEmbeddingsWebTagger`, `StanzaSyntaxEnsembleWebTagger` and `StanzaSyntaxWebTagger` were refactored in a way that they no longer throw `WebTaggerRequestTooLargeError` upon large requests, but use batch processing instead;


Fixed

* `MorphExtendedTagger`: now it is possible to customize input and output layer names. Also added unit tests for syntax preprocessing with customized layer names;

* `PostgresStorage` `collections` property: now collections list can be accessed even if it is empty;

* `MissingLayerQuery`: removed a major speed bottleneck;

* `BlockQuery`: added missing `required_layers` property and an unit test;

* `BufferedTableInsert`: buffer size limits are now checked before the insertion (to avoid large buffer limit exceedings);

1.6.8beta

Changed

* `HfstEstMorphAnalyser` was renamed to `HfstClMorphAnalyser` and it's engine was changed. It is no longer dependent on the PyPI `hfst` package, but uses [HFST command line tools](https://github.com/hfst/hfst/wiki/Command-Line-Tools) instead. Both file-based and stream-based communication modes are supported. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/hfst/morph_analysis_with_hfst_analyser.ipynb);

* `UserDictTagger`: dictionary entries should now be passed only via the constructor. Using methods `add_word()` and `add_words_from_csv_file()` directly is deprecated. [Tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_07a_morph_analysis_with_user_dict.ipynb);

* `CorpusBasedMorphDisambiguator`: the interface was changed in a way that the disambiguator now works with detached layers. The interface similar to `Retagger`'s `_change_layer` is provided by methods `_predisambiguate_detached_layers` and ` _postdisambiguate_detached_layers`. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_07b_morph_analysis_with_corpus-based_disambiguation.ipynb);

* Statistical syntactic parsers:
* `MaltParserTagger` now uses a model trained on `morph_analysis` layer by default. This is also the only model that is distributed with the package, other models need to be downloaded from the [github](https://github.com/estnltk/estnltk/tree/ba471626227238b2b83ef7a3479b315407c44807/estnltk/taggers/syntax/maltparser_tagger/java-res/maltparser);
* `UDPipeTagger`-s models now need to be downloaded separately from github. See [this tutorial](https://github.com/estnltk/estnltk/blob/ba471626227238b2b83ef7a3479b315407c44807/tutorials/syntax/syntax.ipynb) for details;

* Refactored `VabamorfTagger` & other morphology components:
* removed `re_sort_analyses` parameter, which is no longer relevant;
* added `use_reorderer` parameter, which turns on applying [`MorphAnalysisReorderer`](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_07c_morph_analysis_reordering.ipynb) as the last correction step. Note: the parameter is switched on by default, so that ambiguous morphological analyses are now always reordered (by their corpus frequency);

* `Layer.groupby` now accepts a string argument if you need to group by a single attribute. So, instead of `text.morph_analysis.groupby(['partofspeech'])`, you can write `text.morph_analysis.groupby('partofspeech')`. In addition, an attached layer can be grouped by other layer of the `Text` object simply by giving the name of the layer, e.g. `text.morph_analysis.groupby('sentences')`.

* `TIMEXES_RESOLVER`, which provides a pipeline for `TimeTagger` along with required tokenization fixes (which improve the quality of timex detection). For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/taggers/temporal_expression_tagger.ipynb);

* PostgreSQL interface was refactored:
* Removed `_select_by_key`, `get_layer_names` and `find_fingerprint` from `PgCollection`;
* Tabel insertion logic was moved from `PgCollection` to separate context manager classes: `BufferedTableInsert`, `CollectionTextObjectInserter` and `CollectionDetachedLayerInserter`;
* Refactored `Query.eval()` interface: `PgCollection` is now used as an input;
* `JsonbTextQuery` and `JsonbLayerQuery` were replaced by **`LayerQuery`**, which now works both on attached and detached layers. [Tutorial](https://github.com/estnltk/estnltk/blob/7efc7a60eb15be5775c6790c0b8ca5a06259e2ae/tutorials/storage/storing_text_objects_in_postgres.ipynb);

* Dropped Python 3.5 support, and added Python 3.8 support.

Added

* Function `make_userdict()` for automatically creating `UserDictTagger` based on a dictionary of mappings from incorrect spellings to correct spellings. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_07a_morph_analysis_with_user_dict.ipynb);

* Function `split_by_clauses`, which properly splits text into clauses,
considering all the discontinuous spans (embedded clauses) of the clauses layer. For this, a supporting function `extract_discontinuous_sections` was implemented, which allows to extract discontinuous sections from the `Text`. The function `split_by` was updated in a way that it applies `split_by_clauses` on the clauses layer by default. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/system/layer_operations.ipynb);

* Added parameters `predisambiguate` and `postdisambiguate` to `VabamorfTagger`, which can be used to turn on _text-based morphological disambiguation_ provided by `CorpusBasedMorphDisambiguator`. Note: by default, the parameters are not turned on. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_07b_morph_analysis_with_corpus-based_disambiguation.ipynb);

* Parameter `force_resolving_by_priority` to `GrammarParsingTagger`. This applies (experimental) post-resolving all conflicts by priority attributes of grammar rules. Details in [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/finite_grammar/introduction_to_finite_grammar.ipynb);

* `TokenSplitter` -- a retagger which can be used to make rule-based post-corrections to the tokens layer. Details are given in [this tutorial](https://github.com/estnltk/estnltk/blob/c5b30eb7b1c7eb6868ebda408a6c12a19f8dffa7/tutorials/nlp_pipeline/B_01_segmentation_tokens.ipynb);

* Added patterns for detecting hashtags and Twitter username mentions to `CompoundTokenTagger`. Can be switched on by the flag `tag_hashtags_and_usernames`. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_02_segmentation_compound_tokens.ipynb). Also, `PostMorphAnalysisTagger` was updated to provide some post-corrections on the corresponding tokens, see details [here](https://github.com/estnltk/estnltk/blob/2b47a4ce8e3a08b85b23b73a281836a23e20ac68/tutorials/nlp_pipeline/B_06_morphological_analysis.ipynb).

* The following new query types were added to PostgreSQL interface:
* `IndexQuery`, `SliceQuery` -- for querying `Text` objects by their collection indexes;
* `MetadataQuery` -- for querying `Text` objects by their metadata (collection or text metadata, for details, see this [tutorial](https://github.com/estnltk/estnltk/blob/7efc7a60eb15be5775c6790c0b8ca5a06259e2ae/tutorials/storage/storing_text_objects_in_postgres.ipynb);
* Updated `LayerNgramQuery`: added validation for the existence of `ngram_index` columns;

* Shuffling and sampling methods were added to `PgSubCollection`:
* method `permutate()` -- iterates texts in random order;
* method `sample()` -- selects a random sample of texts from the collection;
* method `sample_from_layer()` -- selects a random sample of spans from a specific layer of the collection;
* for details, see [this tutorial](https://github.com/estnltk/estnltk/blob/7efc7a60eb15be5775c6790c0b8ca5a06259e2ae/tutorials/storage/sampling_texts_and_layers_in_postgres.ipynb)

* Proper implementations of `head` and `tail` methods were added to `PgSubCollection`. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/7efc7a60eb15be5775c6790c0b8ca5a06259e2ae/tutorials/storage/storing_text_objects_in_postgres.ipynb);

* Possibility to append to an existing table with `PgCollection`'s `export_layer` method. Use the parameter `mode='append'` to switch on the appending mode. Read also [docstring of the method](https://github.com/estnltk/estnltk/blob/aadf87855f07b2a465eecdaebac95aef1caffeda/estnltk/storage/postgres/collection.py#L960-L1007).

* `StanzaSyntaxTagger` and `StanzaSyntaxEnsembleTagger` -- the syntax taggers use models trained with [Stanza](https://stanfordnlp.github.io/stanza/). See [tutorial](https://github.com/estnltk/estnltk/blob/ba471626227238b2b83ef7a3479b315407c44807/tutorials/syntax/syntax.ipynb) for usage, performance scores and links to models;

* `UDValidationRetagger` and `DeprelAgreementRetagger` -- retaggers for marking errors in parsed syntax. These only work for layers that make use of UD syntactic relations.
See [tutorial](https://github.com/estnltk/estnltk/blob/ba471626227238b2b83ef7a3479b315407c44807/tutorials/syntax/syntax.ipynb) for details and usage.


* WebTaggers -- taggers which annotate texts via webservices. Currently implemented web taggers: `VabamorfWebTagger`, `BertEmbeddingsWebTagger`, `SoftmaxEmbTagSumWebTagger`, `StanzaSyntaxWebTagger`, `StanzaSyntaxEnsembleWebTagger`. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/c5b30eb7b1c7eb6868ebda408a6c12a19f8dffa7/tutorials/taggers/web_taggers.ipynb);

* Wordnet method `get_synset_by_name` which can be used to retrieve a synset by its name attribute, and method `all_relation_types` for retrieving all relation types. Details in [this tutorial](https://github.com/estnltk/estnltk/blob/c5b30eb7b1c7eb6868ebda408a6c12a19f8dffa7/tutorials/wordnet/wordnet.ipynb).

Fixed

* `PostMorphAnalysisTagger`: added `input_words_layer` parameter, because it was required for checking `morph_analysis` parent;

* `Layer.display()` crashing on an empty layer;

* `extract_sections` crashing on discontinouos spans & `trim_overlapping=True`;

* estner `Trainer`:
* `trainer.train` is now called after features from all documents have been extracted (makes training much faster if the number of training documents is large);
* `ModelStorageUtil` now allows to save a newly trained model to the directory from which the settings were loaded. An exception is the default model dir, which should not be used.


* PostgreSQL interface:
* `pgpass_parsing`: reading password from the `pgpass_file` should work now;
* Fixed `PgCollection.create_layer`: missing\_layer is now properly specified in the data\_iterator;
* Fix in `SubstringQuery`: added missig `required_layers` property;
* Fix in `WhereClause`: added missing `_required_layers` for conjunction of `WhereClauses`;
* Fixes for `And` & `Or` queries: added missing `required_layers` properties;

1.6.7beta

Changed

* The module `parse_enc2017` was renamed to `parse_enc`. Tools were updated so that they can also be used to read *.vert files of the [Estonian National Corpus 2019](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2019-vrt-vormingus/be71121e733b11eaa6e4005056b4002483e6e5cdf35343e595e6ba4576d839fb/). So, now both ENC 2017 corpus and ENC 2019 corpus files can be read. For details, see the renewed tutorial [here](https://github.com/estnltk/estnltk/blob/9535931e9ade45fe17d5bfd0f24e4411fac02999/tutorials/corpus_processing/importing_text_objects_from_corpora.ipynb).
* Updated `text.tag_layer(layer_names)` : `layer_names` can now also be a string. So, you can use `text.tag_layer('sentences')` or `text.tag_layer('morph_analysis')`;

Added

* Experimental noun phrase chunker, which can be used to detect non-overlapping noun phrases from the text. The tool was ported from the version 1.4.1; the tutorial is available [here](https://github.com/estnltk/estnltk/blob/9535931e9ade45fe17d5bfd0f24e4411fac02999/tutorials/taggers/noun_phrase_chunker.ipynb);
* `WordNet` synset definitions and examples. Interfaces of `WordNet` queries and similarity finding functions were also changed, see [this tutorial](https://github.com/estnltk/estnltk/blob/ca4be5942c991c380dea330a177226c35aa7dbb8/tutorials/wordnet/wordnet.ipynb) for details;
* `delete_layer()` to Postgres collection (can only be used with detached layers);

Fixed

* `CompoundTokenTagger`: removed a speed bottleneck that appeared while processing large texts;
* `CompoundTokenTagger`'s import of `EmptyDataError` (now it should work with pandas<=1.1.0)
* `VerbChainDetector`: removed legacy attributes;
* Postgres collection's `_repr_html_`: collections which names start with the same prefix are now displayed correctly;
* Issues in `conll_importer` and `maltparser_tagger` caused by updating the conllu package to version 3.0;

1.6.6beta

Changed

* Updated the C++ source code of EstNLTK's Vabamorf module to be compatible with [Vabamorf's commit 7a44b62dba](https://github.com/Filosoft/vabamorf/tree/7a44b62dba66cd39116edaad57db4f7c6afb34d9) (2020-01-22). The update also adds two new binary lexicons: `2020-01-22_nosp` and `2020-01-22_sp`. However, the old binary lexicons (`2015-05-06` and `2019-10-15`) can no longer be used with the new Vabamorf, so they were removed. For details about binary lexicons, see the section _"Changing Vabamorf's binary lexicons"_ in [this tutorial](https://github.com/estnltk/estnltk/blob/d4319cd65178fc1772c2873ee9b11d238f9745a6/tutorials/nlp_pipeline/B_06_morphological_analysis.ipynb);

* Improved `CorpusBasedMorphDisambiguator`'s `count_inside_compounds` heuristic: lemmas acquired from non-compound words are now also used for reducing ambiguities of last words of compound words. In addition, a user-definable lemma skip list (`ignore_lemmas_in_compounds`) was introduced to the heuristic in order to reduce erroneous disambiguation choices. The name of heuristic's flag was changed from `count_inside_compounds` to `disamb_compound_words`. For details on the usage, see [this tutorial](https://github.com/estnltk/estnltk/blob/45eb6536d36f77284c25a1d4c7c0852d5b033e63/tutorials/nlp_pipeline/B_07b_morph_analysis_with_corpus-based_disambiguation.ipynb).

* Changed Layer removing: in order to delete a layer from `Text` object, use `text.pop_layer( layer_name )`. Old ways of removing (via `del` and `delattr`) are no longer working;

* Reorganized tutorials (including cleaned up some old stuff):

* Removed `brief_intro_to_text_layers_and_tools.ipynb` (became redundant after `basics_of_estnltk.ipynb` was introduced);
* Removed `text_segmentation.ipynb` (it's content is covered by segmentation tutorials in `tutorials/nlp_pipeline`);
* Relocated tutorials:
* `importing_text_objects_from_corpora.ipynb` to `tutorials/corpus_processing`;
* `restoring_pretokenized_text.ipynb` to `tutorials/corpus_processing`;
* `estnltk_basic_concepts.ipynb` to `tutorials/system`;
* `layer_operations.ipynb` to `tutorials/system`;
* `low_level_layer_operations.ipynb` to `tutorials/system`;
* `MorphAnalyzedToken`'s tutorial to `tutorials/system`;
* Introduced `tutorials/miscellaneous` (currently contains tutorials about morphological synthesis and word syllabification):


Added

* `SpellCheckRetagger`: a Vabamorf-based spelling normalization for the words layer. For details on the usage, see [this tutorial](https://github.com/estnltk/estnltk/blob/5c5ce3a810f7a5e41602156b7044edb63e83532d/tutorials/nlp_pipeline/B_03_segmentation_words_spelling_normalization.ipynb);
* `MaltParserTagger` which analyses Text with EstNLTK's MaltParser and produces `maltparser_syntax` layer. This should be the most straightforward way of using the MaltParser. [Usage tutorial](https://github.com/estnltk/estnltk/blob/84b12bc7ece6bde6906e8d4fa0d5831d3149fce5/tutorials/syntax/syntax.ipynb);
* `VabamorfEstCatConverter` -- a tagger which can be used to convert Vabamorf's category names to Estonian (for educational purposes). For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/84b12bc7ece6bde6906e8d4fa0d5831d3149fce5/tutorials/nlp_pipeline/A_01_short_introduction_and_tutorial_for_linguists.ipynb);
* A heuristic that improves syllabification of compounds in the function `syllabify_word`. The heuristic (`split_compounds`) is switched on by default. For details, see [this tutorial](https://github.com/estnltk/estnltk/blob/5c5ce3a810f7a5e41602156b7044edb63e83532d/tutorials/miscellaneous/syllabification.ipynb);
* Flag `slang_lex` to `VabamorfTagger`. The flag switches on an extended version of Vabamorf's lexicon (namely, the lexicon `2020-01-22_nosp`), which contains extra entries for analysing most common spoken and slang words;
* Flag `overwrite_existing` to `UserDictTagger`: allows to turn off overwriting of existing analyses, so that only words with `None` analyses will obtain analyses from the user dictionary;
* Tutorial `basics_of_estnltk.ipynb`, which provides an overview of the API of EstNLTK 1.6. The tutorial also provides links to more detailed/advanced tutorials. The tutorial should be a "getting started" point for new users of EstNLTK, as well as place for more advanced users to find pointers to further information;
* Tutorial about Vabamorf-based morphological synthesis ([here](https://github.com/estnltk/estnltk/blob/5c5ce3a810f7a5e41602156b7044edb63e83532d/tutorials/miscellaneous/morphological_synthesis.ipynb)). The interface of the synthesizer is basically the same as in v1.4;
* Tutorial about Vabamorf-based word syllabification ([here](https://github.com/estnltk/estnltk/blob/5c5ce3a810f7a5e41602156b7044edb63e83532d/tutorials/miscellaneous/syllabification.ipynb));
* Updated tutorial `importing_text_objects_from_corpora` on how to import texts from Estonian UD treebank;
* Updated neural morphological tagger's tutorial: added information about where the models can be downloaded;
* `WordNet`: API that provides means to query Estonian WordNet. Compared to v1.4 queries for definition and examples are missing and synsets can be queried by wn[lemma] as opposed to v1.4's wn.synsets(lemma). Other aspects are comparable to v1.4;
* `UDPipeTagger` which analyses Text with EstNLTK's UDPipe and produces `udpipe_syntax` layer. [Usage tutorial](https://github.com/estnltk/estnltk/blob/84b12bc7ece6bde6906e8d4fa0d5831d3149fce5/tutorials/syntax/syntax.ipynb)
* Flag `fix_selfreferences` to `VislTagger`: removes self-references.
* `Named Entity Tagging`: Taggers for named entity recognition (NER). One tagger tags whole named entities and the other tagger tags each word separately. CRF model used is the same as in v1.4.

Fixed

* Vabamorf's initialization bug on Linux: merged fixes from [97](https://github.com/estnltk/estnltk/issues/97) and pull [#109](https://github.com/estnltk/estnltk/pull/109), and added some conditionals to keep the whole thing compiling and running under Windows (and macOS);
* Removed a major speed bottleneck from `CompoundTokenTagger`'s patterns (annotating `compound_tokens` should be much faster now);
* Rule customization in `CompoundTokenTagger`: 1st level and 2nd level patterns can now be updated via constructor parameters `patterns_1` and `patterns_2`;
* Rule customization in `SentenceTokenizer`: all merge patterns can now be updated via constructor parameter `patterns`;
* Rule prioritizing in `SyntaxIgnoreTagger` (the output no longer has fluctuations in rule types);
* Fix in `WordTagger`: 'words' layer can now be created from detached 'tokens' and 'compound_tokens' layers;
* `syllabify_word`: do not analyse dash and slash in order to avoid crash in Vabamorf's syllabifier;
* Bug related to setting `'display.max_colwidth'` in `pandas` (affects `pandas` versions >= 1.0.0);
* `convert_cg3_to_conll.py`: handles input `'"<s>"'` with analysis correctly

1.6.5.5beta

A pre-release only for developers. The list of changes will be documented in the next release.

Page 2 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.