Annif

Latest version: v1.3.1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 9

0.53.1

This patch release [fixes](https://github.com/NatLibFi/Annif/pull/501) a bug which prevented training the SVC backend on fulltext corpus.

0.53

0.53.0

This release adds two new backends, YAKE and SVC. The YAKE backend is a wrapper around the [YAKE library](https://github.com/LIAAD/yake), which performs lexical unsupervised keyword extraction. There is no need for training data. See the [YAKE](https://github.com/NatLibFi/Annif/wiki/Backend%3A-YAKE) wiki page for more information. In future Annif releases, it would be possible to extend YAKE support so that it can be used to suggest new terms for a vocabulary (the keywords that are not found in the vocabulary).

The SVC backend implements Linear Support Vector Classification. It is well suited for multiclass (but not multilabel) classification, for example classifying documents with the Dewey Decimal Classification or the 20 Newsgroups classification. It requires relatively little training data, and is suitable for classifications of up to around 10,000 classes. See the [SVC](https://github.com/NatLibFi/Annif/wiki/Backend%3A-SVC) wiki page for more information.

This release also upgrades many dependencies, which enables all Annif backends to run on Python 3.9 (previously nn_ensemble backend was available only for 3.6-3.8). The Docker image uses now Python 3.8 instead of 3.7.

Note that nn_ensemble models are not compatible across Python versions: e.g. a model trained on Python 3.7 can be used only on Python 3.7. Training the nn_ensemble models shows a `CustomMaskWarning`, but it is harmless (caused by a [TensorFlow bug](https://github.com/tensorflow/tensorflow/issues/49754)) and can be ignored.

Due to the update of scikit-learn, using TFIDF, MLLM or Omikuji models trained on older Annif versions will show warnings about the `TfidfVectorizer`. To the best of our knowledge, these are harmless and can be ignored. You have to retrain the models to get rid of the warnings.

This release includes also many minor improvements and bug fixes.

New features:
486 New SVC (support vector classification) backend using scikit-learn
439/461 YAKE backend
490/494 Make --version option show Annif version

Improvements:
488 Add support for ngram setting in omikuji backend

Maintenance:

0.52.0

This release includes a new MLLM backend which is a Python implementation of the Maui-like Lexical Matching algorithm. It was inspired by the [Maui algorithm](https://hdl.handle.net/10289/3513) (by Alyona Medelyan), but not a direct reimplementation. It is meant for long full-text documents and like Maui, it needs to be trained with a relatively small number (hundreds or thousands) of manually indexed documents so that the algorithm can choose the right mix of heuristics that achieves best results on a particular document collection. See [the MLLM Wiki page](https://github.com/NatLibFi/Annif/wiki/Backend%3A-MLLM) for more information.

New features include the possibility to configure two project parameters:
- `min_token_length` [can be set in the analyzer parameters](https://github.com/NatLibFi/Annif/wiki/Analyzers); e.g. setting the value to 2 allows the word "UK" to pass to a backend, while with the default value (3) the word is filtered out by the analyzer
- `lr` can be set in the [neural-network ensemble](https://github.com/NatLibFi/Annif/wiki/Backend%3A-nn_ensemble) project configuration to define the learning rate.

The STWFSA backend has been updated to use a newer version of the [stwfsapy library](https://github.com/zbw/stwfsapy). Old STWFSA models are not compatible with the new version so any STWFSA projects must be retrained. The release includes also several minor improvements and bug fixes.

New features:
462 New lexical backend MLLM
456/468 Allow configuration of token min length (credit: [mo-fu](https://github.com/mo-fu))
475 Allow configuration of nn ensemble learning rate (credit: [mo-fu](https://github.com/mo-fu))

Improvements:
478/479 Update stwfsa to 0.2.* (credit: [mo-fu](https://github.com/mo-fu))
472 Cleanup suggestion tests
480 Optimize check for deprecated subject IDs using a set

Maintenance:
474 Use GitHub Actions as CI service

Bug fixes:
470/471 Make sure suggestion scores are in the range 0.0-1.0
477 Optimize the optimize command
481 Backwards compatibility fix for the token_min_length setting
482 MLLM fix: don't include use_hidden_labels in hyperopt, it won't have any effect

0.51.0

This release includes a new [STWFSA backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-STWFSA) which is a wrapper around [STWFSAPY](https://github.com/zbw/stwfsapy), a lexical algorithm based on finite state automata. It achieves best results with short texts, i.e., titles and author keywords, and is best suited for English language data.

The NN ensemble backend has been improved with better handling of source weights. Retraining NN ensemble models after updating Annif to this version is recommended, since the quality of results can decrease if old models are used. A new option for several CLI commands has been added: `--docs-limit/-d` option can be used to limit the number of documents to process, for example to create learning-curve data. Also several bugs have been fixed.

New features:
438 Lexical STWFSAPY Backend (credit mo-fu)
465 Limit document number CLI option

Improvements:
457/458 Improved handling of source weights in NN ensemble

Bug fixes:
454/455 Address SonarCloud complaints
459/460 Pass limit parameter to Maui Server during train
463 Fix TruncatingCorpus iterator

0.50.0

This release introduces a setting to use only a part of the input text for subject indexing: the new `input_limit` project parameter truncates the input text to the given character number. This can improve the quality of the suggestions as the beginning of a long document typically includes an abstract and introduction. The default value for `input_limit` is zero, which means that truncation is not performed.

Improvements include better handling of cached data in nn_ensemble training and optimization of memory usage in evaluation by using sparse matrices for suggested subjects. Many dependencies have been updated and a few minor issues fixed.

New features:
446 Add a backend paratemer to limit input characters in suggest
452 Apply the input_limit backend parameter to texts in train & learn

Improvements:
441 Sparse subjects (credit mo-fu)
443/444 Allow use of cached data after cancelled training of nn_ensemble backend

Maintenance:
448 Upgrade dependencies
445 Upgrade LMDB dependency from 0.98 to 1.0.0
449 Resolve DeprecationWarning: change warn to warning

Bug fixes:
447 Fix missing default params in pav and nn ensemble

Page 5 of 9

Releases

Has known vulnerabilities

Previous Next

Annif

Page 5 of 9

0.53.1

0.53

0.53.0

0.52.0

0.51.0

0.50.0

Page 5 of 9

Links

Releases