Release notes
New models, query strategies and API changes
Important Changes
- Due to significant API changes, the log file versions have been updated. As a result, log files created with older version of ASReview will not be able to be read with the new version. Keep using the old version with these old log files (for reproducibility purposes, this is generally a good idea).
- Different options for installing the package are now available. In an effort to keep the number of dependencies in check, the dependencies of some models are optional. In order to use these models, it is necessary to install these packages manually (an error will be shown giving the name of the missing package). You can also use `pip install asreview[all]` to install all optional dependencies automatically.
New Features
- New Model: `nn-2-layer`
- Dense Neural Network consisting of two layers. Seems to work well with the new doc2vec feature extraction.
- New Model: `rf`
- Random Forest model (sklearn).
- New Model: `logistic`
- Logistic regression model.
- New Balance strategy: `double`
- This is the same strategy as the `triple` balance strategy, except there is
- New Query strategy: `cluster`
- This query strategy uses K-Means clustering to divide the papers in different clusters. It then randomly selects one of these clusters and finds the one with the highest probability in that cluster.
- New Query strategy: `mixed`
- This is a new 'class' of query strategies, where query strategies can be mixed. Previously only `rand_max` was implemented, but any two query strategies can be combined.
- New Feature Extraction: `doc2vec`
- Uses the doc2vec model from the `gensim` package.
- New Feature Extraction: `sbert`
- Uses the Sentence BERT model with a pretrained (provided by sbert) dataset. This is probably not ideal, and as such I haven't had much success with it.
- New Feature Extraction: `embedding-idf`
- Uses the average of word embeddings weighted by inverse document frequency.
API Changes
- Create abstract 'super' model above all types of models.
- Move feature extraction out of the models. This means that one can now use different feature extractors with the same model, although some restrictions apply.
- Remove ModAL from the active learning process.
- We were not using modAL all that intensively anymore. The main reason for removal is that modAL uses a system that requires functions/arguments to be passed around. Now we're using classes, which improves the readability and maintainability.
- Align all types of models (train, query, balance, feature extraction) with a similar class structure.
- Improve and align the hyper parameter optimization of the different types of models.
- Remove the `get_data` member function from the `ASreviewData` class. It was not a very useful structure that was often used to get one piece of data and throw away the other two. As a replacement, use `as_data.texts` to get the texts (title + abstract), and `as_data.labels` to get the labels.
- Lots of renamed classes. It is generally advised to search with a string.
- The query strategy `rand` has been renamed to `random`.
Misc
- A lot of documentation was added/updated.
- New and improved unit tests. Query models are added to unit tests and tested for 'cheating'. Feature extraction received their own tests.