This is a large release including many new features:
New models
Implemented support for arbitrary scikit-learn models via SKLearnClassifier and TF-IDF as a baseline embedding approach via TfidfEmbedder. Implemented support for spaCy text categorizer models and spacy-transformers models via SpaCyModel. Upgraded `pytorch_transformers` v1.0.0 to `transformers` v2.4.1, which added support for several new models.
Interactive apps
gobbli now comes bundled with a few [Streamlit](https://www.streamlit.io/) apps that can be used to explore datasets, evaluate gobbli model performance, and generate local explanations for gobbli model predictions. See [the docs](https://gobbli.readthedocs.io/en/latest/interactive_apps.html) for more information.
Overhauled benchmarks
Completely overhauled the benchmark framework. Benchmark output is now stored as Markdown files, which can much more easily be read on GitHub, and benchmarks can be selectively rerun when new models are added. Also developed a "benchmark" for embeddings, which plots the model embeddings in 2 dimensions and allows for a qualitative assessment of how well each model differentiates between the classes in the dataset. See [the benchmark output folder](https://github.com/RTIInternational/gobbli/tree/master/benchmark/benchmark_output).
Miscellaneous improvements
- Add new BERT weights from NCBI trained on PubMed data (`ncbi-bert-base-pubmed-uncased`, `ncbi-bert-base-pubmed-mimic-uncased`, `ncbi-bert-large-pubmed-uncased`, `ncbi-bert-large-pubmed-mimic-uncased`) (thanks pmbaumgartner!)
- Upgrade fastText to a more recent version which supports [autotuning parameters](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.fasttext.html#gobbli.model.fasttext.FastText.init).
- Add support for optional [gradient accumulation](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.transformer.html#gobbli.model.transformer.Transformer.init) in Transformer models, allowing for smaller batch sizes and larger models while retaining performance
- Upgrade [USE implementation](https://gobbli.readthedocs.io/en/latest/auto/gobbli.model.use.html#gobbli.model.use.USE) to the TensorFlow 2.0 version and add support for multilingual weights (`universal-sentence-encoder-multilingual`, `universal-sentence-encoder-multilingual-large`)
- Add a couple of utilities for [inspecting and cleaning up disk usage](https://gobbli.readthedocs.io/en/latest/advanced_usage.html#housekeeping)
- Fix memory issues with USE model by batching input data
- Fix potential encoding issues with non-ASCII text in USE model
- Reuse static pretrained weights across instances of models instead of redownloading every time