Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.
New Features
Support for new languages
Flair embeddings
We now include new language models for:
* [Swedish](https://github.com/zalandoresearch/flair/issues/3)
* [Polish](https://github.com/zalandoresearch/flair/issues/187)
* [Bulgarian](https://github.com/zalandoresearch/flair/issues/188)
* [Slovenian](https://github.com/zalandoresearch/flair/issues/202)
* [Dutch](https://github.com/zalandoresearch/flair/issues/224)
In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:
python
flair_embeddings = FlairEmbeddings('dutch-forward')
Word Embeddings
We now include pre-trained [FastText Embeddings for 30 languages](https://github.com/zalandoresearch/flair/issues/234): English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.
Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:
python
German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')
German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')
Named Entity Recognition
Thanks to the Flair community, we now include NER models for:
* [French](https://github.com/zalandoresearch/flair/issues/238)
* [Dutch](https://github.com/zalandoresearch/flair/issues/224)
Next to the previous models for English and German.
Part-of-Speech Taggigng
Thanks to the Flair community, we now include PoS models for:
* [German tweets](https://github.com/zalandoresearch/flair/issues/51)
Multilingual models
As a major new feature, we now include models that can tag text in various languages.
12-language Part-of-Speech Tagging
We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).
python
load model
tagger = SequenceTagger.load('pos-multi')
text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')
predict PoS tags
tagger.predict(sentence)
print sentence with predicted tags
print(sentence.to_tagged_string())
4-language Named Entity Recognition
We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).
python
load model
tagger = SequenceTagger.load('ner-multi')
text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')
predict NER tags
tagger.predict(sentence)
print sentence with predicted tags
print(sentence.to_tagged_string())
This model also kind of works on other languages, such as French.
Pre-trained classification models ([issue 70](https://github.com/zalandoresearch/flair/issues/70))
Flair now also includes two pre-trained classification models:
* de-offensive-lanuage: detecting offensive language in German text ([GermEval 2018 Task 1](https://projects.fzai.h-da.de/iggsa/projekt/))
* en-sentiment: detecting postive and negative sentiment in English text ([IMDB](http://ai.stanford.edu/~amaas/data/sentiment/))
Simply load the `TextClassifier` using the preferred model, such as
python
TextClassifier.load('en-sentiment')
BERT and ELMo embeddings
We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.
BERT Embeddings ([issue 251](https://github.com/zalandoresearch/flair/issues/251))
We added [BERT embeddings](https://arxiv.org/pdf/1810.04805.pdf) to Flair. We are using the implementation of [huggingface](https://github.com/huggingface/pytorch-pretrained-BERT). The embeddings can be used as any other embedding type in Flair:
python
from flair.embeddings import BertEmbeddings
init embedding
embedding = BertEmbeddings()
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)
ELMo Embeddings ([issue 260](https://github.com/zalandoresearch/flair/issues/260))
Flair now also includes [ELMo embeddings](http://www.aclweb.org/anthology/N18-1202). We use the implementation of [AllenNLP](https://allennlp.org/elmo). As this implementation comes with a lot of sub-dependencies, you need to first install the library via `pip install allennlp` before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:
python
from flair.embeddings import ELMoEmbeddings
init embedding
embedding = ELMoEmbeddings()
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)
Multi-Dataset Training ([issue 232](https://github.com/zalandoresearch/flair/issues/232))
You can now train a model on on multiple datasets with the `MultiCorpus` object. We use this to train our multilingual models.
Just create multiple corpora and put them into `MultiCorpus`:
python
english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
The `multi_corpus` can now be used for training, just as any other corpus before. Check [the tutorial](TUTORIAL_6_TRAINING_A_MODEL.md) for more details.
Parameter Selection using Hyperopt ([issue 242](https://github.com/zalandoresearch/flair/issues/242))
We built a wrapper around [hyperopt](http://hyperopt.github.io/hyperopt/) to allow you to search for the best hyperparameters for your downstream task.
Define your search space and start training using several different parameter settings. The results are written to a specific file called `param_selection.txt` in the result directory. Check [the tutorial](TUTORIAL_7_HYPER_PARAMETER.md) for more details.
NLP Dataset Downloader ([issue 243](https://github.com/zalandoresearch/flair/issues/243))
To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:
python
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in [the tutorial](TUTORIAL_5_CORPUS.md).
Model training features
We added various other features to model training.
Saving training log ([issue 212](https://github.com/zalandoresearch/flair/issues/212))
The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in `training.log`.
Resuming training ([issue 217](https://github.com/zalandoresearch/flair/issues/217))
It is now possible to stop training at any point in time and to resume it later by training with `checkpoint` set to `True`. Check [the tutorial](TUTORIAL_6_TRAINING_A_MODEL.md) for more details.
Custom Optimizers ([issue 220](https://github.com/zalandoresearch/flair/issues/220))
You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.
Learning Rate Finder ([issue 228](https://github.com/zalandoresearch/flair/issues/228))
A new helper method to assist you in finding a [good learning rate for model training](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_8_MODEL_OPTIMIZATION.md#finding-the-best-learning-rate).
Breaking Changes
This release introduces breaking changes. The most important are:
Unified Model Trainer ([issue 189](https://github.com/zalandoresearch/flair/issues/189))
Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely `ModelTrainer`. This replaces the earlier classes `SequenceTaggerTrainer` and `TextClassifierTrainer`.
Downstream task models now implement the new `flair.nn.Model` interface. So, both the `SequenceTagger` and `TextClassifier` now inherit from `flair.nn.Model`. This allows both models to be trained with the `ModelTrainer`, like this:
python
Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')
Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')
The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.
Metric class
The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum `EvaluationMetric` which you can pass to the ModelTrainer to tell it what to use for evaluation.
Updates and Bug Fixes
Torch 1.0 ([issue 176](https://github.com/zalandoresearch/flair/issues/299))
Flair now bulids on torch 1.0.
Use Pathlib ([issue 176](https://github.com/zalandoresearch/flair/issues/176))
Flair now uses `Path` wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.
Bug Fixes
* Fix: Non-whitespaced tokenized text results into an infinite loop ([issue 226](https://github.com/zalandoresearch/flair/issues/226))
* Fix: Getting IndexError: list index out of range error ([issue 233](https://github.com/zalandoresearch/flair/issues/233))
* Do not reset cache directory always to None ([issue 249](https://github.com/zalandoresearch/flair/issues/249))
* Filter sentences with zero tokens ([issue 266](https://github.com/zalandoresearch/flair/issues/266))