Flair

Latest version: v0.15.1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 6

0.8

Not secure
FLERT (2031 2032 2104)

This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in [our recent paper](https://arxiv.org/abs/2011.06993). This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.

Pre-trained FLERT models (2130)

We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:

python
from flair.data import Sentence
from flair.models import SequenceTagger

load tagger
tagger = SequenceTagger.load("ner-large")

make example sentence
sentence = Sentence("George Washington went to Washington")

predict NER tags
tagger.predict(sentence)

print sentence
print(sentence)

print predicted NER spans
print('The following NER tags are found:')
iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)


If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like [here](https://huggingface.co/flair/ner-english-ontonotes-large).


Contextualized Sentences

In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:

1. `Sentence` objects now have `next_sentence()` and `previous_sentence()` methods that are set automatically if loaded through `ColumnCorpus`. This is a pointer system to navigate through sentences in a corpus:
python
load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)

get a sentence
sentence = corpus.test[123]
print(sentence)
get the previous sentence
print(sentence.previous_sentence())
get the sentence after that
print(sentence.next_sentence())
get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())

This allows dynamic computation of contexts in the embedding classes.

2. `Sentence` objects now have the `is_document_boundary` field which is set through the `ColumnCorpus`. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.


Refactored TransformerWordEmbeddings (breaking)

`TransformerWordEmbeddings` refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: `pooling_operation` is now `subtoken_pooling` (to make clear that we pool subtokens), `use_scalar_mean` is now `layer_mean` (we only do a simple layer mean) and `use_context` can now optionally take an integer to indicate the length of the context. Default arguments are also changed.

For instance, to create embeddings with a document-level context of 64 subtokens, init like this:
python
embeddings = TransformerWordEmbeddings(
model='bert-base-uncased',
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=64,
)


Train your Own FLERT Models

You can train a FLERT-model like this:

python
import torch

from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer


corpus = CONLL_03()

use_context = 64
hf_model = 'xlm-roberta-large'

embeddings = TransformerWordEmbeddings(
model=hf_model,
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=use_context,
)

tag_dictionary = corpus.make_tag_dictionary('ner')

init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
reproject_embeddings=False,
)

train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR

context_string = '+context' if use_context else ''

trainer.train(f"resources/flert",
learning_rate=5.0e-6,
mini_batch_size=4,
mini_batch_chunk_size=1,
max_epochs=20,
scheduler=OneCycleLR,
embeddings_storage_mode='none',
weight_decay=0.,
)


We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.


HuggingFace model hub integration (2040 2108 2115)

We now host Flair sequence tagging models on the HF model hub (thanks for all the support HuggingFace!).

**Overview of all models.** There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check [here](https://huggingface.co/models?filter=flair).

The hub allows all users to upload and share their own models. Even better, you can enable the **Inference API** and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model [here](https://huggingface.co/flair/ner-english-ontonotes-large).

To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do

python
from flair.data import Sentence
from flair.models import SequenceTagger

load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")

make example sentence
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")

predict NER tags
tagger.predict(sentence)

print sentence
print(sentence)

print predicted NER spans
print('The following NER tags are found:')
iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)



Other New Features

New Task: Recognizing Textual Entailment (2123)

Thanks to marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.

For instance, if you want to train an RTE task of the GLUE benchmark use this script:

python
import torch

from flair.data import Corpus
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings

1. get the entailment corpus
corpus: Corpus = GLUE_RTE()

2. make the tag dictionary from the corpus
label_dictionary = corpus.make_label_dictionary()

3. initialize text pair tagger
from flair.models import TextPairClassifier

tagger = TextPairClassifier(
document_embeddings=TransformerDocumentEmbeddings(),
label_dictionary=label_dictionary,
)

4. train trainer with AdamW
from flair.trainers import ModelTrainer

trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)

5. run training
trainer.train('resources/taggers/glue-rte-english',
learning_rate=2e-5,
mini_batch_chunk_size=2, this can be removed if you hae a big GPU
train_with_dev=True,
max_epochs=3)


Add possibility to specify empty label name to CSV corpora (2068)

Some CSV classification datasets contain a value that means "no class". We now extend the `CSVClassificationDataset` so that it is possible to specify which value should be skipped using the `no_class_label` argument.

For instance:

python
load corpus
corpus = CSVClassificationCorpus(
data_folder='resources/tasks/code/',
train_file='java_io.csv',
skip_header=True,
column_name_map={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'},
no_class_label='NONE',
)


This causes all entries of NONE in one of the label columns to be skipped.

More options for splits in corpora and training (2034)

For various reasons, we might want to have a `Corpus` that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.

We add several ways of doing so.

1. If a dataset has predefined splits, like most NLP datasets, you can pass the arguments `train_with_test` and `train_with_dev` to the `ModelTrainer`. This causes the trainer to train over all three splits (and do no evaluation):

python
trainer.train(f"path/to/your/folder",
learning_rate=0.1,
mini_batch_size=16,
train_with_dev=True,
train_with_test=True,
)


2. You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass `sample_missing_splits=False` as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:

python
semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)


Add TFIDF Embeddings (2086)

We added some old-school embeddings (thanks yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.

To initialize these embeddings, you must pass the train split of your training corpus, i.e.

python
embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)


This triggers the process where the most common words are used to featurize documents.

New Datasets

Hungarian NER Corpus (2045)

Added the Hungarian business news corpus annotated with NER information (thanks to alibektas).

python
load Hungarian business NER corpus
corpus = BUSINESS_HUN()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


StackOverflow NER Corpus (2052)

python
load StackOverflow business NER corpus
corpus = STACKOVERFLOW_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


Added GermEval 18 Offensive Language dataset (2102)

python
load StackOverflow business NER corpus
corpus = GERMEVAL_2018_OFFENSIVE_LANGUAGE()
print(corpus)
print(corpus.make_label_dictionary()


Added RTE corpora of GLUE and SuperGLUE

python
load the recognizing textual entailment corpus of the GLUE benchmark
corpus = GLUE_RTE()
print(corpus)
print(corpus.make_label_dictionary()


Improvements

Allow newlines as Tokens in a Sentence (2070)

Newlines and tabs can now become Tokens in a Sentence:

python
make sentence with newlines and tabs
sentence: Sentence = Sentence(["I", "\t", "ich", "\n", "you", "\t", "du", "\n"], use_tokenizer=True)

Alternatively: sentence: Sentence = Sentence("I \t ich \n you \t du \n", use_tokenizer=False)

print sentence and each token
print(sentence)
for token in sentence:
print(token)


Improve transformer serialization (2046)

We improved the serialization of the `TransformerWordEmbeddings` class such that you can now train a model with one version of the transformers library and load it with another version. Previously, if you trained a model with transformers 3.5.1 and loaded it with 3.1.01, or trained with 3.5.1 and loaded with 4.1.1, or other version mismatches, there would either be errors or bad predictions.

**Migration guide:** If you have a model trained with an older version of Flair that uses `TransformerWordEmbeddings` you can save it in the new version-independent format by loading the model with the same transformers version you used to train it, and then saving it again. The newly saved model is then version-independent:

python
load old model, but use the *same transformer version you used when training this model*
tagger = SequenceTagger.load('path/to/old-model.pt')

save the model. It is now version-independent and can for instance be loaded with transformers 4.
tagger.save('path/to/new-model.pt')


Fix regression prediction errors (2067)

Fix of two problems in the regression model:
- the predict() method was unable to set labels and threw errors (see 2056)
- predicted labels had no label name

Now, you can set a label name either in the predict method or during instantiation of the regression model you want to train. So the full code for training a regression model and using it to predict is:

python
load regression dataset
corpus = WASSA_JOY()

make simple document embeddings
embeddings = DocumentPoolEmbeddings([WordEmbeddings('glove')], fine_tune_mode='linear')

init model and give name to label
model = TextRegressor(embeddings, label_name='happiness')

target folder
output_folder = 'resources/taggers/regression_test/'

run training
trainer = ModelTrainer(model, corpus)
trainer.train(
output_folder,
mini_batch_size=16,
max_epochs=10,
)

load model
model = TextRegressor.load(output_folder + 'best-model.pt')

predict for sentence
sentence = Sentence('I am so happy')
model.predict(sentence)

print sentence and prediction
print(sentence)


In my example run, this prints the following sentence + predicted value:
~~~
Sentence: "I am so happy" [− Tokens: 4 − Sentence-Labels: {'happiness': [0.9239126443862915 (1.0)]}]
~~~

Do not shuffle first epoch during training (2058)

Normally, we shuffle sentences at each epoch during training in the ModelTrainer class. However, in some cases it makes sense to see sentences in their natural order during the first epoch, and shuffle only from the second epoch onward.


Bug Fixes and Improvements

- Update to transformers 4 (2057)
- Fix the evaluate() method in the SimilarityLearner class (2113)
- Fix memory memory leak in WordEmbeddings (2018)
- Add support for Transformer-XL Embeddings (2009)
- Restrict numpy version to <1.20 for Python 3.6 (2014)
- Small formatting and variable declaration changes (2022)
- Fix document boundary offsets for Dutch CoNLL-03 (2061)
- Changed the torch version in requirements.txt: Torch>=1.5.0 (2063)
- Fix linear input dimension if the reproject (2073)
- Various improvements for TARS (2090 2128)
- Added a link to the interpret-flair repo (2096)
- Improve documentatin ( 2110)
- Update sentencepiece and gdown version (2131)
- Add to_plain_string method to Span class (2091)

0.7

Not secure
Few-Shot and Zero-Shot Classification with TARS (1917 1926)

With TARS we add a major new feature to Flair for zero-shot and few-shot classification. Details on the approach can be found in our paper [Halder et al. (2020)](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf). Our approach allows you to classify text in cases in which you have little or even no training data at all.

This example illustrates how you predict new classes without training data:

python
1. Load our pre-trained TARS model for English
tars = TARSClassifier.load('tars-base')

2. Prepare a test sentence
sentence = flair.data.Sentence("I am so glad you liked it!")

3. Define some classes that you want to predict using descriptive names
classes = ["happy", "sad"]

4. Predict for these classes
tars.predict_zero_shot(sentence, classes)

Print sentence with predicted labels
print(sentence)


For a full overview of TARS features, please refer to our new [TARS tutorial](/resources/docs/TUTORIAL_10_TRAINING_ZERO_SHOT_MODEL.md).


Other New Features

Option to set Flair seed (1979)

Adds the possibility to set a seed via wrapping the Hugging Face Transformers library helper method (thanks stefan-it).

By specifying a seed with:

python
import flair

flair.set_seed(42)


you can make experimental runs reproducible. The wrapped `set_seed` method sets seeds for `random`, `numpy` and `torch`. More details [here](https://github.com/huggingface/transformers/blob/08f534d2da47875a4b7eb1c125cfa7f0f3b79642/src/transformers/trainer_utils.py#L29-L48).

Control multi-word behavior in UD datasets (1981)

To better handle multi-words in UD corpora, we introduce the `split_multiwords` constructor argument to all UD corpora which by default is set to `True`. It controls the handling of multiwords that are split into different tokens. For instance the German "am" is split into two different tokens: "am" -> "an" + "dem". Or the French "aux" -> "a" + "les".

If `split_multiwords` is set to `True`, they are split as in UD. If set to `False`, we keep the original multiword as a single token. Example:

python
default mode: multiwords are split
corpus = UD_GERMAN(split_multiwords=True)
print sentence 179
print(corpus.dev[179].to_plain_string())

alternative mode: multiwords are kept as original
corpus = UD_GERMAN(split_multiwords=False)
print sentence 179
print(corpus.dev[179].to_plain_string())


This prints

~~~
Ein Hotel zu dem Wohlfühlen.

Ein Hotel zum Wohlfühlen.
~~~

The latter is how it appears in text, the former is after splitting of multiwords.

Pass pretokenized sentence to Sentence object (1965)

You can now pass pass a pretokenized sequence as list of words (thanks ulf1):

python
from flair.data import Sentence
sentence = Sentence(['The', 'grass', 'is', 'green', '.'])
print(sentence)


This should print:

console
Sentence: "The grass is green ." [− Tokens: 5]


Map label names in sequence labeling datasets (1988)

You can now pass a label map to sequence labeling datasets to change label names (thanks pharnisch).

python
print tag dictionary with mapped names
corpus = CONLL_03_DUTCH(label_name_map={'PER': 'person', 'ORG': 'organization', 'LOC': 'location', 'MISC': 'other'})
print(corpus.make_tag_dictionary('ner'))

print tag dictionary with original names
corpus = CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))


Data Sets

Universal Proposition Banks (1870 1866 1888)

Flair 0.7 adds support 7 Universal Proposition Banks to train your own multilingual semantic role labelers (thanks to Dabendorf).

Load for instance with:

python
load English Universal Proposition Bank
corpus = UP_ENGLISH()
print(corpus)

make dictionary of frames
frame_dictionary = corpus.make_tag_dictionary('frame')
print(frame_dictionary)


Now available for Finnish, Chinese, Italian, French, German, Spanish and English

NER Corpora

We add support for 6 new NER corpora:

Arabic NER Corpus (1901)

Added the ANER corpus for Arabic NER (thanks to megantosh).

python
load Arabic NER corpus
corpus = ANER_CORP()
print(corpus)


Movie NER Corpora (1912)

Added the MIT movie reviews corpora annotated with NER information, in the simple and complex variant (thanks to pharnisch).

python
load simple movie NER corpus
corpus = MITMovieNERSimple()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

load complex movie NER corpus
corpus = MITMovieNERComplex()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


Added SEC Fillings NER corpus (1922)

Added corpus of SEC fillings annotated with 4-class NER tags (thanks to samahakk).

python
load SEC fillings corpus
corpus = SEC_FILLINGS()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


WNUT 2020 NER dataset support (1942)

Added corpus of wet lab protocols annotated with NER information used for WNUT 2020 challenge (thanks to aynetdia).

python
load wet lab protocol data
corpus = WNUT_2020_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


Weibo NER dataset support (1944)

Added dataset about NER for Chinese Social Media (thanks to 87302380).

python
load Weibo NER data
corpus = WEIBO_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


Added Finnish NER corpus (1946)

Added the TURKU corpus for Finnish NER (thanks to melvelet).

python
load Finnish NER data
corpus = TURKU_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))


Universal Depdency Treebanks

We add support for 11 new UD treebanks:

- Greek UD Treebank (1933, thanks malamasn)
- Livvi UD Treebank (1953, thanks hebecked)
- Naija UD Treebank (1952, thanks teddim420)
- Buryat UD Treebank (1954, thanks MaxDall)
- North Sami UD Treebank (1955, thanks dobbersc)
- Maltese UD Treebank (1957, thanks phkuep)
- Marathi UD Treebank (1958, thanks polarlyset)
- Afrikaans UD Treebank (1959, thanks QueStat)
- Gothic UD Treebank (1961, thanks wjSimon)
- Old French UD Treebank (1964, thanks Weyaaron)
- Wolof UD Treebank (1967, thanks LukasOpp)

Load each with language name, for instance:

python
load Gothic UD treebank data
corpus = UD_GOTHIC()
print(corpus)
print(corpus.test[0])


Added GoEmotions text classification corpus (1914)

Added [GoEmotions dataset]( https://github.com/google-research/google-research/tree/master/goemotions ) containing 58k Reddit comments labeled with 27 emotion categories. Load with:

python
load GoEmotions corpus
corpus = GO_EMOTIONS()
print(corpus)
print(corpus.make_label_dictionary())


Enhancements and bug fixes
- Add handling for micro-average precision and recall (1935)
- Make dev and test splits in treebanks optional (1951)
- Updated communicative functions model (1857)
- Biomedical Data: Explicit encodings for Windows Support (1893)
- Fix wrong abstract method (1923 1940)
- Improve tutorial (1939)
- Fix requirements (1971 )

0.6.1

Not secure
Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.

New Model: Legal NER (1872)

Add legal NER model for German. Trained using the German legal NER dataset available [here](https://github.com/elenanereiss/Legal-Entity-Recognition) that can be loaded in Flair with the `LER_GERMAN` corpus object.

Uses German Flair and FastText embeddings and gets **96.35** F1 score.

Use like this:

python
load German LER tagger
tagger = SequenceTagger.load('de-ler')

example text
text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."

sentence = Sentence(text)

predict and print entities
tagger.predict(sentence)

for entity in sentence.get_spans('ner'):
print(entity)


New Datasets

Add XTREME and WikiANN corpora for multilingual NER (1862)

These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:

python
load German Xtreme
german_corpus = XTREME('de')
print(german_corpus)

load French Xtreme
french_corpus = XTREME('fr')
print(french_corpus)


Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:

python
load Xtreme MultiCorpus for all
multi_corpus = XTREME()
print(multi_corpus)


Add Twitter NER Dataset (1850)

Dataset of [tweets](
https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/ner.txt) annotated with NER tags. Load with:

python
load twitter dataset
corpus = TWITTER_NER()

print example tweet
print(corpus.test[0])


Add German Europarl NER Dataset (1849)

Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:

python
load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)

print first test sentence
print(corpus.test[1])


Add MIT Restaurant NER Dataset (1177)

Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:

python
load restaurant dataset
corpus = MIT_RESTAURANTS()

print example sentence
print(corpus.test[0])


Add Universal Propositions Banks for French and German (1866)

Our kickoff into supporting the [Universal Proposition Banks](https://github.com/System-T/UniversalPropositions) adds the first two UP datasets to Flair. Load with:

python
load German UP
corpus = UP_GERMAN()
print(corpus)

print example sentence
print(corpus.dev[1])


Add Universal Dependencies Dataset for Chinese (1880)

Adds the Kyoto dataset for Chinese. Load with:

python
load Chinese UD dataset
corpus = UD_CHINESE_KYOTO()

print example sentence
print(corpus.test[0])


Bug fixes

- Move models to HU server (1834 1839 1842)
- Fix deserialization issues in transformer tokenizers 1865
- Documentation fixes (1819 1821 1836 1852)
- Add link to a repo with examples of Flair on GCP (1825)
- Correct variable names (1875)
- Fix problem with custom delimiters in ColumnDataset (1876)
- Fix offensive language detection model (1877)
- Correct Dutch NER model (1881)

0.6

Not secure
Biomedical Models and Datasets:

Most of the biomedical models and datasets were developed together with the [Knowledge Management in Bioinformatics](https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi) group at the HU Berlin, in particular leonweber and mariosaenger. [This page](https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR.md) gives an overview of the new models and datasets, and example tutorials. Some highlights:


Biomedical NER models (1790)

Flair now has pre-trained models for biomedical NER trained over unified versions of 31 different biomedical corpora. Because they are trained on so many different datasets, the models are shown to be very robust with new datasets, outperforming all previously available off-the-shelf datasets. If you want to load a model to detect "diseases" in text for instance, do:

python
make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

load disease tagger and predict
tagger = SequenceTagger.load("hunflair-disease")
tagger.predict(sentence)


Done! Let's print the diseases found by the tagger:

python
for entity in sentence.get_spans():
print(entity)

This should print:
~~~
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
~~~

You can also get one model that finds 5 biomedical entity types (diseases, genes, species, chemicals and cell lines), like this:

python
load bio-NER tagger and predict
tagger = MultiTagger.load("hunflair")
tagger.predict(sentence)

This should print:
~~~
Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)]
Span [5]: "Fmr1" [− Labels: Gene (0.838)]
Span [7]: "Mouse" [− Labels: Species (0.9979)]
~~~

So it now also finds genes and species. As explained [here](https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR.md) these models work best if you use them together with a biomedical tokenizer.


Biomedical NER datasets (1790)

Flair now supports 31 biomedical NER datasets out of the box, both in their standard versions as well as the "Huner" splits for reproducibility of experiments. For a full list of datasets, refer to [this page](https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR_CORPORA.md).

You can load a dataset like this:

python
load one of the bioinformatics corpora
corpus = JNLPBA()

print statistics and one sentence
print(corpus)
print(corpus.train[0])


We also include "huner" corpora that combine many different biomedical datasets into a single corpus. For instance, if you execute the following line:

python
load combined chemicals corpus
corpus = HUNER_CHEMICAL()


This loads a combination of 6 different corpora that contain annotation of chemicals into a single corpus. This allows you to train stronger cross-corpus models since you now combine training data from many sources. See more info [here](https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR_CORPORA.md#huner-data-sets).


POS model for Portuguese clinical text (1789)

Thanks to LucasFerroHAILab, we now include a model for part-of-speech tagging in Portuguese clinical text. Run this model like this:

python
load your tagger
tagger = SequenceTagger.load('pt-pos-clinical')

example sentence
sentence = Sentence('O vírus Covid causa fortes dores .')
tagger.predict(sentence)
print(sentence)


You can find more details in their paper [here](https://link.springer.com/article/10.1007/s42600-020-00067-7).


Model for negation and speculation in biomedical literature (1758)

Using the BioScope corpus, we trained a model to recognize negation and speculation in biomedical literature. Use it like this:

python
sentence = Sentence("The picture most likely reflects airways disease")

tagger = SequenceTagger.load("negation-speculation")
tagger.predict(sentence)

for entity in sentence.get_spans():
print(entity)


This should print:

~~~
Span [4,5,6,7]: "likely reflects airways disease" [− Labels: SPECULATION (0.9992)]
~~~

Thus indicating that this portion of the sentence is speculation.


Other New Features:

MultiTagger (1791)

We added support for tagging text with multiple models at the same time. This can save memory usage and increase tagging speed.

For instance, if you want to POS tag, chunk, NER and detect frames in your text at the same time, do:

python
load tagger for POS, chunking, NER and frame detection
tagger = MultiTagger.load(['pos', 'upos', 'chunk', 'ner', 'frame'])

example sentence
sentence = Sentence("George Washington was born in Washington")

predict
tagger.predict(sentence)

print(sentence)


This will give you a sentence annotated with 5 different layers of annotation.

Sentence splitting

Flair now includes convenience methods for sentence splitting. For instance, to use segtok to split and tokenize a text into sentences, use the following code:

python
from flair.tokenization import SegtokSentenceSplitter

example text with many sentences
text = "This is a sentence. This is another sentence. I love Berlin."

initialize sentence splitter
splitter = SegtokSentenceSplitter()

use splitter to split text into list of sentences
sentences = splitter.split(text)


We also ship other splitters, such as `SpacySentenceSplitter` (requires SpaCy to be installed).

Japanese tokenization (1786)

Thanks to himkt we now have expanded support for Japanese tokenization in Flair. For instance, use the following code to tokenize a Japanese sentence without installing extra libraries:

python
from flair.data import Sentence
from flair.tokenization import JapaneseTokenizer

init japanese tokenizer
tokenizer = JapaneseTokenizer("janome")

make sentence (and tokenize)
sentence = Sentence("私はベルリンが好き", use_tokenizer=tokenizer)

output tokenized sentence
print(sentence)


One-Cycle Learning (1776)

Thanks to lucaventurini2 Flair one supports one-cycle learning, which may give quicker convergence. For instance, train a model in 20 epochs using the code below:

python
train as always
trainer = ModelTrainer(tagger, corpus)

set one cycle LR as scheduler
trainer.train('onecycle_ner',
scheduler=OneCycleLR,
max_epochs=20)


Improvements:

Changes in convention

Turn on tokenizer by default in `Sentence` object (1806)

The `Sentence` object now executes tokenization (`use_tokenizer=True`) by default:

python
Tokenizes by default
sentence = Sentence("I love Berlin.")
print(sentence)

i.e. this is equivalent to
sentence = Sentence("I love Berlin.", use_tokenizer=True)
print(sentence)

i.e. if you don't want to use tokenization, set it to False
sentence = Sentence("I love Berlin.", use_tokenizer=False)
print(sentence)


`TransformerWordEmbeddings` now handle long documents by default

Previously, so had to set `allow_long_sentences=True` to enable handling of long sequences (greater than 512 subtokens) in `TransformerWordEmbeddings`. This is no longer necessary as this value is now set to `True` by default.


Bug fixes
- Fix serialization of `BytePairEmbeddings` (1802)
- Fix issues with loading models that use `ELMoEmbeddings` (1803)
- Allow longer lengths in transformers that can handle more than 512 subtokens (1804)
- Fix encoding for WASSA datasets (1766)
- Update BPE package (1764)
- Improve documentation (1752 1778)
- Fix evaluation of `TextClassifier` if no `label_type` is passed (1748)
- Remove torch version checks that throw errors (1744)
- Update DaNE dataset URL (1800)
- Fix weight extraction error for empty sentences (1805)

0.5.1

Not secure
New Features and Enhancements:

TransformerWordEmbeddings can now process long sentences (1680)

Adds a heuristic as a workaround to the max sequence length of some transformer embeddings, making it possible to now embed sequences of arbitrary length if you set `allow_long_sentences=True`, like so:

python
TransformerWordEmbeddings(
allow_long_sentences=True, set allow_long_sentences to True to enable this features
),


Setting random seeds (1671)

It is now possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

python
set a random seed
import random
random.seed(4)

load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)

print first sentence of dev and test
print(corpus.dev[0])
print(corpus.test[0])


Make reprojection layer optional (1676)

Makes the reprojection layer optional in SequenceTagger. You can control this behavior through the `reproject_embeddings` parameter. If you set it to `True`, embeddings are reprojected via linear map to identical size. If set to `False`, no reprojection happens. If you set this parameter to an integer, the linear map maps embedding vectors to vectors of this size.

python
tagger with standard reprojection
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=True,
)

tagger without reprojection
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=False,
)

reprojection to vectors of length 128
tagger = SequenceTagger(
hidden_size=256,
[...]
reproject_embeddings=128,
)


Set label name when predicting (1671)

You can now optionally specify the "label name" of the predicted label. This may be useful if you want to for instance run two different NER models on the same sentence:

python
sentence = Sentence('I love Berlin')

load two NER taggers
tagger_1 = SequenceTagger.load('ner')
tagger_2 = SequenceTagger.load('ontonotes-ner')

specify label name of tagger_1 to be 'conll03_ner'
tagger_1.predict(sentence, label_name='conll03_ner')

specify label name of tagger_2 to be 'onto_ner'
tagger_1.predict(sentence, label_name='onto_ner')

print(sentence)


This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

Sentence Transformers (1696)

Adds the `SentenceTransformerDocumentEmbeddings` class so you get embeddings from the [`sentence-transformer`](https://github.com/UKPLab/sentence-transformers) library. Use as follows:

python
from flair.data import Sentence
from flair.embeddings import SentenceTransformerDocumentEmbeddings

init embedding
embedding = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')

create a sentence
sentence = Sentence('The grass is green .')

embed the sentence
embedding.embed(sentence)


You can find a full list of their pretained models [here](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0).

Other enhancements
- Update to transformers 3.0.0 (1727)
- Better Memory mode presets for classification corpora (1701)
- ClassificationDataset now also accepts line with "\t" seperator additionaly to blank spaces (1654)
- Change default fine-tuning in DocumentPoolEmbeddings to "none" (1675)
- Short-circuit the embedding loop (1684)
- Add option to pass kwargs into transformer models when initializing model (1694)


New Datasets and Models

Two new dutch NER models (1687)

The new default model is a BERT-based RNN model with the highest accuracy:

python
from flair.data import Sentence
from flair.models import SequenceTagger

load the default BERT-based model
tagger = SequenceTagger.load('nl-ner')

tag sentence
sentence = Sentence('Ik hou van Amsterdam')
tagger.predict(sentence)


You can also load a Flair-based RNN model (might be faster on some setups):

python
load the default BERT-based model
tagger = SequenceTagger.load('nl-ner-rnn')


Corpus of communicative functions (1683) and pre-trained model (1706)

Adds corpus of communicate functions in scientific literature, described in this [LREC paper](https://www.researchgate.net/publication/339658767_An_Evaluation_Dataset_for_Identifying_Communicative_Functions_of_Sentences_in_English_Scholarly_Papers) and available [here](https://github.com/Alab-NII/FECFevalDataset). Load with:

python
corpus = COMMUNICATIVE_FUNCTIONS()
print(corpus)


We also ship a pre-trained model on this corpus, which you can load with:
python
load communicative function tagger
tagger = TextClassifier.load('communicative-functions')

load communicative function tagger
sentence = Sentence("However, previous approaches are limited in scalability .")

predict and print labels
tagger.predict(sentence)
print(sentence.labels)



Keyword Extraction Corpora (1629) and pre-trained model (1689)

Added 3 datasets available for [keyphrase extraction](https://github.com/midas-research/keyphrase-extraction-as-sequence-labeling-data) via sequence labeling: [Inspec](https://github.com/midas-research/keyphrase-extraction-as-sequence-labeling-data/tree/master/Inspec), [SemEval-2017](https://github.com/midas-research/keyphrase-extraction-as-sequence-labeling-data/tree/master/SemEval-2017) and [Processed SemEval-2010](https://github.com/midas-research/keyphrase-extraction-as-sequence-labeling-data/tree/master/processed_semeval-2010)

Load like this:

python
inspec_corpus = INSPEC()
semeval_2010_corpus = SEMEVAL2010()
semeval_2017 = SEMEVAL2017()


We also ship a pre-trained model on this corpus, which you can load with:

python
load keyphrase tagger
tagger = SequenceTagger.load('keyphrase')

load communicative function tagger
sentence = Sentence("Here, we describe the engineering of a new class of ECHs through the "
"functionalization of non-conductive polymers with a conductive choline-based "
"bio-ionic liquid (Bio-IL).", use_tokenizer=True)

predict and print labels
tagger.predict(sentence)
print(sentence)


Swedish NER (1652)

Add corpus for swedish NER using dataset https://github.com/klintan/swedish-ner-corpus/. Load with:

python
corpus = NER_SWEDISH()
print(corpus)


German Legal Named Entity Recognition (1697)

Adds corpus of legal named entities for German. Load with:
python
corpus = LER_GERMAN()
print(corpus)


Refactoring of evaluation

We made a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original `conlleval` script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the `conlleval` script.

In more detail, this PR makes the following changes:

- `Span` is now a list of `Token` and can now be iterated like a sentence
- `flair.DataLoader` is now used throughout
- The `evaluate()` interface in the `Model` base class is changed so that it no longer requires a data loader, but ran run either over list of `Sentence` or a `Dataset`
- `SequenceTagger.evaluate()` now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (1663) and a non-sklearn implementation is used.
- In the `evaluate()` method of the `SequenceTagger` and `TextClassifier`, we now explicitly call the `.predict() `method.

Bug fixes:

- Fix figsize issue (1622)
- Allow strings to be passed instead of Path (1637)
- Fix segtok tokenization issue (1653)
- Serialize dropout in `SequenceTagger` (1659)
- Fix serialization error in `DocumentPoolEmbeddings` (1671)
- Fix subtokenization issues in transformers (1674)
- Add new datasets to __init__.py (1677)
- Fix deprecation warnings due to invalid escape sequences. (1678)
- Fix PooledFlairEmbeddings deserialization error (1604)
- Fix transformer tokenizer deserialization (1686)
- Fix issues caused by embedding mode and lambda functions in ELMoEmbeddings (1692)
- Fix serialization error in PooledFlairEmbeddings (1593)
- Fix mean pooling in PooledFlairEmbeddings (1698)
- Fix condition to assign whitespace_after attribute in the build_spacy_tokenizer wraper (1700)
- Fix WIKINER encoding for windows (1713)
- Detect and ignore empty sentences in BERT embeddings (1716)
- Fix error in returning multiple classes (1717)

0.5

Not secure
Transformer Word Embeddings

If you want to embed the words in a sentence with transformers, do it like this:

python
from flair.embeddings import TransformerWordEmbeddings

init embedding
embedding = TransformerWordEmbeddings('bert-base-uncased')

create a sentence
sentence = Sentence('The grass is green .')

embed words in sentence
embedding.embed(sentence)


If instead you want to use RoBERTa, do:

python
from flair.embeddings import TransformerWordEmbeddings

init embedding
embedding = TransformerWordEmbeddings('roberta-base')

create a sentence
sentence = Sentence('The grass is green .')

embed words in sentence
embedding.embed(sentence)


Transformer Document Embeddings

To get a single embedding for the whole document with BERT, do:

python
from flair.embeddings import TransformerDocumentEmbeddings

init embedding
embedding = TransformerDocumentEmbeddings('bert-base-uncased')

create a sentence
sentence = Sentence('The grass is green .')

embed the sentence
embedding.embed(sentence)


If instead you want to use RoBERTa, do:

python
from flair.embeddings import TransformerDocumentEmbeddings

init embedding
embedding = TransformerDocumentEmbeddings('roberta-base')

create a sentence
sentence = Sentence('The grass is green .')

embed the sentence
embedding.embed(sentence)


Text classification by fine-tuning a transformer

Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks.
Use `TransformerDocumentEmbeddings` for this and set `fine_tune=True`. Then, use the following example code:


python
from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

1. get the corpus
corpus: Corpus = TREC_6()

2. create the label dictionary
label_dict = corpus.make_label_dictionary()

3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

6. start the training
trainer.train('resources/taggers/trec',
learning_rate=3e-5, use very small learning rate
mini_batch_size=16,
mini_batch_chunk_size=4, optionally set this if transformer is too much for your machine
max_epochs=5, terminate after 5 epochs
)


New Taggers, Embeddings and Datasets

Flair 0.5 adds a ton of new taggers, embeddings and datasets.

New Taggers

New sentiment models (1613)

We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.

There are two new models, a transformer-based model you can load like this:

python
load tagger
classifier = TextClassifier.load('sentiment')

predict for example sentence
sentence = Sentence("enormously entertaining for moviegoers of any age .")
classifier.predict(sentence)

check prediction
print(sentence)


And a faster, slightly less accurate model based on RNNs you can load like this:

python
classifier = TextClassifier.load('sentiment-fast')


Fine-grained POS models for English (1625)

Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:

python
Fine-grained POS model
tagger = SequenceTagger.load('pos')

Fine-grained POS model (fast variant)
tagger = SequenceTagger.load('pos-fast')

Universal POS model
tagger = SequenceTagger.load('upos')

Universal POS model (fast variant)
tagger = SequenceTagger.load('upos-fast')


Added Malayalam POS and XPOS tagger model (1522)

Added taggers for historical German speech and thought (1532)

New Embeddings

Added language models for historical German by redewiedergabe (1507)

Load the language models with:

python
embeddings_forward = FlairEmbeddings('de-historic-rw-forward')
embeddings_backward = FlairEmbeddings('de-historic-rw-backward')


Added Malayalam flair embeddings models (1458)

python
embeddings_forward = FlairEmbeddings('ml-forward')
embeddings_backward = FlairEmbeddings('ml-backward')


Added Flair Embeddings from CLEF HIPE Shared Task (1554)

Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the [CLEF HIPE shared task](https://impresso.github.io/CLEF-HIPE-2020/).

New Datasets

Added NER dataset for Finnish (1620)

You can now load a Finnish NER corpus with
python
ner_finnish = flair.datasets.NER_FINNISH()


Added DaNE dataset (1425)

You can now load a Danish NER corpus with
python
dane = flair.datasets.DANE()


Added SentEval classification datasets (1454)

Adds 6 SentEval classification datasets to Flair:

python
senteval_corpus_1 = flair.datasets.SENTEVAL_CR()
senteval_corpus_2 = flair.datasets.SENTEVAL_MR()
senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ()
senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA()
senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY()
senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()


Added Sentiment Datasets (1545)

Adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment.

python
amazon_reviews = flair.datasets.AMAZON_REVIEWS()
sentiment_140 = flair.datasets.SENTIMENT_140()


Added BIOfid dataset (1589)
python
biofid = flair.datasets.BIOFID()


Refactorings

Any DataPoint can now be labeled (1450)

Refactored the `DataPoint` class and classes that inherit from it (`Token`, `Sentence`, `Image`, `Span`, etc.) so that all have the same methods for adding and accessing labels.

- `DataPoint` base class now defined labeling methods (closes 1449)
- Labels can no longer be passed to `Sentence` constructor, so instead of:
python
sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")])

you should now do:
python
sentence_1 = Sentence("this is great")
sentence_1.add_label('sentiment', 'POSITIVE')

or:
python
sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')


Note that Sentence labels now have a `label_type` (in the example that's 'sentiment').

- The `Corpus` method `_get_class_to_count` is renamed to `_count_sentence_labels`
- The `Corpus` method `_get_tag_to_count` is renamed to `_count_token_labels`
- `Span` is now a `DataPoint` (so it has an `embedding` and `labels`)

Embeddings module was split into smaller submodules (1588)

Split the previously huge `embeddings.py` into several submodules organized in an `embeddings/` folder. The submodules are:

- `token.py` for all `TokenEmbeddings` classes
- `document.py` for all `DocumentEmbeddings` classes
- `image.py` for all `ImageEmbeddings` classes
- `legacy.py` for embeddings that are now deprecated
- `base.py` for remaining basic classes

All embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:

python
from flair.embeddings import FlairEmbeddings
embeddings = FlairEmbeddings('news-forward')

so specifying the submodule is not needed.

Datasets module was split into smaller submodules (1510)

Split the previously huge `datasets.py` into several submodules organized in a `datasets/` folder. The submodules are:

- `sequence_labeling.py` for all sequence labeling datasets
- `document_classification.py` for all document classification datasets
- `treebanks.py` for all dependency parsed corpora (UD treebanks)
- `text_text.py` for all bi-text datasets (currently only parallel corpora)
- `text_image.py` for all paired text-image datasets (currently only Feidegger)
- `base.py` for remaining basic classes

All datasets are still exposed through the datasets package, so it is still possible to load corpora with
python
from flair.datasets import TREC_6

without specifying the submodule.

Other refactorings

- Refactor datasets for code legibility (1394)

Small refactorings on `flair.datasets` for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent `Corpus` class. You can now instantiate a `Corpus` only with a train file which will trigger the sampling. (2) Moved the default logic for identifying train, dev and test files into a dedicated method to avoid duplicates in code.

- Extend string output of Sentence (1452)

Other

New Features

Add option to specify document delimiter for language model training (1541)

You now have the option of specifying a document_delimiter when training a LanguageModel. Say, you have a corpus of textual lists and use "[SEP]" to mark boundaries between two lists, like this:


Colors:
- blue
- green
- red
[SEP]
Cities:
- Berlin
- Munich
[SEP]
...


Then you can now train a language model by setting the `document_delimiter` in the `TextCorpus` and `LanguageModel` objects. This will make sure only documents as a whole will get shuffled during training (i.e. the lists in the above example):

python
your document delimiter
delimiter = '[SEP]'

set it when you load the corpus
corpus = TextCorpus(
"data/corpora/conala-corpus/",
dictionary,
is_forward_lm,
character_level=True,
document_delimiter=delimiter,
)

set it when you init the language model
language_model = LanguageModel(
dictionary,
is_forward_lm=True,
hidden_size=512,
nlayers=1,
document_delimiter=delimiter
)

train your language model as always
trainer = LanguageModelTrainer(language_model, corpus)

Allow column delimiter to be set in ColumnCorpus (1526)

Added the possibility to set a different column delimite for `ColumnCorpus`, i.e.

python
corpus = ColumnCorpus(
Path("/path/to/corpus/"),
column_format={0: 'text', 1: 'ner'},
column_delimiter='\t', set a different delimiter
)


if you want to read a tab-separated column corpus.

Improvements in classification corpus datasets (1545)

There are a number of improvements for the `ClassificationCorpus` and `ClassificationDataset` classes:
- It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit.
- It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets.
- You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test')
- You can now specify the option "filter_if_longer_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.

Added different ways to combine ELMo layers (1547)

Improved default annealing scheme to anneal against score and loss (1570)

Add new scheduler that uses dev score as main metric to anneal against, but additionally uses dev loss in case two epochs have the same dev score.

Added option for hidden state position in FlairEmbeddings (1571)

Adds the option to choose which hidden state to use in FlairEmbeddings: either the state at the end of each word, or the state at the whitespace after. Default is the state at the whitespace after.

You can change the default like this:
python
embeddings = FlairEmbeddings('news-forward', with_whitespace=False)


This configuration seems to be better for syntactic tasks. For POS tagging, it seems that you should set `with_whitespace=False`. For instance, on UD_ENGLISH POS-tagging, we get **96.56 +- 0.03** with whitespace and **96.72 +- 0.04** without, averaged over three runs.

See the discussion in 1362 for more details.

Other features

- Added the option of passing different tokenizers when loading classification datasets (1579)

- Added option for true whitespaces in ColumnCorpus 1583

- Configurable cache_root from environment variable (507)

Performance improvements

- Improve performance for loading not-in-memory corpus (1413)

- A new lmdb based alternative backend for word embeddings (1515 1536)

- Slim down requirements (1419)

Bug Fixes

- Fix issue where flair was crashing for cpu only version of pytorch (1393 1418)

- Fix GPU memory error in PooledFlairEmbeddings (1417)

- Various small fixes (1402 1533 1511 1560 1616)

- Improve documentation (1446 1447 1520 1525 1556)

- Fix various issues in classification datasets (1499)

Page 4 of 6

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.