Flair

Latest version: v0.15.1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 6

0.12.2

Not secure

Another follow-up release to 0.12 that fixes a several bugs and adds a new multilingual frame tagger. Further, our new documentation website at https://flairnlp.github.io/docs/intro is now online!

New frame tagging model 3172

Adds a new model for detecting PropBank frame. The model is trained using the "FLERT" approach, so it is much stronger than the previous 'frame' model. We also added some training data from the universal proposition bank to improve multilingual frame detection.

Use it like this:

python
load the large frame model
model = Classifier.load('frame-large')

English sentence with the verb "return" in two different senses
sentence = Sentence("Dirk returned to Berlin to return his hat.")
model.predict(sentence)
print(sentence)

German sentence with the verb "trug" in two different senses
sentence_de = Sentence("Dirk trug einen Koffer und trug einen Hut.")
model.predict(sentence_de)
print(sentence_de)

This should print:

console
Sentence[9]: "Dirk returned to Berlin to return his hat." → ["returned"/return.01, "return"/return.02]

Sentence[9]: "Dirk trug einen Koffer und trug einen Hut." → ["trug"/carry.01, "trug"/wear.01]

The printout tells us that the verbs in both sentences are correctly disambiguated.

Documentation
- adds a pointer to the new Flair documentation website at https://flairnlp.github.io/docs/intro
- adds a night mode Flair logo 3145

Enhancements / New Features
- more consistent behavior of context dropout and FLERT token 3168
- settting device through environment variable 3148 (thanks HallerPatrick)
- modify Sentence.to_original_text() to take into account Sentence.start_position for whitespace calculation 3150 (thanks mauryaland)
- gather dev and test labels if the dataset is available 3162 (thanks helpmefindaname)

Bug fixes
- fix bugs caused by wrong data point equality and caching 3157
- fix transformer smaller training vocab 3155 (thanks helpmefindaname)
- update scispacy version 3144 (thanks mariosaenger)
- unpin huggingface-hub 3149 (thanks marctorsoc)

0.12.1

Not secure

This is a quick follow-up release to 0.12 that fixes a few small bugs and includes an improved version of our Zelda entity linker.

New Entity Linking model

We include a new version of our Zelda entity linker with improved predictions. Try it as follows:

python
from flair.nn import Classifier
from flair.data import Sentence

load the model
tagger = Classifier.load('linker')

make a sentence
sentence = Sentence('Kirk and Spock met on the Enterprise.')

predict NER tags
tagger.predict(sentence)

print predicted entities
for label in sentence.get_labels():
print(label)

This should print:
console

0.12

Not secure

New Features

Simplify Flair model usage 3067

You can now load any Flair model through its parent class. Since most models inherit from `Classifier`, you can load and run multiple different models with exactly the same code. So, to run three different taggers for sentiment, entities and frames, do:

python
from flair.data import Sentence
from flair.nn import Classifier

load three taggers to tag entities, frames and sentiment
tagger_1 = Classifier.load('ner')
tagger_2 = Classifier.load('frame')
tagger_3 = Classifier.load('sentiment')

example sentence
sentence = Sentence('Dirk celebrated in Essen')

predict with all three models
tagger_1.predict(sentence)
tagger_2.predict(sentence)
tagger_3.predict(sentence)

print all predictions
for label in sentence.get_labels():
print(label)

With this change, users no longer need to know which model classes implement which model. For more advanced users who do know this, the regular way for loading a model still works:
python
sentiment_tagger = TextClassifier.load('sentiment')

Entity Linking (BETA)

As of Flair 0.12 we ship an **experimental entity linker** trained on the [Zelda dataset](https://github.com/flairNLP/zelda). The linker not only tags entities, but also attempts to link each entity to the corresponding Wikipedia URL if one exists.

To illustrate, let's use a short example text with two mentions of "Barcelona". The first refers to the football club "FC Barcelona", the second to the city "Barcelona".

python
from flair.nn import Classifier
from flair.data import Sentence

load the model
tagger = Classifier.load('linker')

make a sentence
sentence = Sentence('Bayern played against Barcelona. The match took place in Barcelona.')

predict NER tags
tagger.predict(sentence)

print sentence with predicted tags
print(sentence)

This should print:
console
Sentence[12]: "Bayern played against Barcelona. The match took place in Barcelona." → ["Bayern"/FC_Bayern_Munich, "Barcelona"/FC_Barcelona, "Barcelona"/Barcelona]

As we can see, the linker can resolve what the two mentions of "Barcelona" refer to:
- the first mention "Barcelona" is linked to "FC_Barcelona"
- the second mention "Barcelona" is linked to "Barcelona"

Additionally, the mention "Bayern" is linked to "FC_Bayern_Munich", telling us that here the football club is meant.

Entity linking support includes:
- Support for the ZELDA candidate lists 3108 3111
- Support for the ZELDA training and evaluation dataset 3088

Support for Ukrainian language 3026

This version adds support for Ukrainian taggers, embeddings and datasets. For instance, to do NER and POS tagging of a Ukrainian sentence, do:

python
Load Ukrainian NER and POS taggers
from flair.models import SequenceTagger

ner_tagger = SequenceTagger.load('ner-ukrainian')
pos_tagger = SequenceTagger.load('pos-ukrainian')

Tag a sentence
from flair.data import Sentence
sentence = Sentence("Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди.")

ner_tagger.predict(sentence)
pos_tagger.predict(sentence)

print(sentence)
”Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди." →
[“Сьогодні"/ADV, "в"/ADP, "Знам’янці"/LOC, "Знам’янці"/PROPN, "проживають”/VERB, "нащадки"/NOUN, "поета"/NOUN, "—"/PUNCT, "родина"/NOUN, "Шкоди”/PERS, "Шкоди"/PROPN, "."/PUNCT]

Multitask Learning (2910 3085 3101)
We add support for multitask learning in Flair (closes 2508 and closes 1260) with hopefully a simple syntax to define multiple tasks that share parts of the model.

The most common part to share is the transformer, which you might want to fine-tune across several tasks. Instantiate a transformer embedding and pass it to two separate models that you instantiate as before:

python
--- Embeddings that are shared by both models ---
shared_embedding = TransformerDocumentEmbeddings("distilbert-base-uncased", fine_tune=True)

--- Task 1: Sentiment Analysis (5-class) ---
corpus_1 = SENTEVAL_SST_GRANULAR()

model_1 = TextClassifier(shared_embedding,
label_dictionary=corpus_1.make_label_dictionary("class"),
label_type="class")

-- Task 2: Binary Sentiment Analysis on Customer Reviews --
corpus_2 = SENTEVAL_CR()

model_2 = TextClassifier(shared_embedding,
label_dictionary=corpus_2.make_label_dictionary("sentiment"),
label_type="sentiment",
)

-- Define mapping (which tagger should train on which model) --
multitask_model, multicorpus = make_multitask_model_and_corpus(
[
(model_1, corpus_1),
(model_2, corpus_2),
]
)

-- Create model trainer and train --
trainer = ModelTrainer(multitask_model, multicorpus)
trainer.fine_tune(f"resources/taggers/multitask_test")

The mapping part here defines which tagger should be trained on which corpus. By calling `make_multitask_model_and_corpus` with a mapping, you get a corpus and model object that you can train as before.

Explicit context boundaries in Transformer embeddings 3073 3078

We improve our FLERT model by now explicitly marking up context boundaries using a new `[FLERT]` special token in our transformer embeddings. Our experiments show that the context marker leads to improved NER results:

| Transformer | Context-Marker | CoNLL-03 Test F1 |
|----------|:-------------:|------:|
| bert-base-uncased | _none_ | 91.52 +- 0.16 |
| | `[SEP]` | 91.38 +- 0.18 |

0.11

Not secure

New Features

Regular Expression Tagger (2533)

You can now do sequence labeling in Flair with regular expressions! Simply define a `RegexpTagger` and add some regular expressions, like in the example below:

python
sentence with a number and two quotes
sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".')

instantiate regex tagger with a quote matching pattern
tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\\?))\2.)*?\1', 'QUOTE'))

also add a number mapping
tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER'))

tag sentence
tagger.predict(sentence)

check out matches
for entity in sentence.get_labels():
print(entity)

Clustering with Flair (2573 2619)

Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:

python
from sklearn.cluster import KMeans

from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel

embeddings = SentenceTransformerDocumentEmbeddings()
store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)

clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)

fit the model on a corpus
clustering_model.fit(corpus)

save the model
clustering_model.save(model_file="clustering_model.pt")

load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")

make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')

predict for sentence
model.predict(sentence)

print sentence with prediction
print(sentence)

Dataset Manipulations

You can now change label names, ignore labels and add custom preprocessing when loading a dataset.

For instance, the standard WNUT_17 dataset comes with 7 NER labels:

python
corpus = WNUT_17(in_memory=False)
print(corpus.make_label_dictionary('ner'))

which prints:

console
Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work

With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):

python
corpus = WNUT_17(in_memory=False, label_name_map={
'person': 'PER',
'location': 'LOC',
'group': 'ORG',
'corporation': 'ORG',
'product': 'O',
'creative-work': 'O', by renaming to 'O' this tag gets ignored
})

which prints:

console
Dictionary with 4 tags: <unk>, PER, LOC, ORG

You can manipulate the data even more with custom preprocessing functions. See the example in 2708.

Other New Features and Data Sets

- A new `WordTagger` class for simple word-level predictions (2607)
- Classic `WordEmbeddings` can now be fine-tuned in Flair (2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"
- Add `NER_MULTI_CONER` Dataset (2507)
- Add support for HIPE 2022 (2675)
- Allow trainer to work with mutliple learning rates (2641)
- Update hyperparameter tuning (2633)

Preview Features

Some preview features in beta stage, use at your own risk.

Prototypical networks in Flair (2627)

Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper [Prototypical Networks for Few-shot Learning](https://arxiv.org/abs/1703.05175) for more info.

plonerma implemented a custom decoder that can be added to any Flair model that inherits from `DefaultClassifier` (i.e. early all Flair models). For instance, use this script:

python
from flair.data import Corpus
from flair.datasets import UP_ENGLISH
from flair.embeddings import TransformerWordEmbeddings
from flair.models import WordTagger
from flair.nn import PrototypicalDecoder
from flair.trainers import ModelTrainer

what tag do we want to predict?
tag_type = 'frame'

get a corpus
corpus: Corpus = UP_ENGLISH().downsample(0.1)

make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)

initialize simple embeddings
embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased",
fine_tune=True,
layers='-1')

initialize prototype decoder
decoder = PrototypicalDecoder(num_prototypes=len(tag_dictionary),
embeddings_size=embeddings.embedding_length,
distance_function='euclidean',
normal_distributed_initial_prototypes=True,
)

initialize the WordTagger, but pass the prototype decoder
tagger = WordTagger(embeddings,
tag_dictionary,
tag_type,
decoder=decoder)

initialize trainer
trainer = ModelTrainer(tagger, corpus)

run training
trainer.fine_tune('resources/taggers/prototypical_decoder')

Other Beta features

- Dependency Parsing in Flair (2486 2579)
- Lemmatization in Flair (2531)
- Initial implementation of JsonCorpora and Datasets (2653)

Major Refactorings

With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.

Major refactoring of Label Logic in Flair (2607 2609 2645)

The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like `SpanLabel`, `RelationLabel` etc. are removed in favor of a single `Label` class for all types of label. The `Sentence` object will now be automatically aware of all labels added to it.

To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.

Before:

python
example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

create span for "Humboldt Universität zu Berlin"
span = Span(sentence[0:4])

make a Span-label
span_label = SpanLabel(span=span, value='University')

add Span-label to sentence
sentence.add_complex_label(typename='ner', label=span_label)

Now:

python
example sentence
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

directly add a label to the span "Humboldt Universität zu Berlin"
sentence[0:4].add_label("ner", "Organization")

So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.

Refactoring of printouts (2704)

We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.

Unified classes to reduce redundancy

Next to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:

- *Unify Corpora* (2607): Unifies several corpora into a single object. Before, we had `ColumnCorpus`, `UniversalDependenciesCorpus`, `CoNNLuCorpus`, and `EntityLinkingCorpus`, which resulted in too much redundancy. Now, there is only the `ColumnCorpus` for all such datasets
- *Unify Transformer Embeddings* (2558, 2584, 2586): There was too much redundancy and inconsistency between the two Transformer-based embeddings classes `TransformerWordEmbedding` and `TransformerDocumentEmbedding`. Thanks to helpmefindaname, they now both inherit from the same base object and now share all features.
- *Unify Tokenizers* (2607) : The `Tokenizer` classes no longer return lists of `Token`, rather lists of strings that the `Sentence` object converts to tokens, centralizing the offset and whitespace_after detection in one place.

Simplifications to DefaultClassifier

The `DefaultClassifier` is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular.
- `forward_pass` simplified to return 3 instead of 4 arguments
- `forward_pass` returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below)
- removed the unintuitive `spawn` logic we no longer need due to Label refactoring
- unify dropouts across all classes (2669)

Sequence tagger refactoring (2361 2550, 2561,2564, 2585, 2565)

Major refactoring of `SequenceTagger` for better modularity and code readability.

Refactoring of Span Logic (2607 2609 2645)

Spans are no longer stored as word-level 'bioes' tags, but rather directly stored as span-level annotations. The `SequenceTagger` will still internally use BIO/BIOES tags, but the corpora and sentences no longer explicitly store this information.

So you now choose the labeling format when instantiating the `SequenceTagger`, i.e.:
python
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type="ner",
tag_format="BIOES", choose if you want to use BIOES or BIO internally
)

Internally, this refactoring makes a number of changes and simplifications:
- a number of fields have been added or moved up to the `DataPoint` class, for convenience, including properties to get `start_position` and `end_position` of datapoints, their `text`, their `tag` and `score` (if they have only one tag) and an `unlabeled_identifier`
- moves up `set_embedding()` and `to()` from the data point classes (`Sentence`, `Token`, etc.) to their parent `DataPoint`
- a number of methods like `get_tag` and `add_tag` have been removed from Token in favor of the `get_label` and `add_label` method of the parent DataPoint class
- The `ColumnCorpus` will automatically identify which columns are span labels and treat them accordingly

Code Quality Checks (2611)

They are back and more strict than ever! Thanks to helpmefindaname, we now include mypy and formatting tests as part of our build process, which lead to many changes in the code and a much greater chance at catching errors early.

Speed and Memory Improvements:
- `EntityLinker` class refactored for speed (2607)
- Performance improvements in standard `evaluate()` method, especially for large datasets (2607)
- `ColumnCorpus` no longer does disk reads when `in_memory=False`, it simply stores the raw data in memory leading to significant speed-ups on large datasets (2607)
- Memory management improvements for embeddings (2645)
- Efficiency improvements for WordEmbeddings (2491) and OneHotEmbeddings (2490)

Bug Fixes and Improvements
- Add equality method to `Dictionary` (2532)
- Fix encoding error in lemmatizer (2539)
- Fixed printing and logging inconsistencies. (2665)
- Readme (2525 2618 2617 2662)
- Fix bug in `WSD_UFSAC` corpus (2521)
- change position of model saving in between epochs (2548)
- Fix loss weights in TextPairClassifier and RelationExtractor models (2576)
- Fix token positions on column corpus (2440)
- long sequence transformers of any kind (2599)
- The deprecated data_fetcher is finally removed (2607)
- Small lm training improvements (2590)
- Remove minor bug in NEL_ENGLISH_AIDA corpus (2615)
- Fix module import bug (2616)
- Fix reloading fast tokenizers (2622)
- Fix two small bugs (2634)
- Fix .pre-commit-config.yaml (2651)
- patch the missing document_delmiter for lm.__get_state__() (2658)
- `DocumentPoolEmbeddings` class can now be instantiated only with a single embedding (2645)
- You can now specify a `min_count` when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. `tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)`) (2607)
- The `Dictionary` will now compute count statistics for labels in a corpus (2607)
- The `ColumnCorpus` can now handle relation annotation, dependency tree information and UD feats and misc (2607)
- Embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility
- Make transformer offset calculation more robust (2714)

0.10

Not secure

This release adds several new features such as in-built "model cards" for all Flair models, the first pre-trained models for Relation Extraction, better support for fine-tuning and a refactoring of the model training methods for more flexibility. It also fixes a number of critical bugs that were introduced by the refactorings in Flair 0.9.

Model Trainer Enhancements

_Breaking change_: We changed the `ModelTrainer` such that you now no longer pass the optimizer during initialization. Rather, it is now passed as a parameter of the `train` or `fine_tune` method.

**Old syntax**:

python
1. initialize trainer with AdamW optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)

2. run training with small learning rate and mini-batch size
trainer.train('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
)

**New syntax** (optimizer is parameter of train method):

python
1. initialize trainer
trainer = ModelTrainer(classifier, corpus)

2. run training with AdamW, small learning rate and mini-batch size
trainer.train('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
optimizer=torch.optim.AdamW,
)

Convenience function for fine-tuning (2439)

Adds a `fine_tune` routine that sets default parameters used for fine-tuning (AdamW optimizer, small learning rate, few epochs, cyclic learning rate scheduling, etc.). Uses the new linear scheduler with warmup (2415).

**New syntax** with `fine_tune` method:

python
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

1. get the corpus
corpus: Corpus = TREC_6()

2. what label do we want to predict?
label_type = 'question_class'

3. create the label dictionary
label_dict = corpus.make_label_dictionary(label_type=label_type)

4. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type=label_type)

6. initialize trainer
trainer = ModelTrainer(classifier, corpus)

7. run training with fine-tuning
trainer.fine_tune('resources/taggers/question-classification-with-transformer',
learning_rate=5.0e-5,
mini_batch_size=4,
)

Model Cards (2457)

When you train any Flair model, a "model card" will now automatically be saved that stores all training parameters and versions used to train this model. Later when you load a Flair model, you can print the model card and understand how the model was trained.

The following example trains a small POS-tagger and prints the model card in the end:

python
initialize corpus and make label dictionary for POS tags
corpus = UD_ENGLISH().downsample(0.01)
tag_type = "pos"
tag_dictionary = corpus.make_label_dictionary(tag_type)

simple sequence tagger
tagger = SequenceTagger(hidden_size=256,
embeddings=WordEmbeddings("glove"),
tag_dictionary=tag_dictionary,
tag_type=tag_type)

initialize model trainer and experiment path
trainer = ModelTrainer(tagger, corpus)
path = f'resources/taggers/model-card'

train for a few epochs
trainer.train(path,
max_epochs=20,
)

load best model and print "model card"
trained_model = SequenceTagger.load(path + '/best-model.pt')
trained_model.print_model_card()

This should print a model card like:
~~~
------------------------------------
--------- Flair Model Card ---------
------------------------------------
- this Flair model was trained with:
-- Flair version 0.9
-- PyTorch version 1.7.1
-- Transformers version 4.8.1
------------------------------------
------- Training Parameters: -------
------------------------------------
-- base_path = resources/taggers/model-card
-- learning_rate = 0.1
-- mini_batch_size = 32
-- mini_batch_chunk_size = None
-- max_epochs = 20
-- train_with_dev = False
-- train_with_test = False
[... shortened ...]
------------------------------------
~~~

Resume training any model (2457)

Previously, we distinguished between checkpoints and model files. Now all models can function as checkpoints, meaning you can load them and continue training them. Say you want to load the model above (trained to epoch 20) and continue training it to epoch 25. Do it like this:

python
resume training best model, but this time until epoch 25
trainer.resume(trained_model,
base_path=path + '-resume',
max_epochs=25,
)

Pass optimizer and scheduler instance

You can also now pass an initialized optimizer and scheduler to the train and fine_tune methods.

Multi-Label Predictions and Confidence Threshold in TARS models (2430)

Adding the possibility to set confidence thresholds on multi-label prediction in TARS, and setting whether a problem is single-label or multi-label:

python
from flair.models import TARSClassifier
from flair.data import Sentence

1. Load our pre-trained TARS model for English
tars: TARSClassifier = TARSClassifier.load('tars-base')

switch to a multi-label task (emotion detection)
tars.switch_to_task('GO_EMOTIONS')

sentence with two emotions
sentence = Sentence("I am happy and sad")

predict normally
tars.predict(sentence)
print(sentence)

predict with lower label threshold (you can set this to 0. to get all labels)
tars.predict(sentence, label_threshold=0.01)
print(sentence)

predict and enforce a single-label prediction
tars.predict(sentence, label_threshold=0.01, multi_label=False)
print(sentence)

Relation Extraction ( 2471 2492)

We refactored the RelationExtractor for more options, hopefully better code clarity and small speed improvements.

We also added two few relation extraction models, trained over a modified version of TACRED: `relations` and `relations-fast`. To use these models, you also need an entity tagger. The tagger identifies entities, then the relation extractor possible entities.

For instance use this code:

python
from flair.data import Sentence
from flair.models import RelationExtractor, SequenceTagger

1. make example sentence
sentence = Sentence("George was born in Washington")

2. load entity tagger and predict entities
tagger = SequenceTagger.load('ner-fast')
tagger.predict(sentence)

check which entities have been found in the sentence
entities = sentence.get_labels('ner')
for entity in entities:
print(entity)

3. load relation extractor
extractor: RelationExtractor = RelationExtractor.load('relations-fast')

predict relations
extractor.predict(sentence)

check which relations have been found
relations = sentence.get_labels('relation')
for relation in relations:
print(relation)

Embeddings

- Refactoring of WordEmbeddings to avoid gensim version issues and enable further fine-tuning of pre-trained embeddings (2491)
- Refactoring of OneHotEmbeddings to fix errors caused by some corpora and enable "stable embeddings" (2490 )

Other Enhancements and Bug Fixes

- Compatibility with gensim 4 and Python 3.9 (2496)
- Fix TransformerWordEmbeddings if model_max_length not set in Tokenizer (2502)
- Fix TransformerWordEmbeddings handling of lang ids (2417)
- Fix attention mask for special Transformer architectures (2485)
- Fix regression model (2424)
- Fix problems caused by refactoring of Dictionary (2429 2435 2453)
- Fix infinite loop in Span::to_original_text (2462)
- Fix result object in ModelTrainer (2519)
- Fix bug in wsd_ufsac corpus (2521)
- Fix bugs in TARS and simple sequence tagger (2468)
- Add Amharic FLAIR EMBEDDING model (2494)
- Add MultiCoNer Dataset (2507)
- Add Korean Flair Tutorials (2516 2517)
- Remove hyperparameter features (2518)
- Make it optional to create logfiles and loss files (2421)
- Small simplification of TransformerWordEmbeddings (2425)

0.9

Not secure

With release 0.9 we are refactoring Flair for simplicity and speed, to make Flair faster and more easily scale to new NLP tasks. The first new tasks included in this release are **Relation Extraction** (RE), support for **GLUE benchmark** tasks and **Entity Linking** - all in *beta for early adopters*! We're working towards a Flair 1.0 release that will span the whole suite of standard NLP tasks. Also included is a new approach for **Zero-Shot Sequence Labeling** based on TARS! This release also includes a wealth of new datasets for all these tasks and tons of other new features and bug fixes.

Zero-Shot Sequence Labeling with TARS (2260)

We extend the TARS zero-shot learning approach to sequence labeling and ship a pre-trained model for English NER. Try defining some classes and see if the model can find them:

python
1. Load zero-shot NER tagger
tars = TARSTagger.load('tars-ner')

2. Prepare some test sentences
sentences = [
Sentence("The Humboldt University of Berlin is situated near the Spree in Berlin, Germany"),
Sentence("Bayern Munich played against Real Madrid"),
Sentence("I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne"),
Sentence("Game of Thrones is my favorite series"),
]

3. Define some classes of named entities such as "soccer teams", "TV shows" and "rivers"
labels = ["Soccer Team", "University", "Vehicle", "River", "City", "Country", "Person", "Movie", "TV Show"]
tars.add_and_switch_to_new_task('task 1', labels, label_type='ner')

4. Predict for these classes and print results
for sentence in sentences:
tars.predict(sentence)
print(sentence.to_tagged_string("ner"))

This should print:

console
The Humboldt <B-University> University <I-University> of <I-University> Berlin <E-University> is situated near the Spree <S-River> in Berlin <S-City> , Germany <S-Country>

Bayern <B-Soccer Team> Munich <E-Soccer Team> played against Real <B-Soccer Team> Madrid <E-Soccer Team>

I flew with an Airbus <B-Vehicle> A380 <E-Vehicle> to Peru <S-City> to pick up my Porsche <B-Vehicle> Cayenne <E-Vehicle>

Game <B-TV Show> of <I-TV Show> Thrones <E-TV Show> is my favorite series

So in these examples, we are finding entity classes such as "TV show" (_Game of Thrones_), "vehicle" (_Airbus A380_ and _Porsche Cayenne_), "soccer team" (_Bayern Munich_ and _Real Madrid_) and "river" (_Spree_), even though the model was never explicitly trained for this. Note that this is ongoing research and the examples are a bit cherry-picked. We expect the zero-shot model to improve quite a bit until the next release.

New NLP Tasks and Datasets

We prototypically now support new tasks such as GLUE benchmark, Relation Extraction and Entity Linking. With this, we ship the datasets and model classes you need to train your own models. But we are still tweaking both methods, meaning that we don't ship any pre-trained models as-of-yet.

GLUE Benchmark (2149 2363)

A standard benchmark to evaluate progress in language understanding, mostly consisting of single and pairwise sentence classification tasks.

New datasets in Flair:

- 'GLUE_COLA' - The Corpus of Linguistic Acceptability from GLUE benchmark
- 'GLUE_MNLI' - The Multi-Genre Natural Language Inference Corpus from the GLUE benchmark
- 'GLUE_RTE' - The RTE task from the GLUE benchmark
- 'GLUE_QNLI' - The Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark
- 'GLUE_WNLI' - The Winograd Schema Challenge formated as NLI task from the GLUE benchmark
- 'GLUE_MRPC' - The MRPC task from GLUE benchmark
- 'GLUE_QQP' - The Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent

Initialize datasets like so:

python
from flair.datasets import GLUE_QNLI

load corpus
corpus = GLUE_QNLI()

print corpus
print(corpus)

print first sentence-pair of training data split
print(corpus.train[0])

print all labels in corpus
print(corpus.make_label_dictionary("entailment"))

Relation Extraction (2333 2352)

Relation extraction classifies if and which relationship holds between two entities in a text.

Model class: `RelationExtractor`

Datasets in Flair:
- 'RE_ENGLISH_CONLL04' - the [CoNLL-04](https://github.com/bekou/multihead_joint_entity_relation_extraction/tree/master/data/CoNLL04) Relation Extraction dataset (#2333)
- 'RE_ENGLISH_SEMEVAL2010' - the [SemEval-2010 Task 8](https://aclanthology.org/S10-1006.pdf) dataset on Multi-Way Classification of Semantic Relations Between Pairs of Nominals (#2333)
- 'RE_ENGLISH_TACRED' - the TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred/) with 41 relations (download required) (#2333)
- 'RE_ENGLISH_DRUGPROT' - the [DrugProt corpus from Biocreative VII Track 1](https://zenodo.org/record/5119892#.YSdSaVuxU5k/) on drug and chemical-protein interactions (2340 2352)

Initialize datasets like so:

python
initalize CoNLL 04 corpus for Relation extraction
corpus = RE_ENGLISH_CONLL04()
print(corpus)

print first sentence of training split with annotations
sentence = corpus.train[0]

print label dictionary
label_dict = corpus.make_label_dictionary("relation")
print(label_dict)

Entity Linking (2375)

Entity Linking goes one step further than NER and uniquely links entities to knowledge bases such as Wikipedia.

Model class: `EntityLinker`

Datasets in Flair:
- 'NEL_ENGLISH_AIDA' - the [AIDA CoNLL-YAGO Entity Linking corpus](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/aida/downloads) on the CoNLL-03 dataset for English
- 'NEL_ENGLISH_AQUAINT' - the Aquaint Entity Linking corpus introduced in [Milne and Witten (2008)](https://www.cms.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningToLinkWithWikipedia.pdf)
- 'NEL_ENGLISH_IITB' - the ITTB Entity Linking corpus introduced in [Sayali et al. (2009)](https://dl.acm.org/doi/10.1145/1557019.1557073)
- 'NEL_ENGLISH_REDDIT' - the Reddit Entity Linking corpus introduced in [Botzer et al. (2021)](https://arxiv.org/abs/2101.01228v2) (only gold annotations)
- 'NEL_ENGLISH_TWEEKI' - the ITTB Entity Linking corpus introduced in [Harandizadeh and Singh (2020)](https://aclanthology.org/2020.wnut-1.29.pdf)
- 'NEL_GERMAN_HIPE' - the [HIPE](https://impresso.github.io/CLEF-HIPE-2020/) Entity Linking corpus for historical German as a [sentence-segmented version](https://github.com/stefan-it/clef-hipe)

python
from flair.datasets import NEL_ENGLISH_REDDIT

load corpus
corpus = NEL_ENGLISH_REDDIT()

print corpus
print(corpus)

print a sentence of training data split
print(corpus.train[3])

New NER Datasets
- 'NER_ARABIC_ANER' - [Arabic Named Entity Recognition Corpus](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp) 4-class NER (#2188)
- 'NER_ARABIC_AQMAR' - [American and Qatari Modeling of Arabic](http://www.cs.cmu.edu/~ark/AQMAR/) 4-class NER (modified) (#2188)
- 'NER_ENGLISH_PERSON' - NER for [person names](https://github.com/das-sudeshna/genid) (#2271)
- 'NER_ENGLISH_WEBPAGES' - 4-class NER on web pages from [Ratinov and Roth (2009)](https://aclanthology.org/W09-1119/) (#2232 )
- 'NER_GERMAN_POLITICS' - [NEMGP](https://www.thomas-zastrow.de/nlp/) corpus for German politics (#2341)
- 'NER_JAPANESE' - [Japanese NER](https://github.com/Hironsan/IOB2Corpus) dataset automatically generated from Wikipedia (#2154)
- 'NER_MASAKHANE' - [MasakhaNER: Named Entity Recognition for African Languages](https://github.com/masakhane-io/masakhane-ner) corpora (#2212, 2214, 2227, 2229, 2230, 2231, 2222, 2234, 2242, 2243)

Other datasets

- 'YAHOO_ANSWERS' - The [10 largest main categories](https://course.fast.ai/datasets#nlp) from the Yahoo! Answers (2198)
- Various Universal Dependencies datasets (2211, 2216, 2219, 2221, 2244, 2245, 2246, 2247, 2223, 2248, 2235, 2236, 2239, 2226)

New Functionality

Support for Arabic NER (2188)

Flair now supports NER and POS tagging for Arabic. To tag an Arabic sentence, just load the appropriate model:

python

load model
tagger = SequenceTagger.load('ar-ner')

make Arabic sentence
sentence = Sentence("احب برلين")

predict NER tags
tagger.predict(sentence)

print sentence with predicted tags
for entity in sentence.get_labels('ner'):
print(entity)

This should print:
console

Page 3 of 6

Releases

Has known vulnerabilities

Previous Next

Flair

Page 3 of 6

0.12.2

0.12.1

0.12

0.11

0.10

0.9

Page 3 of 6

Links

Releases