FLERT (2031 2032 2104)
This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in [our recent paper](https://arxiv.org/abs/2011.06993). This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.
Pre-trained FLERT models (2130)
We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:
python
from flair.data import Sentence
from flair.models import SequenceTagger
load tagger
tagger = SequenceTagger.load("ner-large")
make example sentence
sentence = Sentence("George Washington went to Washington")
predict NER tags
tagger.predict(sentence)
print sentence
print(sentence)
print predicted NER spans
print('The following NER tags are found:')
iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like [here](https://huggingface.co/flair/ner-english-ontonotes-large).
Contextualized Sentences
In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:
1. `Sentence` objects now have `next_sentence()` and `previous_sentence()` methods that are set automatically if loaded through `ColumnCorpus`. This is a pointer system to navigate through sentences in a corpus:
python
load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)
get a sentence
sentence = corpus.test[123]
print(sentence)
get the previous sentence
print(sentence.previous_sentence())
get the sentence after that
print(sentence.next_sentence())
get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())
This allows dynamic computation of contexts in the embedding classes.
2. `Sentence` objects now have the `is_document_boundary` field which is set through the `ColumnCorpus`. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.
Refactored TransformerWordEmbeddings (breaking)
`TransformerWordEmbeddings` refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: `pooling_operation` is now `subtoken_pooling` (to make clear that we pool subtokens), `use_scalar_mean` is now `layer_mean` (we only do a simple layer mean) and `use_context` can now optionally take an integer to indicate the length of the context. Default arguments are also changed.
For instance, to create embeddings with a document-level context of 64 subtokens, init like this:
python
embeddings = TransformerWordEmbeddings(
model='bert-base-uncased',
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=64,
)
Train your Own FLERT Models
You can train a FLERT-model like this:
python
import torch
from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
corpus = CONLL_03()
use_context = 64
hf_model = 'xlm-roberta-large'
embeddings = TransformerWordEmbeddings(
model=hf_model,
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=use_context,
)
tag_dictionary = corpus.make_tag_dictionary('ner')
init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
reproject_embeddings=False,
)
train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR
context_string = '+context' if use_context else ''
trainer.train(f"resources/flert",
learning_rate=5.0e-6,
mini_batch_size=4,
mini_batch_chunk_size=1,
max_epochs=20,
scheduler=OneCycleLR,
embeddings_storage_mode='none',
weight_decay=0.,
)
We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.
HuggingFace model hub integration (2040 2108 2115)
We now host Flair sequence tagging models on the HF model hub (thanks for all the support HuggingFace!).
**Overview of all models.** There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check [here](https://huggingface.co/models?filter=flair).
The hub allows all users to upload and share their own models. Even better, you can enable the **Inference API** and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model [here](https://huggingface.co/flair/ner-english-ontonotes-large).
To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do
python
from flair.data import Sentence
from flair.models import SequenceTagger
load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")
make example sentence
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")
predict NER tags
tagger.predict(sentence)
print sentence
print(sentence)
print predicted NER spans
print('The following NER tags are found:')
iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
Other New Features
New Task: Recognizing Textual Entailment (2123)
Thanks to marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.
For instance, if you want to train an RTE task of the GLUE benchmark use this script:
python
import torch
from flair.data import Corpus
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings
1. get the entailment corpus
corpus: Corpus = GLUE_RTE()
2. make the tag dictionary from the corpus
label_dictionary = corpus.make_label_dictionary()
3. initialize text pair tagger
from flair.models import TextPairClassifier
tagger = TextPairClassifier(
document_embeddings=TransformerDocumentEmbeddings(),
label_dictionary=label_dictionary,
)
4. train trainer with AdamW
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
5. run training
trainer.train('resources/taggers/glue-rte-english',
learning_rate=2e-5,
mini_batch_chunk_size=2, this can be removed if you hae a big GPU
train_with_dev=True,
max_epochs=3)
Add possibility to specify empty label name to CSV corpora (2068)
Some CSV classification datasets contain a value that means "no class". We now extend the `CSVClassificationDataset` so that it is possible to specify which value should be skipped using the `no_class_label` argument.
For instance:
python
load corpus
corpus = CSVClassificationCorpus(
data_folder='resources/tasks/code/',
train_file='java_io.csv',
skip_header=True,
column_name_map={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'},
no_class_label='NONE',
)
This causes all entries of NONE in one of the label columns to be skipped.
More options for splits in corpora and training (2034)
For various reasons, we might want to have a `Corpus` that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.
We add several ways of doing so.
1. If a dataset has predefined splits, like most NLP datasets, you can pass the arguments `train_with_test` and `train_with_dev` to the `ModelTrainer`. This causes the trainer to train over all three splits (and do no evaluation):
python
trainer.train(f"path/to/your/folder",
learning_rate=0.1,
mini_batch_size=16,
train_with_dev=True,
train_with_test=True,
)
2. You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass `sample_missing_splits=False` as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:
python
semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)
Add TFIDF Embeddings (2086)
We added some old-school embeddings (thanks yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.
To initialize these embeddings, you must pass the train split of your training corpus, i.e.
python
embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)
This triggers the process where the most common words are used to featurize documents.
New Datasets
Hungarian NER Corpus (2045)
Added the Hungarian business news corpus annotated with NER information (thanks to alibektas).
python
load Hungarian business NER corpus
corpus = BUSINESS_HUN()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
StackOverflow NER Corpus (2052)
python
load StackOverflow business NER corpus
corpus = STACKOVERFLOW_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))
Added GermEval 18 Offensive Language dataset (2102)
python
load StackOverflow business NER corpus
corpus = GERMEVAL_2018_OFFENSIVE_LANGUAGE()
print(corpus)
print(corpus.make_label_dictionary()
Added RTE corpora of GLUE and SuperGLUE
python
load the recognizing textual entailment corpus of the GLUE benchmark
corpus = GLUE_RTE()
print(corpus)
print(corpus.make_label_dictionary()
Improvements
Allow newlines as Tokens in a Sentence (2070)
Newlines and tabs can now become Tokens in a Sentence:
python
make sentence with newlines and tabs
sentence: Sentence = Sentence(["I", "\t", "ich", "\n", "you", "\t", "du", "\n"], use_tokenizer=True)
Alternatively: sentence: Sentence = Sentence("I \t ich \n you \t du \n", use_tokenizer=False)
print sentence and each token
print(sentence)
for token in sentence:
print(token)
Improve transformer serialization (2046)
We improved the serialization of the `TransformerWordEmbeddings` class such that you can now train a model with one version of the transformers library and load it with another version. Previously, if you trained a model with transformers 3.5.1 and loaded it with 3.1.01, or trained with 3.5.1 and loaded with 4.1.1, or other version mismatches, there would either be errors or bad predictions.
**Migration guide:** If you have a model trained with an older version of Flair that uses `TransformerWordEmbeddings` you can save it in the new version-independent format by loading the model with the same transformers version you used to train it, and then saving it again. The newly saved model is then version-independent:
python
load old model, but use the *same transformer version you used when training this model*
tagger = SequenceTagger.load('path/to/old-model.pt')
save the model. It is now version-independent and can for instance be loaded with transformers 4.
tagger.save('path/to/new-model.pt')
Fix regression prediction errors (2067)
Fix of two problems in the regression model:
- the predict() method was unable to set labels and threw errors (see 2056)
- predicted labels had no label name
Now, you can set a label name either in the predict method or during instantiation of the regression model you want to train. So the full code for training a regression model and using it to predict is:
python
load regression dataset
corpus = WASSA_JOY()
make simple document embeddings
embeddings = DocumentPoolEmbeddings([WordEmbeddings('glove')], fine_tune_mode='linear')
init model and give name to label
model = TextRegressor(embeddings, label_name='happiness')
target folder
output_folder = 'resources/taggers/regression_test/'
run training
trainer = ModelTrainer(model, corpus)
trainer.train(
output_folder,
mini_batch_size=16,
max_epochs=10,
)
load model
model = TextRegressor.load(output_folder + 'best-model.pt')
predict for sentence
sentence = Sentence('I am so happy')
model.predict(sentence)
print sentence and prediction
print(sentence)
In my example run, this prints the following sentence + predicted value:
~~~
Sentence: "I am so happy" [− Tokens: 4 − Sentence-Labels: {'happiness': [0.9239126443862915 (1.0)]}]
~~~
Do not shuffle first epoch during training (2058)
Normally, we shuffle sentences at each epoch during training in the ModelTrainer class. However, in some cases it makes sense to see sentences in their natural order during the first epoch, and shuffle only from the second epoch onward.
Bug Fixes and Improvements
- Update to transformers 4 (2057)
- Fix the evaluate() method in the SimilarityLearner class (2113)
- Fix memory memory leak in WordEmbeddings (2018)
- Add support for Transformer-XL Embeddings (2009)
- Restrict numpy version to <1.20 for Python 3.6 (2014)
- Small formatting and variable declaration changes (2022)
- Fix document boundary offsets for Dutch CoNLL-03 (2061)
- Changed the torch version in requirements.txt: Torch>=1.5.0 (2063)
- Fix linear input dimension if the reproject (2073)
- Various improvements for TARS (2090 2128)
- Added a link to the interpret-flair repo (2096)
- Improve documentatin ( 2110)
- Update sentencepiece and gdown version (2131)
- Add to_plain_string method to Span class (2091)