Transformer Word Embeddings
If you want to embed the words in a sentence with transformers, do it like this:
python
from flair.embeddings import TransformerWordEmbeddings
init embedding
embedding = TransformerWordEmbeddings('bert-base-uncased')
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
python
from flair.embeddings import TransformerWordEmbeddings
init embedding
embedding = TransformerWordEmbeddings('roberta-base')
create a sentence
sentence = Sentence('The grass is green .')
embed words in sentence
embedding.embed(sentence)
Transformer Document Embeddings
To get a single embedding for the whole document with BERT, do:
python
from flair.embeddings import TransformerDocumentEmbeddings
init embedding
embedding = TransformerDocumentEmbeddings('bert-base-uncased')
create a sentence
sentence = Sentence('The grass is green .')
embed the sentence
embedding.embed(sentence)
If instead you want to use RoBERTa, do:
python
from flair.embeddings import TransformerDocumentEmbeddings
init embedding
embedding = TransformerDocumentEmbeddings('roberta-base')
create a sentence
sentence = Sentence('The grass is green .')
embed the sentence
embedding.embed(sentence)
Text classification by fine-tuning a transformer
Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks.
Use `TransformerDocumentEmbeddings` for this and set `fine_tune=True`. Then, use the following example code:
python
from torch.optim.adam import Adam
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
1. get the corpus
corpus: Corpus = TREC_6()
2. create the label dictionary
label_dict = corpus.make_label_dictionary()
3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)
4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)
6. start the training
trainer.train('resources/taggers/trec',
learning_rate=3e-5, use very small learning rate
mini_batch_size=16,
mini_batch_chunk_size=4, optionally set this if transformer is too much for your machine
max_epochs=5, terminate after 5 epochs
)
New Taggers, Embeddings and Datasets
Flair 0.5 adds a ton of new taggers, embeddings and datasets.
New Taggers
New sentiment models (1613)
We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.
There are two new models, a transformer-based model you can load like this:
python
load tagger
classifier = TextClassifier.load('sentiment')
predict for example sentence
sentence = Sentence("enormously entertaining for moviegoers of any age .")
classifier.predict(sentence)
check prediction
print(sentence)
And a faster, slightly less accurate model based on RNNs you can load like this:
python
classifier = TextClassifier.load('sentiment-fast')
Fine-grained POS models for English (1625)
Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:
python
Fine-grained POS model
tagger = SequenceTagger.load('pos')
Fine-grained POS model (fast variant)
tagger = SequenceTagger.load('pos-fast')
Universal POS model
tagger = SequenceTagger.load('upos')
Universal POS model (fast variant)
tagger = SequenceTagger.load('upos-fast')
Added Malayalam POS and XPOS tagger model (1522)
Added taggers for historical German speech and thought (1532)
New Embeddings
Added language models for historical German by redewiedergabe (1507)
Load the language models with:
python
embeddings_forward = FlairEmbeddings('de-historic-rw-forward')
embeddings_backward = FlairEmbeddings('de-historic-rw-backward')
Added Malayalam flair embeddings models (1458)
python
embeddings_forward = FlairEmbeddings('ml-forward')
embeddings_backward = FlairEmbeddings('ml-backward')
Added Flair Embeddings from CLEF HIPE Shared Task (1554)
Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the [CLEF HIPE shared task](https://impresso.github.io/CLEF-HIPE-2020/).
New Datasets
Added NER dataset for Finnish (1620)
You can now load a Finnish NER corpus with
python
ner_finnish = flair.datasets.NER_FINNISH()
Added DaNE dataset (1425)
You can now load a Danish NER corpus with
python
dane = flair.datasets.DANE()
Added SentEval classification datasets (1454)
Adds 6 SentEval classification datasets to Flair:
python
senteval_corpus_1 = flair.datasets.SENTEVAL_CR()
senteval_corpus_2 = flair.datasets.SENTEVAL_MR()
senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ()
senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA()
senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY()
senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()
Added Sentiment Datasets (1545)
Adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment.
python
amazon_reviews = flair.datasets.AMAZON_REVIEWS()
sentiment_140 = flair.datasets.SENTIMENT_140()
Added BIOfid dataset (1589)
python
biofid = flair.datasets.BIOFID()
Refactorings
Any DataPoint can now be labeled (1450)
Refactored the `DataPoint` class and classes that inherit from it (`Token`, `Sentence`, `Image`, `Span`, etc.) so that all have the same methods for adding and accessing labels.
- `DataPoint` base class now defined labeling methods (closes 1449)
- Labels can no longer be passed to `Sentence` constructor, so instead of:
python
sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")])
you should now do:
python
sentence_1 = Sentence("this is great")
sentence_1.add_label('sentiment', 'POSITIVE')
or:
python
sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')
Note that Sentence labels now have a `label_type` (in the example that's 'sentiment').
- The `Corpus` method `_get_class_to_count` is renamed to `_count_sentence_labels`
- The `Corpus` method `_get_tag_to_count` is renamed to `_count_token_labels`
- `Span` is now a `DataPoint` (so it has an `embedding` and `labels`)
Embeddings module was split into smaller submodules (1588)
Split the previously huge `embeddings.py` into several submodules organized in an `embeddings/` folder. The submodules are:
- `token.py` for all `TokenEmbeddings` classes
- `document.py` for all `DocumentEmbeddings` classes
- `image.py` for all `ImageEmbeddings` classes
- `legacy.py` for embeddings that are now deprecated
- `base.py` for remaining basic classes
All embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:
python
from flair.embeddings import FlairEmbeddings
embeddings = FlairEmbeddings('news-forward')
so specifying the submodule is not needed.
Datasets module was split into smaller submodules (1510)
Split the previously huge `datasets.py` into several submodules organized in a `datasets/` folder. The submodules are:
- `sequence_labeling.py` for all sequence labeling datasets
- `document_classification.py` for all document classification datasets
- `treebanks.py` for all dependency parsed corpora (UD treebanks)
- `text_text.py` for all bi-text datasets (currently only parallel corpora)
- `text_image.py` for all paired text-image datasets (currently only Feidegger)
- `base.py` for remaining basic classes
All datasets are still exposed through the datasets package, so it is still possible to load corpora with
python
from flair.datasets import TREC_6
without specifying the submodule.
Other refactorings
- Refactor datasets for code legibility (1394)
Small refactorings on `flair.datasets` for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent `Corpus` class. You can now instantiate a `Corpus` only with a train file which will trigger the sampling. (2) Moved the default logic for identifying train, dev and test files into a dedicated method to avoid duplicates in code.
- Extend string output of Sentence (1452)
Other
New Features
Add option to specify document delimiter for language model training (1541)
You now have the option of specifying a document_delimiter when training a LanguageModel. Say, you have a corpus of textual lists and use "[SEP]" to mark boundaries between two lists, like this:
Colors:
- blue
- green
- red
[SEP]
Cities:
- Berlin
- Munich
[SEP]
...
Then you can now train a language model by setting the `document_delimiter` in the `TextCorpus` and `LanguageModel` objects. This will make sure only documents as a whole will get shuffled during training (i.e. the lists in the above example):
python
your document delimiter
delimiter = '[SEP]'
set it when you load the corpus
corpus = TextCorpus(
"data/corpora/conala-corpus/",
dictionary,
is_forward_lm,
character_level=True,
document_delimiter=delimiter,
)
set it when you init the language model
language_model = LanguageModel(
dictionary,
is_forward_lm=True,
hidden_size=512,
nlayers=1,
document_delimiter=delimiter
)
train your language model as always
trainer = LanguageModelTrainer(language_model, corpus)
Allow column delimiter to be set in ColumnCorpus (1526)
Added the possibility to set a different column delimite for `ColumnCorpus`, i.e.
python
corpus = ColumnCorpus(
Path("/path/to/corpus/"),
column_format={0: 'text', 1: 'ner'},
column_delimiter='\t', set a different delimiter
)
if you want to read a tab-separated column corpus.
Improvements in classification corpus datasets (1545)
There are a number of improvements for the `ClassificationCorpus` and `ClassificationDataset` classes:
- It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit.
- It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets.
- You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test')
- You can now specify the option "filter_if_longer_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.
Added different ways to combine ELMo layers (1547)
Improved default annealing scheme to anneal against score and loss (1570)
Add new scheduler that uses dev score as main metric to anneal against, but additionally uses dev loss in case two epochs have the same dev score.
Added option for hidden state position in FlairEmbeddings (1571)
Adds the option to choose which hidden state to use in FlairEmbeddings: either the state at the end of each word, or the state at the whitespace after. Default is the state at the whitespace after.
You can change the default like this:
python
embeddings = FlairEmbeddings('news-forward', with_whitespace=False)
This configuration seems to be better for syntactic tasks. For POS tagging, it seems that you should set `with_whitespace=False`. For instance, on UD_ENGLISH POS-tagging, we get **96.56 +- 0.03** with whitespace and **96.72 +- 0.04** without, averaged over three runs.
See the discussion in 1362 for more details.
Other features
- Added the option of passing different tokenizers when loading classification datasets (1579)
- Added option for true whitespaces in ColumnCorpus 1583
- Configurable cache_root from environment variable (507)
Performance improvements
- Improve performance for loading not-in-memory corpus (1413)
- A new lmdb based alternative backend for word embeddings (1515 1536)
- Slim down requirements (1419)
Bug Fixes
- Fix issue where flair was crashing for cpu only version of pytorch (1393 1418)
- Fix GPU memory error in PooledFlairEmbeddings (1417)
- Various small fixes (1402 1533 1511 1560 1616)
- Improve documentation (1446 1447 1520 1525 1556)
- Fix various issues in classification datasets (1499)