New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials
Breaking changes since `v2`
- In 4874 the language modeling BERT has been split in two: `BertForMaskedLM` and `BertLMHeadModel`. `BertForMaskedLM` therefore cannot do causal language modeling anymore, and cannot accept the `lm_labels` argument.
- The `Trainer` data collator is now a method instead of a class
- Directly setting a tokenizer special token attributes (e.g. `tokenizer.mask_token = '<mask>'` now only associate the token to the attribute of the tokenizer but doesn't add the token to the vocabulary if it is not in the vocabulary. Tokens are only added by using the `tokenizer.add_special_tokens()` and `tokenizer.add_tokens()` methods
- The `prepare_for_model` method was removed as part of the new tokenizer API.
- The truncation method is now `only_first` by default.
New Tokenizer API (n1t0, thomwolf, mfuntowicz)
The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8.
The redesigned API is explained in detail here 4510 and here: https://huggingface.co/transformers/master/preprocessing.html
Notable changes:
- it's now possible to truncate to the max input length of a model while padding the longest sequence in a batch
- padding and truncation are decoupled and easier to control
- it's possible to pad to a multiple of a predefined length, e.g. 8 which can give significant speeds up on recent NVIDIA GPU (V100)
- a generic wrapper using `tokenizer.__call__` can be used for all case (single sequence, pair of sequences to groups, batches, etc...)
- tokenizers now accept pre-tokenized inputs (when the input is already split in word strings e.g. for NER)
- All the Rust tokenizers are now fully tested like slow tokenizers
- A new class `AddedToken` can be used to have a more fine-grained control on how added tokens behave during tokenization. In particular the user can control (1) whether left and right spaces are removed around the token during tokenization (2) whether the token will be identified inside another word and (3) whether the token will be recognized in normalized forms (e.g. in lower case if the tokenizer uses lower-casing)
- Serialization issues where fixed
- Possiblity to create NumPy tensors when using `return_tensors` parameter on tokenizers.
- Introduced a new enum `TensorType` to map all the possible tensor backends we support: `TensorType.TENSORFLOW`, `TensorType.PYTORCH`, `TensorType.NUMPY`
- Tokenizers now accept `TensorType` enum on `encode(...)`, `encode_plus(...)`, `batch_encode_plus(...)` tokenizer method for `return_tensors` parameters.
- `BatchEncoding` new property `is_fast` indicates if the `BatchEncoding` comes from a Python (slow) tokenizer or a Rust (fast) tokenizer.
- Slow and Fast Tokenizers are now picklable. So is their output, the dict sub-class `BatchEncoding`.
Several PRs to make the API more stable have been made:
- [tokenizers] Fix 5081 and improve backward compatibility 5125 (thomwolf)
- Tokenizers API developments 5103 (thomwolf)
- Clearer error message in the use-case of 5169 (thomwolf)
- Add more tests on tokenizers serialization - fix bugs 5056 (thomwolf)
- [Tokenization] Fix 5181 - make 5155 more explicit - move back the default logging level in tests to WARNING 5252 (thomwolf)
- [tokenizers] Several small improvements and bug fixes 5287
- Add `pad_to_multiple_of` on tokenizers (reimport) 5054 (mfuntowicz)
- [tokenizers] Updates data processors, docstring, examples and model cards to the new API 5308
TensorFlow improvements (jplu, dzorlu, LysandreJik)
Very big release for TensorFlow!
- TensorFlow models can now compute the loss themselves, using the `TFPretrainedModel.compute_loss` method. 4530
- Can now resize token embeddings in TensorFlow 4351
- Cleaning TensorFlow models 5229
Enhanced documentation (sgugger)
We welcome sgugger as a team member in New York. He already introduced a lot of very cool documentation changes:
- Added a [model summary](https://huggingface.co/transformers/master/model_summary.html) #4789
- Expose classes used in documentation 4808
- Explain how to preview the docs in a PR 4795
- Clean documentation 4849
- Remove old doc page and add note about cache in installation 5027
- Fix all sphynx warnings 5068 (sgugger)
- Update pipeline examples to doctest syntax 5030
- Reorganize documentation 5064
- Update installation page and add contributing to the doc 5084
- Update glossary 5148
- Quick tour 5145
- Switch master/stable doc and add older releases 5193
- Add version control menu 5222
- Don't recreate old docs 5243
- Tokenization tutorial 5257
- Remove links for all docs 5280
- New model sharing tutorial 5323
Training & fine-tuning quickstart
- Our own joeddav added a training & fine-tuning quickstart to the documentation 5034!
MobileBERT
The MobileBERT from [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow.
A single checkpoint is added: `mobilebert-uncased` which is the `uncased_L-24_H-128_B-512_A-4_F-4_OPT` checkpoint converted to our API.
This model was first implemented in PyTorch by lonePatient, ported to the library by vshampor, then finalized and implemented in Tensorflow by LysandreJik.
Eli5 examples (yjernite) 4968
- The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo.
- The RetriBert model implements the dense passage retriever. It's basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and Yacine thought it would be easier to write its own class for now and see if we can merge into the BART code later.
Enhanced examples/seq2seq (sshleifer)
- the `examples/seq2seq` [folder](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md) is a combination of the old `examples/summarization` and `examples/translation` folders.
- Finetuning works well for summarization, more experiments needed for translation. Finetuning works on multi-gpu, saves rouge scores during validation, and provides `--freeze_encoder` and `--freeze_embeds` options. These options make finetuning BART 5x faster on the cnn/dailymail dataset.
- Distillbart code is added in distillation.py. It only supports summarization, for now.
- Evaluation works well for both summarization and translation.
- New weights and biases [shared task](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md#xsum-shared-task) for collaboration on the XSUM summarization task
Distilbart (sshleifer)
- Distilbart models are smaller versions of `bart-large-cnn` and `bart-large-xsum`. They can be loaded using `BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-xsum-12-6')`, for example See this [tweet](https://twitter.com/sam_shleifer/status/1276160367853547522?s=20) for more info on available models and their speed/performance.
- Commands to reproduce are available in the `examples/seq2seq` [folder](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md)
BERT Loses Patience (JetRunner)
Add BERT Loses Patience (Patience-based Early Exit) based on the paper https://arxiv.org/abs/2006.04152 and the official implementation https://github.com/JetRunner/PABEE
Unifying `label` arguments (sgugger) 4722
- Deprecate any argument that's not `labels` (like `masked_lm_labels`, `lm_labels`, etc.) to `labels`.
NumPy type in tokenizers (mfuntowicz) 4585
Introduce a new tensor type for return_tensors on tokenizer for NumPy.
- As we're introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don't know about "tf", "pt".
*Note: TensorType are compatible with previous "tf", "pt" and now "np" str to allow backward compatibility (+unittest)*
- Numpy is now a possible target when creating tensors. This is usefull for JAX.
Community notebooks
- Adding notebooks for Fine Tuning 4732 (abhimishra91):
- Multi-class classification: Using DistilBert
- Multi-label classification: Using Bert
- Summarization: Using T5 - Model Tracking with WandB
- [Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb) #5195 (pommedeterresautee)
- [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) (patrickvonplaten) #5312
Benchmarks (patrickvonplaten)
The benchmark script was consolidated and some features were added:
Adds the functionality to measure the following functionalities for TF and PT (4912):
- Tensorflow:
- Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU
- PyTorch:
- Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU
- Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU
- [Benchmark] Add encoder decoder to benchmark and clean labels 4810
- [Benchmark] add tpu and torchscipt for benchmark 4850
- [Benchmark] Extend Benchmark to all model type extensions 5241
- [Benchmarks] improve Example Plotter 5245
Hidden states, attentions and cache
Before v3.0.0, the way to handle attentions, model hidden states, and whether to use the cache in models that have it for sequential decoding was to specify an argument in the configuration. In version v3.0.0, while we do maintain that argument for backwards compatibility, we introduce a new way of handling these through the `forward` and `call` methods.
- Output attentions 4538 (Bharat123rox)
- Output hidden states 4978 (drjosephliu)
- Use cache 5194 (patrickvonplaten)
Revamped `AutoModel`s (patrickvonplaten)
The `AutoModelWithLMHead` encompasses all models with a language modeling head, not making the distinction between causal, masked and seq2seq models. Three new auto models are added:
- `AutoModelForCausalLM` for Autoregressive models
- `AutoModelForMaskedLM` for Autoencoding models
- `AutoModelForSeq2SeqCausalLM` for Sequence-to-sequence models with causal LM for the decoder
New model & tokenizer architectures
- XLMRobertaForQuestionAnswering 4855 (sgugger)
- ElectraForQuestionAnswering 4913 (patil-suraj)
- Add AlbertForMultipleChoice 4959 (sgugger)
- BartForQuestionAnswering 4908 (patil-suraj)
- BartTokenizerFast 4878 (patil-suraj)
- Add DistilBertForMultipleChoice 5032 (sgugger)
- ElectraForMultipleChoice 4954 (sgugger)
ONNX
- Fixed a bug causing invalid ordering of the inputs in the underlying ONNX IR.
- Increased logging to giv ethe user more information about the exported variables.
Bug fixes and improvements
- TFRobertaModelIntegrationTest requires tf 4726 (sshleifer)
- Cleanup glue for TPU 4621 (jysohn23)
- [Reformer] Improved memory if input is shorter than chunk length 4720 (patrickvonplaten)
- Pipelines: miscellanea of QoL improvements and small features 4632 (julien-c)
- Fix bug when changing the <EOS> token for generate 4745 (patrickvonplaten)
- never_split on slow tokenizers should not split 4723 (mfuntowicz)
- PretrainedModel.generate: remove unused kwargs 4761 (sshleifer)
- Codecov is now setup differently to have better insights into code coverage 4768 (LysandreJik)
- Don't access pad_token_id if there is no pad_token 4773 (sgugger)
- Removed deprecated use of Variable API from pplm example 4619 (prajjwal1)
- Add drop_last arg for data loader 4757 4925 (setu4993)
- No silent error when XLNet's `d_head` is already in the configuration 4747 (LysandreJik)
- MarianTokenizer: delete unused constants 4802 (sshleifer)
- NER: Add new WNUT’17 example 4681 (stefan-it)
- [EncoderDecoderConfig] automatically set decoder config to decoder 4809 (patrickvonplaten)
- Add matplotlib to known 3rd party dependencies 4800 (sshleifer)
- Pipelines test and new kwarg 4812 (sshleifer)
- Updated path "cd examples/text-generation/pplm" 4778 (Mr-Ruben)
- [marian tests] pass device to pipeline 4815 (sshleifer)
- Export PretrainedBartModel from __init__ 4819 (BramVanroy)
- Updates args in tf squad example. 4820 (daniel-shan)
- [Generate] beam search should generate without replacement (patrickvonplaten)
- TFTrainer: Align how the checkpoints are managed the same way than in the PyTorch trainer. 4831 (jplu)
- [Longformer] Remove redundant code 4839 (ZhuBaohe)
- [cleanup] consolidate some prune_heads logic 4799 (sshleifer)
- Fix the __getattr__ method in BatchEncoding 4772 (jplu)
- Consolidate summarization examples 4837 (aretius)
- Fix a bug in the initialization and serialization of TFRobertaClassificationHead 4884 (harkous)
- [examples] Cleanup summarization docs 4876 (sshleifer)
- run_pplm.py bug fix 4867 (songyouwei)
- Remove unused arguments in Multiple Choice example 4853 (sgugger)
- Deal with multiple choice in common tests 4886 (sgugger
- Fix the CI 4903 (sgugger)
- [All models] fix docs after adding output attentions to all forward functions 4909 (patrickvonplaten)
- Add more models to common tests 4910 (sgugger)
- [ctrl] fix pruning of MultiHeadAttention 4904 (aretius)
- Don't init TPU device twice 4916 (patrickvonplaten)
- Run a single wandb instance per TPU run 4851 (LysandreJik)
- check type before logging in trainer to ensure values are scalars 4883 (mgoldey)
- Split LMBert model in two 4874 (sgugger)
- Make multiple choice models work with input_embeds 4921 (sgugger)
- Fix resize_token_embeddings for Transformer-XL 4759 (RafaelWO)
- [mbart] Fix fp16 testing logic 4949 (sshleifer)
- Hans data with newer tokenizer API 4854 (sgugger)
- Fix parameter 'output_attentions' docstring 4976 (ZhuBaohe)
- Improve ONNX logging 4999 (mfuntowicz)
- NER: fix construction of input examples for RoBERTa 4943 (stefan-it)
- Possible fix to make AMP work with DDP in the trainer 4728 (BramVanroy)
- Make DataCollator a callable 5015 (sgugger)
- Increase pipeline support for ONNX export. 5005 (mfuntowicz)
- Fix importing transformers on Windows - SIGKILL not defined 4997 (mfuntowicz)
- TFTrainer: improve logging 4946 (borisdayma)
- Add position_ids in TFElectra models docstring 5021 (sgugger)
- [Bart] Question Answering Model is added to tests 5024 (patrickvonplaten)
- Ability to pickle/unpickle BatchEncoding pickle (reimport) 5039 (mfuntowicz)
- refactor(wandb): consolidate import 5044 (borisdayma)
- [cleanup] Hoist ModelTester objects to top level 4939 (aretius)
- Convert hans to Trainer 5025 (sgugger)
- Fix marian tokenizer save pretrained 5043 (sshleifer)
- [cleanup] examples test_run_squad uses tiny model 5059 (sshleifer)
- Add header and fix command for HANS 5082 (sgugger)
- [examples] SummarizationModule improvements 4951 (sshleifer)
- Some changes to simplify the generation function 5031 (yjernite)
- Make default_data_collator more flexible and deprecate old behavior 5060 (sgugger)
- [MarianTokenizer] Switch to sacremoses for punc normalization 5092 (sshleifer)
- [style] add pandas to setup.cfg 5093 (sshleifer)
- [ElectraForQuestionAnswering] fix qa example in doc 4929 (patil-suraj)
- Fixing TPU training by disabling wandb.watch gradients logging 4926 (patil-suraj)
- [docs] fix T5 training doc 5080 (patil-suraj)
- support local_files_only option for tf models 5116 (ogarin)
- [cleanup] generate_beam_search comments 5115 (sshleifer)
- [fix] Move _adjust_logits above postprocess to fix Marian.generate 5126 (sshleifer)
- Pin `sphinx-rtd-theme` 5128 (LysandreJik)
- Add missing arg in 02-transformers notebook 5085 (pri-ax)
- [cleanup] remove redundant code in SummarizationDataset 5119 (sshleifer)
- AutoTokenizer supports mbart-large-en-ro 5121(sshleifer)
- Fix in Reformer Config documentation 5138 (erickrf)
- [bart-mnli] Fix class flipping bug 5141 (sshleifer)
- [MobileBert] fix dropout 5150 (ZhuBaohe)
- SummarizationPipeline: init required task name 5086 (julien-c)
- [examples] fixes arguments for summarization finetune scripts 5157 (ieBoytsov)
- Fixing docs for Encoder Decoder Config 5171 (mikaelsouza)
- fix bart doc 5132 (fuzihaofzh)
- Added feature to move added tokens in vocabulary for Transformer-XL 4953 (RafaelWO)
- Add support for gradient checkpointing in BERT 4659 (ibeltagy)
- Fix for IndexError when Roberta Tokenizer is called on empty text 4209 (malteos)
- Add TF auto model to the docs + fix sphinx warnings (again) 5187 (sgugger)
- Have documentation fail on warning 5189 (LysandreJik)
- Cleaner warning when loading pretrained models 4557 (thomwolf)
- Upgrade examples to pl=0.8.1 5146 (sshleifer)
- [fix] mobilebert had wrong path, causing slow test failure 5205 (sshleifer)
- [fix] remove unused import 5206 (sshleifer)
- [Reformer] Axial Pos Emb Improve mem usage reformer 5209 (patrickvonplaten)
- [pl_examples] revert deletion of optimizer_step 5227 (sshleifer)
- [bart] add config.extra_pos_embeddings to facilitate reuse 5190 (sshleifer)
- Only put tensors on a device 5223 (sgugger)
- Fix PABEE division by zero error 5233 (JetRunner)
- Use the script in utils 5224 (sgugger)
- Delay decay schedule until the end of warmup 4940 (amodaresi)
- Replace pad_token with -100 for LM loss calculation 4718 (setu4993)
- examples/seq2seq supports translation 5202 (sshleifer)
- Fix convert_graph_to_onnx script 5230 (n1t0)
- Refactor Code samples; Test code samples 5036 (LysandreJik)
- [Generation] fix docs for decoder_input_ids 5306 (patrickvonplaten)
- [pipelines] Change summarization default to distilbart-cnn-12-6 5289 (sshleifer)
- Add BART-base modeling and configuration 5315 (JetRunner)
- CircleCI stores cleaner output at test_outputs.txt 5291 (sshleifer)
- [pl_examples] default warmup steps=0 5316 (sshleifer)