New model architecture: DistilBERT
Adding Huggingface's new transformer architecture, **DistilBERT** described in [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5) by Victor Sanh, Lysandre Debut and Thomas Wolf.
This new model architecture comes with two pretrained checkpoints:
- `distilbert-base-uncased`: the base DistilBert model
- `distilbert-base-uncased-distilled-squad `: DistilBert model fine-tuned with distillation on SQuAD.
An awaited new pretrained checkpoint: GPT-2 large (774M parameters)
The third OpenAI GPT-2 checkpoint (GPT-2 large) is available in the library under the shortcut name `gpt2-large`: 774M parameters, 36 layers, and 20 heads.
New XLM multilingual pretrained checkpoints in 17 and 100 languages
We have added two new [XLM models in 17 and 100 languages](https://github.com/facebookresearch/XLMpretrained-cross-lingual-language-models) which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.
New dependency: `sacremoses`
Support for XLM is improved by carefully reproducing the original tokenization workflow (work by shijie-wu in 1092). We now rely on [`sacremoses`](https://github.com/alvations/sacremoses), a python port of Moses tokenizer, truecaser and normalizer by alvations, for XLM word tokenization.
In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *
\* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.
Bug fixes and improvements to the library modules
- Bertology script has seen major improvements (tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (LysandreJik )
- Added GPT-2 Large 774M model (thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (LysandreJik thomwolf )
- Multi-GPU training has been patched (FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (Morizeyao, adai183 )
- Updated the in-depth BERT fine-tuning scripts to `pytorch-transformers` (Morizeyao )
- Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (LysandreJik thomwolf)
- Add `proxies` and `force_download` options to `from_pretrained()` method to be able to use proxies and update cached models/tokenizers (thomwolf)
- Add shortcut to each special tokens with `_id` properties (e.g. `tokenizer.cls_token_id` for the id in the vocabulary of `tokenizer.cls_token`) (thomwolf)
- Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by [fairseq authors](https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.pyL38-L56)) (thomwolf)
- Fix and clean up byte-level BPE tests (thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (LysandreJik )
- Fix a warning raised when the decode method is called for a model with no `sep_token` like GPT-2 (LysandreJik )
- Updated the tokenizers saving method (boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (GuillemGSubies )
- Stable `EnvironmentErrors` have been added to utility files (abhishekraok )
- Fixed distributed barrier hang (VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (LysandreJik )
- Change layer norm code to PyTorch's native layer norm (dhpollack)
- Improve tokenization of XLM for multilingual inputs (shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (thomwolf)
- Added new AutoModels: `AutoModelWithLMHead`, `AutoModelForSequenceClassification`, `AutoModelForQuestionAnswering` (LysandreJik)
- Torch.hub is now based on AutoModels (LysandreJik thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (CrafterKolyan)
- Adding DistilBert model architecture and checpoints (VictorSanh LysandreJik thomwolf)
- Fixes to DistilBert configuration and training script (stefan-it)
- Fix XLNet attention mask for fp16 (ziliwang)
- Documentation auto-deploy (LysandreJik)
- Fix to add a tuple of tokens (epwalsh)
- Update fp16 apex implmentation in scripts (anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (LysandreJik)
- Fix tokenizer reloading in example scripts (rabeehk)
- Fix byte-level decoding error when using added tokens (thomwolf LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (julien-c)