Transformers

Latest version: v4.50.3

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 23 of 33

4.4.0

Not secure

SpeechToText

Two new models are released as part of the S2T implementation: `Speech2TextModel` and `Speech2TextForConditionalGeneration`, in PyTorch.

Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech signal. It’s a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively.

The Speech2Text model was proposed in [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

- Speech2TextTransformer 10175 (patil-suraj)

M2M100

Two new models are released as part of the M2M100 implementation: `M2M100Model` and `M2M100ForConditionalGeneration`, in PyTorch.

M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

The M2M100 model was proposed in [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=m2m_100

- Add m2m100 10236 (patil-suraj)

I-BERT

Six new models are released as part of the I-BERT implementation: `IBertModel`, `IBertForMaskedLM`, `IBertForSequenceClassification`, `IBertForMultipleChoice`, `IBertForTokenClassification` and `IBertForQuestionAnswering`, in PyTorch.

I-BERT is a quantized version of RoBERTa running inference up to four times faster.

The I-BERT framework in PyTorch allows to identify the best parameters for quantization. Once the model is exported in a framework that supports int8 execution (such as TensorRT), a speedup of up to 4x is visible, with no loss in performance thanks to the parameter search.

The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=ibert

- I-BERT model support 10153 (kssteven418)
- [IBert] Correct link to paper 10445 (patrickvonplaten)
- Add I-BERT to README 10462 (LysandreJik)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

mBART-50

MBart-50 is created using the original mbart-large-cc25 checkpoint by extending its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50 languages.

The MBart model was presented in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=mbart-50

- Add mBART-50 10154 (patil-suraj)

DeBERTa-v2

Fixe new models are released as part of the DeBERTa-v2 implementation: `DebertaV2Model`, `DebertaV2ForMaskedLM`, `DebertaV2ForSequenceClassification`, `DeberaV2ForTokenClassification` and `DebertaV2ForQuestionAnswering`, in PyTorch.

The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deberta-v2

- Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… 10018 (BigBird01)
- DeBERTa-v2 fixes 10328 (LysandreJik)

Wav2Vec2

XLSR-Wav2Vec2

The XLSR-Wav2Vec2 model was proposed in [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

The checkpoint corresponding to that model is added to the model hub: [facebook/
wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)

- [XLSR-Wav2Vec2] Add multi-lingual Wav2Vec2 models 10648 (patrickvonplaten)

Training script

A fine-tuning script showcasing how the Wav2Vec2 model can be trained has been added.

- Add Fine-Tuning for Wav2Vec2 10145 (patrickvonplaten)

Further improvements

The Wav2Vec2 architecture becomes more stable as several changes are done to its architecture. This introduces feature extractors and feature processors as the pre-processing aspect of multi-modal speech models.

- Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC 10089 (patrickvonplaten)
- Fix example in Wav2Vec2 documentation 10096 (abhishekkrthakur)
- [Wav2Vec2] Remove unused config 10457 (patrickvonplaten)
- [Wav2Vec2FeatureExtractor] smal fixes 10455 (patil-suraj)
- [Wav2Vec2] Improve Tokenizer & Model for batched inference 10117 (patrickvonplaten)
- [PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer 10324 (patrickvonplaten)
- [Wav2Vec2 Example Script] Typo 10547 (patrickvonplaten)
- [Wav2Vec2] Make wav2vec2 test deterministic 10714 (patrickvonplaten)
- [Wav2Vec2] Fix documentation inaccuracy 10694 (MikeG112)

AMP & XLA Support for TensorFlow models

Most of the TensorFlow models are now compatible with automatic mixed precision and have XLA support.

- Add AMP for TF Albert 10141 (jplu)
- Unlock XLA test for TF ConvBert 10207 (jplu)
- Making TF BART-like models XLA and AMP compliant 10191 (jplu)
- Making TF XLM-like models XLA and AMP compliant 10211 (jplu)
- Make TF CTRL compliant with XLA and AMP 10209 (jplu)
- Making TF GPT2 compliant with XLA and AMP 10230 (jplu)
- Making TF Funnel compliant with AMP 10216 (jplu)
- Making TF Lxmert model compliant with AMP 10257 (jplu)
- Making TF MobileBert model compliant with AMP 10259 (jplu)
- Making TF MPNet model compliant with XLA 10260 (jplu)
- Making TF T5 model compliant with AMP and XLA 10262 (jplu)
- Making TF TransfoXL model compliant with AMP 10264 (jplu)
- Making TF OpenAI GPT model compliant with AMP and XLA 10261 (jplu)
- Rework the AMP for TF XLNet 10274 (jplu)
- Making TF Longformer-like models compliant with AMP 10233 (jplu)

SageMaker Trainer for model parallelism

We are rolling out experimental support for model parallelism on SageMaker with a new `SageMakerTrainer` that can be used in place of the regular `Trainer`. This is a temporary class that will be removed in a future version, the end goal is to have `Trainer` support this feature out of the box.

- Add SageMakerTrainer for model paralellism 10122 (sgugger)
- Extend trainer logging for sm 10633 (philschmid)
- Sagemaker Model Parallel tensoboard writing fix 10403 (mansimane)
- Multiple fixes in SageMakerTrainer 10687 (sgugger)
- Add DistributedSamplerWithLoop 10746 (sgugger)

General improvements and bugfixes

- [trainer] deepspeed bug fixes and tests 10039 (stas00)
- Removing run_pl_glue.py from text classification docs, include run_xnli.py & run_tf_text_classification.py 10066 (cbjuan)
- remove token_type_ids from TokenizerBertGeneration output 10070 (sadakmed)
- [deepspeed tests] transition to new tests dir 10080 (stas00)
- Added integration tests for Pytorch implementation of the ELECTRA model 10073 (spatil6)
- Fix naming in TF MobileBERT 10095 (jplu)
- [examples/s2s] add test set predictions 10085 (patil-suraj)
- Logging propagation 10092 (LysandreJik)
- Fix some edge cases in report_to and add deprecation warnings 10100 (sgugger)
- Add head_mask and decoder_head_mask to TF LED 9988 (stancld)
- Replace strided slice with tf.expand_dims 10078 (jplu)
- Fix Faiss Import 10103 (patrickvonplaten)
- [RAG] fix generate 10094 (patil-suraj)
- Fix TFConvBertModelIntegrationTest::test_inference_masked_lm Test 10104 (abhishekkrthakur)
- doc: update W&B related doc 10086 (borisdayma)
- Remove speed metrics from default compute objective [WIP] 10107 (shiva-z)
- Fix tokenizers training in notebooks 10110 (n1t0)
- [DeepSpeed docs] new information 9610 (stas00)
- [CI] build docs faster 10115 (stas00)
- [scheduled github CI] add deepspeed fairscale deps 10116 (stas00)
- Line endings should be LF across repo and not CRLF 10119 (LysandreJik)
- Fix TF LED/Longformer attentions computation 10007 (jplu)
- remove adjust_logits_during_generation method 10087 (patil-suraj)
- [DeepSpeed] restore memory for evaluation 10114 (stas00)
- Update run_xnli.py to use Datasets library 9829 (Qbiwan)
- Add new community notebook - Blenderbot 10126 (lordtt13)
- [DeepSpeed in notebooks] Jupyter + Colab 10130 (stas00)
- [examples/run_s2s] remove task_specific_params and update rouge computation 10133 (patil-suraj)
- Fix typo in GPT2DoubleHeadsModel docs 10148 (M-Salti)
- [hf_api] delete deprecated methods and tests 10159 (julien-c)
- Revert propagation 10171 (LysandreJik)
- Conversion from slow to fast for BPE spm vocabs contained an error. 10120 (Narsil)
- Fix typo in comments 10157 (mrm8488)
- Fix typo in comment 10156 (mrm8488)
- [Doc] Fix version control in internal pages 10124 (sgugger)
- [t5 tokenizer] add info logs 9897 (stas00)
- Fix v2 model loading issue 10129 (BigBird01)
- Fix datasets set_format 10178 (sgugger)
- Fixing NER pipeline for list inputs. 10184 (Narsil)
- Add new model to labels that should not stale 10187 (LysandreJik)
- Check TF ops for ONNX compliance 10025 (jplu)
- [RAG] fix tokenizer 10167 (patil-suraj)
- Fix TF template 10189 (jplu)
- fix run_seq2seq.py; porting trainer tests to it 10162 (stas00)
- Specify dataset dtype 10195 (LysandreJik)
- [CI] make the examples sub-group of tests run always 10196 (stas00)
- [WIP][examples/seq2seq] move old s2s scripts to legacy 10136 (patil-suraj)
- set tgt_lang of MBart Tokenizer for summarization 10205 (HeroadZ)
- Store FLOS as floats to avoid overflow. 10213 (sgugger)
- Fix add_token_positions in custom datasets tutorial 10217 (joeddav)
- [trainer] fix ignored columns logger 10219 (stas00)
- Factor out methods 10215 (LysandreJik)
- Fix head masking for TFT5 models 9877 (stancld)
- [CI] 2 fixes 10248 (stas00)
- [trainer] refactor place_model_on_device logic, add deepspeed 10243 (stas00)
- [Trainer] doc update 10241 (stas00)
- Reduce the time spent for the TF slow tests 10152 (jplu)
- Introduce warmup_ratio training argument 10229 (tanmay17061)
- [Trainer] memory tracker metrics 10225 (stas00)
- Script for distilling zero-shot classifier to more efficient student 10244 (joeddav)
- [test] fix func signature 10271 (stas00)
- [trainer] implement support for full fp16 in evaluation/predict 10268 (stas00)
- [ISSUES.md] propose using google colab to reproduce problems 10270 (stas00)
- Introduce logging_strategy training argument 10267 (tanmay17061)
- [CI] Kill any run-away pytest processes 10281 (stas00)
- Patch zero shot distillation script cuda issue 10284 (joeddav)
- Move the TF NER example 10276 (jplu)
- Fix example links in the task summary 10291 (sgugger)
- fixes 10303 10304 (cronoik)
- [ci] don't fail when there are no zombies 10308 (stas00)
- fix typo in conversion script 10316 (tagucci)
- Add note to resize token embeddings matrix when adding new tokens to voc 10331 (LysandreJik)
- Deprecate prepare_seq2seq_batch 10287 (sgugger)
- [examples/seq2seq] defensive programming + expand/correct README 10295 (stas00)
- [Trainer] implement gradient_accumulation_steps support in DeepSpeed integration 10310 (stas00)
- Loading from last checkpoint functionality in Trainer.train 10334 (tanmay17061)
- [trainer] add Trainer methods for metrics logging and saving 10266 (stas00)
- Fix evaluation with label smoothing in Trainer 10338 (sgugger)
- Fix broken examples/seq2seq/README.md markdown 10344 (Wikidepia)
- [bert-base-german-cased] use model repo, not external bucket 10353 (julien-c)
- [Trainer/Deepspeed] handle get_last_lr() before first step() 10362 (stas00)
- ConvBERT fix torch <> tf weights conversion 10314 (abhishekkrthakur)

- fix deprecated reference `tokenizer.max_len` in glue.py 10220 (poedator)
- [trainer] move secondary methods into a separate file 10363 (stas00)
- Run GA on every push even on forks 10383 (LysandreJik)
- GA: only run model templates once 10388 (LysandreJik)
- Bugfix: Removal of padding_idx in BartLearnedPositionalEmbedding 10200 (mingruimingrui)
- Remove unused variable in example for Q&A 10392 (abhishekkrthakur)
- Ignore unexpected weights from PT conversion 10397 (LysandreJik)
- Add support for ZeRO-2/3 and ZeRO-offload in fairscale 10354 (sgugger)
- Fix None in add_token_positions - issue 10210 10374 (andreabac3)
- Make Barthez tokenizer tests a bit faster 10399 (sgugger)
- Fix run_glue evaluation when model has a label correspondence 10401 (sgugger)
- [ci, flax] non-existing models are unlikely to pass tests 10409 (julien-c)

- [LED] Correct Docs 10419 (patrickvonplaten)
- Add Ray Tune hyperparameter search integration test 10414 (krfricke)
- Ray Tune Integration Bug Fixes 10406 (amogkam)
- [examples] better model example 10427 (stas00)
- Fix conda-build 10431 (LysandreJik)
- [run_seq2seq.py] restore functionality: saving to test_generations.txt 10428 (stas00)
- updated logging and saving metrics 10436 (bhadreshpsavani)
- Introduce save_strategy training argument 10286 (tanmay17061)
- Adds terms to Glossary 10443 (darigovresearch)
- Fixes compatibility bug when using grouped beam search and constrained decoding together 10475 (mnschmit)
- Generate can return cross-attention weights too 10493 (Mehrad0711)
- Fix typos 10489 (WybeKoper)
- [T5] Fix speed degradation bug t5 10496 (patrickvonplaten)
- feat(docs): navigate with left/right arrow keys 10481 (ydcjeff)
- Refactor checkpoint name in BERT and MobileBERT 10424 (sgugger)
- remap MODEL_FOR_QUESTION_ANSWERING_MAPPING classes to names auto-generated file 10487 (stas00)
- Fix the bug in constructing the all_hidden_states of DeBERTa v2 10466 (felixgwu)
- Smp grad accum 10488 (sgugger)
- Remove unsupported methods from ModelOutput doc 10505 (sgugger)
- Not always consider a local model a checkpoint in run_glue 10517 (sgugger)
- Removes overwrites for output_dir 10521 (philschmid)
- Rework TPU checkpointing in Trainer 10504 (sgugger)
- [ProphetNet] Bart-like Refactor 10501 (patrickvonplaten)
- Fix example of custom Trainer to reflect signature of compute_loss 10537 (lewtun)
- Fixing conversation test for torch 1.8 10545 (Narsil)
- Fix torch 1.8.0 segmentation fault 10546 (LysandreJik)
- Fixed dead link in Trainer documentation 10554 (jwa018)
- Typo correction. 10531 (cliang1453)
- Fix embeddings for PyTorch 1.8 10549 (sgugger)
- Stale Bot 10509 (LysandreJik)
- Refactoring checkpoint names for multiple models 10527 (danielpatrickhug)
- offline mode for firewalled envs 10407 (stas00)
- fix tf doc bug 10570 (Sniper970119)
- [run_seq2seq] fix nltk lookup 10585 (stas00)
- Fix typo in docstring for pipeline 10591 (silvershine157)
- wrong model used for BART Summarization example 10582 (orena1)
- [M2M100] fix positional embeddings 10590 (patil-suraj)
- Enable torch 1.8.0 on GPU CI 10593 (LysandreJik)
- tokenization_marian.py: use current_spm for decoding 10357 (Mehrad0711)
- [trainer] fix double wrapping + test 10583 (stas00)
- Fix version control with anchors 10595 (sgugger)
- offline mode for firewalled envs (part 2) 10569 (stas00)
- [examples tests] various fixes 10584 (stas00)
- Added max_sample_ arguments 10551 (bhadreshpsavani)
- [examples tests on multigpu] resolving require_torch_non_multi_gpu_but_fix_me 10561 (stas00)
- Check layer types for Optimizer construction 10598 (sgugger)
- Speedup tf tests 10601 (LysandreJik)
- [docs] How to solve "Title level inconsistent" sphinx error 10600 (stas00)
- [FeatureExtractorSavingUtils] Refactor PretrainedFeatureExtractor 10594 (patrickvonplaten)
- fix flaky m2m100 test 10604 (patil-suraj)
- [examples template] added max_sample args and metrics changes 10602 (bhadreshpsavani)
- Fairscale FSDP fix model save 10596 (sgugger)
- Fix tests of TrainerCallback 10615 (sgugger)
- Fixes an issue in `text-classification` where MNLI eval/test datasets are not being preprocessed. 10621 (allenwang28)
- [M2M100] remove final_logits_bias 10606 (patil-suraj)
- Add new GLUE example with no Trainer. 10555 (sgugger)
- Copy tokenizer files in each of their repo 10624 (sgugger)
- Document Trainer limitation on custom models 10635 (sgugger)
- Fix Longformer tokenizer filename 10653 (LysandreJik)
- Update README.md 10647 (Arvid-pku)
- Ensure metric results are JSON-serializable 10632 (sgugger)
- S2S + M2M100 should be available in tokenization_auto 10657 (LysandreJik)
- Remove special treatment for custom vocab files 10637 (sgugger)
- [S2T] fix example in docs 10667 (patil-suraj)
- W2v2 test require torch 10665 (LysandreJik)
- Fix Marian/TFMarian tokenization tests 10661 (LysandreJik)
- Fixes Pegasus tokenization tests 10671 (LysandreJik)
- Onnx fix test 10663 (mfuntowicz)
- Fix integration slow tests 10670 (sgugger)
- Specify minimum version for sacrebleu 10662 (LysandreJik)
- Add DeBERTa to MODEL_FOR_PRETRAINING_MAPPING 10668 (jeswan)
- Fix broken link 10656 (WybeKoper)
- fix typing error for HfArgumentParser for Optional[bool] 10672 (bfineran)
- MT5 integration test: adjust loss difference 10669 (LysandreJik)
- Adding new parameter to `generate`: `max_time`. 9846 (Narsil)
- TensorFlow tests: having from_pt set to True requires torch to be installed. 10664 (LysandreJik)
- Add auto_wrap option in fairscale integration 10673 (sgugger)
- fix: 10628 expanduser path in TrainingArguments 10660 (PaulLerner)
- Pass encoder outputs into GenerationMixin 10599 (ymfa)
- [wip] [deepspeed] AdamW is now supported by default 9624 (stas00)
- [Tests] RAG 10679 (patrickvonplaten)
- enable loading Mbart50Tokenizer with AutoTokenizer 10690 (patil-suraj)
- Wrong link to super class 10709 (cronoik)
- Distributed barrier before loading model 10685 (sgugger)
- GPT2DoubleHeadsModel made parallelizable 10658 (ishalyminov)
- split seq2seq script into summarization & translation 10611 (theo-m)
- Adding required flags to non-default arguments in hf_argparser 10688 (Craigacp)
- Fix backward compatibility with EvaluationStrategy 10718 (sgugger)
- Tests run on Docker 10681 (LysandreJik)
- Rename zero-shot pipeline multi_class argument 10727 (joeddav)
- Add minimum version check in examples 10724 (sgugger)
- independent training / eval with local files 10710 (riklopfer)
- Flax testing should not run the full torch test suite 10725 (patrickvonplaten)

4.3.3

Not secure

This patch fixes an issue with the conversion for ConvBERT models: https://github.com/huggingface/transformers/pull/10314.

4.3.2

Not secure

This patch release fixes the RAG model (10094) and the detection of whether faiss is available (10103)

4.3.1

Not secure

This patch release modifies the API of the `Wav2Vec2` model: the `Wav2Vec2ForCTC` was added as a replacement of `Wav2Vec2ForMaskedLM`. `Wav2Vec2ForMaskedLM` is kept for backwards compatibility but is deprecated.

- Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC 10089 (patrickvonplaten)

4.3.0

Not secure

Wav2Vec2 from facebook (patrickvonplaten)

Two new models are released as part of the Wav2Vec2 implementation: `Wav2Vec2Model` and `Wav2Vec2ForMaskedLM`, in PyTorch.

Wav2Vec2 is a multi-modal model, combining speech and text. It's the first multi-modal model of its kind we welcome in Transformers.

The Wav2Vec2 model was proposed in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=wav2vec2

Available notebooks:

- https://colab.research.google.com/drive/18Ms6WjyjpsL-73Y2Vpagh9WdJvVEIoQL?usp=sharing

Contributions:

- Wav2Vec2 9659 (patrickvonplaten)

Future Additions

- Enable fine-tuning and pretraining for Wav2Vec2
- Add example script with dependency to wav2letter/flashlight
- Add Encoder-Decoder Wav2Vec2 model

ConvBERT

The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

Six new models are released as part of the ConvBERT implementation: `ConvBertModel`, `ConvBertForMaskedLM`, `ConvBertForSequenceClassification`, `ConvBertForTokenClassification`, `ConvBertForQuestionAnswering` and `ConvBertForMultipleChoice`. These models are available both in PyTorch and TensorFlow.

Contributions:

- ConvBERT Model 9717 (abhishekkrthakur)
- ConvBERT: minor fixes for conversion script 9937 (stefan-it)
- Fix GroupedLinearLayer in TF ConvBERT 9972 (abhishekkrthakur)

BORT

The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by Amazon's Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the authors refer to as “Bort”.

The BORT model can be loaded directly in the BERT architecture, therefore all BERT model heads are available for BORT.

Contributions:

- ADD BORT 9813 (stefan-it)

Trainer now supports Amazon SageMaker’s data parallel library (sgugger)

When executing a script with `Trainer` using Amazon SageMaker and enabling [SageMaker's data parallelism library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html), `Trainer` will automatically use the `smdistributed` library. All maintained examples have been tested with this functionality. Here is an [overview of SageMaker data parallelism library](https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/).

- When on SageMaker use their env variables for saves 9876 (sgugger)

Community page

A new Community Page has been added to the docs. These contain all the notebooks contributed by the community, as well as some community projects built around Transformers. Feel free to open a PR if you want your project to be showcased!

- Add a community page to the docs 9682 (sgugger)

Additional model architectures

DeBERTa now has more model heads available.

- Add DeBERTa head models 9691 (NielsRogge)

BART, mBART, Marian, Pegasus and Blenderbot now have decoder-only model architectures. They can therefore be used in decoder-only settings.

- BartForCausalLM analogs to `ProphetNetForCausalLM` 9128 (sadakmed)

Breaking changes

None.

General improvements and bugfixes

- Fix Trainer with a parallel model 9578 (sgugger)
- Switch metrics in run_ner to datasets 9567 (sgugger)
- Compliancy with tf-nightly 9570 (jplu)
- Make logs TF compliant 9565 (jplu)
- [setup.py] note on how to get to transformers exact dependencies from shell 9553 (stas00)
- Fix conda build 9589 (LysandreJik)
- BatchEncoding.to with device with tests 9584 (LysandreJik)
- Gradient accumulation for TFTrainer 9585 (kiyoungkim1)
- Upstream (and rename) sortish sampler 9574 (sgugger)
- [deepspeed doc] install issues + 1-gpu deployment 9582 (stas00)
- [TF Led] Fix wrong decoder attention mask behavior 9601 (patrickvonplaten)
- Remove unused token_type_ids in MPNet 9564 (jplu)
- Ignore lm_head decoder bias warning 9615 (LysandreJik)
- [deepspeed] --gradient_accumulation_steps fix 9622 (stas00)
- Remove duplicated extras["retrieval"] 9621 (n1t0)
- Fix: torch.utils.checkpoint.checkpoint attribute error. 9626 (devrimcavusoglu)
- Add head_mask/decoder_head_mask for BART 9569 (stancld)
- [Bart-like tests] Fix torch device for bart tests 9669 (patrickvonplaten)
- Fix DPRReaderTokenizer's attention_mask 9663 (mkserge)
- add mbart to automodel for masked lm 9673 (patrickvonplaten)
- Fix imports in conversion scripts 9674 (sgugger)
- Fix GPT conversion script 9676 (sgugger)
- Fix old Seq2SeqTrainer 9675 (sgugger)
- Update `past_key_values` in GPT-2 9596 (forest1988)
- Update integrations.py 9652 (max-yue)
- Fix TF Flaubert and XLM 9661 (jplu)
- New run_seq2seq script 9605 (sgugger)
- Add separated decoder_head_mask for T5 Models 9634 (stancld)
- Fix model templates and use less than 119 chars 9684 (sgugger)
- Restrain tokenizer.model_max_length default 9681 (sgugger)
- Speed up RepetitionPenaltyLogitsProcessor (pytorch) 9600 (LSinev)
- Use datasets squad_v2 metric in run_qa 9677 (sgugger)
- Fix label datatype in TF Trainer 9616 (jplu)
- New TF embeddings (cleaner and faster) 9418 (jplu)
- Fix TF template 9697 (jplu)
- Add t5 convert to transformers-cli 9654 (acul3)
- Fix Funnel Transformer conversion script 9683 (sgugger)
- Add notebook 9696 (NielsRogge)
- Fix Trainer and Args to mention AdamW, not Adam. 9685 (gchhablani)
- [deepspeed] fix the backward for deepspeed 9705 (stas00)
- Fix WAND_DISABLED test 9703 (sgugger)
- [trainer] no --deepspeed and --sharded_ddp together 9712 (stas00)
- fix typo 9708 (Muennighoff)
- Temporarily deactivate TPU tests while we work on fixing them 9720 (LysandreJik)
- Allow text generation for ProphetNetForCausalLM 9707 (guillaume-be)
- [LED] Reduce Slow Test required GPU RAM from 16GB to 8GB 9723 (patrickvonplaten)
- [T5] Fix T5 model parallel tests 9721 (patrickvonplaten)
- fix T5 head mask in model_parallel 9726 (patil-suraj)
- Fix mixed precision in TF models 9163 (jplu)
- Changing model default for TableQuestionAnsweringPipeline. 9729 (Narsil)
- Fix TF s2s models 9478 (jplu)
- Fix memory regression in Seq2Seq example 9713 (sgugger)
- examples: fix XNLI url 9741 (stefan-it)
- Fix some TF slow tests 9728 (jplu)
- Fixes to run_seq2seq and instructions 9734 (sgugger)
- Add `report_to` training arguments to control the integrations used 9735 (sgugger)
- Fix a TF test 9755 (jplu)
- [fsmt] token_type_ids isn't used 9736 (stas00)
- Fix broken [Open in Colab] links (9688) 9761 (wilcoln)
- Fix TFTrainer prediction output 9662 (janinaj)
- Use object store to pass trainer object to Ray Tune (makes it work with large models) 9749 (krfricke)
- Fix a typo in `Trainer.hyperparameter_search` docstring 9762 (sorami)
- [fsmt] onnx triu workaround 9738 (stas00)
- Fix model parallel definition in superclass 9787 (LysandreJik)
- Auto-resume training from checkpoint 9776 (sgugger)
- [PR/Issue templates] normalize, group, sort + add myself for deepspeed 9706 (stas00)
- [Flaky Generation Tests] Make sure that no early stopping is happening for beam search 9794 (patrickvonplaten)
- Fix broken links in the converting tf ckpt document 9791 (forest1988)
- Add head_mask/decoder_head_mask for TF BART models 9639 (stancld)
- Adding `skip_special_tokens=True` to FillMaskPipeline 9783 (Narsil)
- Improve pytorch examples for fp16 9796 (ak314)
- Smdistributed trainer 9798 (sgugger)
- RagTokenForGeneration: Fixed parameter name for logits_processor 9790 (michaelrglass)
- Fix fine-tuning translation scripts 9809 (mbiesialska)
- Allow RAG to output decoder cross-attentions 9789 (dblakely)
- Commit the last step on world_process_zero in WandbCallback 9805 (tristandeleu)
- Fix a bug in run_glue.py (9812) 9815 (forest1988)
- [LedFastTokenizer] Correct missing None statement 9828 (patrickvonplaten)
- [Setup.py] update jaxlib 9831 (patrickvonplaten)
- Add a test for TF mixed precision 9806 (jplu)
- Setup logging with a stdout handler 9816 (sgugger)
- Fix auto-resume training from checkpoint 9822 (jncasey)
- [MT5 Import init] Fix typo 9830 (patrickvonplaten)
- Adding a test to prevent late failure in the Table question answering pipeline. 9808 (Narsil)
- Remove a TF usage warning and rework the documentation 9756 (jplu)
- Delete a needless duplicate condition 9826 (tomohideshibata)
- Clean TF Bert 9788 (jplu)
- Add a flag for find_unused_parameters 9820 (sgugger)
- Fix TF template 9840 (jplu)
- Fix model templates 9842 (LysandreJik)
- Add tpu_zone and gcp_project in training_args_tf.py 9825 (kiyoungkim1)
- Labeled pull requests 9849 (LysandreJik)
- [GA forks] Test on every push 9851 (LysandreJik)
- When resuming training from checkpoint, Trainer loads model 9818 (sgugger)
- Allow --arg Value for booleans in HfArgumentParser 9823 (sgugger)
- [traner] fix --lr_scheduler_type choices 9800 (stas00)
- Pin memory in Trainer by default 9857 (abhishekkrthakur)
- Partial local tokenizer load 9807 (LysandreJik)
- Remove submodule 9868 (LysandreJik)
- Fixing flaky conversational test + flag it as a pipeline test. 9837 (Narsil)
- Fix computation of attention_probs when head_mask is provided. 9853 (mfuntowicz)
- Deprecate model_path in Trainer.train 9854 (sgugger)
- Remove redundant `test_head_masking = True` flags in test files 9858 (stancld)
- [docs] expand install instructions 9817 (stas00)
- on_log event should occur *after* the current log is written 9872 (abhishekkrthakur)
- pin_memory -> dataloader_pin_memory 9874 (abhishekkrthakur)
- Adding a new `return_full_text` parameter to TextGenerationPipeline. 9852 (Narsil)
- Clarify use of unk_token in slow tokenizers' docstrings 9875 (ethch18)
- Add XLA test 9848 (jplu)
- [seq2seq] correctly handle mt5 9879 (stas00)
- [trainer] [deepspeed] refactor deepspeed setup devices 9880 (stas00)
- [doc] nested markup is invalid in rst 9898 (stas00)
- Clarify definition of seed argument in TrainingArguments 9903 (lewtun)
- TFBart lables consider both pad token and -100 9847 (kiyoungkim1)
- Add head_mask and decoder_head_mask to FSMT 9819 (stancld)
- Doc title in the template 9910 (sgugger)
- [seq2seq] fix logger format for non-main process 9911 (stas00)
- [wandb] restore WANDB_DISABLED=true to disable wandb 9896 (stas00)
- Fit chinese wwm to new datasets 9887 (wlhgtc)
- Remove subclass for sortish sampler 9907 (sgugger)
- AdaFactor: avoid updating group["lr"] attributes 9751 (ceshine)
- [docs] fix auto model docs 9924 (patil-suraj)
- Add new model docs 9667 (patrickvonplaten)
- Fix bart conversion script 9923 (patil-suraj)
- Tensorflow doc changes on loss output size 9922 (janjitse)
- [Tokenizer Utils Base] Make pad function more flexible 9928 (patrickvonplaten)
- [Bart models] fix typo in naming 9944 (patrickvonplaten)
- ALBERT Tokenizer integration test 9943 (LysandreJik)
- Fix 9918 9932 (sgugger)
- Bump numpy 9934 (sgugger)
- Use compute_loss in prediction_step 9935 (sgugger)
- Add head_mask and decoder_head_mask to PyTorch LED 9856 (stancld)
- [research proj] [lxmert] remove bleach dependency 9970 (stas00)
- Fix Longformer and LED 9942 (jplu)
- fix steps_in_epoch variable in trainer when using max_steps 9969 (yylun)
- [run_clm.py] fix getting extention 9977 (patil-suraj)
- Added integration tests for TensorFlow implementation of the ALBERT model 9976 (spatil6)
- TF DistilBERT integration tests 9975 (spatil6)
- Added integration tests for TensorFlow implementation of the mobileBERT 9978 (spatil6)
- Added integration tests for TensorFlow implementation of the MPNet model 9979 (spatil6)
- Added integration tests for Pytorch implementation of the ALBERT model 9980 (spatil6)
- distilbert: fix creation of sinusoidal embeddings 9917 (stefan-it)
- Add `from_slow` in fast tokenizers build and fixes some bugs 9987 (sgugger)
- Added Integration testing for Pytorch implementation of DistilBert model from issue 9948' 9995 (danielpatrickhug)
- Fix model templates 9999 (LysandreJik)
- [Proposal] Adding new `encoder_no_repeat_ngram_size` to `generate`. 9984 (Narsil)
- Remove unintentional "double" assignment in TF-BART like models 9997 (stancld)
- [trainer] a few fixes 9993 (stas00)
- Update tokenizers requirement 10077 (n1t0)
- Bump minimum Jax requirement to 2.8.0 10027 (patrickvonplaten)

Transformers

Page 23 of 33

4.4.0

4.3.3

4.3.2

4.3.1

4.3.0

4.3.0.rc1

Page 23 of 33

Links

Releases