Transformers

Latest version: v4.50.3

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 26 of 33

3.2.0

Not secure

Bert Seq2Seq models, FSMT, Funnel Transformer, LXMERT

BERT Seq2seq models

The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using EncoderDecoderModel as proposed in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

It was added to the library in PyTorch with the following checkpoints:

- `google/roberta2roberta_L-24_bbc`
- `google/roberta2roberta_L-24_gigaword`
- `google/roberta2roberta_L-24_cnn_daily_mail`
- `google/roberta2roberta_L-24_discofuse`
- `google/roberta2roberta_L-24_wikisplit`
- `google/bert2bert_L-24_wmt_de_en`
- `google/bert2bert_L-24_wmt_en_de`

Contributions:

- Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. 6594 (patrickvonplaten)

FSMT (FairSeq MachineTranslation)

FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR’s WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616) by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.

It was added to the library in PyTorch, with the following checkpoints:

- `facebook/wmt19-en-ru`
- `facebook/wmt19-en-de`
- `facebook/wmt19-ru-en`
- `facebook/wmt19-de-en`

Contributions:

- [ported model] FSMT (FairSeq MachineTranslation) 6940 (stas00)
- build/eval/gen-card scripts for fsmt 7155 (stas00)
- skip failing FSMT CUDA tests until investigated 7220 (stas00)
- [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test 7224 (stas00)
- [s2s] adjust finetune + test to work with fsmt 7263 (stas00)
- [fsmt] SinusoidalPositionalEmbedding no need to pass device 7292 (stas00)
- Adds FSMT to LM head AutoModel 7312 (LysandreJik)

LayoutLM

The LayoutLM model was proposed in [LayoutLM: Pre-training of Text and Layout for Document Image Understandin](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It’s a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.

It was added to the library in PyTorch with the following checkpoints:

- `layoutlm-base-uncased`
- `layoutlm-large-uncased`

Contributions:

- Add LayoutLM Model 7064 (liminghao1630)
- Fixes for LayoutLM 7318 (sgugger)
Funnel Transformer

The Funnel Transformer model was proposed in the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236). It is a bidirectional transformer model, like BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks (CNN) in computer vision.

It was added to the library in both PyTorch and TensorFlow, with the following checkpoints:

- `funnel-transformer/small`
- `funnel-transformer/small-base`
- `funnel-transformer/medium`
- `funnel-transformer/medium-base`
- `funnel-transformer/intermediate`
- `funnel-transformer/intermediate-base`
- `funnel-transformer/large`
- `funnel-transformer/large-base`
- `funnel-transformer/xlarge`
- `funnel-transformer/xlarge-base`

Contributions:

- Funnel transformer 6908 (sgugger)
- Add TF Funnel Transformer 7029 (sgugger)

LXMERT

The LXMERT model was proposed in [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities) pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.

It was added to the library in TensorFlow with the following checkpoints:

- `unc-nlp/lxmert-base-uncased`
- `unc-nlp/lxmert-vqa-uncased`
- `unc-nlp/lxmert-gqa-uncased`

Contributions

- Adding the LXMERT pretraining model (MultiModal languageXvision) to HuggingFace's suite of models 5793 (eltoto1219)
- [LXMERT] Fix tests on gpu 6946 (patrickvonplaten)

New pipelines

The following pipeline was added to the library:

- [pipelines] Text2TextGenerationPipeline 6744 (patil-suraj)

Notebooks

The following community notebooks were contributed to the library:

- Demoing LXMERT with raw images by incorporating the FRCNN model for roi-pooled extraction and bounding-box predction on the GQA answer set. 6986 (eltoto1219)
- [Community notebooks] Add notebook on fine-tuning GPT-2 Model with Trainer Class 7005 (philschmid)
- Add "Fine-tune ALBERT for sentence-pair classification" notebook to the community notebooks 7255 (NadirEM)
- added multilabel text classification notebook using distilbert to community notebooks 7201 (DhavalTaunk08)

Encoder-decoder architectures

An additional encoder-decoder architecture was added:

- [EncoderDecoder] Add xlm-roberta to encoder decoder 6878 (patrickvonplaten)

Bug fixes and improvements

- TF Flaubert w/ pre-norm 6841 (LysandreJik)
- Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task 6644 (HuangLianzhe)
- Fix in Adafactor docstrings 6845 (sgugger)
- Fix resuming training for Windows 6847 (sgugger)
- Only access loss tensor every logging_steps 6802 (jysohn23)
- Marian distill scripts + integration test 6799 (sshleifer)
- Add checkpointing to Ray Tune HPO 6747 (krfricke)
- Split hp search methods 6857 (sgugger)
- Update ONNX notebook to include section on quantization. 6831 (mfuntowicz)
- Fix marian slow test 6854 (sshleifer)
- [s2s] command line args for faster val steps 6833 (sshleifer)
- Bart can make decoder_input_ids from labels 6758 (sshleifer)
- add a final report to all pytest jobs 6861 (stas00)
- Logging doc 6852 (sgugger)
- Restore PaddingStrategy.MAX_LENGTH on QAPipeline while no v2. 6875 (mfuntowicz)
- [Generate] Facilitate PyTorch generate using `ModelOutputs` 6735 (patrickvonplaten)
- Add cache_dir to save features TextDataset 6879 (jysohn23)
- [Docs, Examples] Fix QA example for PT 6890 (patrickvonplaten)
- Update modeling_bert.py 6897 (parthe)
- [Electra] fix warning for position ids 6884 (patrickvonplaten)
- minor docs grammar fixes 6889 (harrywang)
- Fix error class instantiation 6634 (tamuhey)
- Output attention takes an s 6903 (sgugger)
- [testing] fix ambiguous test 6898 (stas00)
- test_tf_common: remove un_used mixin class parameters 6866 (PuneethaPai)
- Template updates 6914 (sgugger)
- Changed link to the correct paper in the second paragraph 6905 (sengl)
- tweak tar command in readme 6919 (brettkoonce)
- [s2s]: script to convert pl checkpoints to hf checkpoints 6911 (sshleifer)
- [s2s] allow task_specific_params=summarization_xsum 6923 (sshleifer)
- move wandb/comet logger init to train() to allow parallel logging 6850 (krfricke)
- [s2s] use --eval_beams command line arg 6926 (sshleifer)
- [s2s] support early stopping based on loss, rather than rouge 6927 (sshleifer)
- Fix mixed precision issue in TF DistilBert 6915 (chiapas)
- [docstring] misc arg doc corrections 6932 (stas00)
- [s2s] distill: --normalize_hidden --supervise_forward 6834 (sshleifer)
- [s2s] run_eval.py parses generate_kwargs 6948 (sshleifer)
- [doc] remove the implied defaults to :obj:`None`, s/True/ :obj:`True/, etc. 6956 (stas00)
- [s2s] warn if --fp16 for torch 1.6 6977 (sshleifer)
- feat: allow prefix for any generative model 5885 (borisdayma)
- Trainer with grad accum 6930 (sgugger)
- Cannot index `None` 6984 (LysandreJik)
- [docstring] missing arg 6933 (stas00)
- [testing] add dependency: parametrize 6958 (stas00)
- Fixed the default number of attention heads in Reformer Configuration 6973 (tznurmin)
- [gen utils] missing else case 6980 (stas00)
- match CI's version of flake8 6941 (stas00)
- Conversion scripts shouldn't have relative imports 6991 (LysandreJik)
- Add missing arguments for BertWordPieceTokenizer 5810 (monologg)
- fixed trainer tr_loss memory leak 6999 (StuartMesham)
- Floating-point operations logging in trainer 6768 (TevenLeScao)
- Fixing FLOPS merge by checking if torch is available 7013 (LysandreJik)
- [Longformer] Fix longformer documentation 7016 (patrickvonplaten)
- pegasus.rst: fix expected output 7017 (sshleifer)
- adding TRANSFORMERS_VERBOSITY env var 6961 (stas00)
- [generation] consistently add eos tokens 6982 (stas00)
- [from_pretrained] Allow tokenizer_type ≠ model_type 6995 (julien-c)
- replace torch.triu with onnx compatible code 6929 (HenryDashwood)
- Batch encore plus and overflowing tokens fails when non existing overflowing tokens for a sequence 6677 (LysandreJik)
- add -y to bypass prompt for transformers-cli upload 7035 (stas00)
- Fix confusing warnings during TF2 import from PyTorch 6623 (jcrocholl)
- Albert pretrain datasets/ datacollator 6168 (yl-to)
- Fix template 7040 (LysandreJik)
- Small fixes in tf template 7044 (sgugger)
- Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. 6594 (patrickvonplaten)
- fix to ensure that returned tensors after the tokenization is Long 7039 (GeetDsa)
- [BertGeneration] Correct Doc Title 7048 (patrickvonplaten)
- [BertGeneration, Docs] Fix another old name in docs 7050 (patrickvonplaten)
- [xlm tok] config dict: fix str into int to match definition 7034 (stas00)
- [s2s] --eval_max_generate_length 7018 (sshleifer)
- Fix CI with change of name of nlp 7054 (sgugger)
- [wip/s2s] DistributedSortishSampler 7056 (sshleifer)
- these tests require non-multigpu env 7059 (stas00)
- [BertGeneration] Clean naming 7068 (patrickvonplaten)
- Document the dependcy on datasets 7058 (sgugger)
- Automate the lists in auto-xxx docs 7061 (sgugger)
- Add tests and fix various bugs in ModelOutput 7073 (sgugger)
- Compute loss method 7074 (sgugger)
- [T5Tokenizer] remove prefix_tokens 7078 (patil-suraj)
- [s2s] run_eval supports --prefix clarg. 6953 (sshleifer)
- fix bug in pegasus converter 7094 (sshleifer)
- [s2s] two stage run_distributed_eval.py 7105 (sshleifer)
- Update xsum length penalty to better values 7107 (sshleifer)
- [s2s] distributed eval cleanup 7110 (sshleifer)
- [s2s distill] allow pegasus-12-12 7104 (sshleifer)
- Temporarily skip failing tests due to dependency change 7118 (LysandreJik)
- fix link to paper 7116 (btel)
- ignore FutureWarning in tests 7079 (stas00)
- fix deprecation warnings 7033 (stas00)
- [examples testing] restore code 7099 (stas00)
- Clean up autoclass doc 7081 (sgugger)
- Add Mirror Option for Downloads 6679 (JetRunner)
- [s2s] distributed eval in one command 7124 (sshleifer)
- [QOL] add signature for prepare_seq2seq_batch 7108 (sshleifer)
- Fix reproducible tests in Trainer 7119 (sgugger)
- [logging] remove no longer needed verbosity override 7100 (stas00)
- Fix TF Trainer loss calculation 6998 (chiapas)
- Add quotes to paths in MeCab arguments 7142 (polm)
- Multi predictions trainer 7126 (sgugger)
- fix ZeroDivisionError and epoch counting 7125 (chiapas)
- [EncoderDecoderModel] fix indentation error 7131 (patrickvonplaten)
- [docs] add testing documentation 7101 (stas00)
- Refactoring the TF activations functions 7150 (jplu)
- fix the warning message of overflowed sequence 7151 (xiye17)
- [doc] [testing] improve/expand the Parametrization section 7156 (stas00)
- Add empty random document case to DataCollatorForNextSentencePrediction 7161 (choidongyeon)
- [s2s run_eval] new features 7109 (stas00)
- use the correct add_start_docstrings 7174 (stas00)
- [s2s] distributed eval cleanup 7186 (sshleifer)
- remove duplicated code 7173 (stas00)
- remove deprecated flag 7171 (stas00)
- Transformer-XL: Remove unused parameters 7087 (RafaelWO)
- Trainer multi label 7191 (sgugger)
- Change to use relative imports in some files & Add python prompt symbols to example codes 7202 (soheeyang)
- [s2s] run_eval/run_eval_search tweaks 7192 (stas00)
- [s2s] dynamic batch size with --max_tokens_per_batch 7030 (sshleifer)
- [s2s] remove double assert 7223 (sshleifer)
- Add customized text to widget 7204 (mrm8488)
- Rewrites BERT in Flax to the new Linen API 7211 (marcvanzee)
- token-classification: update url of GermEval 2014 dataset 6571 (stefan-it)
- Fix a few countings (steps / epochs) in trainer_tf.py 7175 (chiapas)
- Add new pre-trained models BERTweet and PhoBERT 6129 (datquocnguyen)
- [s2s] distributed_eval.py saves better speed info 7242 (sshleifer)
- [testing doc] slow has to be last 7251 (stas00)
- examples/seq2seq/__init__.py mutates sys.path 7194 (stas00)
- [Bug fix] Fixed target_mapping preparation for XLNet (Pytorch) 7267 (guillaume-be)
- [example/glue] fix compute_metrics_fn for bart like models 7248 (patil-suraj)
- Disable missing weight warning for RobertaForMaskedLM/CamembertForMaskedLM 7282 (raphael0202)
- Fix 7284 7289 (sgugger)
- [s2s tests] fix test_run_eval_search 7297 (stas00)
- [s2s] s/alpha_loss_encoder/alpha_encoder_loss/ 7298 (stas00)
- [s2s] save hostname with repo info 7301 (sshleifer)
- Copy code from Bert to Roberta and add safeguard script 7219 (sgugger)
- Fix 7304 7305 (sgugger)
- Fix saving TF custom models 7291 (jplu)
- is_pretokenized -> is_split_into_words 7236 (sgugger)
- Add possibility to evaluate every epoch 7302 (sgugger)
- Support for Windows in check_copies 7316 (sgugger)
- Create an XLA parameter and fix the mixed precision 7311 (jplu)

3.1

Mistral 3.1 is heavily referenced in the following [model-based release](https://github.com/huggingface/transformers/releases/tag/v4.49.0-Mistral-3) and we recommend reading these if you want all the information relative to that model.

![image](https://github.com/user-attachments/assets/2f3ae3f3-87e8-4a64-a72e-6167ff6d90e5)

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:
- Fast-response conversational agents.
- Low-latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.
- Programming and math reasoning.
- Long document understanding.
- Visual understanding.

* Add Mistral3 by Cyrilvallez in 36790

Smol VLM 2

SmolVLM-2 is heavily referenced in the following [model-based release](https://github.com/huggingface/transformers/releases/tag/v4.49.0-SmolVLM-2) and we recommend reading these if you want all the information relative to that model.

![image](https://github.com/user-attachments/assets/dbdac096-f8cd-467a-8bfb-70af4c1e12c8)

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

- It uses SmolLM2 for the text model.
- It supports multi-image and video inputs

* SmolVLM2 by orrzohar in 36126

SigLIP-2

SigLIP-2 is heavily referenced in the following [model-based release](https://github.com/huggingface/transformers/releases/tag/v4.49.0-SigLIP-2) and we recommend reading these if you want all the information relative to that model.

![image](https://github.com/user-attachments/assets/63122c75-4bfd-469d-8031-da40fc18ed0d)

The SigLIP2 model was proposed in [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://huggingface.co/papers/2502.14786) by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)

* Add SigLIP 2 by qubvel in 36323

Prompt Depth Anything

PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

![image](https://promptda.github.io/assets/teaser.jpg)

* Add Prompt Depth Anything Model by haotongl in 35401

New tool: attention visualization

We add a new tool to `transformers` to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:

py

from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")
visualizer("A normal attention mask")

visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")
visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")

visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer("<img> You are an assistant.", suffix = "What is on the image?")

visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me") we should have slidiing on non sliding side by side

visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("<img>You are an assistant. Make sure you print me") we should have slidiing on non sliding side by side

![image](https://github.com/user-attachments/assets/0a926a20-b084-4420-8bda-5a43cd30ef2f)

* Add attention visualization tool by ArthurZucker in 36630

Deprecating transformers.agents in favor of smolagents

We are deprecating `transformers.agents` in favour of the `smolagents` library. Read more about smolagents [here](https://huggingface.co/docs/smolagents/index).

* Deprecate transformers.agents by aymeric-roucher in 36415

Quantization

We support adding custom quantization method by using the `register_quantization_config` and `register_quantizer` decorator:

python
register_quantization_config("custom")
class CustomConfig(QuantizationConfigMixin):
pass

register_quantizer("custom")
class CustomQuantizer(HfQuantizer):
pass

quantized_model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-350m", quantization_config=CustomConfig(), torch_dtype="auto"
)

* Added Support for Custom Quantization by keetrap in 35915
* Add Example for Custom quantization by MekkCyber in 36286

AMD is developing its in-house quantizer named [Quark](https://quark.docs.amd.com/latest/) released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:

python
pip install amd-quark

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")

* Support loading Quark quantized models in Transformers by fxmarty-amd and BowenBao in 36372

Torchao is augmented with `autoquant` support, CPU-quantization, as well as new `AOBaseConfig` object instances for more advanced configuration.

* Add autoquant support for torchao quantizer by jerryzh168 in 35503
* enable torchao quantization on CPU by jiqing-feng in 36146
* Add option for ao base configs by drisspg in 36526

Tensor Parallelism implementation changes

At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!

* TP initialization module-by-module by Cyrilvallez in 35996

Generation

This release includes two speed upgrades to `generate`:
1. Assisted generation now works with ANY model as an assistant, even with `do_sample=True`;

py
from transformers import pipeline
import torch

prompt = "Alice and Bob"
checkpoint = "google/gemma-2-9b"
assistant_checkpoint = "double7/vicuna-68m"

pipe = pipeline(
"text-generation",
model=checkpoint,
assistant_model=assistant_checkpoint,
do_sample=True
)
pipe_output = pipe(prompt, max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])

2. Beam search was vectorized, and should be significantly faster with a large `num_beams`. The speedup is more visible on smaller models, where `model.forward` doesn't dominate the total run time.

* Universal Speculative Decoding `CandidateGenerator` by keyboardAnt, jmamou, and gauravjain14 in 35029
* [generate] ✨ vectorized beam search ✨ by gante in 35802

Documentation

A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the `transformers` documentation, making it much more easy to navigate. Let us know what you think!

* [docs] Redesign by stevhliu in 31757

Notable repo maintenance

The research examples folder that was hosted in `transformers` is no more. We have moved it out of `transformers` and in the following repo: github.com/huggingface/transformers-research-projects/

* Remove research projects by Rocketknight1 in 36645

We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.

* Proper_flex by ArthurZucker in 36643

More models support flex attention now thanks to qubvel

* Refactor Attention implementation for ViT-based models by qubvel in 36545

First integration of hub kernels for deformable detr!

- Use deformable_detr kernel from the Hub (36853) by danieldk

Bugfixes and improvements

* [tests] fix `EsmModelIntegrationTest::test_inference_bitsandbytes` by faaany in 36225
* Fix `LlavaForConditionalGenerationModelTest::test_config` after 36077 by ydshieh in 36230
* AMD DeepSpeed image additional HIP dependencies by ivarflakstad in 36195
* [generate] remove cache v4.47 deprecations by gante in 36212
* Add missing atol to torch.testing.assert_close where rtol is specified by ivarflakstad in 36234
* [tests] remove tf/flax tests in `/generation` by gante in 36235
* [generate] Fix encoder decoder models attention mask by eustlb in 36018
* Add compressed tensor in quant dockerfile by SunMarc in 36239
* [tests] remove `test_export_to_onnx` by gante in 36241
* Au revoir flaky `test_fast_is_faster_than_slow` by ydshieh in 36240
* Fix TorchAoConfig not JSON serializable by andrewor14 in 36206
* Remove flakiness in VLMs by zucchini-nlp in 36242
* feat: add support for tensor parallel training workflow with accelerate by kmehant in 34194
* Fix XGLM loss computation (PyTorch and TensorFlow) by damianoamatruda in 35878
* GitModelIntegrationTest - flatten the expected slice tensor by ivarflakstad in 36260
* Added Support for Custom Quantization by keetrap in 35915
* Qwen2VL fix cos,sin dtypes to float when used with deepspeed by ArdalanM in 36188
* Uniformize LlavaNextVideoProcessor kwargs by yonigozlan in 35613
* Add support for post-processing kwargs in image-text-to-text pipeline by yonigozlan in 35374
* Add dithering to the `Speech2TextFeatureExtractor` API. by KarelVesely84 in 34638
* [tests] remove `pt_tf` equivalence tests by gante in 36253
* TP initialization module-by-module by Cyrilvallez in 35996
* [tests] deflake dither test by gante in 36284
* [tests] remove flax-pt equivalence and cross tests by gante in 36283
* [tests] make `test_from_pretrained_low_cpu_mem_usage_equal` less flaky by gante in 36255
* Add Example for Custom quantization by MekkCyber in 36286
* docs: Update README_zh-hans.md by hyjbrave in 36269
* Fix callback handler reference by SunMarc in 36250
* Make cache traceable by IlyasMoutawwakil in 35873
* Fix broken CI on release branch due to missing conversion files by ydshieh in 36275
* Ignore conversion files in test fetcher by ydshieh in 36251
* SmolVLM2 by orrzohar in 36126
* Fix typo in Pixtral example by 12v in 36302
* fix: prevent second save in the end of training if last step was saved already by NosimusAI in 36219
* [smolvlm] make CI green by gante in 36306
* Fix default attention mask of generate in MoshiForConditionalGeneration by cyan-channel-io in 36171
* VLMs: even more clean-up by zucchini-nlp in 36249
* Add SigLIP 2 by qubvel in 36323
* [CI] Check test if the `GenerationTesterMixin` inheritance is correct 🐛 🔫 by gante in 36180
* [tests] make quanto tests device-agnostic by faaany in 36328
* Uses Collection in transformers.image_transforms.normalize by CalOmnie in 36301
* Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese by Rocketknight1 in 36121
* [tests] enable bnb tests on xpu by faaany in 36233
* Improve model loading for compressed tensor models by rahul-tuli in 36152
* Change slack channel for mi250 CI to amd-hf-ci by ivarflakstad in 36346
* Add autoquant support for torchao quantizer by jerryzh168 in 35503
* Update amd pytorch index to match base image by ivarflakstad in 36347
* fix(type): padding_side type should be Optional[str] by shenxiangzhuang in 36326
* [Modeling] Reduce runtime when loading missing keys by kylesayrs in 36312
* notify new model merged to `main` by ydshieh in 36375
* Update modeling_llava_onevision.py by yinsong1986 in 36391
* Load models much faster on accelerator devices!! by Cyrilvallez in 36380
* [modular] Do not track imports in functions by Cyrilvallez in 36279
* Fix `is_causal` fail with compile by Cyrilvallez in 36374
* enable torchao quantization on CPU by jiqing-feng in 36146
* Update _get_eval_sampler to reflect Trainer.tokenizer is deprecation self.tokenizer -> self.processing_class by yukiman76 in 36315
* Fix doc formatting in forward passes & modular by Cyrilvallez in 36243
* Added handling for length <2 of suppress_tokens for whisper by andreystarenky in 36336
* addressing the issue 34611 to make FlaxDinov2 compatible with any batch size by MHRDYN7 in 35138
* tests: revert change of torch_require_multi_gpu to be device agnostic by dvrogozh in 35721
* [tests] enable autoawq tests on XPU by faaany in 36327
* fix audio classification pipeline fp16 test on cuda by jiqing-feng in 36359
* chore: fix function argument descriptions by threewebcode in 36392
* Fix pytorch integration tests for SAM by qubvel in 36397
* [CLI] add import guards by gante in 36376
* Fix convert_to_rgb for SAM ImageProcessor by MSt-10 in 36369
* Security fix for `benchmark.yml` by ydshieh in 36402
* Fixed VitDet for non-squre Images by cjfghk5697 in 35969
* Add retry hf hub decorator by muellerzr in 35213
* Deprecate transformers.agents by aymeric-roucher in 36415
* Fixing the docs corresponding to the breaking change in torch 2.6. by Narsil in 36420
* add recommendations for NPU using flash_attn by zheliuyu in 36383
* fix: prevent model access error during Optuna hyperparameter tuning by emapco in 36395
* Universal Speculative Decoding `CandidateGenerator` by keyboardAnt in 35029
* Fix compressed tensors config by MekkCyber in 36421
* Update form pretrained to make TP a first class citizen by ArthurZucker in 36335
* Fix Expected output for compressed-tensors tests by MekkCyber in 36425
* restrict cache allocator to non quantized model by SunMarc in 36428
* Change PR to draft when it is (re)opened by ydshieh in 36417
* Fix permission by ydshieh in 36443
* Fix another permission by ydshieh in 36444
* Add `contents: write` by ydshieh in 36445
* [save_pretrained ] Skip collecting duplicated weight by wejoncy in 36409
* [generate] `torch.distributed`-compatible `DynamicCache` by gante in 36373
* Lazy import libraries in `src/transformers/image_utils.py` by hmellor in 36435
* Fix `hub_retry` by ydshieh in 36449
* [GroundingDino] Fix grounding dino loss 🚨 by EduardoPach in 31828
* Fix loading models with mismatched sizes by qubvel in 36463
* [docs] fix bug in deepspeed config by faaany in 36081
* Add Got-OCR 2 Fast image processor and refactor slow one by yonigozlan in 36185
* Fix couples of issues from 36335 by SunMarc in 36453
* Fix _load_state_dict_into_meta_model with device_map=None by hlky in 36488
* Fix loading zero3 weights by muellerzr in 36455
* Check `TRUST_REMOTE_CODE` for `RealmRetriever` for security by ydshieh in 36511
* Fix kwargs UserWarning in SamImageProcessor by MSt-10 in 36479
* fix torch_dtype, contiguous, and load_state_dict regression by SunMarc in 36512
* Fix some typos in docs by co63oc in 36502
* chore: fix message descriptions in arguments and comments by threewebcode in 36504
* Fix pipeline+peft interaction by Rocketknight1 in 36480
* Fix edge case for continue_final_message by Rocketknight1 in 36404
* [Style] fix E721 warnings by kashif in 36474
* Remove unused code by Rocketknight1 in 36459
* [docs] Redesign by stevhliu in 31757
* Add aya by ArthurZucker in 36521
* chore: Fix typos in docs and examples by co63oc in 36524
* Fix bamba tests amd by ivarflakstad in 36535
* Fix links in quantization doc by MekkCyber in 36528
* chore: enhance messages in docstrings by threewebcode in 36525
* guard torch version for uint16 by SunMarc in 36520
* Fix typos in tests by co63oc in 36547
* Fix typos . by zhanluxianshen in 36551
* chore: enhance message descriptions in parameters,comments,logs and docstrings by threewebcode in 36554
* Delete redundancy if case in model_utils by zhanluxianshen in 36559
* Modular Conversion --fix_and_overwrite on Windows by hlky in 36583
* Integrate SwanLab for offline/online experiment tracking and local visualization by ShaohonChen in 36433
* [bark] fix loading of generation config by gante in 36587
* [XGLM] tag tests as slow by gante in 36592
* fix: argument by ariG23498 in 36558
* Mention UltraScale Playbook 🌌 in docs by NouamaneTazi in 36589
* avoid errors when the size of `input_ids` passed to `PrefixConstrainedLogitsProcessor` is zero by HiDolen in 36489
* Export base streamer. by AndreasAbdi in 36500
* Github action for auto-assigning reviewers by Rocketknight1 in 35846
* Update chat_extras.md with content correction by krishkkk in 36599
* Update "who to tag" / "who can review" by gante in 36394
* Fixed datatype related issues in `DataCollatorForLanguageModeling` by capemox in 36457
* Fix check for XPU. PyTorch >= 2.6 no longer needs ipex. by tripzero in 36593
* [`HybridCache`] disable automatic compilation by gante in 36620
* Fix auto-assign reviewers by Rocketknight1 in 36631
* chore: fix typos in language models by threewebcode in 36586
* [docs] Serving LLMs by stevhliu in 36522
* Refactor some core stuff by ArthurZucker in 36539
* Fix bugs in mllama image processing by tjohnson31415 in 36156
* Proper_flex by ArthurZucker in 36643
* Fix AriaForConditionalGeneration flex attn test by ivarflakstad in 36604
* Remove remote code warning by Rocketknight1 in 36285
* Stop warnings from unnecessary torch.tensor() overuse by Rocketknight1 in 36538
* [docs] Update docs dependency by stevhliu in 36635
* Remove research projects by Rocketknight1 in 36645
* Fix gguf docs by SunMarc in 36601
* fix typos in the docs directory by threewebcode in 36639
* Gemma3 by RyanMullins in 36658
* HPU support by IlyasMoutawwakil in 36424
* fix block mask typing by ArthurZucker in 36661
* [CI] gemma 3 `make fix-copies` by gante in 36664
* Fix bnb regression due to empty state dict by SunMarc in 36663
* [core] Large/full refactor of `from_pretrained` by Cyrilvallez in 36033
* Don't accidentally mutate the base_model_tp_plan by Rocketknight1 in 36677
* Fix Failing GPTQ tests by MekkCyber in 36666
* Remove hardcoded slow image processor class in processors supporting fast ones by yonigozlan in 36266
* [quants] refactor logic for modules_to_not_convert by SunMarc in 36672
* Remove differences between init and preprocess kwargs for fast image processors by yonigozlan in 36186
* Refactor siglip2 fast image processor by yonigozlan in 36406
* Fix rescale normalize inconsistencies in fast image processors by yonigozlan in 36388
* [Cache] Don't initialize the cache on `meta` device by gante in 36543
* Update config.torch_dtype correctly by SunMarc in 36679
* Fix slicing for 0-dim param by SunMarc in 36580
* Changing the test model in Quanto kv cache by MekkCyber in 36670
* fix wandb hp search unable to resume from sweep_id by bd793fcb in 35883
* Upgrading torch version and cuda version in quantization docker by MekkCyber in 36264
* Change Qwen2_VL image processors to have init and call accept the same kwargs by yonigozlan in 36207
* fix type annotation for ALL_ATTENTION_FUNCTIONS by WineChord in 36690
* Fix dtype for params without tp_plan by Cyrilvallez in 36681
* chore: fix typos in utils module by threewebcode in 36668
* [CI] Automatic rerun of certain test failures by gante in 36694
* Add loading speed test by Cyrilvallez in 36671
* fix: fsdp sharded state dict wont work for save_only_model knob by kmehant in 36627
* Handling an exception related to HQQ quantization in modeling by MekkCyber in 36702
* Add GGUF support to T5-Encoder by Isotr0py in 36700
* Final CI cleanup by Rocketknight1 in 36703
* Add support for fast image processors in add-new-model-like CLI by yonigozlan in 36313
* Gemma3 processor typo by Kuangdd01 in 36710
* Make the flaky list a little more general by Rocketknight1 in 36704
* Cleanup the regex used for doc preprocessing by Rocketknight1 in 36648
* [model loading] don't `gc.collect()` if only 1 shard is used by gante in 36721
* Fix/best model checkpoint fix by seanswyi in 35885
* Try working around the processor registration bugs by Rocketknight1 in 36184
* [tests] Parameterized `test_eager_matches_sdpa_inference` by gante in 36650
* 🌐 [i18n-KO] Translated codegen.md to Korean by maximizemaxwell in 36698
* Fix post_init() code duplication by Cyrilvallez in 36727
* Fix grad accum arbitrary value by IlyasMoutawwakil in 36691
* [Generation, Gemma 3] When passing a custom `generation_config`, overwrite default values with the model's base `generation_config` by gante in 36684
* 🚨🚨🚨 Fix sdpa in SAM and refactor relative position embeddings by geetu040 in 36422
* enable/disable compile for quants methods by SunMarc in 36519
* fix can_generate by jiqing-feng in 36570
* Allow ray datasets to be used with trainer by FredrikNoren in 36699
* fix xpu tests by jiqing-feng in 36656
* Fix test isolation for clear_import_cache utility by sambhavnoobcoder in 36345
* Fix `TrainingArguments.torch_empty_cache_steps` post_init check by pkuderov in 36734
* [MINOR:TYPO] Update hubert.md by cakiki in 36733
* [CI] remove redundant checks in `test_eager_matches_sdpa_inference` by gante in 36740
* [docs] Update README by stevhliu in 36265
* doc: Clarify `is_decoder` usage in PretrainedConfig documentation by d-kleine in 36724
* fix typos in the tests directory by threewebcode in 36717
* chore: fix typos in tests directory by threewebcode in 36785
* Fixing typo in gemma3 image_processor_fast and adding a small test by Zebz13 in 36776
* Fix gemma3_text tokenizer in mapping by LysandreJik in 36793
* Add Mistral3 by Cyrilvallez in 36790
* fix hqq due to recent modeling changes by SunMarc in 36771
* Update SHA for `tj-actions/changed-files` by ydshieh in 36795
* Loading optimizations by Cyrilvallez in 36742
* Fix Mistral3 tests by yonigozlan in 36797
* Fix casting dtype for qunatization by SunMarc in 36799
* Fix chameleon's TypeError because inputs_embeds may None by YenFuLin in 36673
* Support custom dosctrings in modular by yonigozlan in 36726
* [generate] ✨ vectorized beam search ✨ by gante in 35802
* Expectations test utils by ivarflakstad in 36569
* fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model by yao-matrix in 36572
* Remove `dist": "loadfile"` for `pytest` in CircleCI jobs by ydshieh in 36811
* Fix Device map for bitsandbytes tests by MekkCyber in 36800
* [Generation] remove leftover code from end-to-end compilation by gante in 36685
* Add attention visualization tool by ArthurZucker in 36630
* Add option for ao base configs by drisspg in 36526
* enable OffloadedCache on XPU from PyTorch 2.7 by yao-matrix in 36654
* [gemma 3] multimodal checkpoints + AutoModelForCausalLM by gante in 36741
* One more fix for reviewer assignment by Rocketknight1 in 36829
* Support tracable dynamicKVcache by tugsbayasgalan in 36311
* Add Space to Bitsandbytes doc by MekkCyber in 36834
* quick fix fast_image_processor register error by JJJYmmm in 36716
* Update configuration_qwen2.py by michaelfeil in 36735
* Just import torch AdamW instead by Rocketknight1 in 36177
* Move the warning to the documentation for DataCollatorWithFlattening by qgallouedec in 36707
* Fix swanlab global step by Zeyi-Lin in 36728
* Disable inductor config setter by default by HDCharles in 36608
* [ForCausalLMLoss] allow users to pass shifted labels by stas00 in 36607
* fix tiktoken convert to pass AddedToken to Tokenizer by itazap in 36566
* Saving `Trainer.collator.tokenizer` in when `Trainer.processing_class` is `None` by innerNULL in 36552
* Pass num_items_in_batch directly to loss computation by eljandoubi in 36753
* Fix fp16 ONNX export for RT-DETR and RT-DETRv2 by qubvel in 36460
* Update deprecated Jax calls by rasmi in 35919
* [qwen2 audio] remove redundant code and update docs by gante in 36282
* Pass state dict by phos-phophy in 35234
* [modular] Sort modular skips by gante in 36304
* [generate] clarify docstrings: when to inherit `GenerationMixin` by gante in 36605
* Update min safetensors bis by SunMarc in 36823
* Fix import for torch 2.0, 2.1 - guard typehint for "device_mesh" by qubvel in 36768
* Gemma 3: Adding explicit GenerationConfig and refactoring conversion … by RyanMullins in 36833
* Fix: remove the redundant snippet of _whole_word_mask by HuangBugWei in 36759
* Shieldgemma2 by RyanMullins in 36678
* Fix ONNX export for sequence classification head by echarlaix in 36332
* Fix hqq skipped modules and dynamic quant by mobicham in 36821
* Use pyupgrade --py39-plus to improve code by cyyever in 36843
* Support loading Quark quantized models in Transformers by fxmarty-amd in 36372
* DeepSpeed tensor parallel+ZeRO by inkcherry in 36825
* Refactor Attention implementation for ViT-based models by qubvel in 36545
* Add Prompt Depth Anything Model by haotongl in 35401
* Add model visual debugger by molbap in 36798
* [torchao] revert to get_apply_tensor_subclass by SunMarc in 36849
* Gemma3: fix test by zucchini-nlp in 36820
* [CI] fix update metadata job by gante in 36850
* Add support for seed in `DataCollatorForLanguageModeling` by capemox in 36497
* Refactor Aya Vision with modular by yonigozlan in 36688
* Mllama: raise better error by zucchini-nlp in 35934
* [CI] doc builder without custom image by gante in 36862
* FIX FSDP plugin update for QLoRA by BenjaminBossan in 36720
* Remove call to `.item` in `get_batch_samples` by regisss in 36861
* chore: fix typos in the tests directory by threewebcode in 36813
* Make ViTPooler configurable by sebbaur in 36517
* Revert "Update deprecated Jax calls by ArthurZucker in 35919)"
* [generate] model defaults being inherited only happens for newer models by gante in 36881
* :red_circle: :red_circle: :red_circle: supersede paligemma forward to shift pos id indexing by molbap in 36859
* Gemma 3 tests expect greedy decoding by molbap in 36882
* Use `deformable_detr` kernel from the Hub by danieldk in 36853
* Minor Gemma 3 fixes by molbap in 36884
* Fix: dtype cannot be str by zucchini-nlp in 36262

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* IlyasMoutawwakil
* Make cache traceable (35873)
* HPU support (36424)
* Fix grad accum arbitrary value (36691)
* orrzohar
* SmolVLM2 (36126)
* threewebcode
* chore: fix function argument descriptions (36392)
* chore: fix message descriptions in arguments and comments (36504)
* chore: enhance messages in docstrings (36525)
* chore: enhance message descriptions in parameters,comments,logs and docstrings (36554)
* chore: fix typos in language models (36586)
* fix typos in the docs directory (36639)
* chore: fix typos in utils module (36668)
* fix typos in the tests directory (36717)
* chore: fix typos in tests directory (36785)
* chore: fix typos in the tests directory (36813)
* aymeric-roucher
* Deprecate transformers.agents (36415)
* keyboardAnt
* Universal Speculative Decoding `CandidateGenerator` (35029)
* EduardoPach
* [GroundingDino] Fix grounding dino loss 🚨 (31828)
* co63oc
* Fix some typos in docs (36502)
* chore: Fix typos in docs and examples (36524)
* Fix typos in tests (36547)
* RyanMullins
* Gemma3 (36658)
* Gemma 3: Adding explicit GenerationConfig and refactoring conversion … (36833)
* Shieldgemma2 (36678)
* cyyever
* Use pyupgrade --py39-plus to improve code (36843)
* haotongl
* Add Prompt Depth Anything Model (35401)
* danieldk
* Use `deformable_detr` kernel from the Hub (36853)

v4.49.0-Mistral-3
A new model is added to transformers: Mistral 3.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformersv4.49.0-Mistral-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Mistral 3

![image](https://github.com/user-attachments/assets/2f3ae3f3-87e8-4a64-a72e-6167ff6d90e5)

The model is detailed in the following [blog post](https://mistral.ai/news/mistral-small-3-1).
The models are available on the Hub with the following tag: [`mistral3`](https://huggingface.co/models?other=mistral3)

Overview

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:
- Fast-response conversational agents.
- Low-latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.
- Programming and math reasoning.
- Long document understanding.
- Visual understanding.

This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez) and [yonigozlan](https://huggingface.co/yonigozlan).

The original code can be found [here](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/pixtral.py) and [here](https://github.com/mistralai/mistral-common).

Usage example

Inference with Pipeline

Here is how you can use the `image-text-to-text` pipeline to perform inference with the `Mistral3` models in just a few lines of code:
python
>>> from transformers import pipeline

>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "image",
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
... },
... {"type": "text", "text": "Describe this image."},
... ],
... },
... ]

>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'

Inference on a single image

This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.

python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...

Text-only generation
This example shows how to generate text using the Mistral3 model without providing any image input.

`python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

>>> messages = [
... {"role": "system", "content": SYSTEM_PROMPT},
... {"role": "user", "content": user_prompt},
... ]

>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]

>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!

/\_/\
( o.o )
> ^ <
"
`

Batched image and text inputs
Mistral3 models also support batched image and text inputs.

python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... },
... ],
... ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]

Batched multi-image input and quantization with BitsAndBytes
This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
This example also how to use `BitsAndBytes` to load the model in 4bit quantization.

python
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
... model_checkpoint, quantization_config=quantization_config
... )

>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]

v4.49.0-Gemma-3
A new model is added to transformers: Gemma 3.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformersv4.49.0-Gemma-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Gemma 3

![image](https://github.com/user-attachments/assets/51d0fffa-9cd0-4616-b80b-9a303eeb1b25)

The model is detailed in the following [blog post](https://huggingface.co/blog/gemma3).
The models and demos using the model are available in the following [collection](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d).

A Space to play around with the [12B-it flavor is available here](https://huggingface.co/spaces/huggingface-projects/gemma-3-12b-it).

Overview

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a [SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip) vision encoder and a [Gemma 2](https://huggingface.co/docs/transformers/model_doc/gemma_2) language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Usage tips

- For image+text and image-only inputs use `Gemma3ForConditionalGeneration`.
- For text-only inputs use `Gemma3ForCausalLM` for generation to avoid loading the vision tower.
- Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
- The text passed to the processor should have the `"<start_of_image_>"` token where the images should be inserted.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from `apply_chat_template`. See the examples below for more details on how to use it.

Image cropping for high resolution images

The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set `do_pan_and_scan=True` to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.

Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.

python
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
do_pan_and_scan=True,
).to(model.device)

Usage Example

Single-image Inference

python
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Multi-image Inference

python
from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url_cow},
{"type": "image", "url": url_stop},
{"type": "text", "text": "Are these two images identical?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Text-only inference

python
from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")

input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=100)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(text)

v4.49.0-AyaVision
A new model is added to transformers: Aya Vision.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformersv4.49.0-AyaVision

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Aya Vision

![image](https://github.com/user-attachments/assets/8a90d406-ed2e-435c-931c-07c5eaed9f62)

The model is detailed in the following [blog post](https://huggingface.co/blog/aya-vision).

Overview

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the `Siglip2-so400-384-14` vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:
- Multimodal capabilities in 23 languages
- Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
- High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
- Seamless integration of visual and textual information in 23 languages.

Usage Example

Here's an example usage of the Aya Vision model.

py
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)

Format message with the aya-vision chat template
messages = [
{"role": "user",
"content": [
{"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
{"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
]},
]

inputs = processor.apply_chat_template(
messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

v4.49.0-SigLIP-2
A new model is added to transformers: SigLIP-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: `v4.49.0-SigLIP-2`.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformersv4.49.0-SigLIP-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SigLIP2

![image](https://github.com/user-attachments/assets/63122c75-4bfd-469d-8031-da40fc18ed0d)

The paper page for the model is available [here](https://huggingface.co/papers/2502.14786).
It is detailed in the following [blog post](https://huggingface.co/blog/siglip2).

The models and demos using the model are available in the following [collection](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107).

Overview

The SigLIP2 model was proposed in [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://huggingface.co/papers/2502.14786) by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)

The abstract from the paper is the following:

*We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success
of the original SigLIP. In this second iteration, we extend the original image-text training objective with
several prior, independently developed techniques into a unified recipe—this includes decoder-based
pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With
these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities,
including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot
accuracy), image-text retrieval, and transfer performance when extracting visual representations for
Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements
on localization and dense prediction tasks. We also train variants which support multiple resolutions
and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that
includes de-biasing techniques, leading to much better multilingual understanding and improved fair-
ness. To provide users with the ability to trade-off inference cost with performance, we release model
checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).*

Usage tips

- Usage of SigLIP2 is similar to [SigLIP](siglip) and [CLIP](clip). The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
- Training is supported but does not use `torch.distributed` utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
- Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
- To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
- The NaFlex variant supports processing images at higher resolutions by adjusting the `max_num_patches` parameter in the `Processor`. The default value is `max_num_patches=256`. Increasing `max_num_patches` to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip2_metrics_table.png"
alt="drawing" width="600"/>

This model was contributed by [qubvel](https://huggingface.co/qubvel-hf).
The original code can be found [here](https://github.com/google-research/big_vision/tree/main).

Usage example

There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the `Siglip2Model` class yourself.

FixRes variant

**Pipeline API**

The pipeline allows to use the model in a few lines of code:

python
>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> load pipe
>>> image_classifier = pipeline(
... task="zero-shot-image-classification",
... model="google/siglip2-base-patch16-224",
... )

>>> load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]

**Using the model yourself**

If you want to do the pre- and postprocessing yourself, here's how to do that:

python
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")

3.1.0

Not secure

Pegasus, mBART, DPR, self-documented outputs and new pipelines

Pegasus

The Pegasus model from [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.

Model implemented as a collaboration between Jingqing Zhang and sshleifer in 6340

- PegasusForConditionalGeneration (torch version) 6340
- add pegasus finetuning script 6811 [script](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh). (warning very slow)

DPR

The DPR model from [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.

- Add DPR model 5279 (lhoestq)
- Fix tests imports dpr 5576 (lhoestq)

DeeBERT

The DeeBERT model from [DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference](https://www.aclweb.org/anthology/2020.acl-main.204/) by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the `examples/` folder alongside its training script, in PyTorch.

- Add DeeBERT (entropy-based early exiting for *BERT) 5477 (Ji-Xin)

Self-documented outputs

As well as returning tuples, PyTorch and TensorFlow models now return a subclass of `ModelOutput` that is appropriate. A `ModelOutput` is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.

- Change model outputs types to self-document outputs 5438 (sgugger)
- Tf model outputs 6247 (sgugger)

Models return tuples by default, and return self-documented outputs if the `return_dict` configuration flag is set to `True` or if the `return_dict=True` keyword argument is passed to the forward/call method.

Summary of the behavior:
python
The new outputs are opt-in, you have to activate them explicitly with `return_dict=True`
Either at instantiation
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)
Or when calling the model
output = model(**inputs, return_dict=True)

You can access the elements of the outputs with
(1) named attributes
loss = outputs.loss
logits = outputs.logits

(2) their names as strings like a dict
loss = outputs["loss"]
logits = outputs["logits"]

(3) their index as integers or slices in the pre-3.1.0 outputs tuples
loss = outputs[0]
logits = outputs[1]
loss, logits = outputs[:2]

One **breaking behavior** of these new outputs (which is the reason you have to opt-in to use these new outputs:
Iterating on the outputs now return the names (keys) instead of the values:
print([element for element in outputs])
>>> ['loss', 'logits']
Thus you cannot unpack the output like pre-3.1.0 (you get the string names instead of the values):
(But you can query a slice like indicated in (3) above)
loss_keys, logits_key = outputs

Encoder-Decoder framework
The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, *e.g.*:
Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....

- [EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer 6411 (patrickvonplaten)
- [EncoderDecoder] Add Cross Attention for GPT2 6415 (patrickvonplaten)
- [EncoderDecoder] Add functionality to tie encoder decoder weights 6538 (patrickvonplaten)
- Multiple combinations of EncoderDecoder models have been fine-tuned and evaluated on CNN/Daily-Mail summarization: https://huggingface.co/models?search=cnn_dailymail-fp16 (patrickvonplaten)

TensorFlow as a first-class citizen

As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.

- [Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile 5395 (patrickvonplaten)
- [Benchmark] Add benchmarks for TF Training 5594 (patrickvonplaten)
Machine Translation

MarianMTModel

- [en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh?text=My+name+is+Wolfgang+and+I+live+in+Berlin) and **357** other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (sshleifer + jorgtied). There are now > 1300 supported pairs for machine translation.
- Marian converter updates 6342 (sshleifer)
- Marian distill scripts + integration test 6799 (sshleifer)

mBART

The mBART model from [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) was can now be accessed through `MBartForConditionalGeneration`.

- Add mbart-large-cc25, support translation finetuning 5129 (sshleifer)
- [mbart] prepare_translation_batch passes **kwargs to allow DeprecationWarning 5581 (sshleifer)
- MBartForConditionalGeneration 6441 (patil-suraj)
- [fix] mbart_en_ro_generate test now identical to fairseq 5731 (sshleifer)
- [Doc] explaining romanian postprocessing for MBART BLEU hacking 5943 (sshleifer)
- [test] partial coverage for train_mbart_enro_cc25.sh 5976 (sshleifer)
- MbartTokenizer: do not hardcode vocab size 5998 (sshleifer)
- MBART: support summarization tasks where max_src_len > max_tgt_len 6003 (sshleifer)
- Fix 6096: MBartTokenizer's mask token 6098 (sshleifer)
- [s2s] Document better mbart finetuning command 6229 (sshleifer)
- mBART Conversion script 6230 (sshleifer)
- [s2s] add BartTranslationDistiller for distilling mBART 6363 (sshleifer)
- [Doc] add more MBart and other doc 6490 (patil-suraj)

examples/seq2seq

- examples/seq2seq/finetune.py supports --task translation
- All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a `prepare_seq2seq_batch` method that makes batches for sequence to sequence trianing.

PRs:

- Seq2SeqDataset uses linecache to save memory 5792 (Pradhy729)
- [examples/seq2seq]: add --label_smoothing option 5919 (sshleifer)
- seq2seq/run_eval.py can take decoder_start_token_id 5949 (sshleifer)
- [examples (seq2seq)] fix preparing decoder_input_ids for T5 5994 (patil-suraj)
- [s2s] add support for overriding config params 6149 (stas00)
- s2s: fix LR logging, remove some dead code. 6205 (sshleifer)
- [s2s] tiny QOL improvement: run_eval prints scores 6341 (sshleifer)
- [s2s] fix label_smoothed_nll_loss 6344 (patil-suraj)
- [s2s] fix --gpus clarg collision 6358 (sshleifer)
- [s2s] Script to save wmt data to disk 6403 (sshleifer)
- rename prepare_translation_batch -> prepare_seq2seq_batch 6103 (sshleifer)
- Mult rouge by 100: standard units 6359 (sshleifer)
- allow spaces in bash args with "$" 6521 (sshleifer)
- [seq2seq] MAX_LEN env var for MT commands 5837 (sshleifer)
- [seq2seq] distillation.py accepts trainer arguments 5865 (sshleifer)
- [s2s]Use prepare_translation_batch for Marian finetuning 6293 (sshleifer)
- [BartTokenizer] add prepare s2s batch 6212 (patil-suraj)
- [T5Tokenizer] add prepare_seq2seq_batch method 6122 (patil-suraj)
- [s2s] round runtime in run_eval 6798 (sshleifer)
- [s2s README] Add more dataset download instructions 6737 (sshleifer)
- [s2s] round bleu, rouge to 4 digits 6704 (sshleifer)
- [s2s] command line args for faster val steps 6833

New documentation

Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.

- Guide to fixed-length model perplexity evaluation 5449 (joeddav)
- Improvements to PretrainedConfig documentation 5642 (sgugger)
- Document model outputs 5673 (sgugger)
- docs(wandb): explain how to use W&B integration 5607 (borisdayma)
- Model utils doc 6005 (sgugger)
- ONNX documentation 5992 (mfuntowicz)
- Tokenizer documentation 6110 (sgugger)
- Pipeline documentation 6175 (sgugger)
- Encoder decoder config docs 6195 (afcruzs)
- Colab button 6389 (sgugger)
- Generation documentation 6470 (sgugger)
- Add custom datasets tutorial 6466 (joeddav)
- Logging documentation 6852 (sgugger)

Trainer updates

New additions to the `Trainer`

- Added data collator for permutation (XLNet) language modeling and related calls 5522 (shngt)
- Trainer support for iterabledataset 5834 (Pradhy729)
- Adding PaddingDataCollator 6442 (sgugger)
- Add hyperparameter search to Trainer 6576 (sgugger)
- [examples] Add trainer support for question-answering 4829 (patil-suraj)
- Adds comet_ml to the list of auto-experiment loggers 6176 (dsblank)
- Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task 6644 (HuangLianzhe)

New models & model architectures

The following model architectures have been added to the library

- FlaubertForTokenClassification 5644 (stas00)
- TFXLMForTokenClassification 5614 (LysandreJik)
- TFXLMForMultipleChoice 5614 (LysandreJik)
- TFFlaubertForTokenClassification 5614 (LysandreJik)
- TFFlaubertForMultipleChoice 5614 (LysandreJik)
- TFElectraForSequenceClassification 6227 (jplu)
- TFElectraForMultipleChoice 6227 (jplu)
- TF Longformer 5764 (patrickvonplaten)
- CamembertForCausalLM 6577 (patil-suraj)

Regression testing on TPU & TPU CI

Thanks to zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the `Trainer`, and offers very simple regression testing on model training performance.

- Test XLA examples 5583
- Add setup for TPU CI to run every hour. 6219 (zcain117)
- Add missing docker arg for TPU CI. 6393 (zcain117)
- Get GKE logs via kubectl logs instead of gcloud logging read. 6446 (zcain117)

New pipelines

New pipelines have been added:

- Zero shot classification pipeline 5760 (joeddav)
- Addition of a DialoguePipeline 5516 (guillaume-be)
- Add targets arg to fill-mask pipeline 6239 (joeddav)

Community notebooks

- [Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) #6321 (elsanns)
- Update ONNX notebook to include section on quantization. 6831 (mfuntowicz)

Centralized logging

Logging is now centralized. The library offers methods to handle the verbosity level of all loggers contained in the library. [Link to logging doc here]:

- Centralize logging 6434 (LysandreJik)

Bug fixes and improvements

- [Reformer] Adapt Reformer MaskedLM Attn mask 5560 (patrickvonplaten)
- Make T5 compatible with ONNX 5518 (abelriboulot)
- [Bart] enable test_torchscript, update test_tie_weights 5457 (sshleifer)
- [docs] fix model_doc links in model summary 5566 (patil-suraj)
- [Benchmark] Readme for benchmark 5363 (patrickvonplaten)
- Fix Inconsistent NER Grouping (Pipeline) 4987 (enzoampil)
- QA pipeline BART compatible 5496 (mfuntowicz)
- More explicit error when failing to tensorize overflowing tokens 5633 (LysandreJik)
- Should check that torch TPU is available 5636 (LysandreJik)
- Add forum link in the docs 5637 (sgugger)
- Fixed TextGenerationPipeline on torch + GPU 5629 (TevenLeScao)
- Fixed use of memories in XLNet (caching for language generation + warning when loading improper memoryless model) 5632 (TevenLeScao)
- [squad] add version tag to squad cache 5669 (lazovich)
- Deprecate old past arguments 5671 (sgugger)
- Pipeline model type check 5679 (JetRunner)
- rename the functions to match the rest of the test convention 5692 (stas00)
- doc improvements 5688 (stas00)
- Fix Trainer in DataParallel setting 5685 (sgugger)
- [Longformer] fix longformer global attention output 5659 (patrickvonplaten)
- [Fix] github actions CI by reverting 5138 5686 (sshleifer)
- [Reformer classification head] Implement the reformer model classification head for text classification 5198 (as-stevens)
- Cleanup bart caching logic 5640 (sshleifer)
- [AutoModels] Fix config params handling of all PT and TF AutoModels 5665 (patrickvonplaten)
- [cleanup] T5 test, warnings 5761 (sshleifer)
- [fix] T5 ONNX test: model.to(torch_device) 5769 (mfuntowicz)
- [Benchmark] fix benchmark non standard model 5801 (patrickvonplaten)
- [Benchmark] Fix models without `architectures` param in config 5808 (patrickvonplaten)
- [Longformer] fix longformer slow-down 5811 (patrickvonplaten)
- [seq2seq] pack_dataset.py rewrites dataset in max_tokens format 5819 (sshleifer)
- [seq2seq] Don't copy self.source in sortishsampler 5818 (sshleifer)
- [cleanups] make Marian save as Marian 5830 (sshleifer)
- [Reformer] - Cache hidden states and buckets to speed up inference 5578 (patrickvonplaten)
- Lightning Updates for v0.8.5 5798 (nateraw)
- Update tokenizers to 0.8.1.rc to fix Mac OS X issues 5867 (sepal)
- Xlnet outputs 5883 (TevenLeScao)

- DataParallel fixes 5733 (stas00)
- [cleanup] squad processor 5868 (sshleifer)
- Improve doc of use_cache 5912 (sgugger)
- [Fix] seq2seq pack_dataset.py actually packs 5913 (sshleifer)
- Add AlbertForPretraining to doc 5914 (sgugger)
- DataParallel fix: multi gpu evaluation 5926 (csarron)
- Clarify arg class 5916 (sgugger)

- [CI] self-scheduled runner tests examples/ 5927 (sshleifer)
- Update doc to new model outputs 5946 (sgugger)
- [CI] Install examples/requirements.txt 5956 (sshleifer)
- Expose padding_strategy on squad processor to fix QA pipeline performance regression 5932 (mfuntowicz)
- [docs] Add integration test example to copy pasta template 5961 (sshleifer)
- Cleanup Trainer and expose customization points 5982 (sgugger)
- Avoid unnecessary warnings when loading pretrained model 5922 (sgugger)
- Ensure OpenAI GPT position_ids is correctly initialized and registered at init. 5773 (mfuntowicz)
- [CI] Don't test apex 6021 (sshleifer)
- add a summary report flag for run_examples on CI 6035 (stas00)
- don't complain about missing W&B when WANDB_DISABLED=true 6036 (stas00)
- Allow to set Adam beta1, beta2 in TrainingArgs 5592 (gonglinyuan)
- Fix the return documentation rendering for all model outputs 6022 (sgugger)

- Fix typo (model saving TF) 5734 (Colanim)
- Add new AutoModel classes in pipeline 6062 (patil-suraj)
- [pack_dataset] don't sort before packing, only pack train 5954 (sshleifer)
- CL util to convert models to fp16 before upload 5953 (sshleifer)
- Add fire to setup.cfg to make isort happy 6066 (sgugger)
- [fix] no warning for position_ids buffer 6063 (sshleifer)
- Pipelines should use tuples instead of namedtuples 6061 (LysandreJik)
- Moving
transformers package import statements to relative imports in some files 5796 (afcruzs)
- github issue template suggests who to tag 5790 (sshleifer)
- Make all data collators accept dict 6065 (sgugger)
- Add inference widget examples 5825 (clmnt)
- [s2s] Delete useless method, log tokens_per_batch 6081 (sshleifer)
- Logs should not be hidden behind a logger.info 6097 (LysandreJik)
- Fix zero-shot pipeline single seq output shape 6104 (joeddav)
- [fix] add bart to LM_MAPPING 6099 (sshleifer)
- [Fix] position_ids tests again 6100 (sshleifer)
- Fix deebert tests 6102 (sshleifer)
- Use FutureWarning to deprecate 6111 (sgugger)
- Added capability to quantize a model while exporting through ONNX. 6089 (mfuntowicz)
- XLNet PLM Readme 6121 (LysandreJik)
- Fix TF CTRL model naming 6134 (jplu)
- Use google style to document properties 6130 (sgugger)
- Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} 5614
- Rework TF trainer 6038 (jplu)
- Actually the extra_id are from 0-99 and not from 1-100 5967 (orena1)
- add another e.g. to avoid confusion 6055 (orena1)
- Tf trainer cleanup 6143 (sgugger)
- Switch from return_tuple to return_dict 6138 (sgugger)
- Fix FlauBERT GPU test 6142 (LysandreJik)
- Enable ONNX/ONNXRuntime optimizations through converter script 6131 (mfuntowicz)
- Add Pytorch Native AMP support in Trainer 6151 (prajjwal1)
- enable easy checkout switch 5645 (stas00)
- Replace mecab-python3 with fugashi for Japanese tokenization 6086 (polm)
- parse arguments from dict 4869 (patil-suraj)
- Harmonize both Trainers API 6157 (sgugger)
- Model output test 6155 (sgugger)
- [s2s] clean up + doc 6184 (stas00)
- Add script to convert BERT tf2.x checkpoint to PyTorch 5791 (mar-muel)
- Empty assert hunt 6056 (TevenLeScao)
- Fix saved model creation 5468 (jplu)
- Adds train_batch_size, eval_batch_size, and n_gpu to to_sanitized_dict output for logging. 5331 (jaymody)
- [DataCollatorForLanguageModeling] fix labels 6213 (patil-suraj)
- Fix _shift_right function in TFT5PreTrainedModel 6214 (maurice-g)
- Remove outdated BERT tips 6217 (JetRunner)
- run_hans label fix 6221 (VictorSanh)
- Make the order of additional special tokens deterministic 5704 (gonglinyuan)
- cleanup torch unittests 6196 (stas00)
- test_tokenization_common.py: Remove redundant coverage 6224 (sshleifer)
- [Reformer] fix reformer fp16 test 6237 (patrickvonplaten)
- [Reformer] Make random seed generator available on random seed and not on model device 6244 (patrickvonplaten)
- Update to match renamed attributes in fairseq master 5972 (LilianBordeau)
- [WIP] lightning_base: support --lr_scheduler with multiple possibilities 6232 (stas00)
- Trainer + wandb quality of life logging tweaks 6241 (TevenLeScao)
- Add strip_accents to basic BertTokenizer. 6280 (PhilipMay)
- Argument to set GPT2 inner dimension 6296 (TevenLeScao)
- [Reformer] fix default generators for pytorch < 1.6 6300 (patrickvonplaten)
- Remove redundant line in run_pl_glue.py 6305 (xujiaze13)
- [Fix] text-classification PL example 6027 (bhashithe)
- fix the shuffle agrument usage and the default 6307 (stas00)
- CI dependency wheel caching 6287 (LysandreJik)
- Patch GPU failures 6281 (LysandreJik)
- fix consistency CrossEntropyLoss in modeling_bart 6265 (idoh)
- Add a script to check all models are tested and documented 6298 (sgugger)
- Fix the tests for Electra 6284 (jplu)
- [examples] consistently use --gpus, instead of --n_gpu 6315 (stas00)

- refactor almost identical tests 6339 (stas00)
- Small docfile fixes 6328 (sgugger)
- Patch models 6326 (LysandreJik)
- Ci GitHub caching 6382 (LysandreJik)
- Fix links for open in colab 6391 (sgugger)
- [EncoderDecoderModel] add a `add_cross_attention` boolean to config 6377 (patrickvonplaten)

- Feed forward chunking 6024 (Pradhy729)
- add pl_glue example test 6034 (stas00)
- testing utils: capturing std streams context manager 6231 (stas00)
- Fix tokenizer saving and loading error 6026 (yobekiko)
- Warn if debug requested without TPU 6390 (dmlap)
- [Performance improvement] "Bad tokens ids" optimization 6064 (guillaume-be)
- pl version: examples/requirements.txt is single source of truth 6309 (stas00)
- [s2s] wmt download script use less ram 6405 (stas00)

- [pl] restore lr logging behavior for glue, ner examples 6314 (stas00)
- lr_schedulers: add get_polynomial_decay_schedule_with_warmup 6361 (stas00)
- [examples] add pytest dependency 6425 (sshleifer)
- [test] replace capsys with the more refined CaptureStderr/CaptureStdout 6422 (stas00)
- Fixes to make life easier with the nlp library 6423 (sgugger)
- Move prediction_loss_only to TrainingArguments 6426 (sgugger)
- Activate check on the CI 6427 (sgugger)
- cleanup tf unittests: part 2 6260 (stas00)
- Fix docs and bad word tokens generation_utils.py 6387 (ZhuBaohe)
- Test model outputs equivalence 6445 (LysandreJik)
- add LongformerTokenizerFast in AutoTokenizer 6463 (patil-suraj)
- add BartTokenizerFast in AutoTokenizer 6464 (patil-suraj)
- Add POS tagging and Phrase chunking token classification examples 6457 (vblagoje)

- Clean directory after script testing 6453 (JetRunner)
- Use hash to clean the test dirs 6475 (JetRunner)
- Sort unique_no_split_tokens to make it deterministic 6461 (lhoestq)
- Fix TPU Convergence bug 6488 (jysohn23)
- Support additional dictionaries for BERT Japanese tokenizers 6515 (singletongue)
- [doc] Summary of the models fixes 6511 (stas00)
- Remove deprecated assertEquals 6532 (JetRunner)
- [testing] a new TestCasePlus subclass + get_auto_remove_tmp_dir() 6494 (stas00)
- [sched] polynomial_decay_schedule use default power=1.0 6473 (stas00)
- Fix flaky ONNX tests 6531 (mfuntowicz)

- [doc] make the text more readable, fix some typos, add some disambiguation 6508 (stas00)

- [doc] multiple corrections to "Summary of the tasks" 6509 (stas00)
- replace _ with __ rst links 6541 (stas00)
- Fixed label datatype for STS-B 6492 (amodaresi)
- fix incorrect codecov reports 6553 (stas00)
- [docs] Fix wrong newline in the middle of a paragraph 6573 (romainr)
- [docs] Fix number of 'ug' occurrences in tokenizer_summary 6574 (romainr)
- add BartConfig.force_bos_token_to_be_generated 6526 (sshleifer)
- Fix bart base test 6587 (sshleifer)
- Feed forward chunking others 6365 (Pradhy729)
- tf generation utils: remove unused kwargs 6591 (sshleifer)
- [BartTokenizerFast] add prepare_seq2seq_batch 6543 (patil-suraj)
- [doc] lighter 'make test' 6512 (stas00)
- [docs] Copy code button misses '...' prefixed code 6518 (romainr)
- removed redundant arg in prepare_inputs 6614 (prajjwal1)
- add intro to nlp lib & dataset links to custom datasets tutorial 6583 (joeddav)
- Add tests to Trainer 6605 (sgugger)
- TFTrainer dataset doc & fix evaluation bug 6618 (joeddav)
- Add tests/test_tokenization_reformer.py 6485 (D-Roberts)
- [Tests] fix attention masks in Tests 6621 (patrickvonplaten)
- XLNet Bug when training with apex 16-bit precision 6567 (johndolgov)
- Move threshold up for flaky test with Electra 6622 (sgugger)
- Regression test for pegasus bugfix 6606 (sshleifer)
- Trainer automatically drops unused columns in nlp datasets 6449 (sgugger)
- [Docs model summaries] Add pegasus to docs 6640 (patrickvonplaten)
- [Doc model summary] add MBart model summary 6649 (patil-suraj)
- Specify config filename in HfArgumentParser 6626 (jarednielsen)
- Don't reset the dataset type + plug for rm unused columns 6683 (sgugger)
- Fixed DataCollatorForLanguageModeling not accepting lists of lists 6685 (TevenLeScao)
- Update repo to isort v5 6686 (sgugger)
- Fix PL token classification examples 6682 (vblagoje)
- Lat fix for Ray HP search 6691 (sgugger)
- Create PULL_REQUEST_TEMPLATE.md 6660 (stas00)
- [doc] remove BartForConditionalGeneration.generate 6659 (stas00)
- Move unused args to kwargs 6694 (sgugger)
- [fixdoc] Add import to pegasus usage doc 6698 (sshleifer)
- Fix hyperparameter_search doc 6695 (sgugger)
- Remove hard-coded uses of float32 to fix mixed precision use 6648 (schmidek)
- Add DPR to models summary 6690 (lhoestq)
- Add typing.overload for convert_ids_tokens 6637 (tamuhey)
- Allow tests in examples to use cuda or fp16,if they are available 5512 (Joel-hanson)
- ci/gh/self-scheduled: add newline to make examples tests run even if src/ tests fail 6706 (sshleifer)
- Use separate tqdm progressbars 6696 (sgugger)
- More tests to Trainer 6699 (sgugger)
- Add tokenizer to Trainer 6689 (sgugger)
- tensor.nonzero() is deprecated in PyTorch 1.6 6715 (mfuntowicz)
- [Albert] Add position ids to allowed uninitialized weights 6719 (patrickvonplaten)
- Fix ONNX test_quantize unittest 6716 (mfuntowicz)
- [squad] make examples and dataset accessible from SquadDataset object 6710 (lazovich)
- Fix pegasus-xsum integration test 6726 (sshleifer)
- T5Tokenizer adds EOS token if not already added 5866 (sshleifer)
- Install nlp for github actions test 6728 (sgugger)
- [Torchscript] Fix docs 6740 (patrickvonplaten)
- Add "tie_word_embeddings" config param 6692 (patrickvonplaten)
- Fix tf boolean mask in graph mode 6741 (JayYip)
- Fix TF optimizer 6717 (jplu)
- [TF Longformer] Improve Speed for TF Longformer 6447 (patrickvonplaten)
- add __init__.py to utils 6754 (joeddav)
- [s2s] run_eval.py QOL improvements and cleanup 6746 (sshleifer)
- s2s distillation uses AutoModelForSeqToSeqLM 6761 (sshleifer)
- Add AdaFactor optimizer from fairseq 6722 (moscow25)
- Adds Adafactor to the docs and slightly fixes the formatting 6765 (LysandreJik)
- Fix the TF Trainer gradient accumulation and the TF NER example 6713 (jplu)
- Fix run_squad.py to work with BART 6756 (tomgrek)
- Add NLP install to self-scheduled CI 6767 (sshleifer)
- [testing] replace hardcoded paths to allow running tests from anywhere 6523 (stas00)
- [test schedulers] adjust to test the first step's reading 6429 (stas00)
- new Makefile target: docs 6510 (stas00)
- [transformers-cli] fix logger getter 6777 (stas00)
- PL: --adafactor option 6776 (sshleifer)
- [style] set the minimal required version for `black` 6784 (stas00)
- Transformer-XL: Improved tokenization with sacremoses 6322 (RafaelWO)
- prepare_seq2seq_batch makes labels/ decoder_input_ids made later. 6654 (sshleifer)
- t5 model should make decoder_attention_mask 6800 (sshleifer)
- [s2s] Test hub configs in self-scheduled CI 6809 (sshleifer)
- [bart] rename self-attention -> attention 6708 (sshleifer)
- [tests] fix typos in inputs 6818 (stas00)
- Fixed open in colab link 6825 (PandaWhoCodes)
- clarify shuffle 6312 (xujiaze13)
- TF Flaubert w/ pre-norm 6841 (LysandreJik)
- Fix resuming training for Windows 6847 (sgugger)
- Only access loss tensor every logging_steps 6802 (jysohn23)
- Add checkpointing to Ray Tune HPO 6747 (krfricke)
- Split hp search methods 6857 (sgugger)
- Fix marian slow test 6854 (sshleifer)
- Bart can make decoder_input_ids from labels 6758 (sshleifer)
- add a final report to all pytest jobs 6861 (stas00)
- Restore PaddingStrategy.MAX_LENGTH on QAPipeline while no v2. 6875 (mfuntowicz)
- [Generate] Facilitate PyTorch generate using `ModelOutputs` 6735 (patrickvonplaten)

3.0.2

Not secure

Tokenizer fixes

Fixes bugs introduced by v3.0.0 and v3.0.1 in tokenizers.

3.0.1

Not secure

Better backward-compatibility for tokenizers following v3.0.0 refactoring

Version v3.0.0, included a refactoring of the tokenizers' backend to allow a simpler and more flexible [user-facing API](https://huggingface.co/transformers/preprocessing.html).

This refactoring was conducted with a particular focus on keeping backward compatibility for the v2.X encoding, truncation and padding API but still led to two breaking changes that could have been avoided.

This patch aims to bring back better backward compatibility, by implementing the following updates:
- the `prepare_for_model` method is now publicly exposed again for both slow and fast tokenizers with an API compatible with both the v2.X truncation/padding API and [the v3.0 recommended API](https://huggingface.co/transformers/preprocessing.html).
- the truncation strategy now defaults again to `longest_first` instead of `first_only`.

Bug fixes and improvements:
- Better support for TransfoXL tokenizer when using TextGenerationPipeline https://github.com/huggingface/transformers/pull/5465 (TevenLeScao)
- Fix use of meme Transformer-XL generations https://github.com/huggingface/transformers/pull/4826 (tommccoy1)
- Fixing a bug in the NER pipeline which lead to discarding the last identified entity https://github.com/huggingface/transformers/pull/5439 (mfuntowicz and enzoampil)
- Better QAPipelines https://github.com/huggingface/transformers/pull/5429 (mfuntowicz)
- Add Question-Answering and MLM heads to the Reformer model https://github.com/huggingface/transformers/pull/5433 (patrickvonplaten)
- Refactoring the LongFormer https://github.com/huggingface/transformers/pull/5219 (patrickvonplaten)
- Various fixes on tokenizers and tests (sshleifer)
- Many improvements to the doc and tutorials (sgugger)
- Fix TensorFlow dataset generator in run_glue https://github.com/huggingface/transformers/pull/4881 (jplu)
- Update Bertabs example to work again https://github.com/huggingface/transformers/pull/5355 (MichaelJanz)
- Move GenerationMixin to separate file https://github.com/huggingface/transformers/pull/5254 (yjernite)

3.0.0

Not secure

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Breaking changes since `v2`

- In 4874 the language modeling BERT has been split in two: `BertForMaskedLM` and `BertLMHeadModel`. `BertForMaskedLM` therefore cannot do causal language modeling anymore, and cannot accept the `lm_labels` argument.
- The `Trainer` data collator is now a method instead of a class
- Directly setting a tokenizer special token attributes (e.g. `tokenizer.mask_token = '<mask>'` now only associate the token to the attribute of the tokenizer but doesn't add the token to the vocabulary if it is not in the vocabulary. Tokens are only added by using the `tokenizer.add_special_tokens()` and `tokenizer.add_tokens()` methods
- The `prepare_for_model` method was removed as part of the new tokenizer API.
- The truncation method is now `only_first` by default.

New Tokenizer API (n1t0, thomwolf, mfuntowicz)

The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8.

The redesigned API is explained in detail here 4510 and here: https://huggingface.co/transformers/master/preprocessing.html

Notable changes:

- it's now possible to truncate to the max input length of a model while padding the longest sequence in a batch
- padding and truncation are decoupled and easier to control
- it's possible to pad to a multiple of a predefined length, e.g. 8 which can give significant speeds up on recent NVIDIA GPU (V100)
- a generic wrapper using `tokenizer.__call__` can be used for all case (single sequence, pair of sequences to groups, batches, etc...)
- tokenizers now accept pre-tokenized inputs (when the input is already split in word strings e.g. for NER)
- All the Rust tokenizers are now fully tested like slow tokenizers
- A new class `AddedToken` can be used to have a more fine-grained control on how added tokens behave during tokenization. In particular the user can control (1) whether left and right spaces are removed around the token during tokenization (2) whether the token will be identified inside another word and (3) whether the token will be recognized in normalized forms (e.g. in lower case if the tokenizer uses lower-casing)
- Serialization issues where fixed
- Possiblity to create NumPy tensors when using `return_tensors` parameter on tokenizers.
- Introduced a new enum `TensorType` to map all the possible tensor backends we support: `TensorType.TENSORFLOW`, `TensorType.PYTORCH`, `TensorType.NUMPY`
- Tokenizers now accept `TensorType` enum on `encode(...)`, `encode_plus(...)`, `batch_encode_plus(...)` tokenizer method for `return_tensors` parameters.
- `BatchEncoding` new property `is_fast` indicates if the `BatchEncoding` comes from a Python (slow) tokenizer or a Rust (fast) tokenizer.
- Slow and Fast Tokenizers are now picklable. So is their output, the dict sub-class `BatchEncoding`.

Several PRs to make the API more stable have been made:

- [tokenizers] Fix 5081 and improve backward compatibility 5125 (thomwolf)
- Tokenizers API developments 5103 (thomwolf)
- Clearer error message in the use-case of 5169 (thomwolf)
- Add more tests on tokenizers serialization - fix bugs 5056 (thomwolf)
- [Tokenization] Fix 5181 - make 5155 more explicit - move back the default logging level in tests to WARNING 5252 (thomwolf)
- [tokenizers] Several small improvements and bug fixes 5287
- Add `pad_to_multiple_of` on tokenizers (reimport) 5054 (mfuntowicz)
- [tokenizers] Updates data processors, docstring, examples and model cards to the new API 5308

TensorFlow improvements (jplu, dzorlu, LysandreJik)

Very big release for TensorFlow!
- TensorFlow models can now compute the loss themselves, using the `TFPretrainedModel.compute_loss` method. 4530
- Can now resize token embeddings in TensorFlow 4351
- Cleaning TensorFlow models 5229

Enhanced documentation (sgugger)

We welcome sgugger as a team member in New York. He already introduced a lot of very cool documentation changes:

- Added a [model summary](https://huggingface.co/transformers/master/model_summary.html) #4789
- Expose classes used in documentation 4808
- Explain how to preview the docs in a PR 4795
- Clean documentation 4849
- Remove old doc page and add note about cache in installation 5027
- Fix all sphynx warnings 5068 (sgugger)
- Update pipeline examples to doctest syntax 5030
- Reorganize documentation 5064
- Update installation page and add contributing to the doc 5084
- Update glossary 5148
- Quick tour 5145
- Switch master/stable doc and add older releases 5193
- Add version control menu 5222
- Don't recreate old docs 5243
- Tokenization tutorial 5257
- Remove links for all docs 5280
- New model sharing tutorial 5323

Training & fine-tuning quickstart

- Our own joeddav added a training & fine-tuning quickstart to the documentation 5034!

MobileBERT

The MobileBERT from [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow.

A single checkpoint is added: `mobilebert-uncased` which is the `uncased_L-24_H-128_B-512_A-4_F-4_OPT` checkpoint converted to our API.

This model was first implemented in PyTorch by lonePatient, ported to the library by vshampor, then finalized and implemented in Tensorflow by LysandreJik.

Eli5 examples (yjernite) 4968

- The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo.

- The RetriBert model implements the dense passage retriever. It's basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and Yacine thought it would be easier to write its own class for now and see if we can merge into the BART code later.

Enhanced examples/seq2seq (sshleifer)

- the `examples/seq2seq` [folder](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md) is a combination of the old `examples/summarization` and `examples/translation` folders.
- Finetuning works well for summarization, more experiments needed for translation. Finetuning works on multi-gpu, saves rouge scores during validation, and provides `--freeze_encoder` and `--freeze_embeds` options. These options make finetuning BART 5x faster on the cnn/dailymail dataset.
- Distillbart code is added in distillation.py. It only supports summarization, for now.
- Evaluation works well for both summarization and translation.
- New weights and biases [shared task](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md#xsum-shared-task) for collaboration on the XSUM summarization task

Distilbart (sshleifer)
- Distilbart models are smaller versions of `bart-large-cnn` and `bart-large-xsum`. They can be loaded using `BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-xsum-12-6')`, for example See this [tweet](https://twitter.com/sam_shleifer/status/1276160367853547522?s=20) for more info on available models and their speed/performance.
- Commands to reproduce are available in the `examples/seq2seq` [folder](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md)

BERT Loses Patience (JetRunner)

Add BERT Loses Patience (Patience-based Early Exit) based on the paper https://arxiv.org/abs/2006.04152 and the official implementation https://github.com/JetRunner/PABEE

Unifying `label` arguments (sgugger) 4722

- Deprecate any argument that's not `labels` (like `masked_lm_labels`, `lm_labels`, etc.) to `labels`.

NumPy type in tokenizers (mfuntowicz) 4585

Introduce a new tensor type for return_tensors on tokenizer for NumPy.

- As we're introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don't know about "tf", "pt".
*Note: TensorType are compatible with previous "tf", "pt" and now "np" str to allow backward compatibility (+unittest)*

- Numpy is now a possible target when creating tensors. This is usefull for JAX.

Community notebooks

- Adding notebooks for Fine Tuning 4732 (abhimishra91):
- Multi-class classification: Using DistilBert
- Multi-label classification: Using Bert
- Summarization: Using T5 - Model Tracking with WandB
- [Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb) #5195 (pommedeterresautee)
- [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) (patrickvonplaten) #5312

Benchmarks (patrickvonplaten)

The benchmark script was consolidated and some features were added:

Adds the functionality to measure the following functionalities for TF and PT (4912):

- Tensorflow:
- Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU
- PyTorch:
- Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU
- Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU

- [Benchmark] Add encoder decoder to benchmark and clean labels 4810
- [Benchmark] add tpu and torchscipt for benchmark 4850
- [Benchmark] Extend Benchmark to all model type extensions 5241
- [Benchmarks] improve Example Plotter 5245

Hidden states, attentions and cache

Before v3.0.0, the way to handle attentions, model hidden states, and whether to use the cache in models that have it for sequential decoding was to specify an argument in the configuration. In version v3.0.0, while we do maintain that argument for backwards compatibility, we introduce a new way of handling these through the `forward` and `call` methods.

- Output attentions 4538 (Bharat123rox)
- Output hidden states 4978 (drjosephliu)
- Use cache 5194 (patrickvonplaten)

Revamped `AutoModel`s (patrickvonplaten)

The `AutoModelWithLMHead` encompasses all models with a language modeling head, not making the distinction between causal, masked and seq2seq models. Three new auto models are added:

- `AutoModelForCausalLM` for Autoregressive models
- `AutoModelForMaskedLM` for Autoencoding models
- `AutoModelForSeq2SeqCausalLM` for Sequence-to-sequence models with causal LM for the decoder

New model & tokenizer architectures

- XLMRobertaForQuestionAnswering 4855 (sgugger)
- ElectraForQuestionAnswering 4913 (patil-suraj)
- Add AlbertForMultipleChoice 4959 (sgugger)
- BartForQuestionAnswering 4908 (patil-suraj)
- BartTokenizerFast 4878 (patil-suraj)
- Add DistilBertForMultipleChoice 5032 (sgugger)
- ElectraForMultipleChoice 4954 (sgugger)

ONNX

- Fixed a bug causing invalid ordering of the inputs in the underlying ONNX IR.
- Increased logging to giv ethe user more information about the exported variables.

Bug fixes and improvements

- TFRobertaModelIntegrationTest requires tf 4726 (sshleifer)
- Cleanup glue for TPU 4621 (jysohn23)
- [Reformer] Improved memory if input is shorter than chunk length 4720 (patrickvonplaten)
- Pipelines: miscellanea of QoL improvements and small features 4632 (julien-c)
- Fix bug when changing the <EOS> token for generate 4745 (patrickvonplaten)
- never_split on slow tokenizers should not split 4723 (mfuntowicz)
- PretrainedModel.generate: remove unused kwargs 4761 (sshleifer)
- Codecov is now setup differently to have better insights into code coverage 4768 (LysandreJik)
- Don't access pad_token_id if there is no pad_token 4773 (sgugger)
- Removed deprecated use of Variable API from pplm example 4619 (prajjwal1)
- Add drop_last arg for data loader 4757 4925 (setu4993)
- No silent error when XLNet's `d_head` is already in the configuration 4747 (LysandreJik)
- MarianTokenizer: delete unused constants 4802 (sshleifer)
- NER: Add new WNUT’17 example 4681 (stefan-it)
- [EncoderDecoderConfig] automatically set decoder config to decoder 4809 (patrickvonplaten)
- Add matplotlib to known 3rd party dependencies 4800 (sshleifer)
- Pipelines test and new kwarg 4812 (sshleifer)
- Updated path "cd examples/text-generation/pplm" 4778 (Mr-Ruben)
- [marian tests] pass device to pipeline 4815 (sshleifer)
- Export PretrainedBartModel from __init__ 4819 (BramVanroy)
- Updates args in tf squad example. 4820 (daniel-shan)
- [Generate] beam search should generate without replacement (patrickvonplaten)
- TFTrainer: Align how the checkpoints are managed the same way than in the PyTorch trainer. 4831 (jplu)
- [Longformer] Remove redundant code 4839 (ZhuBaohe)
- [cleanup] consolidate some prune_heads logic 4799 (sshleifer)
- Fix the __getattr__ method in BatchEncoding 4772 (jplu)
- Consolidate summarization examples 4837 (aretius)
- Fix a bug in the initialization and serialization of TFRobertaClassificationHead 4884 (harkous)
- [examples] Cleanup summarization docs 4876 (sshleifer)
- run_pplm.py bug fix 4867 (songyouwei)
- Remove unused arguments in Multiple Choice example 4853 (sgugger)
- Deal with multiple choice in common tests 4886 (sgugger
- Fix the CI 4903 (sgugger)
- [All models] fix docs after adding output attentions to all forward functions 4909 (patrickvonplaten)
- Add more models to common tests 4910 (sgugger)
- [ctrl] fix pruning of MultiHeadAttention 4904 (aretius)
- Don't init TPU device twice 4916 (patrickvonplaten)
- Run a single wandb instance per TPU run 4851 (LysandreJik)
- check type before logging in trainer to ensure values are scalars 4883 (mgoldey)
- Split LMBert model in two 4874 (sgugger)
- Make multiple choice models work with input_embeds 4921 (sgugger)
- Fix resize_token_embeddings for Transformer-XL 4759 (RafaelWO)
- [mbart] Fix fp16 testing logic 4949 (sshleifer)
- Hans data with newer tokenizer API 4854 (sgugger)
- Fix parameter 'output_attentions' docstring 4976 (ZhuBaohe)
- Improve ONNX logging 4999 (mfuntowicz)
- NER: fix construction of input examples for RoBERTa 4943 (stefan-it)
- Possible fix to make AMP work with DDP in the trainer 4728 (BramVanroy)
- Make DataCollator a callable 5015 (sgugger)
- Increase pipeline support for ONNX export. 5005 (mfuntowicz)
- Fix importing transformers on Windows - SIGKILL not defined 4997 (mfuntowicz)
- TFTrainer: improve logging 4946 (borisdayma)
- Add position_ids in TFElectra models docstring 5021 (sgugger)
- [Bart] Question Answering Model is added to tests 5024 (patrickvonplaten)
- Ability to pickle/unpickle BatchEncoding pickle (reimport) 5039 (mfuntowicz)
- refactor(wandb): consolidate import 5044 (borisdayma)
- [cleanup] Hoist ModelTester objects to top level 4939 (aretius)
- Convert hans to Trainer 5025 (sgugger)
- Fix marian tokenizer save pretrained 5043 (sshleifer)
- [cleanup] examples test_run_squad uses tiny model 5059 (sshleifer)
- Add header and fix command for HANS 5082 (sgugger)
- [examples] SummarizationModule improvements 4951 (sshleifer)
- Some changes to simplify the generation function 5031 (yjernite)
- Make default_data_collator more flexible and deprecate old behavior 5060 (sgugger)
- [MarianTokenizer] Switch to sacremoses for punc normalization 5092 (sshleifer)
- [style] add pandas to setup.cfg 5093 (sshleifer)
- [ElectraForQuestionAnswering] fix qa example in doc 4929 (patil-suraj)
- Fixing TPU training by disabling wandb.watch gradients logging 4926 (patil-suraj)
- [docs] fix T5 training doc 5080 (patil-suraj)
- support local_files_only option for tf models 5116 (ogarin)
- [cleanup] generate_beam_search comments 5115 (sshleifer)
- [fix] Move _adjust_logits above postprocess to fix Marian.generate 5126 (sshleifer)
- Pin `sphinx-rtd-theme` 5128 (LysandreJik)
- Add missing arg in 02-transformers notebook 5085 (pri-ax)
- [cleanup] remove redundant code in SummarizationDataset 5119 (sshleifer)
- AutoTokenizer supports mbart-large-en-ro 5121(sshleifer)
- Fix in Reformer Config documentation 5138 (erickrf)
- [bart-mnli] Fix class flipping bug 5141 (sshleifer)
- [MobileBert] fix dropout 5150 (ZhuBaohe)
- SummarizationPipeline: init required task name 5086 (julien-c)
- [examples] fixes arguments for summarization finetune scripts 5157 (ieBoytsov)
- Fixing docs for Encoder Decoder Config 5171 (mikaelsouza)
- fix bart doc 5132 (fuzihaofzh)
- Added feature to move added tokens in vocabulary for Transformer-XL 4953 (RafaelWO)
- Add support for gradient checkpointing in BERT 4659 (ibeltagy)
- Fix for IndexError when Roberta Tokenizer is called on empty text 4209 (malteos)
- Add TF auto model to the docs + fix sphinx warnings (again) 5187 (sgugger)
- Have documentation fail on warning 5189 (LysandreJik)
- Cleaner warning when loading pretrained models 4557 (thomwolf)
- Upgrade examples to pl=0.8.1 5146 (sshleifer)
- [fix] mobilebert had wrong path, causing slow test failure 5205 (sshleifer)
- [fix] remove unused import 5206 (sshleifer)
- [Reformer] Axial Pos Emb Improve mem usage reformer 5209 (patrickvonplaten)
- [pl_examples] revert deletion of optimizer_step 5227 (sshleifer)
- [bart] add config.extra_pos_embeddings to facilitate reuse 5190 (sshleifer)
- Only put tensors on a device 5223 (sgugger)
- Fix PABEE division by zero error 5233 (JetRunner)
- Use the script in utils 5224 (sgugger)
- Delay decay schedule until the end of warmup 4940 (amodaresi)
- Replace pad_token with -100 for LM loss calculation 4718 (setu4993)
- examples/seq2seq supports translation 5202 (sshleifer)
- Fix convert_graph_to_onnx script 5230 (n1t0)
- Refactor Code samples; Test code samples 5036 (LysandreJik)
- [Generation] fix docs for decoder_input_ids 5306 (patrickvonplaten)
- [pipelines] Change summarization default to distilbart-cnn-12-6 5289 (sshleifer)
- Add BART-base modeling and configuration 5315 (JetRunner)
- CircleCI stores cleaner output at test_outputs.txt 5291 (sshleifer)
- [pl_examples] default warmup steps=0 5316 (sshleifer)

Page 26 of 33

Releases

Has known vulnerabilities

Previous Next

Transformers

Page 26 of 33

3.2.0

3.1

3.1.0

3.0.2

3.0.1

3.0.0

Page 26 of 33

Links

Releases