Transformers

Latest version: v4.50.3

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 33

4.49.0

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

<img width="860" alt="image" src="https://github.com/user-attachments/assets/52e91b74-5572-46a6-93e5-058730411675" />

* Add-helium by ArthurZucker in 35669

Qwen2.5-VL

The [Qwen2.5-VL](https://qwenlm.github.io/blog/qwen2_5-vl/) model is an update to [Qwen2-VL](https://arxiv.org/abs/2409.12191) from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

![image](https://github.com/user-attachments/assets/0a5c25ae-5c1a-4137-8cfa-340962777481)

* add qwen2.5vl by ShuaiBai623 in 35569

SuperGlue

The SuperGlue model was proposed in [SuperGlue: Learning Feature Matching with Graph Neural Networks](https://arxiv.org/abs/1911.11763) by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the [SuperPoint model](https://huggingface.co/magic-leap-community/superpoint), it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

<img width="424" alt="image" src="https://github.com/user-attachments/assets/1d81983f-f9ce-4d82-adb7-e76098df543a" />

* Add SuperGlue model by sbucaille in 29886

Granite Vision Support

The Granite Vision model is a variant of [LLaVA-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next), leveraging a [Granite](https://huggingface.co/docs/transformers/main/en/model_doc/granite) language model alongside a [SigLIP](https://huggingface.co/docs/transformers/main/en/model_doc/SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](https://huggingface.co/docs/transformers/main/en/model_doc/vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

* Granite Vision Support by alex-jw-brooks in 35579

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically [Mamba](https://github.com/state-spaces/mamba)) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the [Mistral v0.1 tokenizer](https://huggingface.co/mistralai/Mistral-7B-v0.1). We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

![image](https://github.com/user-attachments/assets/96202534-b8ac-4adc-b355-34b14554660f)

* Add Zamba2 by pglorio in 34517

4.48.3

This ends the python3.9 issues mostly!
- Add future import for Py < 3.10 (35666) by Rocketknight1

For some very niche cases, the new rope embedding introduced device failures
- Fix device in rope module when using dynamic updates (35608) by Cyrilvallez

Num items in batch
- Fix model kwargs (35875) by muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the `num_items_in_batch`

Finally the fix to Gemma2 is propagated to paligemma2!
- Paligemma: fix generation with Gemma2 (36044) by zucchini-nlp

4.48.2

Sorry because the fixes for `num_items_in_batches` are not done yet 😓 To follow along see this [PR](https://github.com/huggingface/transformers/pull/35875), a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

- Restore is_torch_greater_or_equal_than for backward compatibility (35734) by tlrmchlsmth
- Fix NoneType type as it requires py>=3.10 (35843) by SunMarc

Then we had a small regression for DBRX saving:
- Fix: loading DBRX back from saved path (35728) by zucchini-nlp

Finally we have a fix for gemma and the hybrid attention architectures:
- Fix mask slicing for models with HybridCache 35681 by Cyrilvallez

Miscellaneous:
- Fix is_causal being a tensor (35791) by IlyasMoutawwakil

4.48.1

Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!

`Moonshine` had a small issue when wrapping generate so we removed that!

- [Phi] bias should be True (35650) ArthurZucker
- Fix condition when GA loss bug fix is not performed (35651) techkang
- Patch moonshine (35731) eustlb

🤗

4.48.0

New models

ModernBERT

The ModernBert model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta).

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

- [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens.
- [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- [Alternating Attention](https://arxiv.org/abs/2004.05150v2) where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up processing.
- A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

![image](https://github.com/user-attachments/assets/4256c0b1-9b40-4d71-ac42-fc94827d5e9d)

* Add ModernBERT to Transformers by warner-benjamin in 35158

Aria

The Aria model was proposed in [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://huggingface.co/papers/2410.05993) by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

* Add Aria by aymeric-roucher in 34157
![image](https://github.com/user-attachments/assets/ef41fcc9-2c5f-4a75-ab1a-438f73d3d7e2)

TimmWrapper

We add a `TimmWrapper` set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

py
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as `Trainer`, accelerate device maps, quantization, etc:

py
import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))

* Add TimmWrapper by qubvel and amyeroberts in 34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new [Pixtral-Large](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411) model.

* Update Pixtral conversion script to support large format! by arthurzucker in 34801

ColPali

The ColPali model was proposed in [ColPali: Efficient Document Retrieval with Vision Language Models](https://doi.org/10.48550/arXiv.2407.01449) by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

![colpali_architecture](https://github.com/user-attachments/assets/545ed1d7-ea82-4d0d-80c1-4fcbb1c828cd)

* Add ColPali to 🤗 transformers by tonywu71 and yonigozlan in 33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

* Add Falcon3 documentation by mokeddembillel in 35307

Bamba

Bamba-9B is a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints [here](https://github.com/foundation-model-stack/bamba).

* Add the Bamba Model by fabianlim in 34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in ["ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”](https://arxiv.org/abs/2204.12484).

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

![vitpose](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vitpose-architecture.png)

* Add VitPose by SangbumChoi and NielsRogge in 30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The [Vision Transformer](https://huggingface.co/docs/transformers/main/en/model_doc/vit) (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include [DINOv2](https://huggingface.co/docs/transformers/main/en/model_doc/dinov2) and [MAE](https://huggingface.co/docs/transformers/main/en/model_doc/vit_mae).

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

- no artifacts
- interpretable attention maps
- and improved performances.

* Add DINOv2 with registers by NielsRogge in 35348

Emu3

The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://arxiv.org/abs/2409.18869) by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on [VQ-VAE](https://arxiv.org/abs/1711.00937) model. Discretized visual tokens are later fused with text token ids for image and text generation.

Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..

* Add Emu3 by zucchini-nlp in 33770

Cohere2

A new Cohere update was added through a new "Cohere2" set of classes.

* Add Cohere2 model by alexrs-cohere in 35224

TextNet

[TextNet](https://arxiv.org/abs/2111.02394) is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.

* Add TextNet by jadechoghari in 34979

DiffLlama

[Differential Transformer](https://arxiv.org/abs/2410.05258) combines the Llama architecture with Differential Transformer's Attention.
* Add DiffLllama by weak-kajuma in 34083

PixtralLarge

The conversion script needed a few update, while the modeling code was barely changed!
* [PixtralLarge] Update Pixtral conversion script to support large format! (34801)

Moonshine

Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in [Moonshine: Speech Recognition for Live Transcription and Voice Commands
](https://arxiv.org/abs/2410.15608).

* Add Moonshine by eustlb in 34784

Quantization methods

VPTQ Quantization

From the VPTQ contributors:

> VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

* FEAT : Adding VPTQ quantization method to HFQuantizer by wejoncy in 34770

HIGGS Quantization

From the contributors:

> HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the [paper](https://arxiv.org/abs/2411.17525).
>
> Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute?tab=readme-ov-file).
>
> This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

* HIGGS Quantization Support by BlackSamorez in 34997

Cleanup

We merged a cleanup for vision language models, to make sure it all models are standardized.
* VLMs: major clean up 🧼 (34502)

Breaking changes

Conversion scripts

Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern `models/**/convert_*.py`. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch `.bin` weights or `pickle` files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.

In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.

However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the `main` branch.

* 🚨🚨🚨 Delete conversion scripts when making release wheels by Rocketknight1 in 35296

Backtracking in Nougat

A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.

* 🚨🚨🚨 Limit backtracking in Nougat regexp by qubvel in 35264

Whisper decoding

This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:

➡️ **Previously:**
• Short-form: Returned a `ModelOutput` or `torch.LongTensor`, including decoder input IDs and the EOS token ID.
• Long-form: Returned a `Dict` or `torch.LongTensor`, excluding decoder input IDs and the EOS token ID.

➡️ **From now on:**
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when `return_dict_in_generate=True` and (`return_timestamps=False` or `force_unique_generate_call=True`).

In this case, the output will be a `ModelOutput`, which is the result of the underlying call to GenerationMixin’s generate. Indeed, `return_timestamps=False` ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

* [Whisper] 🚨 Fix whisper decoding 🚨 by eustlb in 34135

Attention refactor

In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.

* 🚨All attention refactor🚨 by ArthurZucker in 35235

Bugfixes and improvements

* Pipeline: simple API for assisted generation by gante and Rocketknight1 34504
* [tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer (35593)
* Setup loss_type in config at model init time (34616)
* [docs] Update Python version in translations by jla524 in 35096
* [docs] top_p, top_k, temperature docstrings by stevhliu in 35065
* Fix private forked repo. CI by ydshieh in 35114
* Add feature dim attributes to BitLinear for easier PEFT integration by agostinv in 34946
* Update I-JEPA checkpoints path by qubvel in 35120
* Fix GA loss bugs and add unit test by techkang in 35121
* [I-JEPA] Update docs by NielsRogge in 35148
* Corrected typo in agent system prompts by Uvi-12 in 35143
* Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by daniel-bogdoll in 34883
* Fix typo in EETQ Tests by MekkCyber in 35160
* Cleanup: continue the init refactor by LysandreJik in 35167
* Super tiny fix logging message by fzyzcjy in 35132
* Fixed typo of 'avilable' in prompts.py by Uvi-12 in 35145
* [CI] Fix bnb quantization tests with accelerate>=1.2.0 by matthewdouglas in 35172
* Fix `num_items_in_batch` not being an integer by xspirus in 35115
* Assisted decoding multi-gpu by zucchini-nlp in 35116
* Fix file path for shard_num 1 with mllama converter by strangiato in 35053
* Support BatchNorm in Hubert pos_conv_emb as in fairseq by gallilmaimon in 34389
* Remove unnecessary masked_fill in deberta models by xadupre in 35182
* Fix DBRX LayerNorm init method by hgt312 in 35177
* Fixing GGUF support for StableLm by MekkCyber in 35060
* [i18n-ar] Translated file : `docs/source/ar/community.md` into Arabic by AhmedAlmaghz in 33027
* Multiple typo fixes in NLP, Audio docs by henryhmko in 35181
* Only import torch.distributed if it is available by GaetanLepage in 35133
* [i18n-<languageCode>] Translating Benchmarks.md to Chinese by asdkfjsd in 35137
* [docs] Fix FlashAttention link by stevhliu in 35171
* Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by johngrahamreynolds in 35188
* [i18n-<languageCode>] Translating agents.md to Chinese by HMJ0628 in 35139
* BLIP: enable device map by zucchini-nlp in 34850
* 🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by Cyrilvallez in 34858
* [PEFT] Better Trainer error when prompt learning with loading best model at the end by BenjaminBossan in 35087
* Cleanup: continue the init refactor by LysandreJik in 35170
* Fix CI by Cyrilvallez in 35208
* Fix seamless TTS generate by ylacombe in 34968
* docs: clarify initializer_range parameter description in Idefics3VisionConfig by h3110Fr13nd in 35215
* Fixed typo of 'indentifier' in audio_utils.py by Uvi-12 in 35226
* Fix type hints for apply_chat_template by Rocketknight1 in 35216
* Support Python 3.10+ Union style in chat template type hints parsing by RezaRahemtola in 35103
* Refactoring `AssistedCandidateGenerator` for Improved Modularity and Reusability by keyboardAnt and jmamou in 35009
* Change back to `Thread` for SF conversion by ydshieh in 35236
* [Init refactor] Modular changes by LysandreJik in 35240
* Fix typo in chat template example by EricWinsorDSIT in 35250
* Run model as compressed/uncompressed mode by horheynm in 34719
* skip Fuyu from test_generate by nhamanasu in 35246
* [tests] fix "Tester object has no attribute '_testMethodName'" by faaany in 34910
* Use `rsfE` with `pytest` by ydshieh in 35119
* Update AMD docker image (rocm 6.1) by ivarflakstad in 35259
* Fixed typos in Audio Classification Documentation by Uvi-12 in 35263
* Translating agents_advanced.md to Chinese by HMJ0628 in 35231
* Fix FSDP no longer working by muellerzr in 35212
* don't use no_sync when deepspeed doesn't support it for certain zero stages by winglian in 35157
* [i18n-Chinese] Translating perf_train_cpu.md to Chinese by asdkfjsd in 35242
* Fall back to slow image processor in ImageProcessingAuto when no fast processor available by yonigozlan in 34785
* Aggeregate test summary files in CircleCI workflow runs by ydshieh in 34989
* Blip: fix offloading and MP tests by zucchini-nlp in 35239
* Fix : model used to test ggml conversion of Falcon-7b is incorrect by MekkCyber in 35083
* Temporarily disable amd push ci by ivarflakstad in 35293
* Delete redundancy for loop checks. by zhanluxianshen in 35288
* [Whisper] patch float type on mps by eustlb in 35295
* Fix typos in Translated Audio Classification Docs by jla524 in 35287
* Translating "translate perf_infer_gpu_multi.md" to Chinese by HMJ0628 in 35271
* Fix wrongs in quicktour[zh] by zhanluxianshen in 35272
* Improved documentation of Automatic speech recognition by Uvi-12 in 35268
* fix modular order by ArthurZucker in 35297
* Add sdpa for Beit by OmarManzoor in 34941
* Support for SDPA for SAM models by MagnusS0 in 34110
* remove `benchmark` job in `push-important-models.yml` by ydshieh in 35292
* Fix typos in translated quicktour docs by jla524 in 35302
* Fix image preview in multi-GPU inference docs by jla524 in 35303
* Fix remove unused parameter in docs by zzzzzsa in 35306
* Add Cohere2 docs details by alexrs-cohere in 35294
* Fixed typo in audio_classification.md by Uvi-12 in 35305
* [docs] Improve register_pipeline by stevhliu in 35300
* Fix loading with only state dict and low_cpu_mem_usage = True by SunMarc in 35217
* [tests] make cuda-only tests device-agnostic by faaany in 35222
* Trigger GitHub CI with a comment on PR by ydshieh in 35211
* change bnb tests by jiqing-feng in 34713
* [Whisper] fix docstrings typo by eustlb in 35319
* feat: add `benchmarks_entrypoint.py` by McPatate in 34495
* Fix documentation for ColPali by tonywu71 in 35321
* Update comment CI bot by ydshieh in 35323
* PaliGemma: Make sure to add <eos> to suffix if <image> is present in `text` by probicheaux in 35201
* Fix some fa2 tests by ArthurZucker in 35340
* Modernbert Release Fixes by warner-benjamin in 35344
* [`docs`] Add link to ModernBERT Text Classification GLUE finetuning script by tomaarsen in 35347
* fix onnx export of speech foundation models by nikosanto13 in 34224
* [`Mamba2`] Fix caching, slow path, and multi-gpu by vasqu in 35154
* Reduce CircleCI usage by ydshieh in 35355
* Implement AsyncTextIteratorStreamer for asynchronous streaming by CISC in 34931
* Cleaner attention interfaces by Cyrilvallez in 35342
* Add Tensor Parallel support for Qwen2VL by jla524 in 35050
* fix zoedepth initialization error under deepspeed zero3 by Tavish9 in 35011
* Aurevoir PyTorch 1 by ydshieh in 35358
* bugfix: torch.export failure caused by `_make_causal_mask` by jiwoong-choi in 35291
* update codecarbon by nhamanasu in 35243
* Update test fetcher when we want to test all by ArthurZucker in 35364
* Use `weights_only=True` with `torch.load` for `transfo_xl` by ydshieh in 35241
* Make `test_generate_with_static_cache` even less flaky by ydshieh in 34995
* Improve modular transformers documentation by joelpaulkoch in 35322
* Improved Documentation Of Audio Classification by Uvi-12 in 35368
* [docs] Follow up register_pipeline by stevhliu in 35310
* owlvit/2 dynamic input resolution by bastrob in 34764
* Fix new FA2 if `is_causal` is passed explicitly by Cyrilvallez in 35390
* bitsandbytes: simplify 8bit dequantization by matthewdouglas in 35068
* make LlamaModel._update_causal_mask torch compilable by winglian in 35187
* Patch GPTNeoX to use adequate FA2 if position_ids is provided by taha-yassine in 35318
* uniformize kwargs for SAM by tibor-reiss in 34578
* Deprecate _is_quantized_training_enabled by MekkCyber in 34991
* Scale loss before backward by qgallouedec in 35207
* Fix typing in docstring for `PaliGemmaProcessor` by alvarobartt in 35278
* Fix : VPTQ test by MekkCyber in 35394
* add bnb support for Ascend NPU by statelesshz in 31512
* bugfix Idefics3 processor - handle gracefully cases with text and no images by mfarre in 35363
* Adding logger.info about update_torch_dtype in some quantizers by MekkCyber in 35046
* Add compile test for fast image processor by yonigozlan in 35184
* Disable `.github/workflows/self-comment-ci.yml` for now by ydshieh in 35366
* enable non-cuda awq model support without modify version by jiqing-feng in 35334
* [`GPTQ`, `CompressedTensors`] Fix unsafe imports and metada check by vasqu in 34815
* Drop inplace operation for loss computation with gradient accumulation by qgallouedec in 35416
* Fix: Rename keyword argument in_channels to num_channels by ningyuv in 35289
* CLIP conversion script - Change fairseq to OpenAI by gau-nernst in 35384
* Fix f-string to show `ACCELERATE_MIN_VERSION` on error by KSafran in 35189
* Fix `model_accepts_loss_kwargs` for timm model by qubvel in 35257
* Update perf_infer_gpu_one.md: fix a typo by martin0258 in 35441
* Add compute_loss_func to Seq2SeqTrainer by d223302 in 35136
* Update docs for `sdpa_kernel` by jla524 in 35410
* [i18n-ar] Translated file: `docs/source/ar/tasks/question_answering.md` into Arabic by AhmedAlmaghz in 35196
* [i18n-ar] Translated file: `docs/source/ar/tasks/summarization.md` into Arabic by AhmedAlmaghz in 35195
* Update translated docs for `sdpa_kernel` by jla524 in 35461
* Reintroduce Python 3.9 support for ModernBERT by tomaarsen in 35458
* Fix new BNB test failures by matthewdouglas in 35345
* Fix docs typos. by zhanluxianshen in 35465
* Fix paligemma warning message by hiyouga in 35486

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* ydshieh
* Fix private forked repo. CI (35114)
* Change back to `Thread` for SF conversion (35236)
* Use `rsfE` with `pytest` (35119)
* Aggeregate test summary files in CircleCI workflow runs (34989)
* remove `benchmark` job in `push-important-models.yml` (35292)
* Trigger GitHub CI with a comment on PR (35211)
* Update comment CI bot (35323)
* Reduce CircleCI usage (35355)
* Aurevoir PyTorch 1 (35358)
* Use `weights_only=True` with `torch.load` for `transfo_xl` (35241)
* Make `test_generate_with_static_cache` even less flaky (34995)
* Disable `.github/workflows/self-comment-ci.yml` for now (35366)
* aymeric-roucher
* Add Aria (34157)
* NielsRogge
* [I-JEPA] Update docs (35148)
* Add DINOv2 with registers (35348)
* HMJ0628
* [i18n-<languageCode>] Translating agents.md to Chinese (35139)
* Translating agents_advanced.md to Chinese (35231)
* Translating "translate perf_infer_gpu_multi.md" to Chinese (35271)
* alexrs-cohere
* Add Cohere2 model (35224)
* Add Cohere2 docs details (35294)
* ArthurZucker
* fix modular order (35297)
* 🚨All attention refactor🚨 (35235)
* Fix some fa2 tests (35340)
* Update test fetcher when we want to test all (35364)
* tonywu71
* Add ColPali to 🤗 transformers (33736)
* Fix documentation for ColPali (35321)
* OmarManzoor
* Add sdpa for Beit (34941)
* fabianlim
* Add the Bamba Model (34982)
* warner-benjamin
* Add ModernBERT to Transformers (35158)
* Modernbert Release Fixes (35344)
* wejoncy
* FEAT : Adding VPTQ quantization method to HFQuantizer (34770)
* bastrob
* owlvit/2 dynamic input resolution (34764)
* BlackSamorez
* HIGGS Quantization Support (34997)

4.47.1

Not secure

We waited a little bit to make sure it was stable, thanks winglian for double checking and everyone for the fixes!

* Fix GA loss bugs and add unit test (35121)
Contributed by techkang and ArthurZucker.
* Fix num_items_in_batch not being an integer (35115))
Contributed by xspirus.
* Fix FSDP no longer working (35212)
Contributed by muellerzr.
* Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (35212)
Contributed by winglian.

* Only import torch.distributed if it is available (35133)
Contributed by GaetanLepage.
* [Whisper] Patch float type on MPS (35295)
Contributed by eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

Page 2 of 33

Releases

Has known vulnerabilities

Previous Next

Transformers

Page 2 of 33

4.49.0

4.48.3

4.48.2

4.48.1

4.48.0

4.47.1

Page 2 of 33

Links

Releases