Transformers

Latest version: v4.46.3

Safety actively analyzes 681857 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 30

4.45.0

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

![image](https://github.com/user-attachments/assets/2b09ca55-b21c-4cea-80e7-32afc5ce8a76)

- Add MLLama 33703, by qubvel, zucchini-nlp, ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available [here]() is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

![image](https://github.com/user-attachments/assets/d5689792-a5dd-4989-b66c-2cf4d398e89e)

* support qwen2-vl by simonJJJ in 32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:
- voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
- audio analysis: users could provide audio and text instructions for analysis during the interaction

![image](https://github.com/user-attachments/assets/221d8815-6657-4e25-b161-c1ca9728f89e)

* Add Qwen2-Audio by faychu in 32137

OLMoE

OLMoE is a series of **O**pen **L**anguage **M**odels using sparse **M**ixture-**o**f-**E**xperts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

![image](https://github.com/user-attachments/assets/948f5f52-7be6-47e2-9790-4d07cac26859)

* Add OLMoE by Muennighoff in 32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

![image](https://github.com/user-attachments/assets/3c9e64a0-8ac9-4449-ba0e-a46cd434908e)

* Llava Onevision: add model by zucchini-nlp in 32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying [blog post](https://huggingface.co/blog/falconmamba).

![image](https://github.com/user-attachments/assets/b1f081c6-36b8-4f66-9091-e760163c8a61)

* Add new model by younesbelkada in 32615

Granite Language Models

he Granite model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

![image](https://github.com/user-attachments/assets/2104b054-2490-41ec-ae09-bb37aad82fcc)

* Granite language models by mayank31398 in 31502

Granite MOE

The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

* Granitemoe by mayank31398 in 33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

![image](https://github.com/user-attachments/assets/2cd49392-c3dc-4c57-bfc5-dab41b7d0861)

* Add Descript-Audio-Codec model by kamilakesbi in 31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the [Llava](https://huggingface.co/docs/transformers/main/en/model_doc/llava) family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses [PixtralVisionModel](https://huggingface.co/docs/transformers/main/en/model_doc/pixtral#transformers.PixtralVisionModel) for its vision encoder, and [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

* Add support for Pixtral by ArthurZucker in 33449

Mimi

The Mimi model was proposed in [Moshi: a speech-text foundation model for real-time dialogue](https://kyutai.org/Moshi.pdf) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

![image](https://github.com/user-attachments/assets/2a45b304-5bcb-4c7b-984e-6c76f970b56f)

* Codec integration by ylacombe in 33565

OmDet-Turbo

The OmDet-Turbo model was proposed in [Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892) by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.

![image](https://github.com/user-attachments/assets/848e91e3-81b9-4362-955a-519eaf9a871d)

* Add OmDet-Turbo by yonigozlan in 31843

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within `transformers` by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

* Add Qwen2Moe GGUF loading support by VladOS95-cyber in 33264
* Fix incorrect vocab size retrieval in GGUF config by Isotr0py in 32551
* Add chat_template for tokenizer extracted from GGUF model by Isotr0py in 32908
* 🚨 Support dequantization for most GGML types by Isotr0py in 32625
* Add support for GGUF Phi-3 by a8nova in 31844

Torch AO

An ongoing effort is to add the ability to use `torchao` as a quantization backend. Future PRs will enable saving and fine-tuning with `peft`.

* Add TorchAOHfQuantizer by jerryzh168 in 32306

Liger Kernel

The Liger kernel is now supported in the `Trainer` class.

* Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by JasonZhu1313 in 32860

Modular Transformers

This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see [blog post](https://huggingface.co/blog/transformers-design-philosophy) for the accompanying design philosophy).

The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.

It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248

![image](https://github.com/user-attachments/assets/307f7415-54d2-4680-b056-aa88a6459777)

* Modular `transformers`: modularity and inheritance for new model additions by ArthurZucker in 33248

Agents

`Agents` continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.

* Multi agents with manager by aymeric-roucher in 32687
* Add new documentation page for advanced agent usage by aymeric-roucher in 33265
* Create local Transformers Engine by aymeric-roucher in 33218
* Agents use grammar by aymeric-roucher in 31735

Dynamic cache for decoder-only models

This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.

The documentation for the Dynamic cache can be found [here](https://huggingface.co/docs/transformers/main/en/internal/generation_utils#transformers.DynamicCache), and documentation related to the KV cache in `transformers` in general can be found [here](https://huggingface.co/docs/transformers/main/en/kv_cache).

* Cache: new Cache format in decoder-only models by zucchini-nlp in 31421

Chat templates updates

We've made several updates to our handling of chat models and chat templates. The most noticeable change is that **assistant prefill** is now supported. This means you can end a chat with an `assistant` message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:

python
pipe = pipeline("text-generation", model_checkpoint)

chat = [
{"role": "user", "content": "Can you format the answer in JSON?"},
{"role": "assistant", "content": '{"name": "'}
]

output = pipe(chat) The model will continue outputting JSON!


We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including [Loop Controls](https://jinja.palletsprojects.com/en/3.0.x/templates/#loop-controls) and a `strftime_now` function that can get the current date and time, which is commonly used in system messages. For more details, see the updated [chat template docs](https://huggingface.co/docs/transformers/main/en/chat_templating).

* Enable some Jinja extensions and add datetime capabilities by Rocketknight1 in 32684
* Update Jinja docs with new functions and general cleanup by Rocketknight1 in 33097
* Add assistant prefill for chat templates and TextGenerationPipeline by Rocketknight1 in 33198
* Add a warning to the chat template docs about the tool_calls format by Rocketknight1 in 33277
* Add tip to clarify tool calling by Rocketknight1 in 32883


Bugfixes and improvements

* 🌐 [i18n-KO] Translated `mask_generation.md` to Korean by jeongiin in 32257
* 🌐 [i18n-KO] Translated `idefics.md` to Korean by boyunJang in 32258
* 🌐 [i18n-KO] Translated `image_to_image.md` to Korean by shinhyunji36 in 32327
* Gemma2: add cache warning by zucchini-nlp in 32279
* enable xla fsdp by hanwen-sun in 32048
* Fix typo in tokenization_utils_base.py by blubitz in 32484
* fix broken link in docs by jorahn in 32491
* Docs: alert for the possibility of manipulating logits by gante in 32467
* 🌐 [i18n-KO] Translated `gptq.md` to Korean by 1kmmk1 in 32293
* 🌐 [i18n-KO] Translated `prompting.md` to Korean by chhaewxn in 32294
* 🌐 [i18n-KO] Translated `quantization/quanto.md` to Korean by fabxoe in 32281
* 🌐 [i18n-KO] Translated `image_feature_extraction.md` to Korean by mreraser in 32239
* Fix references to model google mt5 small by JuanFKurucz in 32497
* Docs: Fixed WhisperModel.forward’s docstring link by Sai-Suraj-27 in 32498
* 🌐 [i18n-KO] Translated `chat_templating.md` to Korean by enchantee00 in 32362
* Fix link to autoclass_tutorial.md in i18n.md by JuanFKurucz in 32501
* Fix typo: depracted -> deprecated by tomaarsen in 32489
* Fix issue 32518: Update llm_tutorial.md by doomdagadiggiedahdah in 32523
* Change Phi3 `_supports_sdpa` to True by pocca2048 in 32457
* Uniformize kwargs for processors - GroundingDINO by SangbumChoi in 31964
* Fix add-new-model-like by molbap in 31773
* filter flash_attn optional imports loading remote code by eaidova in 30954
* 🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean by 010kim in 32372
* 🌐 [i18n-KO] Translated `trainer.md` to Korean by cjfghk5697 in 32260
* 🌐 [i18n-KO] Translated `eetq.md` to Korean by jun048098 in 32352
* 🌐 [i18n-KO] Translated `fsdp.md` to Korean by win2dvp21 in 32261
* 🌐 [i18n-KO] Translated `bitsandbytes.md` to Korean by SeungAhSon in 32408
* Fix generate with `inputs_embeds` as input by molbap in 32493
* Fixed test `test_static_cache_exportability` with torch 2.4.0 by guangy10 in 32516
* Fix code example to load bigcode starcoder2 7b by JuanFKurucz in 32474
* [docs] Translation guide by stevhliu in 32547
* Gemma2: fix FA2 generation by zucchini-nlp in 32553
* Fix a bug in Qwen2Audio by faychu in 32552
* fix slow integration gemma2 test by ArthurZucker in 32534
* fix non contiguous tensor value error in save_pretrained by congcongke in 32422
* 🌐 [i18n-KO] Translated `agent.md` to Korean by Jwaminju in 32351
* Fix: FA2 with packed training by zucchini-nlp in 32487
* Fix sliding window attention used in Gemma2FlashAttention2 by brcps12 in 32522
* fix: Fixed conditional check for `encodec` model names by Sai-Suraj-27 in 32581
* Fix `.push_to_hub(..., create_pr=True, revision="my-branch")` when creating PR on not-owned repo by Wauplin in 32094
* Cleanup tool calling documentation and rename doc by Rocketknight1 in 32337
* 🌐 [i18n-KO] Translated `deepspeed.md` to Korean by 4N3MONE in 32431
* 🌐 [i18n-KO] Translated `awq.md`to Korean by ahnjj in 32324
* fix: Fixed failing `test_find_base_model_checkpoint` by Sai-Suraj-27 in 32638
* "to be not" -> "not to be" by qgallouedec in 32636
* fix: Updated the `is_torch_mps_available()` function to include `min_version` argument by Sai-Suraj-27 in 32545
* Expand inputs in processors for VLMs by zucchini-nlp in 30962
* Automatically add `transformers` tag to the modelcard by LysandreJik in 32623
* Fix tests by molbap in 32649
* fix tensors on different devices in `WhisperGenerationMixin` by faaany in 32316
* Add support for GrokAdamW optimizer by ehartford in 32521
* Add Depth Anything V2 Metric models by bt2513 in 32126
* Fix: Fixed directory path for utils folder in `test_tokenization_utils.py` by Sai-Suraj-27 in 32601
* Modify ProcessorTesterMixin for better generalization by yonigozlan in 32637
* TF_Deberta supporting mixed precision by pinesnow72 in 32618
* Fix tests recurrent by molbap in 32651
* Support MUSA (Moore Threads GPU) backend in transformers by fmo-mt in 31913
* fix: Fixed failing tests in `tests/utils/test_add_new_model_like.py` by Sai-Suraj-27 in 32678
* Update translation docs review by stevhliu in 32662
* Fix `JetMoeIntegrationTest` by ydshieh in 32332
* Update the distributed CPU training on Kubernetes documentation by dmsuehir in 32669
* fix: Fixed unknown pytest config option `doctest_glob` by Sai-Suraj-27 in 32475
* Unpin deepspeed in Docker image/tests by muellerzr in 32572
* Updated workflows to the latest versions by Sai-Suraj-27 in 32405
* reopen: llava-next fails to consider padding_side during Training by jp1924 in 32679
* fix: Corrected ` falcon-mamba-7b` model checkpoint name by Sai-Suraj-27 in 32837
* fix: update doc link for runhouse in README.md by muddlebee in 32664
* VLMs: small clean-up for cache class by zucchini-nlp in 32417
* add back the position ids by ArthurZucker in 32554
* Use head_dim if in config for RoPE by suiyoubi in 32495
* Generate: unify `LogitsWarper` and `LogitsProcessor` by gante in 32626
* [tests] make test_sdpa_equivalence device-agnostic by faaany in 32520
* Cache: use `batch_size` instead of `max_batch_size` by gante in 32657
* Fix AutoConfig and AutoModel support for Llava-Next-Video by TKONIY in 32844
* improve _get_is_as_tensor_fns by zrr1999 in 32596
* Revert PR 32299, flag users when Zero-3 was missed by muellerzr in 32851
* fix multi-gpu with static cache by SunMarc in 32543
* Reduce the error log when using core models that need their weights renamed, and provide a step forward by muellerzr in 32656
* Make beam_constraints.Constraint.advance() docstring more accurate by alex-calderwood in 32674
* generate: missing `to` in DoLa body, causing exceptions in multi-gpu generation by gante in 32856
* Add Flax Dinov2 by MHRDYN7 in 31960
* support torch-speech by itazap in 32537
* [tests] make `test_sdpa_can_compile_dynamic` device-agnostic by faaany in 32519
* Add __repr__ for Conv1D by AaronZLT in 32425
* Support save/load ckpt for XLA FSDP by yitongh in 32311
* RT-DETR parameterized batchnorm freezing by AlanBlanchet in 32631
* Mamba / FalconMamba: Fix mamba left padding by younesbelkada in 32677
* Fix: Mamba2 generation mismatch between input_ids and inputs_embeds by vasqu in 32694
* Docs: Fixed `whisper-large-v2` model link in docs by Sai-Suraj-27 in 32871
* Allow-head-dim by ArthurZucker in 32857
* 🚨🚨🚨 Update min version of accelerate to 0.26.0 by SunMarc in 32627
* Fix repr for conv by ArthurZucker in 32897
* fix: jamba cache fails to use torch.nn.module by xgal in 32894
* Fix: Mamba2 `norm_before_gate` usage by vasqu in 32686
* Replace `tensor.norm()` with decomposed version for CLIP executorch export by qubvel in 32887
* link for optimizer names by nbroad1881 in 32400
* [i18n-ar] add README_ar.md to README.md by AhmedAlmaghz in 32583
* fix: [whisper] don't overwrite GenerationConfig's `return_timestamps` when `return_timestamps` is not passed to `generate` function by hrl in 31296
* Update docker image building by ArthurZucker in 32918
* Jamba: update integration tests by gante in 32250
* fix: Added missing `huggingface_hub` installation to workflows by Sai-Suraj-27 in 32891
* fix: no need to dtype A in jamba by xgal in 32924
* FEAT / Trainer: Add adamw 4bit optimizer by SunMarc in 31865
* CI: separate step to download nltk files by gante in 32935
* FIX / Hub: Also catch for `exceptions.ConnectionError` by younesbelkada in 31469
* Add SynCode to llm_tutorial by shubhamugare in 32884
* Fix benchmark script by ydshieh in 32635
* Improve greedy search memory usage by regisss in 32895
* fix: (issue 32689) `AttributeError` raised when using `Trainer` with `eval_on_start=True` in Jupyter Notebook. by fshp971 in 32849
* Gemma2: eager attention by default by gante in 32865
* [run_slow] idefics2 by andimarafioti in 32840
* Fix regression on `Processor.save_pretrained` caused by 31691 by leloykun in 32921
* 🌐 [i18n-KO] Translated `knowledge_distillation_for_image_classification.md to Korean" by JinukHong in 32334
* Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` by gante in 32863
* docs: fix outdated link to TF32 explanation by anakin87 in 32947
* Reducing memory usage: removing useless logits computation in generate() by Cyrilvallez in 31292
* Forbid `PretrainedConfig` from saving `generate` parameters; Update deprecations in `generate`-related code 🧹 by gante in 32659
* DeviceGuard added to use Deformable Attention more safely on multi-GPU by DonggeunYu in 32910
* added doctring to SchedulerType class by Arunprakash-A in 32898
* Updated the custom_models.md changed cross_entropy code by S-M-J-I in 33118
* CI: add torchvision to the consistency image by gante in 32941
* Test: add higher `atol` in `test_forward_with_num_logits_to_keep` by gante in 33093
* mps: add `isin_mps_friendly`, a wrapper function for `torch.isin` by gante in 33099
* Add changes for uroman package to handle non-Roman characters by nandwalritik in 32404
* fix: Fixed `pydantic` required version in dockerfiles to make it compatible with DeepSpeed by Sai-Suraj-27 in 33105
* quickfix documentation by molbap in 32566
* Fixup py 38 type hints for mps friendly by muellerzr in 33128
* fix: Fixed CodeGenTokenizationTest::test_truncation failing test by Sai-Suraj-27 in 32850
* fix: multilingual midel convert to tflite get wrong token by Ayaa17 in 32079
* disable scheduled daily CI temporarily by ydshieh in 33136
* CI: fix `efficientnet` pipeline timeout and prevent future similar issues due to large image size by gante in 33123
* Log additional test metrics with the CometCallback by Lothiraldan in 33124
* [docs] add quick usage snippet to Whisper. by Vaibhavs10 in 31289
* Update stateful_callbacks state before saving checkpoint by pedrobrs in 32115
* fix Idefics2VisionConfig type annotation by chenzizhao in 33103
* Add a fix for custom code tokenizers in pipelines by Rocketknight1 in 32300
* Llama: make slow tests green 🟢 by gante in 33138
* fix redundant checkpointing in example training scripts by eminorhan in 33131
* update torch req for 4-bit optimizer by SunMarc in 33144
* 🌐 [i18n-KO] Translated `conversations.md` to Korean by newfull5 in 32468
* Very small change to one of the function parameters by alisalamatian1 in 32548
* 🚨 Add Blip2ForImageTextRetrieval by jpizarrom in 29261
* fix model name and copyright by mayank31398 in 33152
* Fix: Jamba batched generation by vasqu in 32914
* [whisper] pass attention_mask to generate_with_fallback() by benniekiss in 33145
* [RoBERTa-based] Add support for sdpa by hackyon in 30510
* Fix import paths for test_module by rasmi in 32888
* Zero-shot pipelines: minor doc changes by pcuenca in 33127
* Customise the separator used for splicing in DataCollatorWithFlattening by beep-bebop in 33114
* Fix spell mistakes by matsuo1234567 in 33149
* update push CI workflow files for security by ydshieh in 33142
* added quick clarification by DuyguA in 33166
* pass module to Params4bit.from_prequantized to ensure quant_state by winglian in 32524
* Mamba2 conversion script for original models by vasqu in 32580
* Add a static cache that offloads to the CPU or other device by gerbenvv in 32161
* use a single for loop by ArthurZucker in 33148
* Pipeline: fix bad generation kwargs docs by gante in 33205
* Add missing quotes in modeling_llava_next_video.py by juliendenize in 33214
* Add warning for stop string edge case by Rocketknight1 in 33169
* Fix local repos with remote code not registering for pipelines by Rocketknight1 in 33100
* Refactor CI: more explicit by ArthurZucker in 30674
* 🌐 [i18n-KO] Translated `llm_optims.md` to Korean by yijun-lee in 32325
* Fix red amin by ArthurZucker in 33220
* Test fetcher: missing return on filtered tests; don't write empty files by gante in 33224
* Generate: throw warning when `return_dict_in_generate` is False but should be True by gante in 33146
* Add video text to text docs by merveenoyan in 33164
* Add GraniteRMSNorm by NielsRogge in 33177
* Add duckduckgo search tool by aymeric-roucher in 32882
* Fix: Suppressed 'use_reentrant=False' warning by ankush13r in 33208
* docs: Replace package abbreviations with full name(`bitsandbytes`) in docstrings by rapsealk in 33230
* Generate: fix assistant in different device by gante in 33257
* remove to restriction for 4-bit model by SunMarc in 33122
* Fixed typo repeated word in DETR docs by sergiopaniego in 33250
* Fix: use `torch.from_numpy()` to create tensors for np.ndarrays by shinyano in 33201
* remove torch input dependant control flow by ArthurZucker in 33245
* Fix: `num_logits_to_keep` in composite models by zucchini-nlp in 33168
* Fix Bark saving by ylacombe in 33266
* Update chat template docs to remove Blenderbot by Rocketknight1 in 33254
* Add sdpa support for Albert by OmarManzoor in 32092
* Only disallow DeepSpeed Zero-3 for auto bs finder by muellerzr in 31731
* fix the parallel number of CI nodes when it is smaller than number of tests by ArthurZucker in 33276
* Repo checks: check documented methods exist by gante in 32320
* Fix: multigpu training by zucchini-nlp in 33271
* Cache docs: update by zucchini-nlp in 32929
* Config: unified logic to retrieve text config by gante in 33219
* [fix] LlavaNextProcessor '_get_unpadded_features' method by laurentd-lunit in 33263
* wait 15m before SSH into runner workflow stops by ydshieh in 33300
* Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by alexsherstinsky in 33188
* [InstructBLIP] qformer_tokenizer is required input by amyeroberts in 33222
* [BUG] fix upper nltk version by ylacombe in 33301
* Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading by matthewdouglas in 33154
* Add validate images and text inputs order util for processors and test_processing_utils by yonigozlan in 33285
* Fix: Fix `FalconMamba` training issues due to incompatible kernels by younesbelkada in 33195
* Add paper link by Muennighoff in 33305
* 🚨 Fix `torch.jit.trace` for `interpolate_pos_encoding` in all vision models by xenova in 33226
* Update SECURITY.md by Michellehbn in 32680
* simple align qwen2vl kv_seq_len calculation with qwen2 by simonJJJ in 33161
* Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by daniellok-db in 33319
* Fix: StaticCache & `inputs_embeds` by zucchini-nlp in 32932
* Docs: add more cross-references to the KV cache docs by gante in 33323
* [whisper] alternative fix for long-form timestamps by sanchit-gandhi in 32131
* fix qwen2vl vision eager-attention by simonJJJ in 33213
* Load dynamic module (remote code) only once if code isn't change by XuehaiPan in 33162
* support loading model without config.json file by itazap in 32356
* Add validation for maximum sequence length in modeling_whisper.py by AmirMohammadFakhimi in 33196
* add self.head_dim for VisionAttention in Qwen2-VL by GeLee-Q in 33211
* support 3D attention mask in bert by gathierry in 32105
* Support reading tiktoken tokenizer.model file by itazap in 31656
* red-ci on main, fix copies by ArthurZucker in 33356
* RoPE: fix BC warning by gante in 33331
* Fix Prefill docs by Rocketknight1 in 33352
* Update author for QLorA/PEFT community notebook by daniellok-db in 33338
* add sdpa mbart by nbroad1881 in 32033
* Fix quantized cache tests by zucchini-nlp in 33351
* schedulefree optimizers by winglian in 30079
* Add visit webpage tool by aymeric-roucher in 33353
* Fixed Majority of the Typos in `transformers[en]` Documentation by nnilayy in 33350
* Compile compatibilty for decoder-only models by zucchini-nlp in 32617
* Adjust templates by LysandreJik in 33384
* Remove repeated prepare_images in processor tests by amyeroberts in 33163
* Fix import of `FalconMambaForCausalLM` by younesbelkada in 33381
* Import structure & first three model refactors by LysandreJik in 31329
* VLM: fixes after refactor by zucchini-nlp in 32907
* fixed Mask2Former image processor segmentation maps handling by maciej-adamiak in 33364
* Bug Fix: Update hub.py to fix NoneType error by rishiraj in 33315
* Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by bruno-hays in 33390
* Make StaticCache configurable at model construct time by guangy10 in 32830
* use diff internal model in tests by itazap in 33387
* Fix `FbgemmFp8Linear` not preserving tensor shape by vgel in 33239
* Fix failing windows by LysandreJik in 33436
* Remove deprecated task in load_dataset by albertvillanova in 33433
* Dynamic number of speculative tokens in order to accelerate speculative decoding by jmamou in 33258
* Fix: Cast prefetch_bucket_size to integer for deepspeed >= 0.15 by kiddj in 33402
* [docs] add the missing huggingface hub username by faaany in 33431
* [docs] add the missing tokenizer when pushing models to huggingface hub by faaany in 33428
* Update stale.yml by LysandreJik in 33434
* Docs - update formatting of llama3 model card by MichaelCurrin in 33438
* Fix incomplete sentence in `Zero-shot object detection` documentation by sergiopaniego in 33430
* Fix flax whisper tokenizer bug by hannan72 in 33151
* Clean-up deprecated code by zucchini-nlp in 33446
* Fix default revision for pipelines by ankane in 33395
* Revive AMD scheduled CI by ydshieh in 33448
* Allow send `SSH into runner` info. to DM by ydshieh in 33346
* Correct Whisper's beam search scores computation by ylacombe in 32336
* Qwen2-VL: clean-up and add more tests by zucchini-nlp in 33354
* [whisper] Clarify error message when setting max_new_tokens by benniekiss in 33324
* [docs] refine the doc for `train with a script` by faaany in 33423
* Return image hidden states by zucchini-nlp in 33426
* add a callback hook right before the optimizer step by winglian in 33444
* Enable `padding_side` as call time kwargs by zucchini-nlp in 33385
* Mitigate a conflict when using sentencepiece by tengomucho in 33327
* [Phi-3] Bug on stale kv cache by garg-amit in 33129
* Fix the initialization of the cache when we have multi gpu by SunMarc in 33303
* Enable finetuning with torchao quantized model by SunMarc in 33361
* Corrected `Agents and tools` documentation links typos by sergiopaniego in 33471
* chore: fix typo in comment in tokenization_utils_base.py by DavidLemayian in 33466
* Cohere: update RoPE structure by gante in 33408
* Fix SSH workflow by ydshieh in 33451
* Add keypoint-detection task guide by merveenoyan in 33274
* Uniformize kwargs for LLaVa processor and update docs by yonigozlan in 32858
* `Agents, supercharged - Multi-agents, External tools, and more` docs typo fixed by sergiopaniego in 33478
* [i18n-ar] Add File : `docs/source/ar/_toctree.yml` by AhmedAlmaghz in 32696
* [Whisper test] Fix some failing tests by ylacombe in 33450
* Fix: Qwen2-VL training on video datasets by hiyouga in 33307
* Updated Trainer's liger-kernel integration to call correct patching API by shimizust in 33502
* Replace `accelerator.use_fp16` in examples by hlky in 33513
* Fix parametrization-based weight norm by ylacombe in 33275
* Fix number of patch check for different vision feature select strategy by insujang in 32494
* chore: migrate coverage cfg to pyproject.toml by SauravMaheshkar in 32650
* idefics2 enable_input_require_grads not aligned with disable_input_re… by sywangyi in 33194
* Update chameleon.md — fix runtime type error by maxwbuckley in 33494
* Add explicit example for RAG chat templating by A-Duss in 33503
* CI Build image - move runners by glegendre01 in 33530
* fix to jamba config, asserting attention and expert offset by ErezSC42 in 33316
* Fix missing `sequences_scores` in the Whisper beam search output by Nik-Kras in 32970
* Uniformize kwargs for Pixtral processor by yonigozlan in 33521
* Add revision to trainer push_to_hub by teamclouday in 33482
* fix patch_attention_mask incorrect setting which leads to the differe… by sywangyi in 33499
* Support LLaVa-OV-Chat by zucchini-nlp in 33532
* Decorator for easier tool building by aymeric-roucher in 33439
* Fix for slow the bug tokenizer adding spaces to single id decodes by DuyguA in 32564
* Chat template: save and load correctly for processors by zucchini-nlp in 33462
* Fix missing head_dim in llama config from gguf model by Isotr0py in 33526
* [i18n-ur] Added README_ur.md file by akkefa in 33461
* fix the wandb logging issue by ZIYU-DEEP in 33464
* Fix tests in ASR pipeline by ylacombe in 33545
* Added support for bfloat16 to zero-shot classification pipeline by umarbutler in 33554
* Pipeline: no side-effects on `model.config` and `model.generation_config` 🔫 by gante in 33480
* Return attention mask in ASR pipeline to avoid warnings by Rocketknight1 in 33509
* enforce original size to be a list by dom-dziela in 33564
* Improve compiled RT-DETR inference speed by yonigozlan in 33412
* Fix bnb dequantization by SunMarc in 33546
* Load and save video-processor from separate folder by zucchini-nlp in 33562
* VLMs: enable generation tests by zucchini-nlp in 33533
* rag: fix CI by gante in 33578
* Cache: don't show warning in forward passes when `past_key_values` is None by gante in 33541
* fix tests with main revision and read token by molbap in 33560
* add uniform processors for altclip + chinese_clip by molbap in 31198
* Generate: check that `attention_mask` is 2D by gante in 33575
* change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by VladOS95-cyber in 33375
* [`Mamba2`] Move dt calculations to kernel by vasqu in 33520
* Cache: don't throw warnings on `gemma2` when instantiating a new cache by gante in 33595
* Uniformize kwargs for Paligemma processor and update docs by yonigozlan in 33571
* [tests] skip tests for xpu by faaany in 33553
* [tests] enable GemmaIntegrationTest on XPU by faaany in 33555
* Fix Llama 3 TikToken conversion by pcuenca in 33538
* Docs: add the ability to manually trigger jobs by gante in 33598
* Fix CircleCI nightly run by ydshieh in 33558
* Allow CI could be run on private forked repositories (e.g. new model additions) by ydshieh in 33594
* [tests] make more tests device-agnostic by faaany in 33580
* Update modeling_mamba2.py, fix pad size by klae01 in 32599
* Generate: remove flakyness in `test_generate_from_inputs_embeds_decoder_only` by gante in 33602
* Remove unnecessary CPM model tests by amyeroberts in 33621
* Add sdpa for BioGpt by OmarManzoor in 33592
* VLM generate: tests can't generate image/video tokens by gante in 33623
* Fix missing test in `torch_job` by ydshieh in 33593
* Add support for args to ProcessorMixin for backward compatibility by yonigozlan in 33479
* Fix contrastive search to correctly handle input with padding by ducviet00 in 33507
* Generate: assistant should sample when the main model samples by gante in 33534
* Fix some missing tests in circleci by ydshieh in 33559
* Update daily ci to use new cluster by ydshieh in 33627
* Fix qwen2vl float16 inference bug by GeLee-Q in 33312
* Fix typos by litianjian in 33583
* enable low-precision pipeline by jiqing-feng in 31625
* Pixtral update example checkpoint by amyeroberts in 33633
* Sdpa dino v2 by avishaiElmakies in 33403
* Clean up Unpack imports by molbap in 33631
* Fix DPT /Dinov2 sdpa regression on main by molbap in 33660
* handle dependency errors in check_imports by molbap in 33622
* add back self.max_position_embeddings = config.max_position_embeddings by chengchengpei in 33550
* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by Isotr0py in 33613
* Uniformize kwargs for Udop processor and update docs by yonigozlan in 33628
* Generation: deprecate `PreTrainedModel` inheriting from `GenerationMixin` by gante in 33203
* Enable BNB multi-backend support by jiqing-feng in 31098
* Fix error string after refactoring into get_chat_template by tibor-reiss in 33652
* uniformize git processor by yonigozlan in 33668
* Fix CIs post merging modular transformers by ArthurZucker in 33681
* Fixed docstring for cohere model regarding unavailability of prune_he… by mnauf in 33253
* Generation tests: update imagegpt input name, remove unused functions by gante in 33663
* Improve Error Messaging for Flash Attention 2 on CPU by sizhky in 33655
* Gemma2: fix config initialization (`cache_implementation`) by gante in 33684
* Fix ByteLevel alphabet missing when Sequence pretokenizer is used by umarbutler in 33556
* Uniformize kwargs for image-text-to-text processors by yonigozlan in 32544
* 🚨🚨 Setting default behavior of assisted decoding by jmamou in 33657
* tests: fix pytorch tensor placement errors by dvrogozh in 33485
* bump tokenizers, fix added tokens fast by ArthurZucker in 32535
* [Pixtral] Improve docs, rename model by NielsRogge in 33491

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* enchantee00
* 🌐 [i18n-KO] Translated `chat_templating.md` to Korean (32362)
* faychu
* Add Qwen2-Audio (32137)
* Fix a bug in Qwen2Audio (32552)
* 010kim
* 🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean (32372)
* cjfghk5697
* 🌐 [i18n-KO] Translated `trainer.md` to Korean (32260)
* younesbelkada
* Add new model (32615)
* Mamba / FalconMamba: Fix mamba left padding (32677)
* FIX / Hub: Also catch for `exceptions.ConnectionError` (31469)
* Fix: Fix `FalconMamba` training issues due to incompatible kernels (33195)
* Fix import of `FalconMambaForCausalLM` (33381)
* 4N3MONE
* 🌐 [i18n-KO] Translated `deepspeed.md` to Korean (32431)
* jerryzh168
* Add TorchAOHfQuantizer (32306)
* MHRDYN7
* Add Flax Dinov2 (31960)
* kamilakesbi
* Add Descript-Audio-Codec model (31494)
* Isotr0py
* Fix incorrect vocab size retrieval in GGUF config (32551)
* Add chat_template for tokenizer extracted from GGUF model (32908)
* 🚨 Support dequantization for most GGML types (32625)
* Fix missing head_dim in llama config from gguf model (33526)
* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (33613)
* AhmedAlmaghz
* [i18n-ar] add README_ar.md to README.md (32583)
* [i18n-ar] Add File : `docs/source/ar/_toctree.yml` (32696)
* simonJJJ
* support qwen2-vl (32318)
* simple align qwen2vl kv_seq_len calculation with qwen2 (33161)
* fix qwen2vl vision eager-attention (33213)
* jpizarrom
* 🚨 Add Blip2ForImageTextRetrieval (29261)
* mayank31398
* Granite language models (31502)
* fix model name and copyright (33152)
* Granitemoe (33207)
* hackyon
* [RoBERTa-based] Add support for sdpa (30510)
* Muennighoff
* Add OLMoE (32406)
* Add paper link (33305)
* VladOS95-cyber
* Add Qwen2Moe GGUF loading support (33264)
* change sequence_bias type of SequenceBiasLogitsProcessor to list, add… (33375)
* jiqing-feng
* enable low-precision pipeline (31625)
* Enable BNB multi-backend support (31098)

4.44.2

Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!

- Fix: Jamba cache fails to use torch.nn.module (32894) Authored by xgal
- Fix: No need to dtype A in Jamba (32924) xgal
- Fix: Regression on Processor.save_pretrained caused by 31691 (32921) Authored by leloykun

4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

- is_torchdynamo_compiling -- cast a wide exception net (32476) by gante
- Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (32276)" (32477) by gante and matthewdouglas
- Gemma2: fix FA2 generation (32553) by zucchini-nlp
- Fix: FA2 with packed training (32487) by zucchini-nlp
- Fix sliding window attention used in Gemma2FlashAttention2 (32522) by brcps12
- Automatically add transformers tag to the modelcard (32623) by LysandreJik
- add back the position ids (32554) by ArthurZucker
- Use head_dim if in config for RoPE (32495) suiyoubi ArthurZucker
- Revert PR 32299, flag users when Zero-3 was missed (32851) by muellerzr
- fix multi-gpu with static cache (32543) by SunMarc
- Reduce the error log when using core models that need their weights r… (32656) by muellerzr
- Fix VLM generation issues (32836) by zucchini-nlp
- Fix generate with inputs_embeds as input (32493) (this PR has some cherry-pick)

**Full Changelog**: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1

4.44.0

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to gante, sanchit-gandhi and xenova

💥 End-to-end generation compile
*Generate: end-to-end compilation 30788 by gante*: `model.generate` now supports compiling! There are a few limitations, but here is a small snippet:

python3
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)



⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)
* 3-5x faster torch.compile forward compilation for autoregressive decoder models 32227* by fxmarty .
As documented on the PR, this makes the whole generation a lot faster when you re-use the cache!
You can see this when you run `model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)`

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀
* Offloaded KV Cache 31325* by n17s : you just have to set `cache_implementation="offloaded"` when calling `from_pretrained` or using this:
python3
from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)


📦 Torch export for static cache
`pytorch` team gave us a great gift: you can now use `torch.export` directly compatible with [Executorch](https://pytorch.org/executorch/main/index.html)! Find examples [here](https://github.com/huggingface/transformers/pull/31706).

* Make static cache compatible with torch.export 32168 by guangy10

This also unlocks support for prompt reuse:
python3
import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]


Gemma2: assisted decoding
*Gemma 2: support assisted generation 32357* by gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it [here](https://huggingface.co/blog/gemma-july-update#assisted-generation).

py
transformers assisted generation reference:
https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
"assistant_model": assistant_model,
"do_sample": True,
"temperature": 0.7,
"max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)


Nemotron support
![image](https://github.com/user-attachments/assets/512d3fbe-909b-4e45-9927-cab78e0f522a)
> Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to suiyoubi. See:
* Add Nemotron HF Support 31699


Codestral support
![image](https://github.com/user-attachments/assets/2827f950-f6c5-4fb8-8569-e8008aa79651)
> Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

* Add codestral mamba2 32080 by molbap and vasqu

Breaking changes:
We removed the chat template **in the code**, they should all be on the hub!
* 🚨 No more default chat templates 31733 by Rocketknight1

Long-form decoding for whisper, even faster:
Our great sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in
* [whisper] compile compatibility with long-form decoding 31772




What's Changed
* Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by RhuiDih in https://github.com/huggingface/transformers/pull/31629
* Updated `ruff` to the latest version by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926
* fix by gante in https://github.com/huggingface/transformers/pull/32162
* fix: Fixed an if condition that is always evaluating to true by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32160
* [docs] change temperature to a positive value by faaany in https://github.com/huggingface/transformers/pull/32077
* adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by rohitdwivedula in https://github.com/huggingface/transformers/pull/32171
* fix: default value reflects the runtime environment variables rather than the ones present at import time. by junrae6454 in https://github.com/huggingface/transformers/pull/32153
* Update qwen2.md by ArtificialZeng in https://github.com/huggingface/transformers/pull/32108
* Remove conversational pipeline tests by amyeroberts in https://github.com/huggingface/transformers/pull/32099
* RoPE: relaxed rope validation by gante in https://github.com/huggingface/transformers/pull/32182
* let's not warn when someone is running a forward by ArthurZucker in https://github.com/huggingface/transformers/pull/32176
* Fix resize embedding with Deepspeed by zucchini-nlp in https://github.com/huggingface/transformers/pull/32192
* Fix float8_e4m3fn in modeling_utils by SunMarc in https://github.com/huggingface/transformers/pull/32193
* Support dequantizing GGUF FP16 format by PenutChen in https://github.com/huggingface/transformers/pull/31783
* :rotating_light: No more default chat templates by Rocketknight1 in https://github.com/huggingface/transformers/pull/31733
* fix: Replaced deprecated `unittest method` with the correct one by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198
* [whisper] fix short-form output type by sanchit-gandhi in https://github.com/huggingface/transformers/pull/32178
* remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by statelesshz in https://github.com/huggingface/transformers/pull/32210
* Update question_answering.py by avlewis in https://github.com/huggingface/transformers/pull/32208
* [BigBird Pegasus] set _supports_param_buffer_assignment to False by kashif in https://github.com/huggingface/transformers/pull/32222
* [warnings] fix E721 warnings by kashif in https://github.com/huggingface/transformers/pull/32223
* Follow up for 31973 by ydshieh in https://github.com/huggingface/transformers/pull/32025
* translate philosophy.md to chinese by statelesshz in https://github.com/huggingface/transformers/pull/32177
* Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by jrhe in https://github.com/huggingface/transformers/pull/31846
* Fix code snippet for Grounding DINO by qubvel in https://github.com/huggingface/transformers/pull/32229
* Generation: stop at `eos` for assisted decoding by zucchini-nlp in https://github.com/huggingface/transformers/pull/31301
* Llava: generate without images by zucchini-nlp in https://github.com/huggingface/transformers/pull/32183
* Resize embeds with DeepSpeed by zucchini-nlp in https://github.com/huggingface/transformers/pull/32214
* don't log base model architecture in wandb if log model is false by joaonadkarni in https://github.com/huggingface/transformers/pull/32143
* Refactor: Removed un-necessary `object` base class by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230
* Adds: extra_repr for RMSNorm layers in most models by rohitdwivedula in https://github.com/huggingface/transformers/pull/32204
* Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 by catalys1 in https://github.com/huggingface/transformers/pull/31934
* [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` by faaany in https://github.com/huggingface/transformers/pull/32039
* Flash-Attn: fix generation when no attention mask or no pading by zucchini-nlp in https://github.com/huggingface/transformers/pull/32241
* More flexible trigger condition by ydshieh in https://github.com/huggingface/transformers/pull/32251
* Llama 3.1: replace for loop by tensor ops at inv_freq initialization by gante in https://github.com/huggingface/transformers/pull/32244
* 🚨 Bloom support for cache class by zucchini-nlp in https://github.com/huggingface/transformers/pull/31445
* Upload new model failure report to Hub by ydshieh in https://github.com/huggingface/transformers/pull/32264
* Optimize t5 tokenize logic to avoid redundant calls by leejet in https://github.com/huggingface/transformers/pull/32270
* fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262
* Repo: remove exceptions in `check_docstrings` by gante in https://github.com/huggingface/transformers/pull/32259
* make `p_mask` a numpy array before passing to `select_starts_ends` by faaany in https://github.com/huggingface/transformers/pull/32076
* fix(docs): Fixed a link in docs by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32274
* Generate: end-to-end compilation by gante in https://github.com/huggingface/transformers/pull/30788
* Whisper tokenizer word level timestamps by kamilakesbi in https://github.com/huggingface/transformers/pull/32197
* [pipeline] fix padding for 1-d tensors by sanchit-gandhi in https://github.com/huggingface/transformers/pull/31776
* Make static cache compatible with torch.export by guangy10 in https://github.com/huggingface/transformers/pull/32168
* Add stream messages from agent run for gradio chatbot by aymeric-roucher in https://github.com/huggingface/transformers/pull/32142
* use torch 2.4 in 2 CI jobs by ydshieh in https://github.com/huggingface/transformers/pull/32302
* Docs: fix GaLore optimizer code example by gil2rok in https://github.com/huggingface/transformers/pull/32249
* Fix GGUF dequantize for `gguf==0.9.1` by Isotr0py in https://github.com/huggingface/transformers/pull/32298
* Cast epochs_trained to int when resuming training by teddy-f-47 in https://github.com/huggingface/transformers/pull/32286
* feat(ci): set `fetch-depth: 0` in trufflehog checkout step by McPatate in https://github.com/huggingface/transformers/pull/31663
* Fix M4T for ASR pipeline by ylacombe in https://github.com/huggingface/transformers/pull/32296
* Docs: formatting nits by gante in https://github.com/huggingface/transformers/pull/32247
* Alternative agent plan by plaggy in https://github.com/huggingface/transformers/pull/32295
* fix: Added missing raise keyword for few exceptions by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32333
* fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit by winglian in https://github.com/huggingface/transformers/pull/32276
* fixes 32329 : The Torch code is correct - to get an average of 10% o… by fkrasnov2 in https://github.com/huggingface/transformers/pull/32335
* Repo checks: skip docstring checks if not in the diff by gante in https://github.com/huggingface/transformers/pull/32328
* Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by xenova in https://github.com/huggingface/transformers/pull/32191
* LLaVA-NeXT: fix anyres shapes by zucchini-nlp in https://github.com/huggingface/transformers/pull/32314
* Gemma2 and flash-attention by zucchini-nlp in https://github.com/huggingface/transformers/pull/32188
* Llama 3.1: Fix incorrect `inv_freq` assignment by gante in https://github.com/huggingface/transformers/pull/32330
* [Idefics2] - Fix FA2 call for Perceiver layer by amyeroberts in https://github.com/huggingface/transformers/pull/32275
* Gemma 2: support assisted generation by gante in https://github.com/huggingface/transformers/pull/32357
* Fix error when streaming to gradio with non-string tool arguments by aymeric-roucher in https://github.com/huggingface/transformers/pull/32360
* >3-5x faster torch.compile forward compilation for autoregressive decoder models by fxmarty in https://github.com/huggingface/transformers/pull/32227
* fix: Fixed `staticmethods` with self as first argument by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361
* fix: warmup_steps check for training_args by Ricardo-L-C in https://github.com/huggingface/transformers/pull/32236
* LLaVa: add cache class attribute by zucchini-nlp in https://github.com/huggingface/transformers/pull/32278
* [enc-dec cache] fix bug in indexing by sanchit-gandhi in https://github.com/huggingface/transformers/pull/32370
* [whisper] compile compatibility with long-form decoding by sanchit-gandhi in https://github.com/huggingface/transformers/pull/31772
* Remove size check between attn_weights and kv_seq_len for phi3 by helunwencser in https://github.com/huggingface/transformers/pull/32339
* add missing attribute _supports_param_buffer_assignment for gpt-j. by nv-guomingz in https://github.com/huggingface/transformers/pull/32359
* Check device map for saving tokenizer config on TPU (fix for issue 31971) by ayukh in https://github.com/huggingface/transformers/pull/32043
* update clean_up_tokenization_spaces warning by itazap in https://github.com/huggingface/transformers/pull/32371
* Empty list in defaults for LLaMA special tokens during weights conversion by ViktorooReps in https://github.com/huggingface/transformers/pull/32342
* Fix conflicting key in init kwargs in PreTrainedTokenizerBase by OmarManzoor in https://github.com/huggingface/transformers/pull/31233
* Offloaded KV Cache by n17s in https://github.com/huggingface/transformers/pull/31325
* Docker: add `speech` dep to the consistency docker image by gante in https://github.com/huggingface/transformers/pull/32374
* Fixed Hybrid Cache Shape Initialization. by OsamaS99 in https://github.com/huggingface/transformers/pull/32163
* Yell at the user if zero-3 init wasn't performed, but expected to have been done by muellerzr in https://github.com/huggingface/transformers/pull/32299
* Update docs by zucchini-nlp in https://github.com/huggingface/transformers/pull/32368
* RoPE: Add numerical tests ✨ by gante in https://github.com/huggingface/transformers/pull/32380
* [generate] only require an attention mask for mps with torch<2.4 by sanchit-gandhi in https://github.com/huggingface/transformers/pull/32367
* fix: (issue 32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. by fshp971 in https://github.com/huggingface/transformers/pull/32157
* MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. by Luke20000429 in https://github.com/huggingface/transformers/pull/31500
* Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer by dependabot in https://github.com/huggingface/transformers/pull/32393
* fix: SeamlessM4TFeatureExtractor stride remainder by TechInterMezzo in https://github.com/huggingface/transformers/pull/32088
* Phi3 tests: fix typing for Python 3.8 by zucchini-nlp in https://github.com/huggingface/transformers/pull/32388
* 32184 save total_vocab_size by itazap in https://github.com/huggingface/transformers/pull/32240
* add values for neftune by nbroad1881 in https://github.com/huggingface/transformers/pull/32399
* Fix documentation references to google/bit-50 model by JuanFKurucz in https://github.com/huggingface/transformers/pull/32407
* Persist embedding type of BART and mBART models after resize by AbdiHaryadi in https://github.com/huggingface/transformers/pull/32242
* fix: Updated `test_embeded_special_tokens` for luke and mluke models by Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413
* Respect the config's attn_implementation if set by amyeroberts in https://github.com/huggingface/transformers/pull/32383
* Fix documentation links and code reference to model llava-next by JuanFKurucz in https://github.com/huggingface/transformers/pull/32434
* Cache: create docs by zucchini-nlp in https://github.com/huggingface/transformers/pull/32150
* Llava: fix checkpoint_doc by RUFFY-369 in https://github.com/huggingface/transformers/pull/32458
* add the missing flash attention test marker by faaany in https://github.com/huggingface/transformers/pull/32419
* Update kwargs validation for `preprocess` with decorator by qubvel in https://github.com/huggingface/transformers/pull/32024
* Fix get large model config for Switch Transformer encoder only tester by JuanFKurucz in https://github.com/huggingface/transformers/pull/32438
* Dependencies: fix typo by gante in https://github.com/huggingface/transformers/pull/32389
* Add Nemotron HF Support by suiyoubi in https://github.com/huggingface/transformers/pull/31699
* Generate: fix end to end compilation by gante in https://github.com/huggingface/transformers/pull/32465
* Add codestral mamba2 by molbap in https://github.com/huggingface/transformers/pull/32080

New Contributors
* RhuiDih made their first contribution in https://github.com/huggingface/transformers/pull/31629
* rohitdwivedula made their first contribution in https://github.com/huggingface/transformers/pull/32171
* ArtificialZeng made their first contribution in https://github.com/huggingface/transformers/pull/32108
* avlewis made their first contribution in https://github.com/huggingface/transformers/pull/32208
* jrhe made their first contribution in https://github.com/huggingface/transformers/pull/31846
* joaonadkarni made their first contribution in https://github.com/huggingface/transformers/pull/32143
* catalys1 made their first contribution in https://github.com/huggingface/transformers/pull/31934
* leejet made their first contribution in https://github.com/huggingface/transformers/pull/32270
* guangy10 made their first contribution in https://github.com/huggingface/transformers/pull/32168
* gil2rok made their first contribution in https://github.com/huggingface/transformers/pull/32249
* teddy-f-47 made their first contribution in https://github.com/huggingface/transformers/pull/32286
* plaggy made their first contribution in https://github.com/huggingface/transformers/pull/32295
* fkrasnov2 made their first contribution in https://github.com/huggingface/transformers/pull/32335
* helunwencser made their first contribution in https://github.com/huggingface/transformers/pull/32339
* nv-guomingz made their first contribution in https://github.com/huggingface/transformers/pull/32359
* ayukh made their first contribution in https://github.com/huggingface/transformers/pull/32043
* n17s made their first contribution in https://github.com/huggingface/transformers/pull/31325
* OsamaS99 made their first contribution in https://github.com/huggingface/transformers/pull/32163
* fshp971 made their first contribution in https://github.com/huggingface/transformers/pull/32157
* Luke20000429 made their first contribution in https://github.com/huggingface/transformers/pull/31500
* TechInterMezzo made their first contribution in https://github.com/huggingface/transformers/pull/32088
* AbdiHaryadi made their first contribution in https://github.com/huggingface/transformers/pull/32242
* RUFFY-369 made their first contribution in https://github.com/huggingface/transformers/pull/32458
* suiyoubi made their first contribution in https://github.com/huggingface/transformers/pull/31699

**Full Changelog**: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0

4.43.4

There was a mick mack, now deepseep issues are properly pushed with:
- Resize embeds with DeepSpeed https://github.com/huggingface/transformers/pull/32214

🤗 Enjoy holidays

4.43.3

We still saw some bugs so zucchini-nlp added:
~- Resize embeds with DeepSpeed 32214~
- don't log base model architecture in wandb if log model is false 32143


Other fixes:
- [whisper] fix short-form output type 32178, by sanchit-gandhi which fixes the short audio temperature fallback!
- [BigBird Pegasus] set _supports_param_buffer_assignment to False 32222 by kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉

Page 2 of 30

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.