Trl

Latest version: v0.12.2

Safety actively analyzes 688238 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 8

0.7.3

`IterativeTrainer`, NEFTune and major bugfixes for `DPOTrainer` and Distributed Training

In this release we introduce two new features, `IterativeTrainer` from gaetanlop and NEFTune, together with important bugfixes for distributed training.

IterativeTrainer

Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.

Read more about it here: https://huggingface.co/docs/trl/iterative_sft_trainer

* Introducing the Iterative Trainer by gaetanlop in https://github.com/huggingface/trl/pull/737

NEFTune

NEFTune is a technique to boost the performance of chat models and was introduced by the paper [“NEFTune: Noisy Embeddings Improve Instruction Finetuning”](https://arxiv.org/abs/2310.05914) from Jain et al. it consists of adding noise to the embedding vectors during training. According to the abstract of the paper:

* [`SFTTrainer`] Adds NEFTune into `SFTTrainer` by younesbelkada in https://github.com/huggingface/trl/pull/871
* [`NEFTune`] Make use of forward hooks instead by younesbelkada in https://github.com/huggingface/trl/pull/889
* Generalize NEFTune for FSDP, DDP, ... by younesbelkada in https://github.com/huggingface/trl/pull/924
* [`NEFTune`] Make use of forward hooks instead by younesbelkada in https://github.com/huggingface/trl/pull/889

Read more about it [here](https://huggingface.co/docs/trl/sft_trainer#enhance-models-performances-using-neftune)

Major bugfixes

Major bugfixes have been addressed to tackle many issues with distributed training and gradient checkpointing.

* [`DPO`] fix DPO + GC issues by younesbelkada in https://github.com/huggingface/trl/pull/927
* [`core` / `DDP`] Fix RM trainer + DDP + quantization + propagate `gradient_checkpointing_kwargs` in SFT & DPO by younesbelkada in https://github.com/huggingface/trl/pull/912

DPOTrainer enhancements and fixes

The DPOTrainer now comes with multiple enhancements and bugfixes! Check them out below

* [DPO] add SLiC hinge loss to DPOTrainer by kashif in https://github.com/huggingface/trl/pull/866
* Fix DPOTrainer + PEFT by younesbelkada in https://github.com/huggingface/trl/pull/941
* [DPO] Merge initial peft model if trainer has a peft_config by kashif in https://github.com/huggingface/trl/pull/956
* Adds model kwargs to SFT and DPO trainers by edbeeching in https://github.com/huggingface/trl/pull/951
* fix: dpo trainer ds config by mengban in https://github.com/huggingface/trl/pull/957
* hotfix for dpo trainer by mnoukhov in https://github.com/huggingface/trl/pull/919
* Fix dpo_llama2.py by younesbelkada in https://github.com/huggingface/trl/pull/934

What's Changed

0.7.2

In this release we provide minor bugfixes and smoother user experience for all public classes. We also added some clarification on the documentation on how to use Flash Attention with `SFTTrainer`

How to use Flash Attention with `SFTTrainer`:

* Update sft_trainer.mdx to highlight Flash Attention features by younesbelkada in https://github.com/huggingface/trl/pull/807

What's Changed

0.7.1

Patch release: fix bug with `PPOTrainer` and `log_stats`

Fixed a bug with `log_stats` of `PPOTrainer` to avoid breaking behaviour

* [`PPOTrainer`] A workaround for failing log_stats by younesbelkada in https://github.com/huggingface/trl/pull/708

What's Changed

0.7.0

Text environments, LLMs with tools and agents!

<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv.png">
</div>

Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools.

We are excited to bring to the community a complete set of functionalities and full examples to train LLMs to use tools!


Check out the documentation page [here](https://huggingface.co/docs/trl/text_environments) and few examples below:
* [fine tune a LLM to learn to use a simple calculator tool](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/calculator.py)
* [fine tune a LLM to learn to use a Question Answering tool to answer general knowledge questions](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/triviaqa.py)
* [fine tune a LLM to learn to use a Python interpreter](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/python_interpreter.py)

What's Changed

0.6.0

DDPO for diffusion models

We are excited to welcome the first RLHF + diffusion models algorithm to refine the generations from diffusion models.
Read more about it directly [in the docs](https://huggingface.co/docs/trl/ddpo_trainer).

| Before | After DDPO finetuning |
| --- | --- |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_squirrel.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_squirrel.png"/></div> |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |

* Denoising Diffusion Policy Optimization by metric-space in https://github.com/huggingface/trl/pull/508

Bug fixes and other enhancements

The release also comes with multiple bug fixes reported and/or led by the community, check out the commit history below

What's Changed

0.5.0

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current `DataCollatorForCompletionOnlyLM` to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in [this paper](https://arxiv.org/abs/2305.18290) and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

* DPO Trainer by kashif in https://github.com/lvwerra/trl/pull/416
* [DPO] make sure all the concated batches are on same device by kashif in https://github.com/lvwerra/trl/pull/528
* [DPO] remove response/pairs from the DPO side by kashif in https://github.com/lvwerra/trl/pull/540
* [DPO] remove unnecessary batch size arg to Collator by kashif in https://github.com/lvwerra/trl/pull/554
* [`DPO`] Resolve logging for DPOTrainer by tomaarsen in https://github.com/lvwerra/trl/pull/570

What's Changed

* Reward trainer multi-gpu eval bug by rlindskog in https://github.com/lvwerra/trl/pull/513
* Use local process index for `_get_current_device()` by lewtun in https://github.com/lvwerra/trl/pull/515

Extending the `DataCollatorForCompletionOnlyLM`

You can now mask out the users prompts in the `DataCollatorForCompletionOnlyLM` data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

* Introducing DataCollatorForChatCompletionOnlyLM by gaetanlop in https://github.com/lvwerra/trl/pull/456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

* [`core`] Fix offline case by younesbelkada in https://github.com/lvwerra/trl/pull/538
* Relax reward trainer constraint by younesbelkada in https://github.com/lvwerra/trl/pull/539
* ADD: num_proc to SFTTrainer by BramVanroy in https://github.com/lvwerra/trl/pull/547
* [`SFTTrainer`] Add warning for wrong padding_side by younesbelkada in https://github.com/lvwerra/trl/pull/550
* Minor typo and whitespace fixes by tmm1 in https://github.com/lvwerra/trl/pull/559
* [`SFTTrainer`] Add epochs and num steps on CLI by younesbelkada in https://github.com/lvwerra/trl/pull/562
* Add `DataCollatorForCompletionOnlyLM` in the docs by younesbelkada in https://github.com/lvwerra/trl/pull/565
* Add comment to explain how the sentiment pipeline is used to run the … by jvhoffbauer in https://github.com/lvwerra/trl/pull/555
* Fix model output dim in reward trainer example by liutianlin0121 in https://github.com/lvwerra/trl/pull/566
* Computes the KL penalty using the entire distribution by edbeeching in https://github.com/lvwerra/trl/pull/541
* Add missing max_seq_length arg to example sft_trainer.py by SharkWipf in https://github.com/lvwerra/trl/pull/585
* [`PPO`] fix corner cases with PPO batch size and forward_batch_size by younesbelkada in https://github.com/lvwerra/trl/pull/563
* Update the example sft_trainer.py by ZeusFSX in https://github.com/lvwerra/trl/pull/587
* docs: Replace SFTTrainer with RewardTrainer in comment by tomaarsen in https://github.com/lvwerra/trl/pull/589
* Fix comparison in DataCollatorForCompletionOnlyLM (588) by RyujiTamaki in https://github.com/lvwerra/trl/pull/594
* refactor grad accum by vwxyzjn in https://github.com/lvwerra/trl/pull/546

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

* [`examples`] Big refactor of examples and documentation by younesbelkada in https://github.com/lvwerra/trl/pull/509
* [`examples`] Fix sentiment nit by younesbelkada in https://github.com/lvwerra/trl/pull/517
* [`examples`] make the sft script more modulable by younesbelkada in https://github.com/lvwerra/trl/pull/543
* Add `use_auth_token` arg to sft_trainer example by corey-lambda in https://github.com/lvwerra/trl/pull/544


New Contributors

* rlindskog made their first contribution in https://github.com/lvwerra/trl/pull/513
* corey-lambda made their first contribution in https://github.com/lvwerra/trl/pull/544
* tmm1 made their first contribution in https://github.com/lvwerra/trl/pull/559
* jvhoffbauer made their first contribution in https://github.com/lvwerra/trl/pull/555
* liutianlin0121 made their first contribution in https://github.com/lvwerra/trl/pull/566
* SharkWipf made their first contribution in https://github.com/lvwerra/trl/pull/585
* ZeusFSX made their first contribution in https://github.com/lvwerra/trl/pull/587
* gaetanlop made their first contribution in https://github.com/lvwerra/trl/pull/456
* RyujiTamaki made their first contribution in https://github.com/lvwerra/trl/pull/594

**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0

Page 6 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.