We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:
1. **RLOO Trainer**: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by [Ahmadian et al from Cohere](https://cohere.com/research/papers/back-to-basics-revisiting-reinforce-style-optimization-for-learning-from-human-feedback-in-llms-2024-02-23). Check out our docs [here](https://huggingface.co/docs/trl/rloo_trainer) to get started
2. **PPOv2 Trainer**: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs [here](https://huggingface.co/docs/trl/ppov2_trainer) to get started
3. **Reward model visualization**: the reward model training now includes visualization on the eval dataset, as shown below.
https://github.com/huggingface/trl/assets/5555347/6575a879-cb2f-4e2e-bb84-a76707f9de84
4. **New losses in the DPO Trainer**: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
5. **New losses in the KTO Trainer**: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)
What's Changed
* set dev version by younesbelkada in https://github.com/huggingface/trl/pull/1568
* fix add_special_tokens issue for data with template by edixiong in https://github.com/huggingface/trl/pull/1509
* [DPO] add 'bco_pair' loss_type by seanexp in https://github.com/huggingface/trl/pull/1524
* [DPO] DPOConfig class by kashif in https://github.com/huggingface/trl/pull/1554
* [SFT] add SFT Trainer Config dataclass by kashif in https://github.com/huggingface/trl/pull/1530
* FIX: Fix CI on transformers main by younesbelkada in https://github.com/huggingface/trl/pull/1576
* [`SFTTrainer`] Add warning in SFTTrainer when dataset already processed by younesbelkada in https://github.com/huggingface/trl/pull/1577
* Fix typo detoxifying doc by qgallouedec in https://github.com/huggingface/trl/pull/1594
* Core: removed unexisting `SftArgumentParser` by younesbelkada in https://github.com/huggingface/trl/pull/1602
* [`KTOTrainer`] add BCO (reward shift and underlying distribution matching) by seanexp in https://github.com/huggingface/trl/pull/1599
* [CLI] Use auto device map for model load by lewtun in https://github.com/huggingface/trl/pull/1596
* Removing `tests/` from package data by jamesbraza in https://github.com/huggingface/trl/pull/1607
* Docs: Fix build main documentation by younesbelkada in https://github.com/huggingface/trl/pull/1604
* support loss function for Self-play Preference Optimization by winglian in https://github.com/huggingface/trl/pull/1612
* Update HH dataset on helpful only subset by vwxyzjn in https://github.com/huggingface/trl/pull/1613
* corrects loss function for Self-play Preference Optimization hard label version by angelahzyuan in https://github.com/huggingface/trl/pull/1615
* Fix ZeRO-3 generation context manager by lewtun in https://github.com/huggingface/trl/pull/1617
* fixed adding bos and eos token unconditionally by jasonyux in https://github.com/huggingface/trl/pull/1591
* visualize rm prediction by vwxyzjn in https://github.com/huggingface/trl/pull/1636
* [ORPO] Correct label mask for pad tokens by IlyaGusev in https://github.com/huggingface/trl/pull/1625
* Update sft_llama2.py to work with the latest API by xianbaoqian in https://github.com/huggingface/trl/pull/1637
* Fixed wrong logs prefixes in KTOTrainer by bartoszzuk in https://github.com/huggingface/trl/pull/1641
* Pairwise Noise Contrastive Alignment by winglian in https://github.com/huggingface/trl/pull/1632
* don't cast the trainable lora layers to half precision by pacman100 in https://github.com/huggingface/trl/pull/1644
* PPO / Reinforce Trainers by vwxyzjn in https://github.com/huggingface/trl/pull/1540
* Apply deprecated `evaluation_strategy` by muellerzr in https://github.com/huggingface/trl/pull/1559
* FEAT: Add support for training collator in PPOTrainer by younesbelkada in https://github.com/huggingface/trl/pull/1658
* Correct Documentation for cDPO Usage by AliBakly in https://github.com/huggingface/trl/pull/1655
* Fix inheritance order in PPOv2Config by Nicolinho in https://github.com/huggingface/trl/pull/1659
* [DPO] Add 'robust' loss_type by Abilityguy in https://github.com/huggingface/trl/pull/1653
* 🤫 TR-DPO implementation by syrn1k in https://github.com/huggingface/trl/pull/1593
* Do not upcast adapters when using FSDP+QLoRA by pacman100 in https://github.com/huggingface/trl/pull/1654
* [Tests] update eval_strategy API by kashif in https://github.com/huggingface/trl/pull/1662
* Fix ppov2 test case by vwxyzjn in https://github.com/huggingface/trl/pull/1661
* FIX / PPO: Fix `enable_input_require_grads` issues with PPO models by younesbelkada in https://github.com/huggingface/trl/pull/1664
* fix dataset load error by sywangyi in https://github.com/huggingface/trl/pull/1670
* FIX / SFTTrainer: Fix SFTTrainer with `args=None` by younesbelkada in https://github.com/huggingface/trl/pull/1678
* Fix max_completion_length for encoder_decoder models in KTO Trainer by samuki in https://github.com/huggingface/trl/pull/1588
* intial RPO loss by kashif in https://github.com/huggingface/trl/pull/1686
* Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig by alexisrozhkov in https://github.com/huggingface/trl/pull/1690
* Skip packing validation by alex-jw-brooks in https://github.com/huggingface/trl/pull/1673
* Fix typo in DPOTrainer's warnings by qgallouedec in https://github.com/huggingface/trl/pull/1688
* Quick fix on GPT4-eval by vwxyzjn in https://github.com/huggingface/trl/pull/1696