QLoRA RLHF, SFT Trainer and RewardTrainer
A new version of TRL that includes training larger models using QLoRA (4 bit quantization through bitsandbytes), brand new classes `RewardTrainer` and `SFTTrainer` to easily conduct your RLHF projects end-to-end!
Introducing `SFTTrainer` and `RewardTrainer`
Use the brand new trainer to easily train your reward model and supervised fine-tuned (SFT) model with few lines of code!
* [`core`] officially support SFT (Supervised Finetuning) by younesbelkada in https://github.com/lvwerra/trl/pull/323
* [`SFT`] Fix sft issues by younesbelkada in https://github.com/lvwerra/trl/pull/336
* [`docs`] fix SFT doc by younesbelkada in https://github.com/lvwerra/trl/pull/367
* [`core`] Officially Support Reward Modeling by younesbelkada in https://github.com/lvwerra/trl/pull/303
* Resolve broken evaluation/prediction for RewardTrainer by tomaarsen in https://github.com/lvwerra/trl/pull/404
QLoRA integration
Pass 4bit models directly into `PPOTrainer` for more memory efficient training
* [`core`] Add 4bit QLora by younesbelkada in https://github.com/lvwerra/trl/pull/383
* [`bnb`] fix 4 bit SFT by younesbelkada in https://github.com/lvwerra/trl/pull/396
Updated StackLlama example
Great work by mnoukhov that managed to fix the issues related with StackLlama and the new versions of `accelerate`, `peft` and `transformers`. The completely reproducible examples below:
* StackLLaMA: correctly merge peft model by mnoukhov in https://github.com/lvwerra/trl/pull/398
* StackLlama: fixed RL training and added args by mnoukhov in https://github.com/lvwerra/trl/pull/400
* Fixed some type annotations of trl.trainer.PPoTrainer by JulesGM in https://github.com/lvwerra/trl/pull/392
* StackLLaMA: fix supervised finetuning and reward model training by mnoukhov in https://github.com/lvwerra/trl/pull/399
Bug fixes and improvements
* [`core`] refactor peft API by younesbelkada in https://github.com/lvwerra/trl/pull/231
* Batched generation by lvwerra in https://github.com/lvwerra/trl/pull/228
* Reduce memory consumption in batched_forward_pass by ohashi56225 in https://github.com/lvwerra/trl/pull/234
* [`core`] Add warning when negative KL by younesbelkada in https://github.com/lvwerra/trl/pull/239
* adds early stopping by edbeeching in https://github.com/lvwerra/trl/pull/238
* PPO config __init__ is bloated by GauravVirmani in https://github.com/lvwerra/trl/pull/241
* feat(ci): enable `pip` cache by SauravMaheshkar in https://github.com/lvwerra/trl/pull/198
* Improve logging for PPO + Docs page by natolambert in https://github.com/lvwerra/trl/pull/243
* Fix typo by heya5 in https://github.com/lvwerra/trl/pull/253
* Using batched generate in sentiment scripts by GauravVirmani in https://github.com/lvwerra/trl/pull/249
* [`core`] Fix DeepSpeed zero-3 issue by younesbelkada in https://github.com/lvwerra/trl/pull/182
* [`distributed`] Fix early stopping and DP by younesbelkada in https://github.com/lvwerra/trl/pull/254
* [`core`] Fix ds issue by younesbelkada in https://github.com/lvwerra/trl/pull/260
* Add LlaMa in tests + `create_reference_model` by younesbelkada in https://github.com/lvwerra/trl/pull/261
* Use active model to generate response in example on README (269) by rmill040 in https://github.com/lvwerra/trl/pull/271
* stack-llama by edbeeching in https://github.com/lvwerra/trl/pull/273
* Adding pointer back to Meta's LLaMA. by meg-huggingface in https://github.com/lvwerra/trl/pull/277
* fix doc string problem in ppo trainer loss function by thuwyh in https://github.com/lvwerra/trl/pull/279
* Add LLaMA tutorial to docs by natolambert in https://github.com/lvwerra/trl/pull/278
* Fix swapped helper texts by philipp-classen in https://github.com/lvwerra/trl/pull/284
* fix typo in gpt2-sentiment.ipynb by eltociear in https://github.com/lvwerra/trl/pull/293
* add functionality to push best models to the hub during training by Bearnardd in https://github.com/lvwerra/trl/pull/275
* Small improvements / fixes to toxicity example by natolambert in https://github.com/lvwerra/trl/pull/266
* Fix arguments description by lvzii in https://github.com/lvwerra/trl/pull/298
* [`t5`] Fix negative kl issue by younesbelkada in https://github.com/lvwerra/trl/pull/262
* Log Token distribution of Query / Response by natolambert in https://github.com/lvwerra/trl/pull/295
* clean examples folder by natolambert in https://github.com/lvwerra/trl/pull/294
* fixed typo in error message by soerenarlt in https://github.com/lvwerra/trl/pull/312
* fix DS for peft ref_model in ppo trainer by halfrot in https://github.com/lvwerra/trl/pull/309
* [`CI`] Fix broken tests by younesbelkada in https://github.com/lvwerra/trl/pull/318
* [`Docs`] Add details on multi-GPU / multi-node by younesbelkada in https://github.com/lvwerra/trl/pull/320
* Give a key to the wandb PPOConfig config entry by JulesGM in https://github.com/lvwerra/trl/pull/315
* added doc for using torch.distributed.launch/run by oroojlooy in https://github.com/lvwerra/trl/pull/324
* Fix argument's description by vinhkhuc in https://github.com/lvwerra/trl/pull/339
* stack_llama: update instructions in README, fix broken _get_submodules and save tokenizer by teticio in https://github.com/lvwerra/trl/pull/358
* stack_llama: add parameter to control max_length (to mitigate OOM errors) by teticio in https://github.com/lvwerra/trl/pull/359
* [`PPO`] Relax negative KL constraint by younesbelkada in https://github.com/lvwerra/trl/pull/352
* [`PPOTrainer`] Fix tensorboard issue by younesbelkada in https://github.com/lvwerra/trl/pull/330
* 140/best n sampling by metric-space in https://github.com/lvwerra/trl/pull/326
* Fix bug when loading local peft model by Opdoop in https://github.com/lvwerra/trl/pull/342
* add is_trainable in kwargs by Opdoop in https://github.com/lvwerra/trl/pull/363
* Remove obsolete layer_norm_names parameter and add peft>=0.3.0 to requirements by teticio in https://github.com/lvwerra/trl/pull/366
* Delete test_training.py by younesbelkada in https://github.com/lvwerra/trl/pull/371
* [`core`] Fix warning issue by younesbelkada in https://github.com/lvwerra/trl/pull/377
* Update customization.mdx by binganao in https://github.com/lvwerra/trl/pull/390
* fix dataloader typo in ppo_trainer.py by LZY-the-boys in https://github.com/lvwerra/trl/pull/389
* from_pretrain with peft adapter on the hub ( 379) by glerzing in https://github.com/lvwerra/trl/pull/380
* keep state_dict kwargs instead of popping it in save_pretrained by rizar in https://github.com/lvwerra/trl/pull/393
* Remove unused imports in docs. by vwxyzjn in https://github.com/lvwerra/trl/pull/406
New Contributors
* ohashi56225 made their first contribution in https://github.com/lvwerra/trl/pull/234
* GauravVirmani made their first contribution in https://github.com/lvwerra/trl/pull/241
* SauravMaheshkar made their first contribution in https://github.com/lvwerra/trl/pull/198
* heya5 made their first contribution in https://github.com/lvwerra/trl/pull/253
* rmill040 made their first contribution in https://github.com/lvwerra/trl/pull/271
* thuwyh made their first contribution in https://github.com/lvwerra/trl/pull/279
* philipp-classen made their first contribution in https://github.com/lvwerra/trl/pull/284
* Bearnardd made their first contribution in https://github.com/lvwerra/trl/pull/275
* lvzii made their first contribution in https://github.com/lvwerra/trl/pull/298
* soerenarlt made their first contribution in https://github.com/lvwerra/trl/pull/312
* halfrot made their first contribution in https://github.com/lvwerra/trl/pull/309
* oroojlooy made their first contribution in https://github.com/lvwerra/trl/pull/324
* vinhkhuc made their first contribution in https://github.com/lvwerra/trl/pull/339
* teticio made their first contribution in https://github.com/lvwerra/trl/pull/358
* metric-space made their first contribution in https://github.com/lvwerra/trl/pull/326
* Opdoop made their first contribution in https://github.com/lvwerra/trl/pull/342
* binganao made their first contribution in https://github.com/lvwerra/trl/pull/390
* LZY-the-boys made their first contribution in https://github.com/lvwerra/trl/pull/389
* glerzing made their first contribution in https://github.com/lvwerra/trl/pull/380
* rizar made their first contribution in https://github.com/lvwerra/trl/pull/393
* mnoukhov made their first contribution in https://github.com/lvwerra/trl/pull/398
* tomaarsen made their first contribution in https://github.com/lvwerra/trl/pull/404
* vwxyzjn made their first contribution in https://github.com/lvwerra/trl/pull/406
**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.1...v0.4.2