by kashif in https://github.com/huggingface/trl/pull/2221
Use pairwise judges for online methods
The `OnlineDPOTrainer` and any trainers that inherit from it (`NashMDTrainer` and `XPOTrainer`) can now accept an initialized `PairwiseJudge` instead of a reward model.
python
from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
by kashif in https://github.com/huggingface/trl/pull/2243
Rename trainer arg `tokenizer` to `processing_class`
The `tokenizer` argument in the trainers has been renamed to `processing_class` to better reflect the fact that it can be not only a tokenizer but also a processor.
diff
- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
`tokenizer` is still supported for `SFTTrainer` and `DPOTrainer` but deprecated and will be removed in the next release.
by qgallouedec in https://github.com/huggingface/trl/pull/2162
Adding weighted preference optimization (WPO) to DPO
The [WPO](https://huggingface.co/papers/2406.11827) paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the `use_weighting` flag to `True` in the [`DPOConfig`].
python
DPOConfig(..., use_weighting=True)
<img width="1112" alt="Screenshot 2024-11-04 at 10 59 38" src="https://github.com/user-attachments/assets/544ddc02-bd09-4f21-b8a4-b81c21561a9b">
<img width="539" alt="Screenshot 2024-11-04 at 10 59 22" src="https://github.com/user-attachments/assets/8d5afe9e-89bd-4d00-8483-dd7ba98997e7">
by gaetanlop in https://github.com/huggingface/trl/pull/2141
π Model card for TRL
Using `trainer.push_to_hub()` now automatically creates a model card that includes:
- A link to the base model used
- A link to the dataset used for training
- A link to the TRL repository
- Sample demo code
- A link to the associated Weights & Biases run
- A link to the paper detailing the training procedure
- Versions of dependencies
- BibTeX citations for both the training procedure and TRL
All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paperβs page).
https://github.com/user-attachments/assets/b903964e-9087-45cc-8fb0-2418fdd87b72
by qgallouedec in https://github.com/huggingface/trl/pull/2123
Minor
Conversational dataset support
You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:
- `BCOTrainer` (by qgallouedec in PR 2107)
- `CPOTrainer` (by qgallouedec in PR 2144)
- `DPOTrainer` (by qgallouedec in PR 2131)
- `KTOTrainer` (by qgallouedec in PR 2248)
- `ORPOTrainer` (by qgallouedec in PR 2184)
python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")
Not needed anymore:
def process(row):
prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
chosen = prompt_chosen[len(prompt) :]
prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
rejected = prompt_rejected[len(prompt) :]
return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
dataset = dataset.map(process)
training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()
Refactor DPO data processing
For more information, see PR 2209.
`trl env` for printing system info
You can now use `trl env` to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.
$ trl env
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2
by qgallouedec in https://github.com/huggingface/trl/pull/2104
Sequence-Level KD