New Features
Sequence parallelism support via ring-flash-attn
This enables long context training by distributing sequences across GPUs, reducing memory requirements per device while allowing near-linear scaling in context length per GPU. This complements other parallelism features that Axolotl offers, including FSDP and DeepSpeed. See our documentation [here](https://axolotl-ai-cloud.github.io/axolotl/docs/sequence_parallelism.html).
<img width="763" alt="Screenshot 2025-04-02 at 9 17 14 AM" src="https://github.com/user-attachments/assets/308db66d-084e-45b1-87c3-1a7b405390bc" />
Gemma-3 support has landed alongside several features to help you fine-tune Gemma-3 models:
- Cut cross entropy
- Liger kernel
- Multimodal
- Fixed loss calculation for Gradient Accumulation
Multimodal
- Beta support for a variety of multi-modal models:
- Mllama
- Pixtral
-Llava-1.5
- Mistral-Small-3.1
- Gemma-3
- Qwen2-VL
- Qwen2.5-VL
Additional Features
- Updated cut-cross-entropy patches for several models: Cohere, Cohere-2, Gemma, Gemma-2, Gemma-3, Mistral-3, and Mllama
- Support for the REX Learning Rate Scheduler - https://arxiv.org/abs/2107.04197
- Tokenizer Overrides - you can now fine-tune with custom values in tokenizers using reserved tokens
- Single-gpu and DDP support for Muon Optimizer
- Sequential packing for Curriculum learning
- Speeding up GRPO training with distributed vLLM - you can now use `axolotl vllm-serve path/to/config.yaml` to serve a separate vLLM instance which can utilize multiple GPUs to speed up trajectory generation during GRPO.
Notes
v0.8.x will be the last set of releases that will officially support torch<=2.4.1. With PyTorch 2.7 release this month, we aim to support the latest 2 stable releases of PyTorch.
We expect FSDP2 support to be a fast follow and we'll include that in v0.8.1 once we can fix and validate issues such as saving checkpoints.
What's Changed
* `train.py` refactor by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2371
* fix(doc): add installation for cce to docs by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2375
* chore(docs): remove phorm by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2378
* feat(doc): add docker images explanation by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2379
* feat(doc): document drop_system_message and clarify limitation by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2381
* chore(doc): add clarification about mpi4py error on single gpu deepspeed by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2383
* fix(doc): add missing low_cpu_mem_usage config to docs by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2369
* feat(grpo): add reward_weights config and refactor by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2365
* Add REX LR Scheduler by xzuyn in https://github.com/axolotl-ai-cloud/axolotl/pull/2380
* Update Tokenizer Overrides Handling in models.py by mhenrichsen in https://github.com/axolotl-ai-cloud/axolotl/pull/1549
* various fixes 20250305 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2384
* Optimizer refactor and add Muon support by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2367
* remove lion-pytorch as it's already handled upstream by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2389
* refactor: trl grpo configs to have descriptions by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2386
* feat(doc): add more info on RewardModel datasets by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2391
* chore(doc): add faq when having no default chat_template by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2398
* Use Latest Cut Cross Entropy by xzuyn in https://github.com/axolotl-ai-cloud/axolotl/pull/2392
* fix: create mount folder on modal if not exist by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2390
* include iproute2 and nvtop in cloud image by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2393
* fix(modal): add git pull when getting branch files by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2399
* pass additional info for fix untrained tokens when using distributed + offloading by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2388
* use max of 32 dataset processes if not explicit by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2403
* build cloud images with torch 2.6.0 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2413
* only validate hf user token on rank 0 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2408
* fixes against upstream main branches by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2407
* chore(docs): add cookbook/blog link to docs by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2410
* Feat: minor docs improvements for RLHF and faq on embeddings by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2401
* Update README.md by SicariusSicariiStuff in https://github.com/axolotl-ai-cloud/axolotl/pull/2360
* use default torch fused adamw optimizer as default as adamw_hf is deprecated by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2425
* bump HF versions except for trl by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2427
* add 12.8.1 cuda to the base matrix by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2426
* add run on novita ai by liyiligang in https://github.com/axolotl-ai-cloud/axolotl/pull/2421
* chore(doc): add instructions on adding custom integrations by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2422
* Fixing KTO+QLoRA+multi-GPU by SalmanMohammadi in https://github.com/axolotl-ai-cloud/axolotl/pull/2420
* adding pre-commit auto-update GH action and bumping plugin versions by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2428
* chore(doc): add explanation on fsdp_transformer_layer_cls_to_wrap by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2429
* Autodoc generation with quartodoc by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2419
* Sequence parallelism by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2412
* installing axolotl prior to quartodoc build by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2434
* Fix failing test by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2436
* Feat: Add support for gemma3_text and add e2e for gemma2 by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2406
* Feat: Rework multimodal support (mllama, llava, pixtral, qwen2, qwen25, gemma3, mistral3) by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2435
* feat: add CCE for gemma3, cohere, and cohere2 by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2443
* chore: minor optim changes (add apollo, improve docs, remove lion-pytorch) by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2444
* fix(doc): document`do_causal_lm_eval` required to run `eval_causal_lm_metrics` by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2445
* Set the pytorch_cuda_alloc_conf env in the train module by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2447
* add override of upstream fix for multi-gpu orpo by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2440
* hf offline decorator for tests to workaround rate limits by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2452
* bump liger to 0.5.5 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2448
* use offline for precached stream dataset by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2453
* fix streaming packing test by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2454
* fix: minor patches for multimodal by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2441
* Sequence parallelism quick follow-ups; remove ModelCallback by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2450
* destroy process group on Ctrl+C / training or eval run by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2457
* Ray train bugfix by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2458
* Updates for trl 0.16.0 - mostly for GRPO by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2437
* Fix(doc): Clarify doc on attention configs and missing pad_token by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2455
* Sequential sample packing by DreamGenX in https://github.com/axolotl-ai-cloud/axolotl/pull/2404
* gemma3 packing fixes by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2449
* Release update 20250331 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2460
* Fix(doc): Minor doc changes for peft and modal by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2462
* Fix: remove the numerous sequential log by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2461
* Validation for Muon optimizer with DS/FSDP by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2464
* fixing eval for SP by djsaunde in https://github.com/axolotl-ai-cloud/axolotl/pull/2468
* fix: downgrade deepspeed to fix grad checkpoint oom by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2465
* fix: set rl=None during inference by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2463
* torch 2.7.0 base image for testing by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2467
* fix: pydantic warning validator not returning self by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2474
* feat: add support for multimodal in lora kernels by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2472
* fix: gemma3 loss in forward pass by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2473
* fix: disable SP during merge by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2470
* fix: separate gemma3 text and vision example config by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2471
* fix(doc): document offload gradient_checkpointing option by NanoCode012 in https://github.com/axolotl-ai-cloud/axolotl/pull/2475
* set release version 0.8.0 by winglian in https://github.com/axolotl-ai-cloud/axolotl/pull/2476
New Contributors
* SicariusSicariiStuff made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2360
* liyiligang made their first contribution in https://github.com/axolotl-ai-cloud/axolotl/pull/2421
**Full Changelog**: https://github.com/axolotl-ai-cloud/axolotl/compare/v0.7.1...v0.8.0