What's New
1. Torch 2.4 Compatibility ([3542](https://github.com/mosaicml/composer/pull/3542), [#3549](https://github.com/mosaicml/composer/pull/3549), [#3553](https://github.com/mosaicml/composer/pull/3553), [#3552](https://github.com/mosaicml/composer/pull/3552), [#3565](https://github.com/mosaicml/composer/pull/3565))
Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:
- \[[PyTorch Issue](https://github.com/pytorch/pytorch/issues/133415)\] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using `composer.optim.DecoupledSGDW` as a workaround.
- \[[PyTorch Issue](https://github.com/pytorch/pytorch/issues/133923)\] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following [planner](https://github.com/mosaicml/composer/pull/3565), but this may break custom planner loading.
2. New checkpointing APIs ([3447](https://github.com/mosaicml/composer/pull/3447), [#3474](https://github.com/mosaicml/composer/pull/3474), [#3488](https://github.com/mosaicml/composer/pull/3488), [#3452](https://github.com/mosaicml/composer/pull/3452))
We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a `Trainer` object. We will be fully migrating to these new APIs in the next minor release.
3: Improved Auto-microbatching ([3510](https://github.com/mosaicml/composer/pull/3510), [#3522](https://github.com/mosaicml/composer/pull/3522))
We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.
Bug Fixes
1. Fix checkpoint symlink uploads ([3376](https://github.com/mosaicml/composer/pull/3376))
Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.
2. Optimizer tracks same parameters after FSDP wrapping ([3502](https://github.com/mosaicml/composer/pull/3502))
When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.
What's Changed
* Bump ipykernel from 6.29.2 to 6.29.5 by dependabot in https://github.com/mosaicml/composer/pull/3459
* Update torchmetrics requirement from <1.3.3,>=0.10.0 to >=1.4.0.post0,<1.4.1 by dependabot in https://github.com/mosaicml/composer/pull/3460
* [Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload by bigning in https://github.com/mosaicml/composer/pull/3376
* Bump databricks-sdk from 0.28.0 to 0.29.0 by dependabot in https://github.com/mosaicml/composer/pull/3456
* Remove Log Exception by jjanezhang in https://github.com/mosaicml/composer/pull/3464
* Corrected docs for MFU in SpeedMonitor by JackZ-db in https://github.com/mosaicml/composer/pull/3469
* [checkpoint v2] Download api by bigning in https://github.com/mosaicml/composer/pull/3447
* Upload api by bigning in https://github.com/mosaicml/composer/pull/3474
* [Checkpoint V2] Upload API by bigning in https://github.com/mosaicml/composer/pull/3488
* Load api by eracah in https://github.com/mosaicml/composer/pull/3452
* Add helpful comment explaining HSDP initialization seeding by mvpatel2000 in https://github.com/mosaicml/composer/pull/3470
* Add fit start to mosaicmllogger by ethanma-db in https://github.com/mosaicml/composer/pull/3467
* Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching by JackZ-db in https://github.com/mosaicml/composer/pull/3510
* Move hooks and fsdp modules onto state rather than trainer by JackZ-db in https://github.com/mosaicml/composer/pull/3522
* Bump coverage[toml] from 7.5.4 to 7.6.0 by dependabot in https://github.com/mosaicml/composer/pull/3471
* revert a wip PR by bigning in https://github.com/mosaicml/composer/pull/3475
* Change FP8 Eval to default to activation dtype by j316chuck in https://github.com/mosaicml/composer/pull/3454
* Get a shared file system safe signal file name by dakinggg in https://github.com/mosaicml/composer/pull/3485
* Bumping flash attention version to v2.6.2 by ShashankMosaicML in https://github.com/mosaicml/composer/pull/3489
* Bump to Pytorch 2.4 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3542
* Add Torch 2.4 Tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/3549
* Fix torch 2.4 images for tests by snarayan21 in https://github.com/mosaicml/composer/pull/3553
* Fix torch 2.4 tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/3552
* Fix bug when subset of model parameters is passed into optimizer with FSDP by sashaDoubov in https://github.com/mosaicml/composer/pull/3502
* Correctly process `parallelism_config['tp']` when it's a dict by snarayan21 in https://github.com/mosaicml/composer/pull/3434
* [torch2.4] Fix sharded checkpointing backward compatibility issue by bigning in https://github.com/mosaicml/composer/pull/3565
* [fix-daily] Use composer get_model_state_dict instead of torch's by eracah in https://github.com/mosaicml/composer/pull/3492
* Load Microbatches instead of Entire Batches to GPU by JackZ-db in https://github.com/mosaicml/composer/pull/3487
* Make Pytest log in color in Github Action by eitanturok in https://github.com/mosaicml/composer/pull/3505
* Revert "Load Microbatches instead of Entire Batches to GPU " by JackZ-db in https://github.com/mosaicml/composer/pull/3508
* Bump transformers version by dakinggg in https://github.com/mosaicml/composer/pull/3511
* Fix FSDP Config Validation by mvpatel2000 in https://github.com/mosaicml/composer/pull/3530
* Add FSDP input validation for use_orig_params and activation_cpu_offload flag by j316chuck in https://github.com/mosaicml/composer/pull/3515
* Fix checkpoint events by b-chu in https://github.com/mosaicml/composer/pull/3468
* Patch conf.py for readthedocs sphinx injection deprecation. by mvpatel2000 in https://github.com/mosaicml/composer/pull/3491
* save load path in state and pass to mosaicmllogger by ethanma-db in https://github.com/mosaicml/composer/pull/3506
* Disable gcs azure daily test by bigning in https://github.com/mosaicml/composer/pull/3514
* Update huggingface-hub requirement from <0.24,>=0.21.2 to >=0.21.2,<0.25 by dependabot in https://github.com/mosaicml/composer/pull/3481
* restore version on dev by XiaohanZhangCMU in https://github.com/mosaicml/composer/pull/3451
* Deprecate deepspeed by dakinggg in https://github.com/mosaicml/composer/pull/3512
* Update importlib-metadata requirement from <7,>=5.0.0 to >=5.0.0,<9 by dependabot in https://github.com/mosaicml/composer/pull/3519
* Update peft requirement from <0.12,>=0.10.0 to >=0.10.0,<0.13 by dependabot in https://github.com/mosaicml/composer/pull/3518
* Use gloo as part of DeviceGPU's process group backend by snarayan21 in https://github.com/mosaicml/composer/pull/3509
* Add a monitor of mlflow logger so that it sets run status as failed if main thread exits unexpectedly by chenmoneygithub in https://github.com/mosaicml/composer/pull/3449
* Revert "Use gloo as part of DeviceGPU's process group backend (3509)" by snarayan21 in https://github.com/mosaicml/composer/pull/3523
* Fix autoresume docstring (save_overwrite) by eracah in https://github.com/mosaicml/composer/pull/3526
* Unpin pip by dakinggg in https://github.com/mosaicml/composer/pull/3524
* hasattr check for Wandb 0.17.6 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3531
* Remove dev on github workflows by mvpatel2000 in https://github.com/mosaicml/composer/pull/3536
* Remove dev branch in GPU workflows by mvpatel2000 in https://github.com/mosaicml/composer/pull/3539
* restore google cloud object store test by bigning in https://github.com/mosaicml/composer/pull/3538
* Update moto[s3] requirement from <5,>=4.0.1 to >=4.0.1,<6 by dependabot in https://github.com/mosaicml/composer/pull/3516
* use s3 boto3 Adaptive retry as default retry mode by bigning in https://github.com/mosaicml/composer/pull/3543
* Use python 3.11 in GAs by eitanturok in https://github.com/mosaicml/composer/pull/3529
* Implement ruff rules enforcing pep 585 by snarayan21 in https://github.com/mosaicml/composer/pull/3551
* Update numpy requirement from <2.1.0,>=1.21.5 to >=1.21.5,<2.2.0 by dependabot in https://github.com/mosaicml/composer/pull/3556
* Bump databricks-sdk from 0.29.0 to 0.30.0 by dependabot in https://github.com/mosaicml/composer/pull/3559
* Update Optim to DecoupledSGD in Notebooks by mvpatel2000 in https://github.com/mosaicml/composer/pull/3554
* Remove lambda code eval testing by mvpatel2000 in https://github.com/mosaicml/composer/pull/3560
* Restore Azure Tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/3561
* Remove tokens for `to_next_epoch` by mvpatel2000 in https://github.com/mosaicml/composer/pull/3562
* Change iteration timestamp for old checkpoints by b-chu in https://github.com/mosaicml/composer/pull/3563
* Fix typo in `composer_collect_env` by dakinggg in https://github.com/mosaicml/composer/pull/3566
* Add default value to get_device() by coryMosaicML in https://github.com/mosaicml/composer/pull/3568
* add ghcr and update build matrix generator by KevDevSha in https://github.com/mosaicml/composer/pull/3465
* Bump aws_ofi_nccl to 1.11.0 by willgleich in https://github.com/mosaicml/composer/pull/3569
* allow listed runners by KevDevSha in https://github.com/mosaicml/composer/pull/3486
* fix runner linux-ubuntu > ubuntu-latest by KevDevSha in https://github.com/mosaicml/composer/pull/3571
* Bump version to v0.24.0 + deprecations by snarayan21 in https://github.com/mosaicml/composer/pull/3570
New Contributors
* ethanma-db made their first contribution in https://github.com/mosaicml/composer/pull/3467
* KevDevSha made their first contribution in https://github.com/mosaicml/composer/pull/3465
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.23.5...v0.24.0