What's New
**1. New Events (2264)**
Composer now has the events `EVAL_BEFORE_ALL` and `EVAL_AFTER_ALL`, which lets users control logging of certain bespoke evaluation information across all evalutors.
**2. Elastic Sharded Checkpointing**
Traditionally, checkpoints are stored as giant monoliths. For large model training, moving the entire model to 1 node may be infeasible and writing one large file from 1 node may be slow. Composer now supports elastic sharded checkpoints with FSDP, where every rank writes a single shard of the checkpoint. This checkpointing strategy is elastic, which means even if you resume on a different number of GPUs, Composer will handle resumption. To enable sharded checkpointing, it must be specified in the FSDP Config as `'state_dict_type': 'sharded'`:
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'FULL_SHARD',
'state_dict_type': 'sharded',
'sharded_ckpt_prefix_dir': 'ba{batch}-shards' will save each set of shards checkpoint to a unique folder based on batch
}
trainer = Trainer(
model=composer_model,
max_duration='4ba'
fsdp_config=fsdp_config,
save_folder='checkpoints',
save_interval='2ba',
...
)
See the [docs](https://docs.mosaicml.com/projects/composer/en/latest/notes/distributed_training.html#saving-and-loading-sharded-checkpoints-with-fsdp) for more information in how to integrate this with your project.
Bug Fixes
* Fixes runtime estimator when using multiple evaluators in https://github.com/mosaicml/composer/pull/2331
* Fix autoresume docs link in https://github.com/mosaicml/composer/pull/2332
* Use Enum value when logging hyper-parameters in https://github.com/mosaicml/composer/pull/2386
* Fix GCSObjectStore to match function signatures of other object stores in https://github.com/mosaicml/composer/pull/2445
* Cast to float32 before numpy() to avoid bf16 errors in https://github.com/mosaicml/composer/pull/2441
What's Changed
* Update numpy requirement from <1.25.0,>=1.21.5 to >=1.21.5,<1.26.0 by dependabot in https://github.com/mosaicml/composer/pull/2316
* Bump ipykernel from 6.23.1 to 6.23.2 by dependabot in https://github.com/mosaicml/composer/pull/2317
* Bump sphinxcontrib-katex from 0.9.5 to 0.9.6 by dependabot in https://github.com/mosaicml/composer/pull/2319
* Pin Apex by mvpatel2000 in https://github.com/mosaicml/composer/pull/2322
* CodeQL on PRs by mvpatel2000 in https://github.com/mosaicml/composer/pull/2323
* Add secrets check as part of pre-commit by karan6181 in https://github.com/mosaicml/composer/pull/2324
* Update local rank 0 to be elastic by mvpatel2000 in https://github.com/mosaicml/composer/pull/2321
* Bump pytest from 7.3.1 to 7.4.0 by dependabot in https://github.com/mosaicml/composer/pull/2330
* Bump ipykernel from 6.23.2 to 6.23.3 by dependabot in https://github.com/mosaicml/composer/pull/2329
* Auto add mosaicml logger by mvpatel2000 in https://github.com/mosaicml/composer/pull/2325
* Add precision config arg for FP8 by julian-q in https://github.com/mosaicml/composer/pull/2335
* Fixes daily test failures with respect to autoadd mosaicml logger by mvpatel2000 in https://github.com/mosaicml/composer/pull/2339
* In-line group to avoid OOM by mvpatel2000 in https://github.com/mosaicml/composer/pull/2320
* Set offload_to_cpu True for state_dict_type=sharded by eracah in https://github.com/mosaicml/composer/pull/2338
* Update version to 15.1 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2341
* Fix mapi mocking by mvpatel2000 in https://github.com/mosaicml/composer/pull/2342
* Change gpu timeout by rishab-partha in https://github.com/mosaicml/composer/pull/2343
* Fix test_fsdp_load_old_checkpoint test to fix daily tests by eracah in https://github.com/mosaicml/composer/pull/2347
* Add spaces between sentences in eval label warning by srstevenson in https://github.com/mosaicml/composer/pull/2327
* Avoid overwriting seed==0 by tbenthompson in https://github.com/mosaicml/composer/pull/2352
* Small Documentation Typo Fixes by sarthak-314 in https://github.com/mosaicml/composer/pull/2349
* Fix wandb errror with autoresume issue by eracah in https://github.com/mosaicml/composer/pull/2353
* Bump ipykernel from 6.23.3 to 6.24.0 by dependabot in https://github.com/mosaicml/composer/pull/2360
* raise min mcli by mvpatel2000 in https://github.com/mosaicml/composer/pull/2362
* Add node rank to signal files by mvpatel2000 in https://github.com/mosaicml/composer/pull/2363
* Move pydantic pin to deepspeed by mvpatel2000 in https://github.com/mosaicml/composer/pull/2366
* Batch log metrics calls in speed_monitor.py by prithvikannan in https://github.com/mosaicml/composer/pull/2367
* Read Composer run name env var by mvpatel2000 in https://github.com/mosaicml/composer/pull/2372
* Fix typing for args in streaming by dakinggg in https://github.com/mosaicml/composer/pull/2373
* Add distributed sync during wait_for_workers to avoid timeout for large checkpoints by dakinggg in https://github.com/mosaicml/composer/pull/2368
* Update torchmetrics requirement from <0.12,>=0.10.0 to >=0.10.0,<1.1 by dependabot in https://github.com/mosaicml/composer/pull/2358
* Add code eval dataset and metric by rishab-partha in https://github.com/mosaicml/composer/pull/2301
* Isolate env var in unit tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/2379
* Add extra steps for space free up by XiaohanZhangCMU in https://github.com/mosaicml/composer/pull/2382
* regex changed in time.py by megha95 in https://github.com/mosaicml/composer/pull/2378
* Support no param models by making optimizer optional by mvpatel2000 in https://github.com/mosaicml/composer/pull/2374
* pin identify version to resolve codequality failures by XiaohanZhangCMU in https://github.com/mosaicml/composer/pull/2391
* Add ls to object stores by dakinggg in https://github.com/mosaicml/composer/pull/2376
* Change transformers by rishab-partha in https://github.com/mosaicml/composer/pull/2383
* Respect MLFLow experiment environment variable by aspfohl in https://github.com/mosaicml/composer/pull/2377
* Change code eval apikey by rishab-partha in https://github.com/mosaicml/composer/pull/2394
* Moves pytest-cpu slack notifications to issues from helpdesk by mvpatel2000 in https://github.com/mosaicml/composer/pull/2398
* Add code eval docs by rishab-partha in https://github.com/mosaicml/composer/pull/2397
* fixed pre-commit issues with modifications to pretty-format-json args. by snarayan21 in https://github.com/mosaicml/composer/pull/2392
* Fix LOCAL_WORLD_SIZE in pytest by rishab-partha in https://github.com/mosaicml/composer/pull/2407
* Add code eval secrets to workflows by rishab-partha in https://github.com/mosaicml/composer/pull/2399
* Enable Elastic Sharded Checkpointing by eracah in https://github.com/mosaicml/composer/pull/2262
* Remove compute_on_step from MAP by priba in https://github.com/mosaicml/composer/pull/2390
* Save metadata and integration when save_weights_only is set by eracah in https://github.com/mosaicml/composer/pull/2396
* remove unused Trainer docstring arg load_fsdp_monolith_rank0_only by eracah in https://github.com/mosaicml/composer/pull/2408
* torch2.0.1 custom auto wrap by vchiley in https://github.com/mosaicml/composer/pull/2400
* Add ruff pre-commit by Skylion007 in https://github.com/mosaicml/composer/pull/2414
* Switch google cloud backend from libcloud to google cloud storage API by XiaohanZhangCMU in https://github.com/mosaicml/composer/pull/2340
* Updates GPU test timeout to use mcloud flag by mvpatel2000 in https://github.com/mosaicml/composer/pull/2420
* Add a `EVAL_STANDALONE_START` and `EVAL_STANDALONE_END` events and change RUD to not `wait_for_workers` every eval by dakinggg in https://github.com/mosaicml/composer/pull/2418
* Throttle optimizer monitor by mvpatel2000 in https://github.com/mosaicml/composer/pull/2419
* Adding extra condition to avoid running eval_train_metrics by furkanbiten in https://github.com/mosaicml/composer/pull/2411
* fp8 on Ada by dskhudia in https://github.com/mosaicml/composer/pull/2424
* Bump coverage[toml] from 7.2.7 to 7.3.0 by dependabot in https://github.com/mosaicml/composer/pull/2432
* Bump cryptography from 38.0.4 to 41.0.3 by dependabot in https://github.com/mosaicml/composer/pull/2436
* Bump ipykernel from 6.24.0 to 6.25.1 by dependabot in https://github.com/mosaicml/composer/pull/2434
* Multilingual compatibility and batching for Code Evaluation by rishab-partha in https://github.com/mosaicml/composer/pull/2410
* Update max duration on tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/2429
* Update timeout by rishab-partha in https://github.com/mosaicml/composer/pull/2438
* add dist.barrier to rotate_checkpoints by eracah in https://github.com/mosaicml/composer/pull/2440
* Bump version to 0.16 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2439
* Fix notebooks by rishab-partha in https://github.com/mosaicml/composer/pull/2446
* Fix notebooks v2 by rishab-partha in https://github.com/mosaicml/composer/pull/2448
New Contributors
* eltociear made their first contribution in https://github.com/mosaicml/composer/pull/2333
* antoinebrl made their first contribution in https://github.com/mosaicml/composer/pull/2334
* julian-q made their first contribution in https://github.com/mosaicml/composer/pull/2335
* srstevenson made their first contribution in https://github.com/mosaicml/composer/pull/2327
* tbenthompson made their first contribution in https://github.com/mosaicml/composer/pull/2352
* sarthak-314 made their first contribution in https://github.com/mosaicml/composer/pull/2349
* prithvikannan made their first contribution in https://github.com/mosaicml/composer/pull/2367
* XiaohanZhangCMU made their first contribution in https://github.com/mosaicml/composer/pull/2382
* megha95 made their first contribution in https://github.com/mosaicml/composer/pull/2378
* snarayan21 made their first contribution in https://github.com/mosaicml/composer/pull/2392
* priba made their first contribution in https://github.com/mosaicml/composer/pull/2390
* Skylion007 made their first contribution in https://github.com/mosaicml/composer/pull/2414
* furkanbiten made their first contribution in https://github.com/mosaicml/composer/pull/2411
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.15.0...v0.16.0