What's New
**1. Hybrid Sharded Data Parallel (HSDP) Integration (2648)**
Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'HYBRID_SHARD',
}
trainer = Trainer(
model=composer_model,
max_duration='4ba',
fsdp_config=fsdp_config,
...
)
`HYBRID_SHARD` will `FULL_SHARD` a model whereas `_HYBRID_SHARD_ZERO2` will `SHARD_GRAD_OP` within the shard block.
**2. Train Loss NaN Monitor (2704)**
Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.
from composer.callbacks import NaNMonitor
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
max_duration='4ba',
callbacks=NaNMonitor(),
...
)
Bug Fixes
* Fix MPS with dict loss by mvpatel2000 in https://github.com/mosaicml/composer/pull/2706
* Squelch Memory Monitor warnings if device=meta by hanlint in https://github.com/mosaicml/composer/pull/2529
* Switch mosaicml logger to use futures to enable better error handling by j316chuck in https://github.com/mosaicml/composer/pull/2702
What's Changed
* Add partial state dict functionality for FSDP by b-chu in https://github.com/mosaicml/composer/pull/2637
* Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by dependabot in https://github.com/mosaicml/composer/pull/2643
* Bump pytest-codeblocks from 0.16.1 to 0.17.0 by dependabot in https://github.com/mosaicml/composer/pull/2645
* Remove checkpoint on close by mvpatel2000 in https://github.com/mosaicml/composer/pull/2646
* Update latest to 2.1 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2650
* HSDP Support by mvpatel2000 in https://github.com/mosaicml/composer/pull/2648
* Log profile averages by j316chuck in https://github.com/mosaicml/composer/pull/2647
* Daily API key by mvpatel2000 in https://github.com/mosaicml/composer/pull/2655
* Add automatic remote uploader downloader for composer profiler by j316chuck in https://github.com/mosaicml/composer/pull/2653
* Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by willgleich in https://github.com/mosaicml/composer/pull/2651
* Fix GCP tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/2658
* Allow no eval_loader when eval is disabled by b-chu in https://github.com/mosaicml/composer/pull/2657
* Gate HSDP by torch 2.1.0 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2656
* Fix FSDP arg default to match torch by mvpatel2000 in https://github.com/mosaicml/composer/pull/2660
* Bump pypandoc from 1.11 to 1.12 by dependabot in https://github.com/mosaicml/composer/pull/2664
* Bump vit-pytorch from 0.35.8 to 1.6.1 by dependabot in https://github.com/mosaicml/composer/pull/2662
* Upgrade to transformers 4.34.1 by dakinggg in https://github.com/mosaicml/composer/pull/2635
* Update docker readme by mvpatel2000 in https://github.com/mosaicml/composer/pull/2669
* Add script to validate remote object store paths by irenedea in https://github.com/mosaicml/composer/pull/2667
* Torch 2.1 Resumption Support by mvpatel2000 in https://github.com/mosaicml/composer/pull/2665
* Bump gitpython from 3.1.37 to 3.1.40 by dependabot in https://github.com/mosaicml/composer/pull/2663
* Fix dist by mvpatel2000 in https://github.com/mosaicml/composer/pull/2670
* Add torch nightly for torch 2.2.0 10-24 by j316chuck in https://github.com/mosaicml/composer/pull/2671
* Adding Model Data Init and Training Progress to MosaicMLLogger by jjanezhang in https://github.com/mosaicml/composer/pull/2633
* Bump pytest from 7.4.2 to 7.4.3 by dependabot in https://github.com/mosaicml/composer/pull/2678
* Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by dependabot in https://github.com/mosaicml/composer/pull/2677
* Bump traitlets from 5.10.0 to 5.12.0 by dependabot in https://github.com/mosaicml/composer/pull/2674
* Bump cryptography from 41.0.4 to 41.0.5 by dependabot in https://github.com/mosaicml/composer/pull/2675
* Secure Code Eval changes by mvpatel2000 in https://github.com/mosaicml/composer/pull/2679
* Lazy validation of code eval metric by mvpatel2000 in https://github.com/mosaicml/composer/pull/2681
* Upgrade transformers to 4.35 by dakinggg in https://github.com/mosaicml/composer/pull/2684
* Bump traitlets from 5.12.0 to 5.13.0 by dependabot in https://github.com/mosaicml/composer/pull/2687
* Bump ipykernel from 6.25.2 to 6.26.0 by dependabot in https://github.com/mosaicml/composer/pull/2686
* Add Kwargs to upload_object by nik-mosaic in https://github.com/mosaicml/composer/pull/2692
* Add version number to composer metadata logs by j316chuck in https://github.com/mosaicml/composer/pull/2565
* Add distributed barrier test fixture to ensure pytest cleans up resources properly by j316chuck in https://github.com/mosaicml/composer/pull/2694
* Properly handle empty metric_names passed to Trainer._filter_metrics by irenedea in https://github.com/mosaicml/composer/pull/2700
* Train loss NaN checking callback by coryMosaicML in https://github.com/mosaicml/composer/pull/2704
* Adding logging and force flushing for run events by jjanezhang in https://github.com/mosaicml/composer/pull/2703
* [daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by eracah in https://github.com/mosaicml/composer/pull/2705
* Fix MPS with dict loss by mvpatel2000 in https://github.com/mosaicml/composer/pull/2706
* Update types to follow PEP 585 by b-chu in https://github.com/mosaicml/composer/pull/2697
* Bump yamllint from 1.32.0 to 1.33.0 by dependabot in https://github.com/mosaicml/composer/pull/2708
* Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by dependabot in https://github.com/mosaicml/composer/pull/2709
* Squelch Memory Monitor warnings if device=meta by hanlint in https://github.com/mosaicml/composer/pull/2529
* Fix NaN monitor for loss dicts. by coryMosaicML in https://github.com/mosaicml/composer/pull/2712
* Switch mosaicml logger to use futures to enable better error handling by j316chuck in https://github.com/mosaicml/composer/pull/2702
* Fetching arguments for FSDP by mvpatel2000 in https://github.com/mosaicml/composer/pull/2710
* Bump version to 0.17 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2711
New Contributors
* willgleich made their first contribution in https://github.com/mosaicml/composer/pull/2651
* jjanezhang made their first contribution in https://github.com/mosaicml/composer/pull/2633
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.16.4...v0.17.0