What's New
**1. Parallelism V2 + Tensor Parallel (3335)**
Composer now supports PyTorch's implementation of [tensor parallelism](https://pytorch.org/docs/stable/distributed.tensor.parallel.html). As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a `fsdp_config` attribute in the Trainer:
trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})
As we generalize to more forms of parallelism, we've deprecated `fsdp_config` in favor of `parallelism_config`:
trainer = Trainer(
model = model,
...
parallelism_config = {
'fsdp': {
'sharding_strategy': 'FULL_SHARD',
'data_parallel_shard_degree': 2, Size of shard dimension
'data_parallel_replicate_degree': 2, Size of replicate dimension
},
'tp_config': {
'tensor_parallel_degree': 2, Size of TP dimension
'layer_plan': ... describes how to TP layers
}
}
)
As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.
See the [docs](https://docs.mosaicml.com/projects/composer/en/latest/notes/distributed_training.html#tensor-parallel-tp) for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.
**2. MLFLow API Simplification**
Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:
mlflow_logger = MLFlowLogger(
tracking_uri = 'databricks',
experiment_name = '/Users/xxx.yyyzzz.com/my-first-project/'
)
trainer = Trainer(
model = model,
...
loggers = mlflow_logger,
)
Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate `tracking_uri` and the `experiment_name` prefix:
trainer = Trainer(
model = model,
...
loggers = MLFlowLogger(experiment_name='my-first-project'),
)
**3. Wallclock Save Interval**
Composer now supports setting a save interval in wallclock time:
trainer = Trainer(
model = model,
...
save_interval='30m',
)
Note that most durations, such as `max_duration`, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like `save_interval`.
Bug Fixes
* Don't close the engine if it's already closed in https://github.com/mosaicml/composer/pull/3143
* Fix HF tests with Pin in https://github.com/mosaicml/composer/pull/3248
* Fix backwards compatibility tests in https://github.com/mosaicml/composer/pull/3252
* Fix unexpected remote checkpointing downloading in https://github.com/mosaicml/composer/pull/3271
* Fix HSDP with ShardDegree < 8 in https://github.com/mosaicml/composer/pull/3313
What's Changed
* Remove CPU offload for DDP/single-gpu by mvpatel2000 in https://github.com/mosaicml/composer/pull/3242
* Adding more checkpoint backwards compatability tests by snarayan21 in https://github.com/mosaicml/composer/pull/3244
* Don't close the engine if its already closed by dakinggg in https://github.com/mosaicml/composer/pull/3143
* Replace `evaluator.dataloader.device_eval_batch_size` with `evaluator.device_eval_microbatch_size` by ShashankMosaicML in https://github.com/mosaicml/composer/pull/3247
* Fix HF tests with Pin by mvpatel2000 in https://github.com/mosaicml/composer/pull/3248
* Remove ICL metrics by mvpatel2000 in https://github.com/mosaicml/composer/pull/3243
* Add offset and length arguments for checkpoint validation functions by irenedea in https://github.com/mosaicml/composer/pull/3246
* Fix backwards compatibility tests, raise error for torch version mismatch by snarayan21 in https://github.com/mosaicml/composer/pull/3252
* Bump cryptography from 41.0.5 to 42.0.6 by dependabot in https://github.com/mosaicml/composer/pull/3256
* Bump databricks-sdk from 0.25.1 to 0.27.0 by dependabot in https://github.com/mosaicml/composer/pull/3257
* Improve GCS Object Store by mvpatel2000 in https://github.com/mosaicml/composer/pull/3251
* add retry to gcs.upload_file by bigning in https://github.com/mosaicml/composer/pull/3232
* Add unit test support for full state dict + load_weights_only and save_weights_only by eracah in https://github.com/mosaicml/composer/pull/3260
* will/bump_aws_ofi_nccl by willgleich in https://github.com/mosaicml/composer/pull/3253
* Fix daily GCS tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/3268
* Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by Joqsan in https://github.com/mosaicml/composer/pull/3259
* Add upload timeout patch to mlflow on azure by dakinggg in https://github.com/mosaicml/composer/pull/3265
* Add option to stagger uploads based on local rank by dakinggg in https://github.com/mosaicml/composer/pull/3275
* explicit close by dakinggg in https://github.com/mosaicml/composer/pull/3276
* Update NCCL_ASYNC_ERROR_HANDLING env variable by priba in https://github.com/mosaicml/composer/pull/3267
* new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by bigning in https://github.com/mosaicml/composer/pull/3271
* Bump to torch 2.2.2 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3283
* Fix UCObjectStore.list_objects by dakinggg in https://github.com/mosaicml/composer/pull/3284
* Update peft version by dakinggg in https://github.com/mosaicml/composer/pull/3287
* replace `load_fsdp_monolith_` with `load_monolith_` by milocress in https://github.com/mosaicml/composer/pull/3288
* Return PyTorch Latest by mvpatel2000 in https://github.com/mosaicml/composer/pull/3290
* Fix daily tests by filtering a warning by mvpatel2000 in https://github.com/mosaicml/composer/pull/3291
* remove orig_params check by milocress in https://github.com/mosaicml/composer/pull/2981
* [ckpt-rewr] Get Model State Dict Util Function by eracah in https://github.com/mosaicml/composer/pull/3250
* Skip compression check with symlink files by mvpatel2000 in https://github.com/mosaicml/composer/pull/3300
* Monkeypatch Device Mesh ND Slicing by mvpatel2000 in https://github.com/mosaicml/composer/pull/3302
* Bump coverage[toml] from 7.4.4 to 7.5.1 by dependabot in https://github.com/mosaicml/composer/pull/3305
* Bump databricks-sdk from 0.27.0 to 0.27.1 by dependabot in https://github.com/mosaicml/composer/pull/3306
* Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by dependabot in https://github.com/mosaicml/composer/pull/3307
* Allow overwrite on upload retry in remote uploader downloader by irenedea in https://github.com/mosaicml/composer/pull/3310
* Update platform references by aspfohl in https://github.com/mosaicml/composer/pull/3304
* Fix cometml unit tests by j316chuck in https://github.com/mosaicml/composer/pull/3314
* Fix HSDP with ShardDegree < 8 by bigning in https://github.com/mosaicml/composer/pull/3313
* Update docstring for get_model_state_dict by eracah in https://github.com/mosaicml/composer/pull/3318
* Tensor Parallelism Integration by mvpatel2000 in https://github.com/mosaicml/composer/pull/3269
* Bugfixes to FSDP + TP by mvpatel2000 in https://github.com/mosaicml/composer/pull/3323
* Wct save interval by KuuCi in https://github.com/mosaicml/composer/pull/3264
* Wrap ChunkedEncodingError from UCObjectStore by irenedea in https://github.com/mosaicml/composer/pull/3321
* Add checkpoint events to mosaicml logger by b-chu in https://github.com/mosaicml/composer/pull/3316
* Bump timeout to fix daily tests by j316chuck in https://github.com/mosaicml/composer/pull/3325
* Fix FSDP ckpt by filtering User Waring by j316chuck in https://github.com/mosaicml/composer/pull/3327
* Revert TP integration by dakinggg in https://github.com/mosaicml/composer/pull/3328
* Bump databricks-sdk from 0.27.1 to 0.28.0 by dependabot in https://github.com/mosaicml/composer/pull/3331
* Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by dependabot in https://github.com/mosaicml/composer/pull/3333
* Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by dependabot in https://github.com/mosaicml/composer/pull/3332
* Bump coverage[toml] from 7.5.1 to 7.5.2 by dependabot in https://github.com/mosaicml/composer/pull/3330
* Update protobuf requirement from <5.27 to <5.28 by dependabot in https://github.com/mosaicml/composer/pull/3329
* Improving memory snapshot by cli99 in https://github.com/mosaicml/composer/pull/3315
* Add A10 to speed monitor by mvpatel2000 in https://github.com/mosaicml/composer/pull/3336
* change ComposerModel output type by hyenal in https://github.com/mosaicml/composer/pull/3341
* Remove evaluator state by snarayan21 in https://github.com/mosaicml/composer/pull/3339
* [ckpt-rewr] Generate Metadata State Dict API by eracah in https://github.com/mosaicml/composer/pull/3311
* Tensor Parallelism v2 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3335
* Migrate Type Hints for PEP 585 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3344
* [checkpoint v2] add remote uploader class by bigning in https://github.com/mosaicml/composer/pull/3303
* Raise errors on all ranks for checkpoint download failures by irenedea in https://github.com/mosaicml/composer/pull/3345
* Add return type annotation when __init__ doesn't take any argument by antoinebrl in https://github.com/mosaicml/composer/pull/3347
* [ckpt-rewr] Get Optim State Dict Util API by eracah in https://github.com/mosaicml/composer/pull/3299
* Fix type check issue with device train microbatch size by mvpatel2000 in https://github.com/mosaicml/composer/pull/3349
* Add torch distributed checkpointing monkeypatches to enable TE checkpointing for extra_state attribute by j316chuck in https://github.com/mosaicml/composer/pull/3298
* Bump coverage[toml] from 7.5.2 to 7.5.3 by dependabot in https://github.com/mosaicml/composer/pull/3353
* Update wandb requirement from <0.17,>=0.13.2 to >=0.13.2,<0.18 by dependabot in https://github.com/mosaicml/composer/pull/3352
* Optional `CheckpointSaver` instantiation inside the `Trainer` by antoinebrl in https://github.com/mosaicml/composer/pull/3334
* MLFlow better experiment defaults by mvpatel2000 in https://github.com/mosaicml/composer/pull/3356
* Rename metadata keys by mvpatel2000 in https://github.com/mosaicml/composer/pull/3354
* Dataclasses for ParallelismConfig by mvpatel2000 in https://github.com/mosaicml/composer/pull/3346
* Upgrade Mofed with apt by willgleich in https://github.com/mosaicml/composer/pull/3340
* Multi gpu ci test by KuuCi in https://github.com/mosaicml/composer/pull/3312
* Autoresume Validation with Max Duration by mvpatel2000 in https://github.com/mosaicml/composer/pull/3358
* Deprecate and bump verstion to 0.23.0 by bigning in https://github.com/mosaicml/composer/pull/3359
New Contributors
* Joqsan made their first contribution in https://github.com/mosaicml/composer/pull/3259
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.22.0...v0.23.0