**Note:** The `mosaicml==0.13.0` PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.
New Features
1. **🤙 New and Updated Callbacks**
* *New `HealthChecker` Callback (2002)*
The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!
python
from composer import Trainer
from composer.callbacks import HealthChecker
Warn if GPU utilization difference drops below 10%
health_checker = HealthChecker(
threshold = 10
)
Construct Trainer
trainer = Trainer(
...,
callbacks=health_checker,
)
Train!
trainer.fit()
* *Updated `MemoryMonitor` to use GigaBytes (GB) units (1940)*
* *New `RuntimeEstimator` Callback (1991)*
Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.
python
from composer import Trainer
from composer.callbacks import RuntimeEstimator
Construct trainer with RuntimeEstimator callback
trainer = Trainer(
...,
callbacks=RuntimeEestimator(),
)
Train!
trainer.fit()
* *Updated `SpeedMonitor` throughput metrics (1987)*
Expands throughput metrics to track relative to several different time units and per device:
* `throughput/batches_per_sec` and `throughput/device/batches_per_sec`
* `throughput/tokens_per_sec` and `throughput/device/tokens_per_sec`
* `throughput/flops_per_sec` and `throughput/device/flops_per_sec`
* `throughput/device/samples_per_sec`
Also adds `throughput/device/mfu` metric to compute per device MFU. Simply enable the `SpeedMonitor` callback per usual to log these new metrics! Please see [SpeedMonitor](https://docs.mosaicml.com/en/latest/api_reference/generated/composer.callbacks.SpeedMonitor.html#composer.callbacks.SpeedMonitor) documentation for more information.
1. **⣿ FSDP Sharded Checkpoints (1902)**
Users can now specify the `state_dict_type` in the `fsdp_config` dictionary to enable sharded checkpoints. For example:
python
from composer import Trainer
fsdp_confnig = {
'sharding_strategy': 'FULL_SHARD',
'state_dict_type': 'local',
}
trainer = Trainer(
...,
fsdp_config=fsdp_config,
save_folder='checkpoints',
save_filename='ba{batch}_rank{rank}.pt',
save_interval='10ba',
)
Please see the [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.state_dict) docs and Composer's [Distributed Training notes](https://docs.mosaicml.com/en/latest/notes/distributed_training.html#saving-and-loading-sharded-checkpoints-with-fsdp) for more information.
1. **🤗 HuggingFace Improvements**
* Update `HuggingFaceModel` class to support encoder-decoder batches without `decoder_input_ids` (1950)
* Allow evaluation metrics to be passed to `HuggingFaceModel` directly (1971)
* Add a utility function to load a Composer checkpoint of a `HuggingFaceModel` and write out the expected `config.json` and `pytorch_model.bin` in the HuggingFace pretrained folder (1974)
1. **🛟 Nvidia H100 Alpha Support - Added `amp_fp8` data type**
In preparation for H100's arrival, we've added the `amp_fp8` precision type. Currently setting `amp_fp8` specifies a new precision context using `transformer_engine.pytorch.fp8_autocast.` For more details, please see Nvidia's new [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html) and the specific [fp8 recipe](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#id1) we utilize.
python
from composer import Trainer
trainer = Trainer(
...,
precision='amp_fp8',
)
API changes
* The `torchmetrics` package has been upgraded to 0.11.x.
The `torchmetrics.Accuracy` metric now requires a `task` argument which can take on a value of `binary`, `multiclass` or `multilabel`. Please see [Torchmetrics Accuracy](https://torchmetrics.readthedocs.io/en/v0.11.3/classification/accuracy.html) docs for details.
Additonally, since specifying `value='multiclass'` requires an additional field of `num_classes` to be specified, we've had to update `ComposerClassifier` to accept the additional `num_classes` argument. Please see PR's 2017 and 2025 for additional details
* Surgery algorithms used in functional form return a value of `None` (1543)
Deprecations
* Deprecate HFCrossEntropy and Perplexity (1857)
* Remove Jenkins CI (1943, 1954)
* Change Deprecation Warnings to Warnings for specifying `ProgressBarLogger` and `ConsoleLogger` to loggers (1846)
Bug Fixes
* Fixed an issue introduced in 0.12.1 where `HuggingFaceModel` crashes if `config.return_dict = False` (1948)
* Refactor EMA to improve memory efficiency (1941)
* Make wandb checkpoint logging compatible with wandb model registry (1973)
* Fix ICL race conditions (1978)
* Update `epoch` metric name to `trainer/epoch` (1986)
* reset scaler (1999)
* Bug/sync optimization logger across ranks (1970)
* Update Docker images to fix resolve vulnerability scan issues (2007)
* Fix eval duplicate logging issue (2018)
* extend test and patch bug (2028)
* Protect for missing slack_sdk import (2031)
Known Issues
* Docker Image Security Vulnerability
* [CVE-2022-45907](https://github.com/advisories/GHSA-47fc-vmwq-366v): The `mosaicml/pytorch:1.12.1*`, `mosaicml/pytorch:1.11.0*`, `mosaicml/pytorch_vision:1.12.1*` and `mosaicml/pytorch_vision:1.11.0*` images are impacted and currently supported for legacy use cases. **We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.**
What's Changed
* Raise error if max duration is in epochs and dataloader is infinite by dakinggg in https://github.com/mosaicml/composer/pull/1942
* Bump traitlets from 5.8.0 to 5.9.0 by dependabot in https://github.com/mosaicml/composer/pull/1946
* Deprecate HFCrossEntropy and Perplexity by dakinggg in https://github.com/mosaicml/composer/pull/1857
* Change functional surgery method return values to None by nik-mosaic in https://github.com/mosaicml/composer/pull/1543
* Retire Jenkins by bandish-shah in https://github.com/mosaicml/composer/pull/1943
* Update MCP GHA Name by mvpatel2000 in https://github.com/mosaicml/composer/pull/1951
* update memory monitor by mvpatel2000 in https://github.com/mosaicml/composer/pull/1940
* Move ffcv up in test order by dskhudia in https://github.com/mosaicml/composer/pull/1953
* Fix memory monitor test by mvpatel2000 in https://github.com/mosaicml/composer/pull/1957
* Fix model surgery failure due to functional API change by nik-mosaic in https://github.com/mosaicml/composer/pull/1949
* Change how we check for forwards args in models for HF models by bcui19 in https://github.com/mosaicml/composer/pull/1955
* add return dict false test and bug fix by dakinggg in https://github.com/mosaicml/composer/pull/1948
* remove jenkins ci by mvpatel2000 in https://github.com/mosaicml/composer/pull/1954
* add support for enc-dec batches without decoder_input_ids by dakinggg in https://github.com/mosaicml/composer/pull/1950
* Refactor EMA to improve memory efficiency by coryMosaicML in https://github.com/mosaicml/composer/pull/1941
* Add warning for untrusted checkpoints by mvpatel2000 in https://github.com/mosaicml/composer/pull/1959
* permit opt tokenizer by bmosaicml in https://github.com/mosaicml/composer/pull/1958
* GHA Docker build flow for PR's by bandish-shah in https://github.com/mosaicml/composer/pull/1883
* Update download badge link to pepy by karan6181 in https://github.com/mosaicml/composer/pull/1966
* Update python version in setup.py and fixed pypi download badge by karan6181 in https://github.com/mosaicml/composer/pull/1969
* allow eval metrics to be passed in to HuggingFaceModel directly by dakinggg in https://github.com/mosaicml/composer/pull/1971
* Make wandb checkpoint logging compatible with wandb model registry by growlix in https://github.com/mosaicml/composer/pull/1973
* Add support for FP8 on H100 using NVidia's TransformerEngine by dskhudia in https://github.com/mosaicml/composer/pull/1965
* Util for writing HuggingFace save_pretrained from a composer checkpoint by dakinggg in https://github.com/mosaicml/composer/pull/1974
* Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by eracah in https://github.com/mosaicml/composer/pull/1902
* Bump custom-inherit from 2.4.0 to 2.4.1 by dependabot in https://github.com/mosaicml/composer/pull/1981
* Bump gitpython from 3.1.30 to 3.1.31 by dependabot in https://github.com/mosaicml/composer/pull/1982
* Fix ICL race conditions by dakinggg in https://github.com/mosaicml/composer/pull/1978
* add map location to huggingface utils by dakinggg in https://github.com/mosaicml/composer/pull/1980
* fix log epoch by mvpatel2000 in https://github.com/mosaicml/composer/pull/1986
* GHA release workflow, refactor PR and Daily workflows by bandish-shah in https://github.com/mosaicml/composer/pull/1968
* Remove python-version input from Daily CPU tests by bandish-shah in https://github.com/mosaicml/composer/pull/1989
* Add some logic to pass the correct github ref to mcp script by bandish-shah in https://github.com/mosaicml/composer/pull/1990
* Fix typo in docstring for eval with missing space by mvpatel2000 in https://github.com/mosaicml/composer/pull/1992
* Fix failing sharded_checkpoint tests that fail when pytorch 1.13 is not installed by eracah in https://github.com/mosaicml/composer/pull/1988
* Add merge_group event trigger to GHA daily workflow by bandish-shah in https://github.com/mosaicml/composer/pull/1996
* Runtime estimator by mvpatel2000 in https://github.com/mosaicml/composer/pull/1991
* Reset scaler state by mvpatel2000 in https://github.com/mosaicml/composer/pull/1999
* Speed monitor refactor by mvpatel2000 in https://github.com/mosaicml/composer/pull/1987
* Test hf fsdp by dakinggg in https://github.com/mosaicml/composer/pull/1972
* Bug/sync optimization logger across ranks by bmosaicml in https://github.com/mosaicml/composer/pull/1970
* Fix optimizer monitor test gating with FSDP by mvpatel2000 in https://github.com/mosaicml/composer/pull/2000
* Low precision groupnorm by mvpatel2000 in https://github.com/mosaicml/composer/pull/1976
* Bump coverage[toml] from 7.1.0 to 7.2.1 by dependabot in https://github.com/mosaicml/composer/pull/2008
* Update docs to include runtime estimator by mvpatel2000 in https://github.com/mosaicml/composer/pull/2009
* Tag surgery algorithms LPLN and LPGN by mvpatel2000 in https://github.com/mosaicml/composer/pull/2011
* Update SpeedMonitor short-description for docs table by mvpatel2000 in https://github.com/mosaicml/composer/pull/2010
* Update Low Precision LayerNorm arguments by nik-mosaic in https://github.com/mosaicml/composer/pull/1994
* Medical Segmentation Example Typo by mvpatel2000 in https://github.com/mosaicml/composer/pull/2014
* Update wallclock logging to default hours by mvpatel2000 in https://github.com/mosaicml/composer/pull/2005
* Add HealthChecker Callback by hanlint in https://github.com/mosaicml/composer/pull/2002
* Allow FX graph mode post-training dynamic quantisation of BlurConv2d operations. by BrettRyland in https://github.com/mosaicml/composer/pull/1995
* Add multi-gpu testing to test_algorithm_resumption by eracah in https://github.com/mosaicml/composer/pull/2016
* Add backwards compatible checkpoint loading for EMA by coryMosaicML in https://github.com/mosaicml/composer/pull/2012
* fsdp with custom process groups by vchiley in https://github.com/mosaicml/composer/pull/2006
* Patch Speed Monitor MFU by mvpatel2000 in https://github.com/mosaicml/composer/pull/2013
* Remove runtime estimator state dict by mvpatel2000 in https://github.com/mosaicml/composer/pull/2015
* Update Docker images to fix resolve vulnerability scan issues by bandish-shah in https://github.com/mosaicml/composer/pull/2007
* Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers by eracah in https://github.com/mosaicml/composer/pull/1846
* Fix eval duplicate logging issue by mvpatel2000 in https://github.com/mosaicml/composer/pull/2018
* Add workflow_dispatch trigger to pr-docker workflow by bandish-shah in https://github.com/mosaicml/composer/pull/2019
* Bump streaming version to less than 0.4.0 by karan6181 in https://github.com/mosaicml/composer/pull/2020
* Upgrade ipython installed in Docker images by bandish-shah in https://github.com/mosaicml/composer/pull/2021
* Upgrade torchmetrics by nik-mosaic in https://github.com/mosaicml/composer/pull/2017
* Complete upgrade of torchmetrics accuracy by nik-mosaic in https://github.com/mosaicml/composer/pull/2025
* Bump version to v0.13.0 by bandish-shah in https://github.com/mosaicml/composer/pull/2024
New Contributors
* BrettRyland made their first contribution in https://github.com/mosaicml/composer/pull/1995
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.12.1...v0.13.1