What's New
**1. Aggregate Memory Monitoring (3042)**
The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
MemoryMonitor(
dist_aggregate_batch_interval=10, aggregate every 10 batches
)
],
)
**2. Advanced Compression Options (3118)**
Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:
- bz2
- gz
- lz4
- lzma
- lzo
- xz
- zst
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)
Thank you to mbway for adding this support!
What's Changed
* Rename composer_run_name tag to run_name when logging to MLflow by jerrychen109 in https://github.com/mosaicml/composer/pull/3040
* enable aggregate mem monitoring by vchiley in https://github.com/mosaicml/composer/pull/3042
* Bump junitparser from 3.1.1 to 3.1.2 by dependabot in https://github.com/mosaicml/composer/pull/3056
* Add SHARD_GRAD_OP to device mesh error check by mvpatel2000 in https://github.com/mosaicml/composer/pull/3058
* Add torch 2.2.1 support by mvpatel2000 in https://github.com/mosaicml/composer/pull/3059
* Use testing repo actions for linting by b-chu in https://github.com/mosaicml/composer/pull/3060
* Link autoresume docs back to watchdog by aspfohl in https://github.com/mosaicml/composer/pull/3052
* Deprecate get_state and remove deprecations by b-chu in https://github.com/mosaicml/composer/pull/3017
* Bump version to 0.20.1 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3061
* Remove s3_bucket pytest cli flag by b-chu in https://github.com/mosaicml/composer/pull/3064
* Remove s3_bucket flag from gpu test by b-chu in https://github.com/mosaicml/composer/pull/3065
* Clean Up OOM Observer Remote Uploader Download path by j316chuck in https://github.com/mosaicml/composer/pull/3070
* Fix daily test for iteration by b-chu in https://github.com/mosaicml/composer/pull/3068
* Remove "generation_length" in favor of "generation_kwargs" by maxisawesome in https://github.com/mosaicml/composer/pull/3014
* Bump packaging by mvpatel2000 in https://github.com/mosaicml/composer/pull/3072
* Use ci-testing repo for CPU and GPU tests by b-chu in https://github.com/mosaicml/composer/pull/3062
* Add new torch monkeypatches to Composer by mvpatel2000 in https://github.com/mosaicml/composer/pull/3063
* Add initial support for neuron devices by bfontain in https://github.com/mosaicml/composer/pull/3049
* Stripping whitespaces as default for QATask ICL eval by ksreenivasan in https://github.com/mosaicml/composer/pull/3073
* Add ICL base class to __all__ by mvpatel2000 in https://github.com/mosaicml/composer/pull/3079
* pass prelimiter into ALL ICL datasets by eitanturok in https://github.com/mosaicml/composer/pull/3069
* Bump sentencepiece from 0.1.99 to 0.2.0 by dependabot in https://github.com/mosaicml/composer/pull/3083
* Add Iteration related Events to callbacks by b-chu in https://github.com/mosaicml/composer/pull/3077
* Add Iteration related Events by b-chu in https://github.com/mosaicml/composer/pull/3076
* Bump CI/CD to v3 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3086
* Add docstring to _iteration_length by b-chu in https://github.com/mosaicml/composer/pull/3088
* Check FSDP module has _device_mesh before getting it by eracah in https://github.com/mosaicml/composer/pull/3091
* Bump minor version in base image by mvpatel2000 in https://github.com/mosaicml/composer/pull/3092
* Enforce async logging flush in mlflow logger at `post_close` call by chenmoneygithub in https://github.com/mosaicml/composer/pull/3093
* Warning log to info log by aspfohl in https://github.com/mosaicml/composer/pull/3096
* Bump transformers by dakinggg in https://github.com/mosaicml/composer/pull/3095
* Change style for splitting on commas by b-chu in https://github.com/mosaicml/composer/pull/3078
* Remove slash by b-chu in https://github.com/mosaicml/composer/pull/3098
* Allowing for fractional number of samples per rank by ShashankMosaicML in https://github.com/mosaicml/composer/pull/3075
* Output eval logging (batch level) by maxisawesome in https://github.com/mosaicml/composer/pull/2977
* Replace errors with warnings for eval args by mvpatel2000 in https://github.com/mosaicml/composer/pull/3100
* Ability to load sharded checkpoints with remote symlink load_path by eracah in https://github.com/mosaicml/composer/pull/3097
* Improvements to `NeptuneLogger` by AleksanderWWW in https://github.com/mosaicml/composer/pull/3085
* Revert "Improvements to `NeptuneLogger`" by mvpatel2000 in https://github.com/mosaicml/composer/pull/3111
* Bump mlflow min pin by dakinggg in https://github.com/mosaicml/composer/pull/3110
* Fix rounding issue in interval calculation by dakinggg in https://github.com/mosaicml/composer/pull/3109
* Bump coverage[toml] from 7.4.1 to 7.4.3 by dependabot in https://github.com/mosaicml/composer/pull/3102
* Uses v0.0.4 of ci-testing by b-chu in https://github.com/mosaicml/composer/pull/3112
* Add versioned deprecation warning by irenedea in https://github.com/mosaicml/composer/pull/2984
* Update Flash Attention to 2.5.5 by Skylion007 in https://github.com/mosaicml/composer/pull/3113
* Setting the max duration to current timestamp in the same units as cu… by ShashankMosaicML in https://github.com/mosaicml/composer/pull/3090
* Making default_split_batch public by ShashankMosaicML in https://github.com/mosaicml/composer/pull/3116
* Adding log exception to Mosaic Logger by jjanezhang in https://github.com/mosaicml/composer/pull/3089
* Add checks to schedulers by b-chu in https://github.com/mosaicml/composer/pull/3115
* Removed default attrs from exception class in the attrs dict by jjanezhang in https://github.com/mosaicml/composer/pull/3126
* Bump coverage[toml] from 7.4.3 to 7.4.4 by dependabot in https://github.com/mosaicml/composer/pull/3121
* Refactor initialization by Practicinginhell in https://github.com/mosaicml/composer/pull/3127
* Bump databricks sdk version by dakinggg in https://github.com/mosaicml/composer/pull/3128
* Update packaging requirement from <23.3,>=21.3.0 to >=21.3.0,<24.1 by dependabot in https://github.com/mosaicml/composer/pull/3122
* Remove rng from save_weights_only ckpt by eracah in https://github.com/mosaicml/composer/pull/3129
* More compression options by mbway in https://github.com/mosaicml/composer/pull/3118
* Only broadcast distcp files by mvpatel2000 in https://github.com/mosaicml/composer/pull/3130
* Bump version to 0.21 by mvpatel2000 in https://github.com/mosaicml/composer/pull/3132
New Contributors
* ksreenivasan made their first contribution in https://github.com/mosaicml/composer/pull/3073
* eitanturok made their first contribution in https://github.com/mosaicml/composer/pull/3069
* Practicinginhell made their first contribution in https://github.com/mosaicml/composer/pull/3127
* mbway made their first contribution in https://github.com/mosaicml/composer/pull/3118
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.20.1...v0.21.0