What's New
1. New Neptune Logger
Composer now supports logging training data to [neptune.ai](https://neptune.ai/) using the `NeptuneLogger`. To get started:
python
neptune_project = 'test_project'
neptune_api_token = 'test_token'
neptune_logger = NeptuneLogger(
project=neptune_project,
api_token=neptune_api_token,
rank_zero_only=False,
mode='debug',
upload_artifacts=True,
)
We also have an [example project](https://app.neptune.ai/o/showcase/org/mosaicml-composer/runs/details?viewId=standard-view&detailsTab=dashboard&dashboardId=9b1f1fae-f543-41d1-a778-8604c9b6503d&shortId=MMLCOMP-3) demonstrating all the awesome things you can do with this integration!
![image](https://github.com/mosaicml/composer/assets/17102158/d887b674-7163-4c90-b380-282eb543aa7f)
Additional information on the `NeptuneLogger` can be found in the [docs](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.loggers.NeptuneLogger.html).
2. OOM observer callback with memory visualizations
Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.
Example:
python
from composer import Trainer
from composer.callbacks import OOMObserver
constructing trainer object with this callback
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
OOMObserver(
folder="traces",
overwrite=true,
filename="rank{rank}_oom",
remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
)
],
)
OOM Visualization:
![Screenshot 2024-02-23 at 9.30.03 AM](https://hackmd.io/_uploads/BkDQULUnp.png)
3. Log all gpu rank stdout/err to MosaicML platform
Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.
Example:
mcli logs <run-name> --node x --gpu x
Note, this defaults to node rank 0 if `--node` is not provided.
Also, we can find the logs of any global gpu rank with the command:
mcli logs <run-name> --global-gpu-rank x
Bug Fixes
* Only save RNG on rank 0 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2998
* [Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by bigning in https://github.com/mosaicml/composer/pull/3030
* Fix skip_first for profiler during resumption by bigning in https://github.com/mosaicml/composer/pull/2986
* Race condition fix in checkpoint loading util by jessechancy in https://github.com/mosaicml/composer/pull/3001
What's Changed
* Remove .ci folder and move FILE_HEADER and CODEOWNERS by irenedea in https://github.com/mosaicml/composer/pull/2957
* Modify UCObjectStore.list_objects to lists all files recursively by irenedea in https://github.com/mosaicml/composer/pull/2959
* Refactor MemorySnapshot by cli99 in https://github.com/mosaicml/composer/pull/2960
* Log all gpu rank stdout/err to MosaicML platform by jjanezhang in https://github.com/mosaicml/composer/pull/2839
* Add Torch 2.2 tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/2970
* Memory snapshot dump pickle by cli99 in https://github.com/mosaicml/composer/pull/2968
* Neptune logger by AleksanderWWW in https://github.com/mosaicml/composer/pull/2447
* Fix torch pins in tests by mvpatel2000 in https://github.com/mosaicml/composer/pull/2973
* Add a register_model_with_run_id api to MLflowLogger by dakinggg in https://github.com/mosaicml/composer/pull/2967
* Remove bespoke codeowners by mvpatel2000 in https://github.com/mosaicml/composer/pull/2971
* Add a BEFORE_LOAD event by snarayan21 in https://github.com/mosaicml/composer/pull/2974
* More torch 2.2 fixes by mvpatel2000 in https://github.com/mosaicml/composer/pull/2975
* Adding the step argument to logger.log_table by ShashankMosaicML in https://github.com/mosaicml/composer/pull/2961
* Fix daily tests for torch 2.2 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2980
* Format load_path with name by mvpatel2000 in https://github.com/mosaicml/composer/pull/2978
* Bump to 0.19.1 by mvpatel2000 in https://github.com/mosaicml/composer/pull/2979
* Fix UC object store bugfix by nancyhung in https://github.com/mosaicml/composer/pull/2982
* [Bugfix][UC] Add back the full object path by nancyhung in https://github.com/mosaicml/composer/pull/2988
* Minor cleanup of UC get_object_size by dakinggg in https://github.com/mosaicml/composer/pull/2989
* Pin UC to earlier version by dakinggg in https://github.com/mosaicml/composer/pull/2990
* Revert "fix skip_first for resumption" by bigning in https://github.com/mosaicml/composer/pull/2991
* Broadcast files for HSDP by mvpatel2000 in https://github.com/mosaicml/composer/pull/2914
* Bump ipykernel from 6.29.0 to 6.29.2 by dependabot in https://github.com/mosaicml/composer/pull/2994
* Bump yamllint from 1.33.0 to 1.34.0 by dependabot in https://github.com/mosaicml/composer/pull/2995
* Refactor `update_metric` by maxisawesome in https://github.com/mosaicml/composer/pull/2965
* Add azure integration test by mvpatel2000 in https://github.com/mosaicml/composer/pull/2996
* Fix Profiler schedule skip_first by bigning in https://github.com/mosaicml/composer/pull/2992
* Remove planner validation by mvpatel2000 in https://github.com/mosaicml/composer/pull/2985
* Fix load for non-HSDP device mesh by mvpatel2000 in https://github.com/mosaicml/composer/pull/2997
* Update NCCL arg since torch deprecated old one by mvpatel2000 in https://github.com/mosaicml/composer/pull/3000
* Add bias argument to LPLN by mvpatel2000 in https://github.com/mosaicml/composer/pull/2999
* Revert "Add bias argument to LPLN" by mvpatel2000 in https://github.com/mosaicml/composer/pull/3003
* Revert "Update NCCL arg since torch deprecated old one" by mvpatel2000 in https://github.com/mosaicml/composer/pull/3004
* Add torch 2.3 image for aws cluster by j316chuck in https://github.com/mosaicml/composer/pull/3002
* Patch torch 2.3 aws naming by j316chuck in https://github.com/mosaicml/composer/pull/3006
* Add debug log before training loop starts by mvpatel2000 in https://github.com/mosaicml/composer/pull/3005
* Deprecate ffcv code by j316chuck in https://github.com/mosaicml/composer/pull/3007
* Remove log for mosaicml logger by mvpatel2000 in https://github.com/mosaicml/composer/pull/3008
* [EASY] Always log 1st batch when resuming training by bigning in https://github.com/mosaicml/composer/pull/3009
* Use reusable actions for linting by b-chu in https://github.com/mosaicml/composer/pull/2948
* Make CodeEval respect device_eval_batch_size by josejg in https://github.com/mosaicml/composer/pull/2969
* Use Mosaic constant for GPU file prefix by jjanezhang in https://github.com/mosaicml/composer/pull/3018
* Fall back to normal logging when gpu prefix is not present by jjanezhang in https://github.com/mosaicml/composer/pull/3020
* Revert "Use reusable actions for linting" to fix CI/CD by mvpatel2000 in https://github.com/mosaicml/composer/pull/3023
* Change to pull_request_target by b-chu in https://github.com/mosaicml/composer/pull/3025
* Bump gitpython from 3.1.41 to 3.1.42 by dependabot in https://github.com/mosaicml/composer/pull/3031
* Bump yamllint from 1.34.0 to 1.35.1 by dependabot in https://github.com/mosaicml/composer/pull/3034
* Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by dependabot in https://github.com/mosaicml/composer/pull/3035
* Bump pypandoc from 1.12 to 1.13 by dependabot in https://github.com/mosaicml/composer/pull/3033
* Add tensorboard images support by Menduist in https://github.com/mosaicml/composer/pull/3021
* Add sorted to logs for checkpoint broadcast by mvpatel2000 in https://github.com/mosaicml/composer/pull/3036
* Friendlier device mesh error by mvpatel2000 in https://github.com/mosaicml/composer/pull/3039
* Upgrade to python3.11 for torch nightly by j316chuck in https://github.com/mosaicml/composer/pull/3038
* Download symlink once by mvpatel2000 in https://github.com/mosaicml/composer/pull/3043
* Add min size to OCI download by mvpatel2000 in https://github.com/mosaicml/composer/pull/3044
* Lint fix by mvpatel2000 in https://github.com/mosaicml/composer/pull/3045
* Revert "Change to pull_request_target " by mvpatel2000 in https://github.com/mosaicml/composer/pull/3047
* Bump composer version 0.19.2 by j316chuck in https://github.com/mosaicml/composer/pull/3048
* Update XLA support by bfontain in https://github.com/mosaicml/composer/pull/2964
* Bump composer version 0.20.0 by j316chuck in https://github.com/mosaicml/composer/pull/3051
* Update ruff. Fix PLE & LOG lints by Skylion007 in https://github.com/mosaicml/composer/pull/3050
New Contributors
* AleksanderWWW made their first contribution in https://github.com/mosaicml/composer/pull/2447
* ShashankMosaicML made their first contribution in https://github.com/mosaicml/composer/pull/2961
* nancyhung made their first contribution in https://github.com/mosaicml/composer/pull/2982
* bigning made their first contribution in https://github.com/mosaicml/composer/pull/2986
* jessechancy made their first contribution in https://github.com/mosaicml/composer/pull/3001
* josejg made their first contribution in https://github.com/mosaicml/composer/pull/2969
* Menduist made their first contribution in https://github.com/mosaicml/composer/pull/3021
* bfontain made their first contribution in https://github.com/mosaicml/composer/pull/2964
**Full Changelog**: https://github.com/mosaicml/composer/compare/v0.19.1...v0.20.0