What's Changed
* [docs] add zero++ paper link by jeffra in https://github.com/microsoft/DeepSpeed/pull/3974
* Avoid race condition with port selection in unit tests by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/3975
* Remove duplicated inference unit tests by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/3951
* Switch to torch.linalg.norm by loadams in https://github.com/microsoft/DeepSpeed/pull/3984
* Simplify chain comparisons, remove redundant parentheses by digger-yu in https://github.com/microsoft/DeepSpeed/pull/3912
* [CPU] Support HBM flatmode and fakenuma mode by delock in https://github.com/microsoft/DeepSpeed/pull/3918
* Fix checkpoint conversion when model layers share weights by awaelchli in https://github.com/microsoft/DeepSpeed/pull/3825
* fixing flops profiler formatting, units and precision by clumsy in https://github.com/microsoft/DeepSpeed/pull/3927
* Specify language=python in pre-commit hook by wangruohui in https://github.com/microsoft/DeepSpeed/pull/3994
* [CPU] Skip CPU support unimplemented error by Yejing-Lai in https://github.com/microsoft/DeepSpeed/pull/3633
* ZeRO Gradient Accumulation Dtype. by jomayeri in https://github.com/microsoft/DeepSpeed/pull/2847
* [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) by delock in https://github.com/microsoft/DeepSpeed/pull/3919
* Re-enable skipped unit tests by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/3939
* Make AMD/ROCm apex install to /blob to save test/compile time. by loadams in https://github.com/microsoft/DeepSpeed/pull/3997
* Option to exclude frozen weights for checkpoint save by tjruwase in https://github.com/microsoft/DeepSpeed/pull/3953
* Allow user to select name of .deepspeed_env by loadams in https://github.com/microsoft/DeepSpeed/pull/4006
* Silence backend warning by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4009
* Fix user arg parsing in single node deployment by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4007
* Specify triton 2.0.0 requirement by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4008
* Re-enable elastic training for torch 2+ by loadams in https://github.com/microsoft/DeepSpeed/pull/4010
* add /dev/shm size to ds_report by jeffra in https://github.com/microsoft/DeepSpeed/pull/4015
* Make Ascend NPU available by hipudding in https://github.com/microsoft/DeepSpeed/pull/3831
* RNNprofiler: fix gates size retrieval logic in _rnn_flops by pinstripe-potoroo in https://github.com/microsoft/DeepSpeed/pull/3921
* fix typo in SECURITY.md by jstan327 in https://github.com/microsoft/DeepSpeed/pull/4019
* add llama2 autoTP support in replace_module by dc3671 in https://github.com/microsoft/DeepSpeed/pull/4022
* [zero_to_fp32] 3x less cpu memory requirements by stas00 in https://github.com/microsoft/DeepSpeed/pull/4025
* [CPU] FusedAdam and CPU training support by delock in https://github.com/microsoft/DeepSpeed/pull/3991
* remove duplicate check for pp and zero stage by inkcherry in https://github.com/microsoft/DeepSpeed/pull/4033
* Pass missing positional arguments in `DeepSpeedHybridEngine.generate()` by XuehaiPan in https://github.com/microsoft/DeepSpeed/pull/4026
* Remove print of weight parameter in RMS norm by puneeshkhanna in https://github.com/microsoft/DeepSpeed/pull/4031
* Monitored Loss Calculations by jomayeri in https://github.com/microsoft/DeepSpeed/pull/4030
* fix(pipe): make pipe module `load_state_dir` non-strict-mode work by hughpu in https://github.com/microsoft/DeepSpeed/pull/4020
* polishing timers and log_dist by clumsy in https://github.com/microsoft/DeepSpeed/pull/3996
* Engine side fix for loading llama checkpoint fine-tuned with zero3 by minjiaz in https://github.com/microsoft/DeepSpeed/pull/3981
* fix: Remove duplicate word the by digger-yu in https://github.com/microsoft/DeepSpeed/pull/4051
* [Bug Fix] Fix comm logging for inference by delock in https://github.com/microsoft/DeepSpeed/pull/4043
* fix opt-350m shard loading issue in AutoTP by sywangyi in https://github.com/microsoft/DeepSpeed/pull/3600
* enable autoTP for MPT by sywangyi in https://github.com/microsoft/DeepSpeed/pull/3861
* autoTP for fused qkv weight by inkcherry in https://github.com/microsoft/DeepSpeed/pull/3844
* [CPU] Faster reduce kernel for SHM allreduce by delock in https://github.com/microsoft/DeepSpeed/pull/4049
* Multiple zero stage 3 related fixes by tjruwase in https://github.com/microsoft/DeepSpeed/pull/3886
* Fix deadlock when SHM based allreduce spin too fast by delock in https://github.com/microsoft/DeepSpeed/pull/4048
* [MiCS] [Bugfix] set self.save_non_zero_checkpoint=True only for first partition group by zarzen in https://github.com/microsoft/DeepSpeed/pull/3787
* add reproducible compilation environment by fecet in https://github.com/microsoft/DeepSpeed/pull/3943
* fix: remove unnessary `` punct in the second `sed` command by hughpu in https://github.com/microsoft/DeepSpeed/pull/4061
* Refactor autoTP inference for HE by molly-smith in https://github.com/microsoft/DeepSpeed/pull/4040
* Fix transformers unit tests by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4079
* Fix Stable Diffusion Injection by lekurile in https://github.com/microsoft/DeepSpeed/pull/4078
* Spread layers more uniformly when using partition_uniform by marcobellagente93 in https://github.com/microsoft/DeepSpeed/pull/4053
* fix typo: change polciies to policies by digger-yu in https://github.com/microsoft/DeepSpeed/pull/4090
* update ut/doc for glm/codegen by inkcherry in https://github.com/microsoft/DeepSpeed/pull/4057
* zero_to_fp32 script adds support for tag argument by EeyoreLee in https://github.com/microsoft/DeepSpeed/pull/4089
* add type checker ignore by EeyoreLee in https://github.com/microsoft/DeepSpeed/pull/4102
* Fix generate config validation error on inference unit tests by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4107
* use correct ckpt path when base_dir not available by polisettyvarma in https://github.com/microsoft/DeepSpeed/pull/4101
* Disable z3 tracing profiler by tjruwase in https://github.com/microsoft/DeepSpeed/pull/4106
* Pass correct node size for ZeRO++ by cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/4085
* add deepspeed chat arxiv report by conglongli in https://github.com/microsoft/DeepSpeed/pull/4110
* enable pipeline checkpoint loading mode by leiwen83 in https://github.com/microsoft/DeepSpeed/pull/3629
* Fix Issue 4083 by jomayeri in https://github.com/microsoft/DeepSpeed/pull/4084
* Add full list of DS_BUILD_* by loadams in https://github.com/microsoft/DeepSpeed/pull/4119
* Update nightly workflows to open an issue if CI fails by loadams in https://github.com/microsoft/DeepSpeed/pull/3952
* Update torch1.9 tests to 1.10 to match latest accelerate. by loadams in https://github.com/microsoft/DeepSpeed/pull/4126
* Handle PermissionError in os.chmod Call - Update engine.py by M-Chris in https://github.com/microsoft/DeepSpeed/pull/4139
* Generalize frozen weights unit test by tjruwase in https://github.com/microsoft/DeepSpeed/pull/4140
* Respect memory pinning config by tjruwase in https://github.com/microsoft/DeepSpeed/pull/4131
* Remove incorrect async-io library checking code. by loadams in https://github.com/microsoft/DeepSpeed/pull/4150
* Return nn.parameter type for weights and biases by molly-smith in https://github.com/microsoft/DeepSpeed/pull/4146
* Fixes 4151 by saforem2 in https://github.com/microsoft/DeepSpeed/pull/4152
* Handling for SIGTERM as well by loadams in https://github.com/microsoft/DeepSpeed/pull/4160
* Fix CI Badges by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4162
* Add DS-Chat CI workflow by lekurile in https://github.com/microsoft/DeepSpeed/pull/4127
* [CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend by delock in https://github.com/microsoft/DeepSpeed/pull/4115
* Add DSE branch input to nv-ds-chat by lekurile in https://github.com/microsoft/DeepSpeed/pull/4173
* Pin transformers by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/4174
New Contributors
* awaelchli made their first contribution in https://github.com/microsoft/DeepSpeed/pull/3825
* wangruohui made their first contribution in https://github.com/microsoft/DeepSpeed/pull/3994
* jstan327 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4019
* XuehaiPan made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4026
* puneeshkhanna made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4031
* hughpu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4020
* fecet made their first contribution in https://github.com/microsoft/DeepSpeed/pull/3943
* marcobellagente93 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4053
* polisettyvarma made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4101
* leiwen83 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/3629
* M-Chris made their first contribution in https://github.com/microsoft/DeepSpeed/pull/4139
**Full Changelog**: https://github.com/microsoft/DeepSpeed/compare/v0.10.0...v0.10.1