What's Changed
* Update version.txt after 0.14.0 release by mrwyattii in https://github.com/microsoft/DeepSpeed/pull/5238
* Fp6 blog chinese by xiaoxiawu-microsoft in https://github.com/microsoft/DeepSpeed/pull/5239
* Add contributed HW support into README by delock in https://github.com/microsoft/DeepSpeed/pull/5240
* Set tp world size to 1 in ckpt load, if MPU is not provided by samadejacobs in https://github.com/microsoft/DeepSpeed/pull/5243
* Make op builder detection adapt to accelerator change by delock in https://github.com/microsoft/DeepSpeed/pull/5206
* Replace HIP_PLATFORM_HCC with HIP_PLATFORM_AMD by rraminen in https://github.com/microsoft/DeepSpeed/pull/5264
* Add CI for Habana Labs HPU/Gaudi2 by loadams in https://github.com/microsoft/DeepSpeed/pull/5244
* Fix attention mask handling in the Hybrid Engine Bloom flow by deepcharm in https://github.com/microsoft/DeepSpeed/pull/5101
* Skip 1Bit Compression and sparsegrad tests for HPU. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5270
* Enabled LMCorrectness inference tests on HPU. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5271
* Added HPU backend support for torch.compile tests. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5269
* Average only valid part of the ipg buffer. by BacharL in https://github.com/microsoft/DeepSpeed/pull/5268
* Add HPU accelerator support in unit tests. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5162
* Fix loading a universal checkpoint by tohtana in https://github.com/microsoft/DeepSpeed/pull/5263
* Add Habana Gaudi2 CI badge to the README by loadams in https://github.com/microsoft/DeepSpeed/pull/5286
* Add intel gaudi to contributed HW in README by BacharL in https://github.com/microsoft/DeepSpeed/pull/5300
* Fixed Accelerate Link by wkaisertexas in https://github.com/microsoft/DeepSpeed/pull/5314
* Enable mixtral 8x7b autotp by Yejing-Lai in https://github.com/microsoft/DeepSpeed/pull/5257
* support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix by inkcherry in https://github.com/microsoft/DeepSpeed/pull/5259
* fix comms dtype by mayank31398 in https://github.com/microsoft/DeepSpeed/pull/5297
* Modified regular expression by igeni in https://github.com/microsoft/DeepSpeed/pull/5306
* Docs typos fix and grammar suggestions by Gr0g0 in https://github.com/microsoft/DeepSpeed/pull/5322
* Added Gaudi2 CI tests. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5275
* Improve universal checkpoint by tohtana in https://github.com/microsoft/DeepSpeed/pull/5289
* Increase coverage for HPU by loadams in https://github.com/microsoft/DeepSpeed/pull/5324
* Add NFS path check for default deepspeed triton cache directory by HeyangQin in https://github.com/microsoft/DeepSpeed/pull/5323
* Correct typo in checking on bf16 unit test support by loadams in https://github.com/microsoft/DeepSpeed/pull/5317
* Make NFS warning print only once by HeyangQin in https://github.com/microsoft/DeepSpeed/pull/5345
* resolve KeyError: 'PDSH_SSH_ARGS_APPEND' by Lzhang-hub in https://github.com/microsoft/DeepSpeed/pull/5318
* BF16 optimizer: Clear lp grads after updating hp grads in hook by YangQun1 in https://github.com/microsoft/DeepSpeed/pull/5328
* Fix sort of zero checkpoint files by tohtana in https://github.com/microsoft/DeepSpeed/pull/5342
* Add `distributed_port` for `deepspeed.initialize` by LZHgrla in https://github.com/microsoft/DeepSpeed/pull/5260
* [fix] fix typo s/simultanenously /simultaneously by digger-yu in https://github.com/microsoft/DeepSpeed/pull/5359
* Update container version for Gaudi2 CI by raza-sikander in https://github.com/microsoft/DeepSpeed/pull/5360
* compute global norm on device by BacharL in https://github.com/microsoft/DeepSpeed/pull/5125
* logger update with torch master changes by rogerxfeng8 in https://github.com/microsoft/DeepSpeed/pull/5346
* Ensure capacity does not exceed number of tokens by jeffra in https://github.com/microsoft/DeepSpeed/pull/5353
* Update workflows that use cu116 to cu117 by loadams in https://github.com/microsoft/DeepSpeed/pull/5361
* FP [6,8,12] quantizer op by jeffra in https://github.com/microsoft/DeepSpeed/pull/5336
* CPU SHM based inference_all_reduce improve by delock in https://github.com/microsoft/DeepSpeed/pull/5320
* Auto convert moe param groups by jeffra in https://github.com/microsoft/DeepSpeed/pull/5354
* Support MoE for pipeline models by mosheisland in https://github.com/microsoft/DeepSpeed/pull/5338
* Update pytest and transformers with fixes for pytest>= 8.0.0 by loadams in https://github.com/microsoft/DeepSpeed/pull/5164
* Increase CI coverage for Gaudi2 accelerator. by vshekhawat-hlab in https://github.com/microsoft/DeepSpeed/pull/5358
* Add CI for Intel XPU/Max1100 by Liangliang-Ma in https://github.com/microsoft/DeepSpeed/pull/5376
* Update path name on xpu-max1100.yml, add badge in README by loadams in https://github.com/microsoft/DeepSpeed/pull/5386
* Update checkout action on workflows on ubuntu 20.04 by loadams in https://github.com/microsoft/DeepSpeed/pull/5387
* Cleanup required_torch_version code and references. by loadams in https://github.com/microsoft/DeepSpeed/pull/5370
* Update README.md for intel XPU support by Liangliang-Ma in https://github.com/microsoft/DeepSpeed/pull/5389
* Optimize the fp-dequantizer to get high memory-BW utilization by RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/5373
* Removal of cuda hardcoded string with get_device function by raza-sikander in https://github.com/microsoft/DeepSpeed/pull/5351
* Add custom reshaping for universal checkpoint by tohtana in https://github.com/microsoft/DeepSpeed/pull/5390
* fix pagable h2d memcpy by GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/5301
* stage3: efficient compute of scaled_global_grad_norm by nelyahu in https://github.com/microsoft/DeepSpeed/pull/5256
* Fix the FP6 kernels compilation problem on non-Ampere GPUs. by JamesTheZ in https://github.com/microsoft/DeepSpeed/pull/5333
New Contributors
* vshekhawat-hlab made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5270
* wkaisertexas made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5314
* igeni made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5306
* Gr0g0 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5322
* Lzhang-hub made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5318
* YangQun1 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5328
* raza-sikander made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5360
* rogerxfeng8 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5346
* JamesTheZ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/5333
**Full Changelog**: https://github.com/microsoft/DeepSpeed/compare/v0.14.0...v0.14.1