Combined release notes since November 12th v0.3.1 release
* Various updates to torch.distributed initialization
* New `deepspeed.init_distributed` API, 608, 645, 644
* Improved AzureML support for patching torch.distributed backend, 542
* Simplify dist init and only init if needed 553
* Transformer kernel updates
* Support for different hidden dimensions 559
* Support arbitrary sequence-length 587
* Elastic training support (602)
* NOTE: More details to come on this feature, currently still in initial piloting of this feature.
* Module replacement support 586
* NOTE: Will be used more and documented in the short-term to help automatically inject/replace deepspeed ops into client models.
* 528 removes dependencies psutil and cpufeature
* Various ZeRO 1 and 2 bug fixes and updates: 531, 532, 545, 548
* 543 backwards compatible checkpoints with older deepspeed v0.2 version
* Add static_loss_scale support to unfused optimizer 546
* Bug fix for norm calculation in absence of model parallel group 551
* Switch CI from azure pipelines to github actions
* Deprecate client ability to disable gradient reduction 552
* Bug fix for tracking optimizer step in cpu-adam when loading checkpoint 564
* Improved support for Ampere architecture 572, 570, 577, 578, 591, 642
* Fix potential random layout inconsistency issues in sparse attention modules 534
* Supported customizing kwargs for lr_scheduler 584
* Support deepspeed.initialize with dict configuration instead of arg 632
* Allow DeepSpeed models to be initialized with optimizer=None 469
Special thanks to our contributors in this release
stas00, gcooper-isi, g-karthik, sxjscience, brettkoonce, carefree0910, Justin1904, harrydrippin