Added
- Added support for torch conv3d with channels_last_3d format.
Changed
- Refined batch memory copy kernel and supported padding to align w/ public logic, and updated corresponding cases.
- Rebased code to public v0.28.1 release.
- Aligned installation method w/ public HVD.
- Refined BroadcastInplaceOp for TF.
- Enabled public horovod examples of tensorflow for IOH.
- Skipped accuracy check for bf16/fp16 on ranks > 2 temporarily because not sure how to change threshold when rank increase.
Fixed
- Fixed SDL warning.
- Fixed hvd.join with allreduce.
- Fixed scale factor related accuracy issue for bf16/fp16.
- Fixed cpu_operation from CCL to MPI when enable INTEL GPU.