Kubeflow-training

Latest version: v1.9.1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 7

1.7.0rc.0

**Breaking Changes**
- Make scheduler-plugins the default gang scheduler. [\1747](https://github.com/kubeflow/training-operator/pull/1747) ([Syulin7](https://github.com/Syulin7))
- Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 ([tenzen-y](https://github.com/tenzen-y))

**New features**
- Make scheduler-plugins the default gang scheduler. [\1747](https://github.com/kubeflow/training-operator/pull/1747) ([Syulin7](https://github.com/Syulin7))
- Merge kubeflow/common to training-operator [\1813](https://github.com/kubeflow/training-operator/pull/1813) ([johnugeorge](https://github.com/johnugeorge))
- Auto-generate RBAC manifests by the controller-gen [\1815](https://github.com/kubeflow/training-operator/pull/1815) ([Syulin7](https://github.com/Syulin7))
- Implement suspend semantics [\1859](https://github.com/kubeflow/training-operator/pull/1859) ([tenzen-y](https://github.com/tenzen-y))
- Set up controllers using goroutines to start the manager quickly [\1869](https://github.com/kubeflow/training-operator/pull/1869) ([tenzen-y](https://github.com/tenzen-y))
- Set correct ENV for PytorchJob to support torchrun [\1840](https://github.com/kubeflow/training-operator/pull/1840) ([kuizhiqing](https://github.com/kuizhiqing))

**Bug fixes**
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed [\1866](https://github.com/kubeflow/training-operator/pull/1866) ([tenzen-y](https://github.com/tenzen-y))
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition [\1789](https://github.com/kubeflow/training-operator/pull/1789) ([tenzen-y](https://github.com/tenzen-y))
- Avoid to depend on local env when installing the code-generators [\1810](https://github.com/kubeflow/training-operator/pull/1810) ([tenzen-y](https://github.com/tenzen-y))

**Misc**
- Removing reconciler code [\1879](https://github.com/kubeflow/training-operator/pull/1879) ([johnugeorge](https://github.com/johnugeorge))
- Make Condition and ReplicaStatus optional [\1862](https://github.com/kubeflow/training-operator/pull/1862) ([tenzen-y](https://github.com/tenzen-y))
- Use the same reasons for Condition and Event [\1854](https://github.com/kubeflow/training-operator/pull/1854) ([tenzen-y](https://github.com/tenzen-y))
- Fully consolidate tfjob-operator to training-operator [\1850](https://github.com/kubeflow/training-operator/pull/1850) ([tenzen-y](https://github.com/tenzen-y))
- Clean up /pkg/common/util/v1 [\1845](https://github.com/kubeflow/training-operator/pull/1845) ([tenzen-y](https://github.com/tenzen-y))
- Refactoring tests in common/controller.v1 [\1843](https://github.com/kubeflow/training-operator/pull/1843) ([tenzen-y](https://github.com/tenzen-y))
- remove duplicate code of add task spec annotation [\1839](https://github.com/kubeflow/training-operator/pull/1839) ([lowang-bh](https://github.com/lowang-bh))
- fetch volcano log when e2e failed [\1837](https://github.com/kubeflow/training-operator/pull/1837) ([lowang-bh](https://github.com/lowang-bh))
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e [\1835](https://github.com/kubeflow/training-operator/pull/1835) ([tenzen-y](https://github.com/tenzen-y))
- Replace dummy client with fake client [\1818](https://github.com/kubeflow/training-operator/pull/1818) ([tenzen-y](https://github.com/tenzen-y))
- Add default Intel MPI env variables to MPIJob [\1804](https://github.com/kubeflow/training-operator/pull/1804) ([tkatila](https://github.com/tkatila))
- Improve E2E tests for the gang-scheduling [\1801](https://github.com/kubeflow/training-operator/pull/1801) ([tenzen-y](https://github.com/tenzen-y))
- xgb yaml container name should be consistent with xgb job default container name [\1794](https://github.com/kubeflow/training-operator/pull/1794) ([Crisescode](https://github.com/Crisescode))
- make timeout configurable from e2e tests [\1787](https://github.com/kubeflow/training-operator/pull/1787) ([nagar-ajay](https://github.com/nagar-ajay))

1.6.0

Note: Since scheduler-plugins has changed API from `sigs.k8s.io` with the `x-k8s.io`, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: [\1773](https://github.com/kubeflow/training-operator/pull/1773)

Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training/1.6.0/) does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: [\#1702](https://github.com/kubeflow/training-operator/pull/1702)

**New Features**
- Support for k8s v1.25 in CI [\1684](https://github.com/kubeflow/training-operator/pull/1684) ([johnugeorge](https://github.com/johnugeorge))
- HPA support for PyTorch Elastic [\1701](https://github.com/kubeflow/training-operator/pull/1701) ([johnugeorge](https://github.com/johnugeorge))
- Adopting coschduling plugin [\1724](https://github.com/kubeflow/training-operator/pull/1724) ([tenzen-y](https://github.com/tenzen-y))
- Support for Paddlepaddle [\1675](https://github.com/kubeflow/training-operator/pull/1675) ([kuizhiqing](https://github.com/kuizhiqing))
- Create TFJob and PyTorchJob from Function APIs in the Training SDK [\1659](https://github.com/kubeflow/training-operator/pull/1659) ([andreyvelich](https://github.com/andreyvelich))
- \[SDK\] Use Training Client without Kube Config [\1740](https://github.com/kubeflow/training-operator/pull/1740) ([andreyvelich](https://github.com/andreyvelich))
- \[SDK\] Create Unify Training Client [\1719](https://github.com/kubeflow/training-operator/pull/1719) ([andreyvelich](https://github.com/andreyvelich))

**Bug fixes**
- [SDK] pod has no metadata attr anymore in the get\_job\_logs\(\) … [\1760](https://github.com/kubeflow/training-operator/pull/1760) ([yaobaiwei](https://github.com/yaobaiwei))
- Add PodGroup as controller watch source [\1666](https://github.com/kubeflow/training-operator/pull/1666) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- fix infinite loop in init-pytorch container [\1756](https://github.com/kubeflow/training-operator/pull/1756) ([kidddddddddddddddddddddd](https://github.com/kidddddddddddddddddddddd))
- Fix the success condition of the job in PyTorchJob's Elastic mode. [\1752](https://github.com/kubeflow/training-operator/pull/1752) ([Syulin7](https://github.com/Syulin7))
- Fix XGBoost conditions bug [\1737](https://github.com/kubeflow/training-operator/pull/1737) ([tenzen-y](https://github.com/tenzen-y))
- To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example [\1733](https://github.com/kubeflow/training-operator/pull/1733) ([tenzen-y](https://github.com/tenzen-y))
- fix: support MxNet single host training when update mxJob status [\1644](https://github.com/kubeflow/training-operator/pull/1644) ([PeterChg](https://github.com/PeterChg))
- fix: fix mxnet failed to update StartTime and CompletionTime [\1643](https://github.com/kubeflow/training-operator/pull/1643) ([PeterChg](https://github.com/PeterChg))
- Fix the default LeaderElectionID and make it an argument [\1639](https://github.com/kubeflow/training-operator/pull/1639) ([goyalankit](https://github.com/goyalankit))
- fix: fix wrong parameter for resolveControllerRef [\1583](https://github.com/kubeflow/training-operator/pull/1583) ([fighterhit](https://github.com/fighterhit))
- fix: tfjob with restartPolicy=ExitCode not work [\1562](https://github.com/kubeflow/training-operator/pull/1562) ([cheimu](https://github.com/cheimu))
- fix: Mac M1 compatible Dockerfile and bump TF version [\1700](https://github.com/kubeflow/training-operator/pull/1700) ([terrytangyuan](https://github.com/terrytangyuan))
- Fix status lost [\1697](https://github.com/kubeflow/training-operator/pull/1697) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- handle all restart policies [\1649](https://github.com/kubeflow/training-operator/pull/1649) ([abin-thomas-by](https://github.com/abin-thomas-by))
- \[chore\] fix typo [\1648](https://github.com/kubeflow/training-operator/pull/1648) ([tenzen-y](https://github.com/tenzen-y))

**Misc**
- Add validation for verifying that the CustomJob \(e.g., TFJob\) name meets DNS1035 [\1748](https://github.com/kubeflow/training-operator/pull/1748) ([tenzen-y](https://github.com/tenzen-y))
- Configure controller worker threads [\1707](https://github.com/kubeflow/training-operator/pull/1707) ([HeGaoYuan](https://github.com/HeGaoYuan))
- Validation Spec consistency [\1705](https://github.com/kubeflow/training-operator/pull/1705) ([HeGaoYuan](https://github.com/HeGaoYuan))
- \[SDK\] Remove Final Keyword from constants [\1676](https://github.com/kubeflow/training-operator/pull/1676) ([andreyvelich](https://github.com/andreyvelich))
- Fix Python installation in CI [\1759](https://github.com/kubeflow/training-operator/pull/1759) ([tenzen-y](https://github.com/tenzen-y))
- Update mpijob\_controller.go [\1755](https://github.com/kubeflow/training-operator/pull/1755) ([yshalabi](https://github.com/yshalabi))
- Set the default value of CleanPodPolicy to None [\1754](https://github.com/kubeflow/training-operator/pull/1754) ([Syulin7](https://github.com/Syulin7))
- Update join Slack link [\1750](https://github.com/kubeflow/training-operator/pull/1750) ([Syulin7](https://github.com/Syulin7))
- Update latest operator image [\1742](https://github.com/kubeflow/training-operator/pull/1742) ([johnugeorge](https://github.com/johnugeorge))
- Run E2E with various Python versions to verify Python SDK [\1741](https://github.com/kubeflow/training-operator/pull/1741) ([tenzen-y](https://github.com/tenzen-y))
- Add Yuki to reviewer group [\1739](https://github.com/kubeflow/training-operator/pull/1739) ([johnugeorge](https://github.com/johnugeorge))
- Trim down CRD descriptions [\1735](https://github.com/kubeflow/training-operator/pull/1735) ([tenzen-y](https://github.com/tenzen-y))
- Add CI to build example images [\1731](https://github.com/kubeflow/training-operator/pull/1731) ([tenzen-y](https://github.com/tenzen-y))
- Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup [\1730](https://github.com/kubeflow/training-operator/pull/1730) ([tenzen-y](https://github.com/tenzen-y))
- Fix indents on examples for tensorflow [\1726](https://github.com/kubeflow/training-operator/pull/1726) ([tenzen-y](https://github.com/tenzen-y))
- docs: Update Kubernetes requirement and version matrix [\1721](https://github.com/kubeflow/training-operator/pull/1721) ([terrytangyuan](https://github.com/terrytangyuan))
- chore: Update the use of MultiWorkerMirroredStrategy in TF [\1715](https://github.com/kubeflow/training-operator/pull/1715) ([terrytangyuan](https://github.com/terrytangyuan))
- Removing deprecated Job Labels [\1702](https://github.com/kubeflow/training-operator/pull/1702) ([johnugeorge](https://github.com/johnugeorge))
- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf\_operator [\1699](https://github.com/kubeflow/training-operator/pull/1699) ([dependabot[bot]](https://github.com/apps/dependabot))
- Add myself to reviewer. [\1689](https://github.com/kubeflow/training-operator/pull/1689) ([kuizhiqing](https://github.com/kuizhiqing))
- Upgrade the envtest version [\1687](https://github.com/kubeflow/training-operator/pull/1687) ([tenzen-y](https://github.com/tenzen-y))
- \[chore\] Upgrade some actions version [\1686](https://github.com/kubeflow/training-operator/pull/1686) ([tenzen-y](https://github.com/tenzen-y))
- Upgrade Golangci-lint [\1685](https://github.com/kubeflow/training-operator/pull/1685) ([johnugeorge](https://github.com/johnugeorge))
- Make a generic logger instead of the nil logger on dependent update [\1680](https://github.com/kubeflow/training-operator/pull/1680) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf\_operator [\1669](https://github.com/kubeflow/training-operator/pull/1669) ([dependabot[bot]](https://github.com/apps/dependabot))
- Removed GOARCH dependency for multiarch support [\1674](https://github.com/kubeflow/training-operator/pull/1674) ([pranavpandit1](https://github.com/pranavpandit1))
- Update deployment.yaml [\1668](https://github.com/kubeflow/training-operator/pull/1668) ([OmriShiv](https://github.com/OmriShiv))
- Upgrade Go version to v1.19 [\1663](https://github.com/kubeflow/training-operator/pull/1663) ([tenzen-y](https://github.com/tenzen-y))
- Upgrade kubernetes versoin for test [\1667](https://github.com/kubeflow/training-operator/pull/1667) ([tenzen-y](https://github.com/tenzen-y))
- Adding support for linux/ppc64le in github actions for training-operator [\1692](https://github.com/kubeflow/training-operator/pull/1692) ([amitmukati-2604](https://github.com/amitmukati-2604))
- style: Refine name and signature of 2 replicaName functions [\1660](https://github.com/kubeflow/training-operator/pull/1660) ([houz42](https://github.com/houz42))
- Update training operator sdk version to 1.5.0 [\1651](https://github.com/kubeflow/training-operator/pull/1651) ([johnugeorge](https://github.com/johnugeorge))
- Add finalizers to cluster-role [\1646](https://github.com/kubeflow/training-operator/pull/1646) ([ArangoGutierrez](https://github.com/ArangoGutierrez))
- Update the cmd to support MPI operator in ReadME [\1656](https://github.com/kubeflow/training-operator/pull/1656) ([denkensk](https://github.com/denkensk))

**Closed issues:**

- The default value for CleanPodPolicy is inconsistent. [\1753](https://github.com/kubeflow/training-operator/issues/1753)
- HPA support for PyTorch Elastic [\1751](https://github.com/kubeflow/training-operator/issues/1751)
- Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [\1745](https://github.com/kubeflow/training-operator/issues/1745)
- paddle-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\1729](https://github.com/kubeflow/training-operator/issues/1729)
- \*job API\(master\) cannot compatible with old job [\1725](https://github.com/kubeflow/training-operator/issues/1725)
- Support coscheduling plugin [\1722](https://github.com/kubeflow/training-operator/issues/1722)
- Number of worker threads used by the controller can't be configured [\1706](https://github.com/kubeflow/training-operator/issues/1706)
- Conformance: Training tests [\1698](https://github.com/kubeflow/training-operator/issues/1698)
- PyTorch and MPI Operator pulls hardcoded initContainer [\1696](https://github.com/kubeflow/training-operator/issues/1696)
- PaddlePaddle Training: why can't find pods [\1694](https://github.com/kubeflow/training-operator/issues/1694)
- Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 [\1693](https://github.com/kubeflow/training-operator/issues/1693)
- \[SDK\] Create unify client for all Training Job types [\1691](https://github.com/kubeflow/training-operator/issues/1691)
- Support Kubernetes v1.25 [\1682](https://github.com/kubeflow/training-operator/issues/1682)
- panic happened when add podgroup watch [\1679](https://github.com/kubeflow/training-operator/issues/1679)
- OnDependentUpdateFunc for Job will panic when enable volcano scheduler [\1678](https://github.com/kubeflow/training-operator/issues/1678)
- There is no clusterrole of "MPI Jobs" in kubeflow 1.5. [\1670](https://github.com/kubeflow/training-operator/issues/1670)
- Change Kubernetes version for test [\1665](https://github.com/kubeflow/training-operator/issues/1665)
- Support for multiplatform container imege \(amd64 and arm64\) [\1664](https://github.com/kubeflow/training-operator/issues/1664)
- Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" [\1661](https://github.com/kubeflow/training-operator/issues/1661)
- After setting hostNetwork to true, mpi does not work [\1657](https://github.com/kubeflow/training-operator/issues/1657)
- What is the purpose of /examples/pytorch/elastic/etcd.yaml [\1655](https://github.com/kubeflow/training-operator/issues/1655)
- When will MPIJob support v2beta1 version? [\1653](https://github.com/kubeflow/training-operator/issues/1653)
- Kubernetes HPA doesn't work with elastic PytorchJob [\1645](https://github.com/kubeflow/training-operator/issues/1645)
- training-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\1630](https://github.com/kubeflow/training-operator/issues/1630)
- Training operator fails to create HPA for TorchElastic jobs [\1626](https://github.com/kubeflow/training-operator/issues/1626)

1.6.0rc.1

Note: Since scheduler-plugins has changed API from `sigs.k8s.io` with the `x-k8s.io`, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

**Merged pull requests:**
- [SDK] pod has no metadata attr anymore in the get\_job\_logs\(\) … [\1760](https://github.com/kubeflow/training-operator/pull/1760) ([yaobaiwei](https://github.com/yaobaiwei))
- Fix Python installation in CI [\1759](https://github.com/kubeflow/training-operator/pull/1759) ([tenzen-y](https://github.com/tenzen-y))
- fix infinite loop in init-pytorch container [\1756](https://github.com/kubeflow/training-operator/pull/1756) ([kidddddddddddddddddddddd](https://github.com/kidddddddddddddddddddddd))
- Update mpijob\_controller.go [\1755](https://github.com/kubeflow/training-operator/pull/1755) ([yshalabi](https://github.com/yshalabi))
- Set the default value of CleanPodPolicy to None [\1754](https://github.com/kubeflow/training-operator/pull/1754) ([Syulin7](https://github.com/Syulin7))
- Fix the success condition of the job in PyTorchJob's Elastic mode. [\1752](https://github.com/kubeflow/training-operator/pull/1752) ([Syulin7](https://github.com/Syulin7))
- Update join Slack link [\1750](https://github.com/kubeflow/training-operator/pull/1750) ([Syulin7](https://github.com/Syulin7))
- Add validation for verifying that the CustomJob \(e.g., TFJob\) name meets DNS1035 [\1748](https://github.com/kubeflow/training-operator/pull/1748) ([tenzen-y](https://github.com/tenzen-y))
- Update latest operator image [\1742](https://github.com/kubeflow/training-operator/pull/1742) ([johnugeorge](https://github.com/johnugeorge))
- Run E2E with various Python versions to verify Python SDK [\1741](https://github.com/kubeflow/training-operator/pull/1741) ([tenzen-y](https://github.com/tenzen-y))
- \[SDK\] Use Training Client without Kube Config [\1740](https://github.com/kubeflow/training-operator/pull/1740) ([andreyvelich](https://github.com/andreyvelich))
- Add Yuki to reviewer group [\1739](https://github.com/kubeflow/training-operator/pull/1739) ([johnugeorge](https://github.com/johnugeorge))
- Fix XGBoost conditions bug [\1737](https://github.com/kubeflow/training-operator/pull/1737) ([tenzen-y](https://github.com/tenzen-y))
- Add E2E test for gang-scheduling [\1736](https://github.com/kubeflow/training-operator/pull/1736) ([tenzen-y](https://github.com/tenzen-y))
- Trim down CRD descriptions [\1735](https://github.com/kubeflow/training-operator/pull/1735) ([tenzen-y](https://github.com/tenzen-y))
- To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example [\1733](https://github.com/kubeflow/training-operator/pull/1733) ([tenzen-y](https://github.com/tenzen-y))
- Add CI to build example images [\1731](https://github.com/kubeflow/training-operator/pull/1731) ([tenzen-y](https://github.com/tenzen-y))
- Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup [\1730](https://github.com/kubeflow/training-operator/pull/1730) ([tenzen-y](https://github.com/tenzen-y))
- Fix indents on examples for tensorflow [\1726](https://github.com/kubeflow/training-operator/pull/1726) ([tenzen-y](https://github.com/tenzen-y))
- Adopting coschduling plugin [\1724](https://github.com/kubeflow/training-operator/pull/1724) ([tenzen-y](https://github.com/tenzen-y))
- docs: Update Kubernetes requirement and version matrix [\1721](https://github.com/kubeflow/training-operator/pull/1721) ([terrytangyuan](https://github.com/terrytangyuan))
- \[SDK\] Create Unify Training Client [\1719](https://github.com/kubeflow/training-operator/pull/1719) ([andreyvelich](https://github.com/andreyvelich))
- chore: Update the use of MultiWorkerMirroredStrategy in TF [\1715](https://github.com/kubeflow/training-operator/pull/1715) ([terrytangyuan](https://github.com/terrytangyuan))
- Configure controller worker threads [\1707](https://github.com/kubeflow/training-operator/pull/1707) ([HeGaoYuan](https://github.com/HeGaoYuan))
- Validation Spec consistency [\1705](https://github.com/kubeflow/training-operator/pull/1705) ([HeGaoYuan](https://github.com/HeGaoYuan))
- Removing deprecated Job Labels [\1702](https://github.com/kubeflow/training-operator/pull/1702) ([johnugeorge](https://github.com/johnugeorge))
- HPA support for PyTorch Elastic [\1701](https://github.com/kubeflow/training-operator/pull/1701) ([johnugeorge](https://github.com/johnugeorge))
- fix: Mac M1 compatible Dockerfile and bump TF version [\1700](https://github.com/kubeflow/training-operator/pull/1700) ([terrytangyuan](https://github.com/terrytangyuan))
- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf\_operator [\1699](https://github.com/kubeflow/training-operator/pull/1699) ([dependabot[bot]](https://github.com/apps/dependabot))
- Fix status lost [\1697](https://github.com/kubeflow/training-operator/pull/1697) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- Adding support for linux/ppc64le in github actions for training-operator [\1692](https://github.com/kubeflow/training-operator/pull/1692) ([amitmukati-2604](https://github.com/amitmukati-2604))
- Add myself to reviewer. [\1689](https://github.com/kubeflow/training-operator/pull/1689) ([kuizhiqing](https://github.com/kuizhiqing))
- Upgrade the envtest version [\1687](https://github.com/kubeflow/training-operator/pull/1687) ([tenzen-y](https://github.com/tenzen-y))
- \[chore\] Upgrade some actions version [\1686](https://github.com/kubeflow/training-operator/pull/1686) ([tenzen-y](https://github.com/tenzen-y))
- Upgrade Golangci-lint [\1685](https://github.com/kubeflow/training-operator/pull/1685) ([johnugeorge](https://github.com/johnugeorge))
- Support for k8s v1.25 in CI [\1684](https://github.com/kubeflow/training-operator/pull/1684) ([johnugeorge](https://github.com/johnugeorge))
- Make a generic logger instead of the nil logger on dependent update [\1680](https://github.com/kubeflow/training-operator/pull/1680) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- \[SDK\] Remove Final Keyword from constants [\1676](https://github.com/kubeflow/training-operator/pull/1676) ([andreyvelich](https://github.com/andreyvelich))
- \[PaddlePaddle\] support paddlejob [\1675](https://github.com/kubeflow/training-operator/pull/1675) ([kuizhiqing](https://github.com/kuizhiqing))
- Removed GOARCH dependency for multiarch support [\1674](https://github.com/kubeflow/training-operator/pull/1674) ([pranavpandit1](https://github.com/pranavpandit1))
- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf\_operator [\1669](https://github.com/kubeflow/training-operator/pull/1669) ([dependabot[bot]](https://github.com/apps/dependabot))
- Update deployment.yaml [\1668](https://github.com/kubeflow/training-operator/pull/1668) ([OmriShiv](https://github.com/OmriShiv))
- Upgrade kubernetes versoin for test [\1667](https://github.com/kubeflow/training-operator/pull/1667) ([tenzen-y](https://github.com/tenzen-y))
- Add PodGroup as controller watch source [\1666](https://github.com/kubeflow/training-operator/pull/1666) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- Upgrade Go version to v1.19 [\1663](https://github.com/kubeflow/training-operator/pull/1663) ([tenzen-y](https://github.com/tenzen-y))
- style: Refine name and signature of 2 replicaName functions [\1660](https://github.com/kubeflow/training-operator/pull/1660) ([houz42](https://github.com/houz42))
- Create TFJob and PyTorchJob from Function APIs in the Training SDK [\1659](https://github.com/kubeflow/training-operator/pull/1659) ([andreyvelich](https://github.com/andreyvelich))
- Update the cmd to support MPI operator in ReadME [\1656](https://github.com/kubeflow/training-operator/pull/1656) ([denkensk](https://github.com/denkensk))
- Update training operator sdk version to 1.5.0 [\1651](https://github.com/kubeflow/training-operator/pull/1651) ([johnugeorge](https://github.com/johnugeorge))
- handle all restart policies [\1649](https://github.com/kubeflow/training-operator/pull/1649) ([abin-thomas-by](https://github.com/abin-thomas-by))
- \[chore\] fix typo [\1648](https://github.com/kubeflow/training-operator/pull/1648) ([tenzen-y](https://github.com/tenzen-y))
- Add finalizers to cluster-role [\1646](https://github.com/kubeflow/training-operator/pull/1646) ([ArangoGutierrez](https://github.com/ArangoGutierrez))
- fix: support MxNet single host training when update mxJob status [\1644](https://github.com/kubeflow/training-operator/pull/1644) ([PeterChg](https://github.com/PeterChg))
- fix: fix mxnet failed to update StartTime and CompletionTime [\1643](https://github.com/kubeflow/training-operator/pull/1643) ([PeterChg](https://github.com/PeterChg))
- Fix the default LeaderElectionID and make it an argument [\1639](https://github.com/kubeflow/training-operator/pull/1639) ([goyalankit](https://github.com/goyalankit))
- fix: fix wrong parameter for resolveControllerRef [\1583](https://github.com/kubeflow/training-operator/pull/1583) ([fighterhit](https://github.com/fighterhit))
- fix: tfjob with restartPolicy=ExitCode not work [\1562](https://github.com/kubeflow/training-operator/pull/1562) ([cheimu](https://github.com/cheimu))

**Closed issues:**

- The default value for CleanPodPolicy is inconsistent. [\1753](https://github.com/kubeflow/training-operator/issues/1753)
- HPA support for PyTorch Elastic [\1751](https://github.com/kubeflow/training-operator/issues/1751)
- Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [\1745](https://github.com/kubeflow/training-operator/issues/1745)
- paddle-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\1729](https://github.com/kubeflow/training-operator/issues/1729)
- \*job API\(master\) cannot compatible with old job [\1725](https://github.com/kubeflow/training-operator/issues/1725)
- Support coscheduling plugin [\1722](https://github.com/kubeflow/training-operator/issues/1722)
- Number of worker threads used by the controller can't be configured [\1706](https://github.com/kubeflow/training-operator/issues/1706)
- Conformance: Training tests [\1698](https://github.com/kubeflow/training-operator/issues/1698)
- PyTorch and MPI Operator pulls hardcoded initContainer [\1696](https://github.com/kubeflow/training-operator/issues/1696)
- PaddlePaddle Training: why can't find pods [\1694](https://github.com/kubeflow/training-operator/issues/1694)
- Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 [\1693](https://github.com/kubeflow/training-operator/issues/1693)
- \[SDK\] Create unify client for all Training Job types [\1691](https://github.com/kubeflow/training-operator/issues/1691)
- Support Kubernetes v1.25 [\1682](https://github.com/kubeflow/training-operator/issues/1682)
- panic happened when add podgroup watch [\1679](https://github.com/kubeflow/training-operator/issues/1679)
- OnDependentUpdateFunc for Job will panic when enable volcano scheduler [\1678](https://github.com/kubeflow/training-operator/issues/1678)
- There is no clusterrole of "MPI Jobs" in kubeflow 1.5. [\1670](https://github.com/kubeflow/training-operator/issues/1670)
- Change Kubernetes version for test [\1665](https://github.com/kubeflow/training-operator/issues/1665)
- Support for multiplatform container imege \(amd64 and arm64\) [\1664](https://github.com/kubeflow/training-operator/issues/1664)
- Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" [\1661](https://github.com/kubeflow/training-operator/issues/1661)
- After setting hostNetwork to true, mpi does not work [\1657](https://github.com/kubeflow/training-operator/issues/1657)
- What is the purpose of /examples/pytorch/elastic/etcd.yaml [\1655](https://github.com/kubeflow/training-operator/issues/1655)
- When will MPIJob support v2beta1 version? [\1653](https://github.com/kubeflow/training-operator/issues/1653)
- Kubernetes HPA doesn't work with elastic PytorchJob [\1645](https://github.com/kubeflow/training-operator/issues/1645)
- training-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\1630](https://github.com/kubeflow/training-operator/issues/1630)
- Training operator fails to create HPA for TorchElastic jobs [\1626](https://github.com/kubeflow/training-operator/issues/1626)

1.6.0rc.0

1.5.0

[Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.4.0...v1.5.0)

New Features
- Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob [\1610](https://github.com/kubeflow/training-operator/pull/1610) ([tenzen-y](https://github.com/tenzen-y))
- Add all generation tools to Makefile [\1609](https://github.com/kubeflow/training-operator/pull/1609) ([johnugeorge](https://github.com/johnugeorge))
- Adding MPI python sdk [\1608](https://github.com/kubeflow/training-operator/pull/1608) ([johnugeorge](https://github.com/johnugeorge))
- Adding XGboost Python sdk [\1607](https://github.com/kubeflow/training-operator/pull/1607) ([johnugeorge](https://github.com/johnugeorge))
- Generating MPI python sdk [\1606](https://github.com/kubeflow/training-operator/pull/1606) ([johnugeorge](https://github.com/johnugeorge))
- Update k8s dependencies to v0.24.1 [\1604](https://github.com/kubeflow/training-operator/pull/1604) ([johnugeorge](https://github.com/johnugeorge))
- Migrate test framework to GHA [\1603](https://github.com/kubeflow/training-operator/pull/1603) ([johnugeorge](https://github.com/johnugeorge))
- Add mpi in update-codegen.sh [\1600](https://github.com/kubeflow/training-operator/pull/1600) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- MXNet SDK with Status check fix [\1618](https://github.com/kubeflow/training-operator/pull/1618) ([johnugeorge](https://github.com/johnugeorge))

Bug Fixes
- fix: MPIJob worker still running when NotEnoughResources [\1621](https://github.com/kubeflow/training-operator/pull/1621) ([hackerboy01](https://github.com/hackerboy01))
- fix comments for pytorch-controller [\1620](https://github.com/kubeflow/training-operator/pull/1620) ([hackerboy01](https://github.com/hackerboy01))
- fix: requeue when expire time is not up yet [\1614](https://github.com/kubeflow/training-operator/pull/1614) ([Garrybest](https://github.com/Garrybest))
- Look for fully-qualified job role label in Python sdk [\1588](https://github.com/kubeflow/training-operator/pull/1588) ([person142](https://github.com/person142))
- fix torch env typo [\1573](https://github.com/kubeflow/training-operator/pull/1573) ([kuizhiqing](https://github.com/kuizhiqing))
- Restart job on failure for Always,OnFailure Policy [\1572](https://github.com/kubeflow/training-operator/pull/1572) ([georgkaleido](https://github.com/georgkaleido))
- Increase success threshold [\1568](https://github.com/kubeflow/training-operator/pull/1568) ([haoxins](https://github.com/haoxins))
- update status.startTime for pytorchjob and xgboostjob [\1567](https://github.com/kubeflow/training-operator/pull/1567) ([cheimu](https://github.com/cheimu))
- fix: add mpijobs to kubeflow training role [\1565](https://github.com/kubeflow/training-operator/pull/1565) ([henrysecond1](https://github.com/henrysecond1))
- fix Pytorjob status inaccuracy when task replica scale down [\1593](https://github.com/kubeflow/training-operator/pull/1593) ([PeterChg](https://github.com/PeterChg))
- fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set [\1557](https://github.com/kubeflow/training-operator/pull/1557) ([cheimu](https://github.com/cheimu))
- fix api reader issue [\1551](https://github.com/kubeflow/training-operator/pull/1551) ([zw0610](https://github.com/zw0610))
- fix label and CleanPodPolicy for mpi-controller [\1550](https://github.com/kubeflow/training-operator/pull/1550) ([zw0610](https://github.com/zw0610))
- fix UpdateJobStatusInApiServer when gang-scheduling is enabled [\1549](https://github.com/kubeflow/training-operator/pull/1549) ([zw0610](https://github.com/zw0610))
- fix: add namespace filtering when getting pods/services for jobs [\1545](https://github.com/kubeflow/training-operator/pull/1545) ([henrysecond1](https://github.com/henrysecond1))
- fix: set mpijob runPolicy.cleanPodPolicy to default none [\1554](https://github.com/kubeflow/training-operator/pull/1554) ([cheimu](https://github.com/cheimu))

Misc

- Update training controller image to latest [\1625](https://github.com/kubeflow/training-operator/pull/1625) ([johnugeorge](https://github.com/johnugeorge))
- Update SDK version to 1.5.0 [\1624](https://github.com/kubeflow/training-operator/pull/1624) ([johnugeorge](https://github.com/johnugeorge))
- Upgrade common to v0.4.3 [\1623](https://github.com/kubeflow/training-operator/pull/1623) ([johnugeorge](https://github.com/johnugeorge))
- Adding GHA for automatic image build and push [\1615](https://github.com/kubeflow/training-operator/pull/1615) ([johnugeorge](https://github.com/johnugeorge))
- Remove presubmit test depending on optional-test-infra [\1596](https://github.com/kubeflow/training-operator/pull/1596) ([aws-kf-ci-bot](https://github.com/aws-kf-ci-bot))
- chore: stop action on first fail [\1595](https://github.com/kubeflow/training-operator/pull/1595) ([jasonliu747](https://github.com/jasonliu747))
- update img url in design doc [\1591](https://github.com/kubeflow/training-operator/pull/1591) ([zw0610](https://github.com/zw0610))
- Remove uncalled mpi-controller DeletePodsAndServices\(\) [\1558](https://github.com/kubeflow/training-operator/pull/1558) ([cheimu](https://github.com/cheimu))
- Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy [\1556](https://github.com/kubeflow/training-operator/pull/1556) ([cheimu](https://github.com/cheimu))
- Remove `table-logger` dependency [\1544](https://github.com/kubeflow/training-operator/pull/1544) ([person142](https://github.com/person142))
- Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf\_operator [\1542](https://github.com/kubeflow/training-operator/pull/1542) ([dependabot[bot]](https://github.com/apps/dependabot))

1.5.0rc.0

[Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.4.0...v1.5.0-rc.0)

**Closed issues:**

- MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? [\1617](https://github.com/kubeflow/training-operator/issues/1617)
- unable to fetch TFJob when I use client.go run tfjob [\1612](https://github.com/kubeflow/training-operator/issues/1612)
- Pytorchjob dist-mnist no training logs [\1601](https://github.com/kubeflow/training-operator/issues/1601)
- kubectl get tfjob -o yaml, but not status output [\1598](https://github.com/kubeflow/training-operator/issues/1598)
- missing image in tf\_job\_design\_doc.md [\1590](https://github.com/kubeflow/training-operator/issues/1590)
- Labels in Python client are out of date [\1587](https://github.com/kubeflow/training-operator/issues/1587)
- PyTorchJob Pods "Not Ready" After Completing Training [\1577](https://github.com/kubeflow/training-operator/issues/1577)
- cannot use "github.com/go-openapi/spec".Schema{...} \(type "github.com/go-openapi/spec".Schema\) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value [\1576](https://github.com/kubeflow/training-operator/issues/1576)
- PyTorchJob: OnFailure Policy won't handle pod failure gracefully [\1570](https://github.com/kubeflow/training-operator/issues/1570)
- pytorchjob doesn't have status.startTIme. [\1566](https://github.com/kubeflow/training-operator/issues/1566)
- Optional-test-infra Deprecation Notice - Training [\1561](https://github.com/kubeflow/training-operator/issues/1561)
- Should we update MPIJob unit test CleanPodPolicy field? [\1555](https://github.com/kubeflow/training-operator/issues/1555)
- --enable-gang-scheduling=true doesn't work for MPIJob [\1548](https://github.com/kubeflow/training-operator/issues/1548)
- PyTorchJob fails when creating a task with a different namespace but the same name [\1543](https://github.com/kubeflow/training-operator/issues/1543)
- Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: \"null\" after enable-gang-scheduling [\1538](https://github.com/kubeflow/training-operator/issues/1538)
- Job TTLs not working [\1533](https://github.com/kubeflow/training-operator/issues/1533)
- Support PodGroup in scheduler-plugins/coscheduling [\1518](https://github.com/kubeflow/training-operator/issues/1518)
- support elastic training [\1515](https://github.com/kubeflow/training-operator/issues/1515)
- Modified the configuration of RootLogger [\1514](https://github.com/kubeflow/training-operator/issues/1514)
- Add checking import order in CI [\1510](https://github.com/kubeflow/training-operator/issues/1510)
- Scale down of pytorchJob cause workers pod to restart [\1509](https://github.com/kubeflow/training-operator/issues/1509)
- Support label selector based success/failure conditions [\1507](https://github.com/kubeflow/training-operator/issues/1507)
- \[feat\] Support SuccessPolicy in PyTorchJob [\1505](https://github.com/kubeflow/training-operator/issues/1505)
- pytorch elastic scheduler error [\1504](https://github.com/kubeflow/training-operator/issues/1504)
- Could you add the example of MPIJob in this repository [\1502](https://github.com/kubeflow/training-operator/issues/1502)
- \[Feature\] Create a Informer/ClientSet for PyTorch Jobs [\1499](https://github.com/kubeflow/training-operator/issues/1499)
- \[feature\] Make init container injection logic availabel to all jobs [\1498](https://github.com/kubeflow/training-operator/issues/1498)
- Roadmaps for 1.4 release [\1496](https://github.com/kubeflow/training-operator/issues/1496)
- \[bug\] \(MpiJob\)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. [\1494](https://github.com/kubeflow/training-operator/issues/1494)
- Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org [\1492](https://github.com/kubeflow/training-operator/issues/1492)
- Python PytorchJob: no attribute openapi\_types for example code [\1481](https://github.com/kubeflow/training-operator/issues/1481)
- PyTorch DistributedDataParallel training with multi nodes [\1475](https://github.com/kubeflow/training-operator/issues/1475)
- Installing kubeflow-training breaks import for other kubeflow packages \(katib, fairing, etc.\) [\1471](https://github.com/kubeflow/training-operator/issues/1471)
- Deprecate ksonnet and use python/golang to submit jobs [\1468](https://github.com/kubeflow/training-operator/issues/1468)
- Help Wanted in ParameterServerStrategy Example. [\1459](https://github.com/kubeflow/training-operator/issues/1459)
- Bug: SomeTimes Coredumped using tfjob [\1456](https://github.com/kubeflow/training-operator/issues/1456)
- \[question\] PyTorchJob MNIST example training speed [\1454](https://github.com/kubeflow/training-operator/issues/1454)
- tfjob status not match when EnableDynamicWorker set true [\1452](https://github.com/kubeflow/training-operator/issues/1452)
- training-operator set scheduler error [\1447](https://github.com/kubeflow/training-operator/issues/1447)
- \[sdk\]: Replace `TableLogger` component in the SDK for better support with `ipykernel>=6.x` [\1446](https://github.com/kubeflow/training-operator/issues/1446)
- SDK: wait\_for\_job reports typeError [\1445](https://github.com/kubeflow/training-operator/issues/1445)
- Update prometheus monitoring doc [\1443](https://github.com/kubeflow/training-operator/issues/1443)
- Master branch should provide a nightly image [\1433](https://github.com/kubeflow/training-operator/issues/1433)
- Clean up test folder before testing [\1429](https://github.com/kubeflow/training-operator/issues/1429)
- Clean up TF specific docs [\1424](https://github.com/kubeflow/training-operator/issues/1424)
- \[feature\] Support SchedulingPolicy in PyTorchJob [\1414](https://github.com/kubeflow/training-operator/issues/1414)
- Hyperlinks in the "Overview" section is incorrect/not found [\1411](https://github.com/kubeflow/training-operator/issues/1411)
- add workqueue metric [\1407](https://github.com/kubeflow/training-operator/issues/1407)
- Validation fails for MXJob Tune example [\1402](https://github.com/kubeflow/training-operator/issues/1402)
- Rate exceeded for aws ecr image [\1400](https://github.com/kubeflow/training-operator/issues/1400)
- change layout to follow the standard of kubebuilder? [\1397](https://github.com/kubeflow/training-operator/issues/1397)
- \[example\] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist [\1393](https://github.com/kubeflow/training-operator/issues/1393)
- Update kubeflow/website for 1.4 release [\1392](https://github.com/kubeflow/training-operator/issues/1392)
- Cut beta release of tf-operator for 1.4 release [\1385](https://github.com/kubeflow/training-operator/issues/1385)
- "invalid memory address or nil pointer dereference" [\1382](https://github.com/kubeflow/training-operator/issues/1382)
- some questions about job sync [\1379](https://github.com/kubeflow/training-operator/issues/1379)
- Provides a default Grafana dashboard [\1376](https://github.com/kubeflow/training-operator/issues/1376)
- \[feature\] Support different PS/worker types [\1369](https://github.com/kubeflow/training-operator/issues/1369)
- Need to copy all \(mainly pytorch\) framework's example dir to tf-operator/examples [\1366](https://github.com/kubeflow/training-operator/issues/1366)
- Add more CRD validations markers to block invalid job on client apply [\1363](https://github.com/kubeflow/training-operator/issues/1363)
- Update presubmit and post submit job triggers [\1354](https://github.com/kubeflow/training-operator/issues/1354)
- Optimize post submit jobs flow [\1353](https://github.com/kubeflow/training-operator/issues/1353)
- Enable leader election in controller manager using controllermanagerconfig [\1350](https://github.com/kubeflow/training-operator/issues/1350)
- Support mpi jobs in universal operator [\1345](https://github.com/kubeflow/training-operator/issues/1345)
- post-submit job failure in master branch [\1343](https://github.com/kubeflow/training-operator/issues/1343)
- Improve observability of universal operator [\1340](https://github.com/kubeflow/training-operator/issues/1340)
- Best practice to organize main.go and Dockerfile? [\1333](https://github.com/kubeflow/training-operator/issues/1333)
- Should training operator keep clientset in the same repository? [\1332](https://github.com/kubeflow/training-operator/issues/1332)
- Test image has incorrect tag? [\1329](https://github.com/kubeflow/training-operator/issues/1329)
- Prepare e2e tests for all frameworks [\1323](https://github.com/kubeflow/training-operator/issues/1323)
- Reduce e2e replica-restart-policy-tests running time [\1319](https://github.com/kubeflow/training-operator/issues/1319)
- Improve logs structure by consolidating libs from controller runtime and controllers [\1313](https://github.com/kubeflow/training-operator/issues/1313)
- Enable tests for all frameworks [\1311](https://github.com/kubeflow/training-operator/issues/1311)
- \[bug\] The pod wil be recreated until the expectation expires [\1306](https://github.com/kubeflow/training-operator/issues/1306)
- Upgrade CRDs to apiextensions.k8s.io/v1 [\1304](https://github.com/kubeflow/training-operator/issues/1304)
- Add role details as new columns to `kubectl get jobs` output for CRD. [\1301](https://github.com/kubeflow/training-operator/issues/1301)
- How to handle long pending pods in a TF-job? [\1282](https://github.com/kubeflow/training-operator/issues/1282)
- Could you release a new version of Python SDK [\1279](https://github.com/kubeflow/training-operator/issues/1279)
- Update swagger.json schema for TFJobSpec to include RunPolicy [\1278](https://github.com/kubeflow/training-operator/issues/1278)
- Not able to pass environment variable from tfjob to pod [\1273](https://github.com/kubeflow/training-operator/issues/1273)
- v1\_time.py is not generated by hack/python-sdk/gen-sdk.sh [\1271](https://github.com/kubeflow/training-operator/issues/1271)
- Add a step to upload artifact [\1258](https://github.com/kubeflow/training-operator/issues/1258)
- \[feature\] Support multi port in TFJob [\1251](https://github.com/kubeflow/training-operator/issues/1251)
- \[feat\] Add scale subresource [\1220](https://github.com/kubeflow/training-operator/issues/1220)
- Pod get re-created after it exited and get garbage collected [\1186](https://github.com/kubeflow/training-operator/issues/1186)
- Clean up vendor dependencies [\1162](https://github.com/kubeflow/training-operator/issues/1162)

**Merged pull requests:**

- Update training controller image to latest [\1625](https://github.com/kubeflow/training-operator/pull/1625) ([johnugeorge](https://github.com/johnugeorge))
- Update SDK version to 1.5.0 [\1624](https://github.com/kubeflow/training-operator/pull/1624) ([johnugeorge](https://github.com/johnugeorge))
- Upgrade common to v0.4.3 [\1623](https://github.com/kubeflow/training-operator/pull/1623) ([johnugeorge](https://github.com/johnugeorge))
- fix: MPIJob worker still running when NotEnoughResources [\1621](https://github.com/kubeflow/training-operator/pull/1621) ([hackerboy01](https://github.com/hackerboy01))
- fix comments for pytorch-controller [\1620](https://github.com/kubeflow/training-operator/pull/1620) ([hackerboy01](https://github.com/hackerboy01))
- MXNet SDK with Status check fix [\1618](https://github.com/kubeflow/training-operator/pull/1618) ([johnugeorge](https://github.com/johnugeorge))
- Adding GHA for automatic image build and push [\1615](https://github.com/kubeflow/training-operator/pull/1615) ([johnugeorge](https://github.com/johnugeorge))
- fix: requeue when expire time is not up yet [\1614](https://github.com/kubeflow/training-operator/pull/1614) ([Garrybest](https://github.com/Garrybest))
- Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob [\1610](https://github.com/kubeflow/training-operator/pull/1610) ([tenzen-y](https://github.com/tenzen-y))
- Add all generation tools to Makefile [\1609](https://github.com/kubeflow/training-operator/pull/1609) ([johnugeorge](https://github.com/johnugeorge))
- Adding MPI python sdk [\1608](https://github.com/kubeflow/training-operator/pull/1608) ([johnugeorge](https://github.com/johnugeorge))
- Adding XGboost Python sdk [\1607](https://github.com/kubeflow/training-operator/pull/1607) ([johnugeorge](https://github.com/johnugeorge))
- Generating MPI python sdk [\1606](https://github.com/kubeflow/training-operator/pull/1606) ([johnugeorge](https://github.com/johnugeorge))
- Update k8s dependencies to v0.24.1 [\1604](https://github.com/kubeflow/training-operator/pull/1604) ([johnugeorge](https://github.com/johnugeorge))
- Migrate test framework to GHA [\1603](https://github.com/kubeflow/training-operator/pull/1603) ([johnugeorge](https://github.com/johnugeorge))
- Add mpi in update-codegen.sh [\1600](https://github.com/kubeflow/training-operator/pull/1600) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- Remove presubmit test depending on optional-test-infra [\1596](https://github.com/kubeflow/training-operator/pull/1596) ([aws-kf-ci-bot](https://github.com/aws-kf-ci-bot))
- chore: stop action on first fail [\1595](https://github.com/kubeflow/training-operator/pull/1595) ([jasonliu747](https://github.com/jasonliu747))
- fix Pytorjob status inaccuracy when task replica scale down [\1593](https://github.com/kubeflow/training-operator/pull/1593) ([PeterChg](https://github.com/PeterChg))
- update img url in design doc [\1591](https://github.com/kubeflow/training-operator/pull/1591) ([zw0610](https://github.com/zw0610))
- Look for fully-qualified job role label in Python sdk [\1588](https://github.com/kubeflow/training-operator/pull/1588) ([person142](https://github.com/person142))
- fix torch env typo [\1573](https://github.com/kubeflow/training-operator/pull/1573) ([kuizhiqing](https://github.com/kuizhiqing))
- Restart job on failure for Always,OnFailure Policy [\1572](https://github.com/kubeflow/training-operator/pull/1572) ([georgkaleido](https://github.com/georgkaleido))
- Increase success threshold [\1568](https://github.com/kubeflow/training-operator/pull/1568) ([haoxins](https://github.com/haoxins))
- update status.startTime for pytorchjob and xgboostjob [\1567](https://github.com/kubeflow/training-operator/pull/1567) ([cheimu](https://github.com/cheimu))
- fix: add mpijobs to kubeflow training role [\1565](https://github.com/kubeflow/training-operator/pull/1565) ([henrysecond1](https://github.com/henrysecond1))
- Remove uncalled mpi-controller DeletePodsAndServices\(\) [\1558](https://github.com/kubeflow/training-operator/pull/1558) ([cheimu](https://github.com/cheimu))
- fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set [\1557](https://github.com/kubeflow/training-operator/pull/1557) ([cheimu](https://github.com/cheimu))
- Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy [\1556](https://github.com/kubeflow/training-operator/pull/1556) ([cheimu](https://github.com/cheimu))
- fix: set mpijob runPolicy.cleanPodPolicy to default none [\1554](https://github.com/kubeflow/training-operator/pull/1554) ([cheimu](https://github.com/cheimu))
- fix api reader issue [\1551](https://github.com/kubeflow/training-operator/pull/1551) ([zw0610](https://github.com/zw0610))
- fix label and CleanPodPolicy for mpi-controller [\1550](https://github.com/kubeflow/training-operator/pull/1550) ([zw0610](https://github.com/zw0610))
- fix UpdateJobStatusInApiServer when gang-scheduling is enabled [\1549](https://github.com/kubeflow/training-operator/pull/1549) ([zw0610](https://github.com/zw0610))
- fix: add namespace filtering when getting pods/services for jobs [\1545](https://github.com/kubeflow/training-operator/pull/1545) ([henrysecond1](https://github.com/henrysecond1))
- Remove `table-logger` dependency [\1544](https://github.com/kubeflow/training-operator/pull/1544) ([person142](https://github.com/person142))
- Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf\_operator [\1542](https://github.com/kubeflow/training-operator/pull/1542) ([dependabot[bot]](https://github.com/apps/dependabot))
- Release Python SDK 1.4.0 [\1541](https://github.com/kubeflow/training-operator/pull/1541) ([alembiewski](https://github.com/alembiewski))
- mod: Upgrade ginkgo to v2 [\1537](https://github.com/kubeflow/training-operator/pull/1537) ([haoxins](https://github.com/haoxins))
- docs: Fix broken links in quick-start-v1.md [\1536](https://github.com/kubeflow/training-operator/pull/1536) ([nakamasato](https://github.com/nakamasato))
- extends path in \_\_init\_\_.py for SDK correctly [\1531](https://github.com/kubeflow/training-operator/pull/1531) ([cakeislife100](https://github.com/cakeislife100))
- chore: Update changelog for v1.4.0-rc.0 release [\1528](https://github.com/kubeflow/training-operator/pull/1528) ([terrytangyuan](https://github.com/terrytangyuan))

Page 2 of 7

Releases

Has known vulnerabilities

Previous Next

Kubeflow-training

Page 2 of 7

1.7.0rc.0

1.6.0

1.6.0rc.1

1.6.0rc.0

1.5.0

1.5.0rc.0

Page 2 of 7

Links

Releases