[Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.4.0...v1.5.0-rc.0)
**Closed issues:**
- MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? [\1617](https://github.com/kubeflow/training-operator/issues/1617)
- unable to fetch TFJob when I use client.go run tfjob [\1612](https://github.com/kubeflow/training-operator/issues/1612)
- Pytorchjob dist-mnist no training logs [\1601](https://github.com/kubeflow/training-operator/issues/1601)
- kubectl get tfjob -o yaml, but not status output [\1598](https://github.com/kubeflow/training-operator/issues/1598)
- missing image in tf\_job\_design\_doc.md [\1590](https://github.com/kubeflow/training-operator/issues/1590)
- Labels in Python client are out of date [\1587](https://github.com/kubeflow/training-operator/issues/1587)
- PyTorchJob Pods "Not Ready" After Completing Training [\1577](https://github.com/kubeflow/training-operator/issues/1577)
- cannot use "github.com/go-openapi/spec".Schema{...} \(type "github.com/go-openapi/spec".Schema\) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value [\1576](https://github.com/kubeflow/training-operator/issues/1576)
- PyTorchJob: OnFailure Policy won't handle pod failure gracefully [\1570](https://github.com/kubeflow/training-operator/issues/1570)
- pytorchjob doesn't have status.startTIme. [\1566](https://github.com/kubeflow/training-operator/issues/1566)
- Optional-test-infra Deprecation Notice - Training [\1561](https://github.com/kubeflow/training-operator/issues/1561)
- Should we update MPIJob unit test CleanPodPolicy field? [\1555](https://github.com/kubeflow/training-operator/issues/1555)
- --enable-gang-scheduling=true doesn't work for MPIJob [\1548](https://github.com/kubeflow/training-operator/issues/1548)
- PyTorchJob fails when creating a task with a different namespace but the same name [\1543](https://github.com/kubeflow/training-operator/issues/1543)
- Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: \"null\" after enable-gang-scheduling [\1538](https://github.com/kubeflow/training-operator/issues/1538)
- Job TTLs not working [\1533](https://github.com/kubeflow/training-operator/issues/1533)
- Support PodGroup in scheduler-plugins/coscheduling [\1518](https://github.com/kubeflow/training-operator/issues/1518)
- support elastic training [\1515](https://github.com/kubeflow/training-operator/issues/1515)
- Modified the configuration of RootLogger [\1514](https://github.com/kubeflow/training-operator/issues/1514)
- Add checking import order in CI [\1510](https://github.com/kubeflow/training-operator/issues/1510)
- Scale down of pytorchJob cause workers pod to restart [\1509](https://github.com/kubeflow/training-operator/issues/1509)
- Support label selector based success/failure conditions [\1507](https://github.com/kubeflow/training-operator/issues/1507)
- \[feat\] Support SuccessPolicy in PyTorchJob [\1505](https://github.com/kubeflow/training-operator/issues/1505)
- pytorch elastic scheduler error [\1504](https://github.com/kubeflow/training-operator/issues/1504)
- Could you add the example of MPIJob in this repository [\1502](https://github.com/kubeflow/training-operator/issues/1502)
- \[Feature\] Create a Informer/ClientSet for PyTorch Jobs [\1499](https://github.com/kubeflow/training-operator/issues/1499)
- \[feature\] Make init container injection logic availabel to all jobs [\1498](https://github.com/kubeflow/training-operator/issues/1498)
- Roadmaps for 1.4 release [\1496](https://github.com/kubeflow/training-operator/issues/1496)
- \[bug\] \(MpiJob\)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. [\1494](https://github.com/kubeflow/training-operator/issues/1494)
- Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org [\1492](https://github.com/kubeflow/training-operator/issues/1492)
- Python PytorchJob: no attribute openapi\_types for example code [\1481](https://github.com/kubeflow/training-operator/issues/1481)
- PyTorch DistributedDataParallel training with multi nodes [\1475](https://github.com/kubeflow/training-operator/issues/1475)
- Installing kubeflow-training breaks import for other kubeflow packages \(katib, fairing, etc.\) [\1471](https://github.com/kubeflow/training-operator/issues/1471)
- Deprecate ksonnet and use python/golang to submit jobs [\1468](https://github.com/kubeflow/training-operator/issues/1468)
- Help Wanted in ParameterServerStrategy Example. [\1459](https://github.com/kubeflow/training-operator/issues/1459)
- Bug: SomeTimes Coredumped using tfjob [\1456](https://github.com/kubeflow/training-operator/issues/1456)
- \[question\] PyTorchJob MNIST example training speed [\1454](https://github.com/kubeflow/training-operator/issues/1454)
- tfjob status not match when EnableDynamicWorker set true [\1452](https://github.com/kubeflow/training-operator/issues/1452)
- training-operator set scheduler error [\1447](https://github.com/kubeflow/training-operator/issues/1447)
- \[sdk\]: Replace `TableLogger` component in the SDK for better support with `ipykernel>=6.x` [\1446](https://github.com/kubeflow/training-operator/issues/1446)
- SDK: wait\_for\_job reports typeError [\1445](https://github.com/kubeflow/training-operator/issues/1445)
- Update prometheus monitoring doc [\1443](https://github.com/kubeflow/training-operator/issues/1443)
- Master branch should provide a nightly image [\1433](https://github.com/kubeflow/training-operator/issues/1433)
- Clean up test folder before testing [\1429](https://github.com/kubeflow/training-operator/issues/1429)
- Clean up TF specific docs [\1424](https://github.com/kubeflow/training-operator/issues/1424)
- \[feature\] Support SchedulingPolicy in PyTorchJob [\1414](https://github.com/kubeflow/training-operator/issues/1414)
- Hyperlinks in the "Overview" section is incorrect/not found [\1411](https://github.com/kubeflow/training-operator/issues/1411)
- add workqueue metric [\1407](https://github.com/kubeflow/training-operator/issues/1407)
- Validation fails for MXJob Tune example [\1402](https://github.com/kubeflow/training-operator/issues/1402)
- Rate exceeded for aws ecr image [\1400](https://github.com/kubeflow/training-operator/issues/1400)
- change layout to follow the standard of kubebuilder? [\1397](https://github.com/kubeflow/training-operator/issues/1397)
- \[example\] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist [\1393](https://github.com/kubeflow/training-operator/issues/1393)
- Update kubeflow/website for 1.4 release [\1392](https://github.com/kubeflow/training-operator/issues/1392)
- Cut beta release of tf-operator for 1.4 release [\1385](https://github.com/kubeflow/training-operator/issues/1385)
- "invalid memory address or nil pointer dereference" [\1382](https://github.com/kubeflow/training-operator/issues/1382)
- some questions about job sync [\1379](https://github.com/kubeflow/training-operator/issues/1379)
- Provides a default Grafana dashboard [\1376](https://github.com/kubeflow/training-operator/issues/1376)
- \[feature\] Support different PS/worker types [\1369](https://github.com/kubeflow/training-operator/issues/1369)
- Need to copy all \(mainly pytorch\) framework's example dir to tf-operator/examples [\1366](https://github.com/kubeflow/training-operator/issues/1366)
- Add more CRD validations markers to block invalid job on client apply [\1363](https://github.com/kubeflow/training-operator/issues/1363)
- Update presubmit and post submit job triggers [\1354](https://github.com/kubeflow/training-operator/issues/1354)
- Optimize post submit jobs flow [\1353](https://github.com/kubeflow/training-operator/issues/1353)
- Enable leader election in controller manager using controllermanagerconfig [\1350](https://github.com/kubeflow/training-operator/issues/1350)
- Support mpi jobs in universal operator [\1345](https://github.com/kubeflow/training-operator/issues/1345)
- post-submit job failure in master branch [\1343](https://github.com/kubeflow/training-operator/issues/1343)
- Improve observability of universal operator [\1340](https://github.com/kubeflow/training-operator/issues/1340)
- Best practice to organize main.go and Dockerfile? [\1333](https://github.com/kubeflow/training-operator/issues/1333)
- Should training operator keep clientset in the same repository? [\1332](https://github.com/kubeflow/training-operator/issues/1332)
- Test image has incorrect tag? [\1329](https://github.com/kubeflow/training-operator/issues/1329)
- Prepare e2e tests for all frameworks [\1323](https://github.com/kubeflow/training-operator/issues/1323)
- Reduce e2e replica-restart-policy-tests running time [\1319](https://github.com/kubeflow/training-operator/issues/1319)
- Improve logs structure by consolidating libs from controller runtime and controllers [\1313](https://github.com/kubeflow/training-operator/issues/1313)
- Enable tests for all frameworks [\1311](https://github.com/kubeflow/training-operator/issues/1311)
- \[bug\] The pod wil be recreated until the expectation expires [\1306](https://github.com/kubeflow/training-operator/issues/1306)
- Upgrade CRDs to apiextensions.k8s.io/v1 [\1304](https://github.com/kubeflow/training-operator/issues/1304)
- Add role details as new columns to `kubectl get jobs` output for CRD. [\1301](https://github.com/kubeflow/training-operator/issues/1301)
- How to handle long pending pods in a TF-job? [\1282](https://github.com/kubeflow/training-operator/issues/1282)
- Could you release a new version of Python SDK [\1279](https://github.com/kubeflow/training-operator/issues/1279)
- Update swagger.json schema for TFJobSpec to include RunPolicy [\1278](https://github.com/kubeflow/training-operator/issues/1278)
- Not able to pass environment variable from tfjob to pod [\1273](https://github.com/kubeflow/training-operator/issues/1273)
- v1\_time.py is not generated by hack/python-sdk/gen-sdk.sh [\1271](https://github.com/kubeflow/training-operator/issues/1271)
- Add a step to upload artifact [\1258](https://github.com/kubeflow/training-operator/issues/1258)
- \[feature\] Support multi port in TFJob [\1251](https://github.com/kubeflow/training-operator/issues/1251)
- \[feat\] Add scale subresource [\1220](https://github.com/kubeflow/training-operator/issues/1220)
- Pod get re-created after it exited and get garbage collected [\1186](https://github.com/kubeflow/training-operator/issues/1186)
- Clean up vendor dependencies [\1162](https://github.com/kubeflow/training-operator/issues/1162)
**Merged pull requests:**
- Update training controller image to latest [\1625](https://github.com/kubeflow/training-operator/pull/1625) ([johnugeorge](https://github.com/johnugeorge))
- Update SDK version to 1.5.0 [\1624](https://github.com/kubeflow/training-operator/pull/1624) ([johnugeorge](https://github.com/johnugeorge))
- Upgrade common to v0.4.3 [\1623](https://github.com/kubeflow/training-operator/pull/1623) ([johnugeorge](https://github.com/johnugeorge))
- fix: MPIJob worker still running when NotEnoughResources [\1621](https://github.com/kubeflow/training-operator/pull/1621) ([hackerboy01](https://github.com/hackerboy01))
- fix comments for pytorch-controller [\1620](https://github.com/kubeflow/training-operator/pull/1620) ([hackerboy01](https://github.com/hackerboy01))
- MXNet SDK with Status check fix [\1618](https://github.com/kubeflow/training-operator/pull/1618) ([johnugeorge](https://github.com/johnugeorge))
- Adding GHA for automatic image build and push [\1615](https://github.com/kubeflow/training-operator/pull/1615) ([johnugeorge](https://github.com/johnugeorge))
- fix: requeue when expire time is not up yet [\1614](https://github.com/kubeflow/training-operator/pull/1614) ([Garrybest](https://github.com/Garrybest))
- Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob [\1610](https://github.com/kubeflow/training-operator/pull/1610) ([tenzen-y](https://github.com/tenzen-y))
- Add all generation tools to Makefile [\1609](https://github.com/kubeflow/training-operator/pull/1609) ([johnugeorge](https://github.com/johnugeorge))
- Adding MPI python sdk [\1608](https://github.com/kubeflow/training-operator/pull/1608) ([johnugeorge](https://github.com/johnugeorge))
- Adding XGboost Python sdk [\1607](https://github.com/kubeflow/training-operator/pull/1607) ([johnugeorge](https://github.com/johnugeorge))
- Generating MPI python sdk [\1606](https://github.com/kubeflow/training-operator/pull/1606) ([johnugeorge](https://github.com/johnugeorge))
- Update k8s dependencies to v0.24.1 [\1604](https://github.com/kubeflow/training-operator/pull/1604) ([johnugeorge](https://github.com/johnugeorge))
- Migrate test framework to GHA [\1603](https://github.com/kubeflow/training-operator/pull/1603) ([johnugeorge](https://github.com/johnugeorge))
- Add mpi in update-codegen.sh [\1600](https://github.com/kubeflow/training-operator/pull/1600) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg))
- Remove presubmit test depending on optional-test-infra [\1596](https://github.com/kubeflow/training-operator/pull/1596) ([aws-kf-ci-bot](https://github.com/aws-kf-ci-bot))
- chore: stop action on first fail [\1595](https://github.com/kubeflow/training-operator/pull/1595) ([jasonliu747](https://github.com/jasonliu747))
- fix Pytorjob status inaccuracy when task replica scale down [\1593](https://github.com/kubeflow/training-operator/pull/1593) ([PeterChg](https://github.com/PeterChg))
- update img url in design doc [\1591](https://github.com/kubeflow/training-operator/pull/1591) ([zw0610](https://github.com/zw0610))
- Look for fully-qualified job role label in Python sdk [\1588](https://github.com/kubeflow/training-operator/pull/1588) ([person142](https://github.com/person142))
- fix torch env typo [\1573](https://github.com/kubeflow/training-operator/pull/1573) ([kuizhiqing](https://github.com/kuizhiqing))
- Restart job on failure for Always,OnFailure Policy [\1572](https://github.com/kubeflow/training-operator/pull/1572) ([georgkaleido](https://github.com/georgkaleido))
- Increase success threshold [\1568](https://github.com/kubeflow/training-operator/pull/1568) ([haoxins](https://github.com/haoxins))
- update status.startTime for pytorchjob and xgboostjob [\1567](https://github.com/kubeflow/training-operator/pull/1567) ([cheimu](https://github.com/cheimu))
- fix: add mpijobs to kubeflow training role [\1565](https://github.com/kubeflow/training-operator/pull/1565) ([henrysecond1](https://github.com/henrysecond1))
- Remove uncalled mpi-controller DeletePodsAndServices\(\) [\1558](https://github.com/kubeflow/training-operator/pull/1558) ([cheimu](https://github.com/cheimu))
- fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set [\1557](https://github.com/kubeflow/training-operator/pull/1557) ([cheimu](https://github.com/cheimu))
- Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy [\1556](https://github.com/kubeflow/training-operator/pull/1556) ([cheimu](https://github.com/cheimu))
- fix: set mpijob runPolicy.cleanPodPolicy to default none [\1554](https://github.com/kubeflow/training-operator/pull/1554) ([cheimu](https://github.com/cheimu))
- fix api reader issue [\1551](https://github.com/kubeflow/training-operator/pull/1551) ([zw0610](https://github.com/zw0610))
- fix label and CleanPodPolicy for mpi-controller [\1550](https://github.com/kubeflow/training-operator/pull/1550) ([zw0610](https://github.com/zw0610))
- fix UpdateJobStatusInApiServer when gang-scheduling is enabled [\1549](https://github.com/kubeflow/training-operator/pull/1549) ([zw0610](https://github.com/zw0610))
- fix: add namespace filtering when getting pods/services for jobs [\1545](https://github.com/kubeflow/training-operator/pull/1545) ([henrysecond1](https://github.com/henrysecond1))
- Remove `table-logger` dependency [\1544](https://github.com/kubeflow/training-operator/pull/1544) ([person142](https://github.com/person142))
- Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf\_operator [\1542](https://github.com/kubeflow/training-operator/pull/1542) ([dependabot[bot]](https://github.com/apps/dependabot))
- Release Python SDK 1.4.0 [\1541](https://github.com/kubeflow/training-operator/pull/1541) ([alembiewski](https://github.com/alembiewski))
- mod: Upgrade ginkgo to v2 [\1537](https://github.com/kubeflow/training-operator/pull/1537) ([haoxins](https://github.com/haoxins))
- docs: Fix broken links in quick-start-v1.md [\1536](https://github.com/kubeflow/training-operator/pull/1536) ([nakamasato](https://github.com/nakamasato))
- extends path in \_\_init\_\_.py for SDK correctly [\1531](https://github.com/kubeflow/training-operator/pull/1531) ([cakeislife100](https://github.com/cakeislife100))
- chore: Update changelog for v1.4.0-rc.0 release [\1528](https://github.com/kubeflow/training-operator/pull/1528) ([terrytangyuan](https://github.com/terrytangyuan))