Kubeflow-training

Latest version: v1.9.1

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 7

1.9.0

This is the Training Operator v1.9.0 release.

This release introduces [a new JAXJob](https://www.kubeflow.org/docs/components/training/user-guides/jax/), enabling seamless distributed training with [JAX](https://github.com/google/jax).

Additionally, it adds the `managedBy` API to streamline the orchestration of training Jobs in multi-cluster environment using [MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/multikueue/).

Breaking Changes

- Upgrade Kubernetes to v1.31.3 ([2330](https://github.com/kubeflow/training-operator/pull/2330) by astefanutti)
- Upgrade Kubernetes to v1.30.7 ([2332](https://github.com/kubeflow/training-operator/pull/2332) by astefanutti)
- Update the name of PVC in `train` API ([2187](https://github.com/kubeflow/training-operator/pull/2187) by helenxie-bit)
- Remove support for MXJob ([2150](https://github.com/kubeflow/training-operator/pull/2150) by tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 ([2105](https://github.com/kubeflow/training-operator/pull/2105) by tenzen-y)

New Features

Distributed JAX

- Add JAX controller ([2194](https://github.com/kubeflow/training-operator/pull/2194) by sandipanpanda)
- Add JAX API ([2163](https://github.com/kubeflow/training-operator/pull/2163) by sandipanpanda)
- JAX Integration Enhancement Proposal ([2125](https://github.com/kubeflow/training-operator/pull/2125) by sandipanpanda)
- JAX example for MNIST SPMD and add CI testing (https://github.com/kubeflow/training-operator/pull/2390 by saileshd1402)

New Examples

- FSDP Example for T5 Fine-Tuning and PyTorchJob ([2286](https://github.com/kubeflow/training-operator/pull/2286) by andreyvelich)
- Add DeepSpeed Example with Pytorch Operator ([2235](https://github.com/kubeflow/training-operator/pull/2235) by Syulin7)

Control Plane Updates

- Validate pytorchjob workers are configured when elasticpolicy is configured ([2320](https://github.com/kubeflow/training-operator/pull/2320) by tarat44)
- [Feature] Support managed by external controller ([2203](https://github.com/kubeflow/training-operator/pull/2203) by mszadkow)
- Update trainer to ensure type consistency for `train_args` and `lora_config` ([2181](https://github.com/kubeflow/training-operator/pull/2181) by helenxie-bit)
- Support ARM64 platform in TensorFlow examples ([2119](https://github.com/kubeflow/training-operator/pull/2119) by akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples ([2114](https://github.com/kubeflow/training-operator/pull/2114) by tico88612)
- ARM64 supported in PyTorch examples ([2116](https://github.com/kubeflow/training-operator/pull/2116) by danielsuh05)

SDK Updates

- [SDK] Adding env vars ([2285](https://github.com/kubeflow/training-operator/pull/2285) by tarekabouzeid)
- [SDK] Use torchrun to create PyTorchJob from function ([2276](https://github.com/kubeflow/training-operator/pull/2276) by andreyvelich)
- [SDK] move env var to constants.py ([2268](https://github.com/kubeflow/training-operator/pull/2268) by varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API ([2261](https://github.com/kubeflow/training-operator/pull/2261) by varshaprasad96)
- [SDK] Read namespace from the current context ([2255](https://github.com/kubeflow/training-operator/pull/2255) by andreyvelich)
- [SDK] Sync Transformers version for train API ([2146](https://github.com/kubeflow/training-operator/pull/2146) by andreyvelich)
- [SDK] Explain Python version support cycle ([2144](https://github.com/kubeflow/training-operator/pull/2144) by andreyvelich)

Kubeflow Trainer V2

- KEP-2170: Kubeflow Training V2 API ([2171](https://github.com/kubeflow/training-operator/pull/2171) by andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info ([2345](https://github.com/kubeflow/training-operator/pull/2345) by andreyvelich)
- Always update TrainJob status on errors ([2352](https://github.com/kubeflow/training-operator/pull/2352) by astefanutti)
- Fix TrainJob status comparison and update ([2353](https://github.com/kubeflow/training-operator/pull/2353) by astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources ([2350](https://github.com/kubeflow/training-operator/pull/2350) by astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK ([2324](https://github.com/kubeflow/training-operator/pull/2324) by andreyvelich)
- KEP-2170: Add Torch Distributed Runtime ([2328](https://github.com/kubeflow/training-operator/pull/2328) by andreyvelich)
- KEP-2170: Add TrainJob conditions ([2322](https://github.com/kubeflow/training-operator/pull/2322) by tenzen-y)
- KEP-2170: Add the TrainJob state transition design ([2298](https://github.com/kubeflow/training-operator/pull/2298) by tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin ([2316](https://github.com/kubeflow/training-operator/pull/2316) by andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins ([2308](https://github.com/kubeflow/training-operator/pull/2308) by andreyvelich)
- KEP-2170: Create model and dataset initializers ([2303](https://github.com/kubeflow/training-operator/pull/2303) by andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 ([2310](https://github.com/kubeflow/training-operator/pull/2310) by andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts ([2306](https://github.com/kubeflow/training-operator/pull/2306) by tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings ([2304](https://github.com/kubeflow/training-operator/pull/2304) by tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob ([2296](https://github.com/kubeflow/training-operator/pull/2296) by tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects ([2295](https://github.com/kubeflow/training-operator/pull/2295) by tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 ([2289](https://github.com/kubeflow/training-operator/pull/2289) by andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD ([2260](https://github.com/kubeflow/training-operator/pull/2260) by akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API ([2283](https://github.com/kubeflow/training-operator/pull/2283) by andreyvelich)
- KEP-2170: Implement runtime framework ([2248](https://github.com/kubeflow/training-operator/pull/2248) by tenzen-y)
- [v2alpha] Move GV related codebase ([2281](https://github.com/kubeflow/training-operator/pull/2281) by varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs ([2273](https://github.com/kubeflow/training-operator/pull/2273) by varshaprasad96)
- KEP-2170: Implement skeleton webhook servers ([2251](https://github.com/kubeflow/training-operator/pull/2251) by tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager ([2236](https://github.com/kubeflow/training-operator/pull/2236) by tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources ([2237](https://github.com/kubeflow/training-operator/pull/2237) by tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP ([2240](https://github.com/kubeflow/training-operator/pull/2240) by andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs ([2223](https://github.com/kubeflow/training-operator/pull/2223) by andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy ([2222](https://github.com/kubeflow/training-operator/pull/2222) by tenzen-y)
- KEP-2170: Add directories for the V2 APIs ([2221](https://github.com/kubeflow/training-operator/pull/2221) by andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef ([2201](https://github.com/kubeflow/training-operator/pull/2201) by tenzen-y)
- KEP-2170: Make API specification more restricting ([2198](https://github.com/kubeflow/training-operator/pull/2198) by tenzen-y)

Bug Fixes

- [release-1.9] V1: Fix versions in HuggingFace dataset initializer ([2370](https://github.com/kubeflow/training-operator/pull/2370) by andreyvelich)
- Pin accelerate package version in trainer ([2340](https://github.com/kubeflow/training-operator/pull/2340) by gavrissh)
- [fix] Resolve v2alpha API exceptions ([2317](https://github.com/kubeflow/training-operator/pull/2317) by varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API ([2265](https://github.com/kubeflow/training-operator/pull/2265) by saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" ([2250](https://github.com/kubeflow/training-operator/pull/2250) by helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. ([2243](https://github.com/kubeflow/training-operator/pull/2243) by mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models ([2230](https://github.com/kubeflow/training-operator/pull/2230) by helenxie-bit)
- Update `huggingface_hub` Version in the storage initializer to fix ImportError ([2180](https://github.com/kubeflow/training-operator/pull/2180) by helenxie-bit)
- [SDK] Fix Failed condition in wait Job API ([2160](https://github.com/kubeflow/training-operator/pull/2160) by andreyvelich)
- fix volcano podgroup update issue ([2079](https://github.com/kubeflow/training-operator/pull/2079) by ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API ([2122](https://github.com/kubeflow/training-operator/pull/2122) by andreyvelich)

Misc

- [release-1.9] Add release branch to the image push trigger ([2377](https://github.com/kubeflow/training-operator/pull/2377) by andreyvelich)
- Add e2e test for train API ([2199](https://github.com/kubeflow/training-operator/pull/2199) by helenxie-bit)
- buildx link was broken ([2356](https://github.com/kubeflow/training-operator/pull/2356) by Veer0x1)
- Upgrade helm/kind-action to v1.11.0 ([2357](https://github.com/kubeflow/training-operator/pull/2357) by astefanutti)
- Upgrade Go version to v1.23 ([2302](https://github.com/kubeflow/training-operator/pull/2302) by tenzen-y)
- Ensure code generation dependencies are downloaded ([2339](https://github.com/kubeflow/training-operator/pull/2339) by astefanutti)
- Added test for create-pytorchjob.ipynb python notebook ([2274](https://github.com/kubeflow/training-operator/pull/2274) by saileshd1402)
- Remove zw0610 from approvers ([2343](https://github.com/kubeflow/training-operator/pull/2343) by zw0610)
- Upgrade kustomization files to Kustomize v5 ([2326](https://github.com/kubeflow/training-operator/pull/2326) by oksanabaza)
- Add openapi-generator CLI option to skip SDK v2 test generation ([2338](https://github.com/kubeflow/training-operator/pull/2338) by astefanutti)
- Refine the server-side apply installation args ([2337](https://github.com/kubeflow/training-operator/pull/2337) by tenzen-y)
- Ignore cache exporting errors in the image building workflows ([2336](https://github.com/kubeflow/training-operator/pull/2336) by tenzen-y)
- Pin Gloo repository in JAX Dockerfile to a specific commit ([2329](https://github.com/kubeflow/training-operator/pull/2329) by sandipanpanda)
- Update tf job examples to tf v2 ([2270](https://github.com/kubeflow/training-operator/pull/2270) by YosiElias)
- Remove Prometheus Monitoring doc ([2301](https://github.com/kubeflow/training-operator/pull/2301) by sophie0730)
- Upgrade Deepspeed demo dependencies ([2294](https://github.com/kubeflow/training-operator/pull/2294) by Syulin7)
- [SDK] test: add unit test for list_jobs method of the training_client ([2267](https://github.com/kubeflow/training-operator/pull/2267) by seanlaii)
- [SDK] Training Client Conditions related unit tests ([2253](https://github.com/kubeflow/training-operator/pull/2253) by Bobbins228)
- [SDK] test: add unit test for get_job_logs method of the training_client ([2275](https://github.com/kubeflow/training-operator/pull/2275) by seanlaii)
- [SDK] test: add unit test for get_job method of the training_client ([2205](https://github.com/kubeflow/training-operator/pull/2205) by Bobbins228)
- [SDK] test: add unit tests for delete_job() method ([2232](https://github.com/kubeflow/training-operator/pull/2232) by Bobbins228)
- [SDK] Add UTs for `wait_for_job_conditions` ([2196](https://github.com/kubeflow/training-operator/pull/2196) by Electronic-Waste)
- [SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job ([2192](https://github.com/kubeflow/training-operator/pull/2192) by YosiElias)
- [SDK] Add more unit tests for TrainingClient APIs - get_job_pods ([2175](https://github.com/kubeflow/training-operator/pull/2175) by YosiElias)
- Update JAX image to use image published by Kubeflow ([2264](https://github.com/kubeflow/training-operator/pull/2264) by sandipanpanda)
- Update README and out-of-date docs ([2252](https://github.com/kubeflow/training-operator/pull/2252) by andreyvelich)
- Clean up Go modules ([2238](https://github.com/kubeflow/training-operator/pull/2238) by tenzen-y)
- Change isort profile to black for full compatibility ([2234](https://github.com/kubeflow/training-operator/pull/2234) by Ygnas)
- Enhance pre-commit hooks with flake8 linting ([2195](https://github.com/kubeflow/training-operator/pull/2195) by Ygnas)
- Implement pre-commit hooks ([2184](https://github.com/kubeflow/training-operator/pull/2184) by droctothorpe)
- Add command to re-run GitHub Actions tests ([2167](https://github.com/kubeflow/training-operator/pull/2167) by andreyvelich)
- Update JAX integration proposal ([2165](https://github.com/kubeflow/training-operator/pull/2165) by sandipanpanda)
- Update release document ([2153](https://github.com/kubeflow/training-operator/pull/2153) by andreyvelich)
- update volcano to v1.9.0 ([2148](https://github.com/kubeflow/training-operator/pull/2148) by lowang-bh)
- Update Slack Invitation ([2142](https://github.com/kubeflow/training-operator/pull/2142) by andreyvelich)
- Refine the integration tests for the immutable PyTorchJob queueName ([2130](https://github.com/kubeflow/training-operator/pull/2130) by tenzen-y)
- Add GitHub Issue Template ([2129](https://github.com/kubeflow/training-operator/pull/2129) by andreyvelich)
- Update the images to the latest tag in master branch ([2128](https://github.com/kubeflow/training-operator/pull/2128) by johnugeorge)
- Updated Github Action Workflows as per issue 2117 ([2123](https://github.com/kubeflow/training-operator/pull/2123) by hkiiita)
- changed package name to flake8 to fix pytests pip install ([2109](https://github.com/kubeflow/training-operator/pull/2109) by ChristopheBrown)
- chore(fix): isort xgboost ([2098](https://github.com/kubeflow/training-operator/pull/2098) by harshithbelagur)
- Fix isort on examples/pytorch ([2094](https://github.com/kubeflow/training-operator/pull/2094) by marcmaliar)

1.9.0rc.0

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

- Upgrade Kubernetes to v1.31.3 ([2330](https://github.com/kubeflow/training-operator/pull/2330) by astefanutti)
- Upgrade Kubernetes to v1.30.7 ([2332](https://github.com/kubeflow/training-operator/pull/2332) by astefanutti)
- Update the name of PVC in `train` API ([2187](https://github.com/kubeflow/training-operator/pull/2187) by helenxie-bit)
- Remove support for MXJob ([2150](https://github.com/kubeflow/training-operator/pull/2150) by tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 ([2105](https://github.com/kubeflow/training-operator/pull/2105) by tenzen-y)

New Features

Distributed JAX

- Add JAX controller ([2194](https://github.com/kubeflow/training-operator/pull/2194) by sandipanpanda)
- Add JAX API ([2163](https://github.com/kubeflow/training-operator/pull/2163) by sandipanpanda)
- JAX Integration Enhancement Proposal ([2125](https://github.com/kubeflow/training-operator/pull/2125) by sandipanpanda)

New Examples

- FSDP Example for T5 Fine-Tuning and PyTorchJob ([2286](https://github.com/kubeflow/training-operator/pull/2286) by andreyvelich)
- Add DeepSpeed Example with Pytorch Operator ([2235](https://github.com/kubeflow/training-operator/pull/2235) by Syulin7)

Control Plane Updates

- Validate pytorchjob workers are configured when elasticpolicy is configured ([2320](https://github.com/kubeflow/training-operator/pull/2320) by tarat44)
- [Feature] Support managed by external controller ([2203](https://github.com/kubeflow/training-operator/pull/2203) by mszadkow)
- Update trainer to ensure type consistency for `train_args` and `lora_config` ([2181](https://github.com/kubeflow/training-operator/pull/2181) by helenxie-bit)
- Support ARM64 platform in TensorFlow examples ([2119](https://github.com/kubeflow/training-operator/pull/2119) by akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples ([2114](https://github.com/kubeflow/training-operator/pull/2114) by tico88612)
- ARM64 supported in PyTorch examples ([2116](https://github.com/kubeflow/training-operator/pull/2116) by danielsuh05)

SDK Updates

- [SDK] Adding env vars ([2285](https://github.com/kubeflow/training-operator/pull/2285) by tarekabouzeid)
- [SDK] Use torchrun to create PyTorchJob from function ([2276](https://github.com/kubeflow/training-operator/pull/2276) by andreyvelich)
- [SDK] move env var to constants.py ([2268](https://github.com/kubeflow/training-operator/pull/2268) by varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API ([2261](https://github.com/kubeflow/training-operator/pull/2261) by varshaprasad96)
- [SDK] Read namespace from the current context ([2255](https://github.com/kubeflow/training-operator/pull/2255) by andreyvelich)
- [SDK] Sync Transformers version for train API ([2146](https://github.com/kubeflow/training-operator/pull/2146) by andreyvelich)
- [SDK] Explain Python version support cycle ([2144](https://github.com/kubeflow/training-operator/pull/2144) by andreyvelich)

Kubeflow Training V2

- KEP-2170: Kubeflow Training V2 API ([2171](https://github.com/kubeflow/training-operator/pull/2171) by andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info ([2345](https://github.com/kubeflow/training-operator/pull/2345) by andreyvelich)
- Always update TrainJob status on errors ([2352](https://github.com/kubeflow/training-operator/pull/2352) by astefanutti)
- Fix TrainJob status comparison and update ([2353](https://github.com/kubeflow/training-operator/pull/2353) by astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources ([2350](https://github.com/kubeflow/training-operator/pull/2350) by astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK ([2324](https://github.com/kubeflow/training-operator/pull/2324) by andreyvelich)
- KEP-2170: Add Torch Distributed Runtime ([2328](https://github.com/kubeflow/training-operator/pull/2328) by andreyvelich)
- KEP-2170: Add TrainJob conditions ([2322](https://github.com/kubeflow/training-operator/pull/2322) by tenzen-y)
- KEP-2170: Add the TrainJob state transition design ([2298](https://github.com/kubeflow/training-operator/pull/2298) by tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin ([2316](https://github.com/kubeflow/training-operator/pull/2316) by andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins ([2308](https://github.com/kubeflow/training-operator/pull/2308) by andreyvelich)
- KEP-2170: Create model and dataset initializers ([2303](https://github.com/kubeflow/training-operator/pull/2303) by andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 ([2310](https://github.com/kubeflow/training-operator/pull/2310) by andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts ([2306](https://github.com/kubeflow/training-operator/pull/2306) by tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings ([2304](https://github.com/kubeflow/training-operator/pull/2304) by tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob ([2296](https://github.com/kubeflow/training-operator/pull/2296) by tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects ([2295](https://github.com/kubeflow/training-operator/pull/2295) by tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 ([2289](https://github.com/kubeflow/training-operator/pull/2289) by andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD ([2260](https://github.com/kubeflow/training-operator/pull/2260) by akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API ([2283](https://github.com/kubeflow/training-operator/pull/2283) by andreyvelich)
- KEP-2170: Implement runtime framework ([2248](https://github.com/kubeflow/training-operator/pull/2248) by tenzen-y)
- [v2alpha] Move GV related codebase ([2281](https://github.com/kubeflow/training-operator/pull/2281) by varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs ([2273](https://github.com/kubeflow/training-operator/pull/2273) by varshaprasad96)
- KEP-2170: Implement skeleton webhook servers ([2251](https://github.com/kubeflow/training-operator/pull/2251) by tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager ([2236](https://github.com/kubeflow/training-operator/pull/2236) by tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources ([2237](https://github.com/kubeflow/training-operator/pull/2237) by tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP ([2240](https://github.com/kubeflow/training-operator/pull/2240) by andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs ([2223](https://github.com/kubeflow/training-operator/pull/2223) by andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy ([2222](https://github.com/kubeflow/training-operator/pull/2222) by tenzen-y)
- KEP-2170: Add directories for the V2 APIs ([2221](https://github.com/kubeflow/training-operator/pull/2221) by andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef ([2201](https://github.com/kubeflow/training-operator/pull/2201) by tenzen-y)
- KEP-2170: Make API specification more restricting ([2198](https://github.com/kubeflow/training-operator/pull/2198) by tenzen-y)

Bug Fixes

- [release-1.9] V1: Fix versions in HuggingFace dataset initializer ([2370](https://github.com/kubeflow/training-operator/pull/2370) by andreyvelich)
- Pin accelerate package version in trainer ([2340](https://github.com/kubeflow/training-operator/pull/2340) by gavrissh)
- [fix] Resolve v2alpha API exceptions ([2317](https://github.com/kubeflow/training-operator/pull/2317) by varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API ([2265](https://github.com/kubeflow/training-operator/pull/2265) by saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" ([2250](https://github.com/kubeflow/training-operator/pull/2250) by helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. ([2243](https://github.com/kubeflow/training-operator/pull/2243) by mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models ([2230](https://github.com/kubeflow/training-operator/pull/2230) by helenxie-bit)
- Update `huggingface_hub` Version in the storage initializer to fix ImportError ([2180](https://github.com/kubeflow/training-operator/pull/2180) by helenxie-bit)
- [SDK] Fix Failed condition in wait Job API ([2160](https://github.com/kubeflow/training-operator/pull/2160) by andreyvelich)
- fix volcano podgroup update issue ([2079](https://github.com/kubeflow/training-operator/pull/2079) by ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API ([2122](https://github.com/kubeflow/training-operator/pull/2122) by andreyvelich)

Misc

- [release-1.9] Add release branch to the image push trigger ([2377](https://github.com/kubeflow/training-operator/pull/2377) by andreyvelich)
- Add e2e test for train API ([2199](https://github.com/kubeflow/training-operator/pull/2199) by helenxie-bit)
- buildx link was broken ([2356](https://github.com/kubeflow/training-operator/pull/2356) by Veer0x1)
- Upgrade helm/kind-action to v1.11.0 ([2357](https://github.com/kubeflow/training-operator/pull/2357) by astefanutti)
- Upgrade Go version to v1.23 ([2302](https://github.com/kubeflow/training-operator/pull/2302) by tenzen-y)
- Ensure code generation dependencies are downloaded ([2339](https://github.com/kubeflow/training-operator/pull/2339) by astefanutti)
- Added test for create-pytorchjob.ipynb python notebook ([2274](https://github.com/kubeflow/training-operator/pull/2274) by saileshd1402)
- Remove zw0610 from approvers ([2343](https://github.com/kubeflow/training-operator/pull/2343) by zw0610)
- Upgrade kustomization files to Kustomize v5 ([2326](https://github.com/kubeflow/training-operator/pull/2326) by oksanabaza)
- Add openapi-generator CLI option to skip SDK v2 test generation ([2338](https://github.com/kubeflow/training-operator/pull/2338) by astefanutti)
- Refine the server-side apply installation args ([2337](https://github.com/kubeflow/training-operator/pull/2337) by tenzen-y)
- Ignore cache exporting errors in the image building workflows ([2336](https://github.com/kubeflow/training-operator/pull/2336) by tenzen-y)
- Pin Gloo repository in JAX Dockerfile to a specific commit ([2329](https://github.com/kubeflow/training-operator/pull/2329) by sandipanpanda)
- Update tf job examples to tf v2 ([2270](https://github.com/kubeflow/training-operator/pull/2270) by YosiElias)
- Remove Prometheus Monitoring doc ([2301](https://github.com/kubeflow/training-operator/pull/2301) by sophie0730)
- Upgrade Deepspeed demo dependencies ([2294](https://github.com/kubeflow/training-operator/pull/2294) by Syulin7)
- [SDK] test: add unit test for list_jobs method of the training_client ([2267](https://github.com/kubeflow/training-operator/pull/2267) by seanlaii)
- [SDK] Training Client Conditions related unit tests ([2253](https://github.com/kubeflow/training-operator/pull/2253) by Bobbins228)
- [SDK] test: add unit test for get_job_logs method of the training_client ([2275](https://github.com/kubeflow/training-operator/pull/2275) by seanlaii)
- [SDK] test: add unit test for get_job method of the training_client ([2205](https://github.com/kubeflow/training-operator/pull/2205) by Bobbins228)
- [SDK] test: add unit tests for delete_job() method ([2232](https://github.com/kubeflow/training-operator/pull/2232) by Bobbins228)
- [SDK] Add UTs for `wait_for_job_conditions` ([2196](https://github.com/kubeflow/training-operator/pull/2196) by Electronic-Waste)
- [SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job ([2192](https://github.com/kubeflow/training-operator/pull/2192) by YosiElias)
- [SDK] Add more unit tests for TrainingClient APIs - get_job_pods ([2175](https://github.com/kubeflow/training-operator/pull/2175) by YosiElias)
- Update JAX image to use image published by Kubeflow ([2264](https://github.com/kubeflow/training-operator/pull/2264) by sandipanpanda)
- Update README and out-of-date docs ([2252](https://github.com/kubeflow/training-operator/pull/2252) by andreyvelich)
- Clean up Go modules ([2238](https://github.com/kubeflow/training-operator/pull/2238) by tenzen-y)
- Change isort profile to black for full compatibility ([2234](https://github.com/kubeflow/training-operator/pull/2234) by Ygnas)
- Enhance pre-commit hooks with flake8 linting ([2195](https://github.com/kubeflow/training-operator/pull/2195) by Ygnas)
- Implement pre-commit hooks ([2184](https://github.com/kubeflow/training-operator/pull/2184) by droctothorpe)
- Add command to re-run GitHub Actions tests ([2167](https://github.com/kubeflow/training-operator/pull/2167) by andreyvelich)
- Update JAX integration proposal ([2165](https://github.com/kubeflow/training-operator/pull/2165) by sandipanpanda)
- Update release document ([2153](https://github.com/kubeflow/training-operator/pull/2153) by andreyvelich)
- update volcano to v1.9.0 ([2148](https://github.com/kubeflow/training-operator/pull/2148) by lowang-bh)
- Update Slack Invitation ([2142](https://github.com/kubeflow/training-operator/pull/2142) by andreyvelich)
- Refine the integration tests for the immutable PyTorchJob queueName ([2130](https://github.com/kubeflow/training-operator/pull/2130) by tenzen-y)
- Add GitHub Issue Template ([2129](https://github.com/kubeflow/training-operator/pull/2129) by andreyvelich)
- Update the images to the latest tag in master branch ([2128](https://github.com/kubeflow/training-operator/pull/2128) by johnugeorge)
- Updated Github Action Workflows as per issue 2117 ([2123](https://github.com/kubeflow/training-operator/pull/2123) by hkiiita)
- changed package name to flake8 to fix pytests pip install ([2109](https://github.com/kubeflow/training-operator/pull/2109) by ChristopheBrown)
- chore(fix): isort xgboost ([2098](https://github.com/kubeflow/training-operator/pull/2098) by harshithbelagur)
- Fix isort on examples/pytorch ([2094](https://github.com/kubeflow/training-operator/pull/2094) by marcmaliar)

1.8.1

This is the Training Operator v1.8.1 release.

Bug Fixes

- [Bug] Finish CleanupJob early if the job is suspended ([2243](https://github.com/kubeflow/training-operator/pull/2243) by mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models ([2230](https://github.com/kubeflow/training-operator/pull/2230) by helenxie-bit)
- Update `huggingface_hub` Version in the storage initializer to fix ImportError ([2180](https://github.com/kubeflow/training-operator/pull/2180) by helenxie-bit)

New Contributors

- mszadkow made their first contribution in 2243
- helenxie-bit made their first contribution in 2180

1.8.0

This is the Training Operator v1.8.0 release.

This release introduces [a new Python API for LLMs Fine-Tuning](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

- Train/Fine-tune API Proposal for LLMs ([1945](https://github.com/kubeflow/training-operator/pull/1945) by deepanker13)
- [SDK] Train API for LLM Fine-Tuning ([1962](https://github.com/kubeflow/training-operator/pull/1962) by deepanker13)
- Modify LLM Trainer to support BERT and Tiny LLaMA ([2031](https://github.com/kubeflow/training-operator/pull/2031) by andreyvelich)
- Support arm64 for Hugging Face trainer ([2028](https://github.com/kubeflow/training-operator/pull/2028) by tariq-hasan)
- Add Fine-Tune BERT LLM Example ([2021](https://github.com/kubeflow/training-operator/pull/2021) by andreyvelich)
- Train api dataset download changes ([1959](https://github.com/kubeflow/training-operator/pull/1959) by deepanker13)
- Train api init container creation ([1958](https://github.com/kubeflow/training-operator/pull/1958) by deepanker13)
- [SDK] Add docstring for Train API ([2075](https://github.com/kubeflow/training-operator/pull/2075) by andreyvelich)

Breaking Changes

- [SDK] Support Python 3.11 and Drop Python 3.7 ([2105](https://github.com/kubeflow/training-operator/pull/2105) by tenzen-y)
- Support K8s v1.29 and Drop K8s v1.26 ([2039](https://github.com/kubeflow/training-operator/pull/2039) by tenzen-y)
- Support K8s v1.28 and Drop K8s v1.25 ([2038](https://github.com/kubeflow/training-operator/pull/2038) by tenzen-y)
- Deprecation Notice for MXJob ([2058](https://github.com/kubeflow/training-operator/pull/2058) by tenzen-y)
- ⚠️ Breaking Changes: Rename `monitoring-port` flag to `webook-server-port` ([1925](https://github.com/kubeflow/training-operator/pull/1925) by afritzler)

New Features

Control Plane Updates

- Upgrade scheduler-plugins to v0.28.9 ([2065](https://github.com/kubeflow/training-operator/pull/2065) by tenzen-y)
- Implement webhook validations for the PaddleJob ([2057](https://github.com/kubeflow/training-operator/pull/2057) by tenzen-y)
- Implement webhook validations for the XGBoostJob ([2052](https://github.com/kubeflow/training-operator/pull/2052) by tenzen-y)
- Implement webhook validation for the TFJob ([2051](https://github.com/kubeflow/training-operator/pull/2051) by tenzen-y)
- Implement webhook validations for the PyTorchJob ([2035](https://github.com/kubeflow/training-operator/pull/2035) by tenzen-y)
- Upgrade PyTorchJob examples to PyTorch v2 ([2024](https://github.com/kubeflow/training-operator/pull/2024) by champon1020)
- Upgrade Go version to v1.22 ([2046](https://github.com/kubeflow/training-operator/pull/2046) by tenzen-y)

SDK Improvements

- [SDK] Add resources per worker for Create Job API ([1990](https://github.com/kubeflow/training-operator/pull/1990) by andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob ([1988](https://github.com/kubeflow/training-operator/pull/1988) by andreyvelich)
- [SDK] Get Kubernetes Events for Job ([1975](https://github.com/kubeflow/training-operator/pull/1975) by andreyvelich)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 ([2066](https://github.com/kubeflow/training-operator/pull/2066) by tenzen-y)
- [SDK] Add information about TrainingClient logging ([1973](https://github.com/kubeflow/training-operator/pull/1973) by andreyvelich)
- Training operator SDK unit test ([1938](https://github.com/kubeflow/training-operator/pull/1938) by deepanker13)
- [SDK] Consolidate Naming for CRUD APIs ([1907](https://github.com/kubeflow/training-operator/pull/1907) by andreyvelich)

Bug Fixes

- [SDK] Fix Failed condition in wait Job API ([2160](https://github.com/kubeflow/training-operator/pull/2160) by andreyvelich)
- [SDK] Sync Transformers version for train API ([2147](https://github.com/kubeflow/training-operator/pull/2147) by andreyvelich)
- [SDK] Changed package name to flake8 to fix pip install ([2140](https://github.com/kubeflow/training-operator/pull/2140) by tenzen-y)
- [SDK] Fix Incorrect Events in get_job_logs API ([2138](https://github.com/kubeflow/training-operator/pull/2138) by tenzen-y)
- Fix volcano podgroup update issue ([2079](https://github.com/kubeflow/training-operator/pull/2079) by ckyuto)
- Fix import for HuggingFace Dataset Provider ([2085](https://github.com/kubeflow/training-operator/pull/2085) by andreyvelich)
- Updated examples for train API ([2077](https://github.com/kubeflow/training-operator/pull/2077) by shruti2522)
- Fail job for non-retryable exit codes ([2071](https://github.com/kubeflow/training-operator/pull/2071) by kellyaa)
- E2E: Replace outdated images with latest ones ([2083](https://github.com/kubeflow/training-operator/pull/2083) by tenzen-y)
- fix wrong filepath in the simple example command ([2062](https://github.com/kubeflow/training-operator/pull/2062) by qzoscar)
- fix(example): add installation of python-etcd in Pytorch example ([2064](https://github.com/kubeflow/training-operator/pull/2064) by champon1020)
- fix: Upgrade controller-gen to v0.14.0 ([2026](https://github.com/kubeflow/training-operator/pull/2026) by champon1020)
- Fix build workflow config for pytorch-torchrun-example ([2020](https://github.com/kubeflow/training-operator/pull/2020) by PeterWrighten)
- Fix Distributed Data Samplers in PyTorch Examples ([2012](https://github.com/kubeflow/training-operator/pull/2012) by andreyvelich)
- Fix URL in python SDK setup.py ([2011](https://github.com/kubeflow/training-operator/pull/2011) by garymm)
- Fix for Github CI to publish HF trainer image ([1987](https://github.com/kubeflow/training-operator/pull/1987) by johnugeorge)
- train api jupyternotebook fix ([1984](https://github.com/kubeflow/training-operator/pull/1984) by deepanker13)
- fix: volcano podgroup should has a non-empty queue name ([1977](https://github.com/kubeflow/training-operator/pull/1977) by lowang-bh)
- Fix Master Label for PyTorchJob ([1974](https://github.com/kubeflow/training-operator/pull/1974) by andreyvelich)
- IsMasterRole fix in pytorchjob controller ([1969](https://github.com/kubeflow/training-operator/pull/1969) by deepanker13)
- [fix] replace `${go env GOPATH}` with `$(go env GOPATH)` ([1952](https://github.com/kubeflow/training-operator/pull/1952) by double12gzh)
- Fixing issues with providing existing service account ([1918](https://github.com/kubeflow/training-operator/pull/1918) by rpemsel)

Misc

- Refine the integration tests for the immutable PyTorchJob ([2130](https://github.com/kubeflow/training-operator/pull/2130) by tenzen-y)
- Update training operator image to latest ([2089](https://github.com/kubeflow/training-operator/pull/2089) by johnugeorge)
- Update sdk to v1.8.0rc0 ([2087](https://github.com/kubeflow/training-operator/pull/2087) by johnugeorge)
- Test: Simplify and Identify pod-controller envtest ([2084](https://github.com/kubeflow/training-operator/pull/2084) by tenzen-y)
- Remove deadcode related to PodDisruptionBudget ([2073](https://github.com/kubeflow/training-operator/pull/2073) by tenzen-y)
- docs: updating docs for local development ([2074](https://github.com/kubeflow/training-operator/pull/2074) by franciscojavierarceo)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode ([2067](https://github.com/kubeflow/training-operator/pull/2067) by tenzen-y)
- Updated developer docs to include Kind ([2061](https://github.com/kubeflow/training-operator/pull/2061) by franciscojavierarceo)
- adding fine tune example with s3 as the dataset store ([2006](https://github.com/kubeflow/training-operator/pull/2006) by deepanker13)
- CI: Use a mode=min in the builder cache ([2053](https://github.com/kubeflow/training-operator/pull/2053) by tenzen-y)
- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 ([2043](https://github.com/kubeflow/training-operator/pull/2043) by jdcfd)
- Remove Dockerfile.ppc64le of pytorch example ([2042](https://github.com/kubeflow/training-operator/pull/2042) by champon1020)
- publish torchrun example via Dockerfile ([2018](https://github.com/kubeflow/training-operator/pull/2018) by PeterWrighten)
- Updated examples/pytorch to disable istio sidecar injection ([2004](https://github.com/kubeflow/training-operator/pull/2004) by jdcfd)
- [docs] development guide update ([1995](https://github.com/kubeflow/training-operator/pull/1995) by shashank-iitbhu)
- Add Kubeflow Website links to README ([1983](https://github.com/kubeflow/training-operator/pull/1983) by andreyvelich)
- publish trainer hugging face image ([1985](https://github.com/kubeflow/training-operator/pull/1985) by deepanker13)
- Adding Training image needed for train api ([1963](https://github.com/kubeflow/training-operator/pull/1963) by deepanker13)
- Add test to create PyTorchJob from func ([1979](https://github.com/kubeflow/training-operator/pull/1979) by andreyvelich)
- Corrected Some Spelling And Grammatical Errors ([1980](https://github.com/kubeflow/training-operator/pull/1980) by daniel-hutao)
- torchrun example with cpu version pytorch ([1965](https://github.com/kubeflow/training-operator/pull/1965) by kuizhiqing)
- utils changes needed to add train api ([1954](https://github.com/kubeflow/training-operator/pull/1954) by deepanker13)
- Adding parallel support for coveralls ([1956](https://github.com/kubeflow/training-operator/pull/1956) by johnugeorge)
- chore: pkg import only once ([1950](https://github.com/kubeflow/training-operator/pull/1950) by testwill)
- fix nproc env in elastic mode for pytorchjob ([1948](https://github.com/kubeflow/training-operator/pull/1948) by kuizhiqing)
- Avoid modifying log level globally ([1944](https://github.com/kubeflow/training-operator/pull/1944) by droctothorpe)
- Add andreyvelich to Approvers ([1941](https://github.com/kubeflow/training-operator/pull/1941) by andreyvelich)
- Merge v1.7 branch changes to Main ([1940](https://github.com/kubeflow/training-operator/pull/1940) by johnugeorge)
- Increase the root volume size on the github runner when building container images ([1931](https://github.com/kubeflow/training-operator/pull/1931) by tenzen-y)
- Check podGroup CRD for the volcano and the scheudler-plugins as default. ([1929](https://github.com/kubeflow/training-operator/pull/1929) by Syulin7)
- Use a community hosted image in MXJob E2E ([1928](https://github.com/kubeflow/training-operator/pull/1928) by tenzen-y)
- Build MXJob examples in CI ([1927](https://github.com/kubeflow/training-operator/pull/1927) by tenzen-y)
- Bump `k8s.io/*` deps to 1.28 ([1920](https://github.com/kubeflow/training-operator/pull/1920) by afritzler)
- Replace XGBoost image for E2E with community hosted ([1922](https://github.com/kubeflow/training-operator/pull/1922) by tenzen-y)
- Creating service account where approriate for MPI Job ([1917](https://github.com/kubeflow/training-operator/pull/1917) by rpemsel)
- Build XGBoostJob example images in CI ([1913](https://github.com/kubeflow/training-operator/pull/1913) by tenzen-y)
- Manage kube-delivery image from training-operator and update it ([1909](https://github.com/kubeflow/training-operator/pull/1909) by rpemsel)
- Adding Yuki to Approvers ([1901](https://github.com/kubeflow/training-operator/pull/1901) by johnugeorge)
- docs: Remove reference to tf-operator specific design doc ([1903](https://github.com/kubeflow/training-operator/pull/1903) by terrytangyuan)
- Add Training WG Community Call ([1900](https://github.com/kubeflow/training-operator/pull/1900) by andreyvelich)
- update full change list in changelog ([1895](https://github.com/kubeflow/training-operator/pull/1895) by lowang-bh)
- update volcano scheduler to 1.8.0 ([1894](https://github.com/kubeflow/training-operator/pull/1894) by lowang-bh)
- Changelog updated for 1.7.0 rc0 release ([1892](https://github.com/kubeflow/training-operator/pull/1892) by johnugeorge)
- Add Stale GitHub Action ([1893](https://github.com/kubeflow/training-operator/pull/1893) by andreyvelich)
- Refactor core/pod tests ([1890](https://github.com/kubeflow/training-operator/pull/1890) by tenzen-y)
- Remove klog v1 ([1886](https://github.com/kubeflow/training-operator/pull/1886) by tenzen-y)

New Contributors

- ckyuto made their first contribution in 2079
- shruti2522 made their first contribution in 2077
- kellyaa made their first contribution in 2071
- qzoscar made their first contribution in 2062
- franciscojavierarceo made their first contribution in 2061
- tariq-hasan made their first contribution in 2028
- champon1020 made their first contribution in 2024
- garymm made their first contribution in 2011
- PeterWrighten made their first contribution in 2018
- jdcfd made their first contribution in 2004
- daniel-hutao made their first contribution in 1980
- shashank-iitbhu made their first contribution in 1995
- double12gzh made their first contribution in 1952
- testwill made their first contribution in 1950
- deepanker13 made their first contribution in 1938
- droctothorpe made their first contribution in 1944
- afritzler made their first contribution in 1920
- rpemsel made their first contribution in 1909

1.8.0rc.0

**New features**

- Train/Fine-tune API Proposal for LLMs [\1945](https://github.com/kubeflow/training-operator/pull/1945) ([deepanker13](https://github.com/deepanker13))
- Adding Training image needed for train api [\1963](https://github.com/kubeflow/training-operator/pull/1963) ([deepanker13](https://github.com/deepanker13))
- \[SDK\] Train API [\1962](https://github.com/kubeflow/training-operator/pull/1962) ([deepanker13](https://github.com/deepanker13))
- Train api dataset download changes [\1959](https://github.com/kubeflow/training-operator/pull/1959) ([deepanker13](https://github.com/deepanker13))
- Train api init container creation [\1958](https://github.com/kubeflow/training-operator/pull/1958) ([deepanker13](https://github.com/deepanker13))
- Publish trainer hugging face image [\1985](https://github.com/kubeflow/training-operator/pull/1985) ([deepanker13](https://github.com/deepanker13))
- Support arm64 for Hugging Face trainer [\2028](https://github.com/kubeflow/training-operator/pull/2028) ([tariq-hasan](https://github.com/tariq-hasan))
- Modify LLM Trainer to support BERT and Tiny LLaMA [\2031](https://github.com/kubeflow/training-operator/pull/2031) ([andreyvelich](https://github.com/andreyvelich))
- Implement webhook validations for the PyTorchJob [\2035](https://github.com/kubeflow/training-operator/pull/2035) ([tenzen-y](https://github.com/tenzen-y))
- Implement webhook validations for the XGBoostJob [\2052](https://github.com/kubeflow/training-operator/pull/2052) ([tenzen-y](https://github.com/tenzen-y))
- Implement webhook validation for the TFJob [\2051](https://github.com/kubeflow/training-operator/pull/2051) ([tenzen-y](https://github.com/tenzen-y))
- Implement webhook warnings for the MXJob [\2058](https://github.com/kubeflow/training-operator/pull/2058) ([tenzen-y](https://github.com/tenzen-y))
- Implement webhook validations for the PaddleJob [\2057](https://github.com/kubeflow/training-operator/pull/2057) ([tenzen-y](https://github.com/tenzen-y))
- Fail job for non-retryable exit codes [\2071](https://github.com/kubeflow/training-operator/pull/2071) ([kellyaa](https://github.com/kellyaa))
- Adding fine tune example with s3 as the dataset store [\2006](https://github.com/kubeflow/training-operator/pull/2006) ([deepanker13](https://github.com/deepanker13))

**Bug fixes**
- fix nproc env in elastic mode for pytorchjob [\1948](https://github.com/kubeflow/training-operator/pull/1948) ([kuizhiqing](https://github.com/kuizhiqing))
- IsMasterRole fix in pytorchjob controller [\1969](https://github.com/kubeflow/training-operator/pull/1969) ([deepanker13](https://github.com/deepanker13))
- fix: volcano podgroup should has a non-empty queue name [\1977](https://github.com/kubeflow/training-operator/pull/1977) ([lowang-bh](https://github.com/lowang-bh))
- Fix Master Label for PyTorchJob [\1974](https://github.com/kubeflow/training-operator/pull/1974) ([andreyvelich](https://github.com/andreyvelich))
- \[SDK\] Fix Worker and Master templates for PyTorchJob [\1988](https://github.com/kubeflow/training-operator/pull/1988) ([andreyvelich](https://github.com/andreyvelich))
- Fix import for HuggingFace Dataset Provider [\2085](https://github.com/kubeflow/training-operator/pull/2085) ([andreyvelich](https://github.com/andreyvelich))
- Upgrade controller-gen to v0.14.0 [\2026](https://github.com/kubeflow/training-operator/pull/2026) ([champon1020](https://github.com/champon1020))
- Fix Distributed Data Samplers in PyTorch Examples [\2012](https://github.com/kubeflow/training-operator/pull/2012) ([andreyvelich](https://github.com/andreyvelich))
- Fix URL in python SDK setup.py [\2011](https://github.com/kubeflow/training-operator/pull/2011) ([garymm](https://github.com/garymm))

**Misc**
- Adding parallel support for coveralls [\1956](https://github.com/kubeflow/training-operator/pull/1956) ([johnugeorge](https://github.com/johnugeorge))
- torchrun example with cpu version pytorch [\1965](https://github.com/kubeflow/training-operator/pull/1965) ([kuizhiqing](https://github.com/kuizhiqing))
- \[SDK\] Get Kubernetes Events for Job [\1975](https://github.com/kubeflow/training-operator/pull/1975) ([andreyvelich](https://github.com/andreyvelich))
- Fix Master Label for PyTorchJob [\1974](https://github.com/kubeflow/training-operator/pull/1974) ([andreyvelich](https://github.com/andreyvelich))
- \[SDK\] Add information about TrainingClient logging [\1973](https://github.com/kubeflow/training-operator/pull/1973) ([andreyvelich](https://github.com/andreyvelich))
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode [\2067](https://github.com/kubeflow/training-operator/pull/2067) ([tenzen-y](https://github.com/tenzen-y))
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 [\2066](https://github.com/kubeflow/training-operator/pull/2066) ([tenzen-y](https://github.com/tenzen-y))
- Test: Simplify and Identify pod-controller envtest [\2084](https://github.com/kubeflow/training-operator/pull/2084) ([tenzen-y](https://github.com/tenzen-y))
- E2E: Replace outdated images with latest ones [\2083](https://github.com/kubeflow/training-operator/pull/2083) ([tenzen-y](https://github.com/tenzen-y))
- Upgrade scheduler-plugins to v0.28.9 [\2065](https://github.com/kubeflow/training-operator/pull/2065) ([tenzen-y](https://github.com/tenzen-y))

1.7.0

**Breaking Changes**
- Make scheduler-plugins the default gang scheduler. [\1747](https://github.com/kubeflow/training-operator/pull/1747) ([Syulin7](https://github.com/Syulin7))
- Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 ([tenzen-y](https://github.com/tenzen-y))

**New features**
- Make scheduler-plugins the default gang scheduler. [\1747](https://github.com/kubeflow/training-operator/pull/1747) ([Syulin7](https://github.com/Syulin7))
- Merge kubeflow/common to training-operator [\1813](https://github.com/kubeflow/training-operator/pull/1813) ([johnugeorge](https://github.com/johnugeorge))
- Auto-generate RBAC manifests by the controller-gen [\1815](https://github.com/kubeflow/training-operator/pull/1815) ([Syulin7](https://github.com/Syulin7))
- Implement suspend semantics [\1859](https://github.com/kubeflow/training-operator/pull/1859) ([tenzen-y](https://github.com/tenzen-y))
- Set up controllers using goroutines to start the manager quickly [\1869](https://github.com/kubeflow/training-operator/pull/1869) ([tenzen-y](https://github.com/tenzen-y))
- Set correct ENV for PytorchJob to support torchrun [\1840](https://github.com/kubeflow/training-operator/pull/1840) ([kuizhiqing](https://github.com/kuizhiqing))

**Bug fixes**
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed [\1866](https://github.com/kubeflow/training-operator/pull/1866) ([tenzen-y](https://github.com/tenzen-y))
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition [\1789](https://github.com/kubeflow/training-operator/pull/1789) ([tenzen-y](https://github.com/tenzen-y))
- Avoid to depend on local env when installing the code-generators [\1810](https://github.com/kubeflow/training-operator/pull/1810) ([tenzen-y](https://github.com/tenzen-y))

**Misc**
- Removing reconciler code [\1879](https://github.com/kubeflow/training-operator/pull/1879) ([johnugeorge](https://github.com/johnugeorge))
- Make Condition and ReplicaStatus optional [\1862](https://github.com/kubeflow/training-operator/pull/1862) ([tenzen-y](https://github.com/tenzen-y))
- Use the same reasons for Condition and Event [\1854](https://github.com/kubeflow/training-operator/pull/1854) ([tenzen-y](https://github.com/tenzen-y))
- Fully consolidate tfjob-operator to training-operator [\1850](https://github.com/kubeflow/training-operator/pull/1850) ([tenzen-y](https://github.com/tenzen-y))
- Clean up /pkg/common/util/v1 [\1845](https://github.com/kubeflow/training-operator/pull/1845) ([tenzen-y](https://github.com/tenzen-y))
- Refactoring tests in common/controller.v1 [\1843](https://github.com/kubeflow/training-operator/pull/1843) ([tenzen-y](https://github.com/tenzen-y))
- remove duplicate code of add task spec annotation [\1839](https://github.com/kubeflow/training-operator/pull/1839) ([lowang-bh](https://github.com/lowang-bh))
- fetch volcano log when e2e failed [\1837](https://github.com/kubeflow/training-operator/pull/1837) ([lowang-bh](https://github.com/lowang-bh))
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e [\1835](https://github.com/kubeflow/training-operator/pull/1835) ([tenzen-y](https://github.com/tenzen-y))
- Replace dummy client with fake client [\1818](https://github.com/kubeflow/training-operator/pull/1818) ([tenzen-y](https://github.com/tenzen-y))
- Add default Intel MPI env variables to MPIJob [\1804](https://github.com/kubeflow/training-operator/pull/1804) ([tkatila](https://github.com/tkatila))
- Improve E2E tests for the gang-scheduling [\1801](https://github.com/kubeflow/training-operator/pull/1801) ([tenzen-y](https://github.com/tenzen-y))
- xgb yaml container name should be consistent with xgb job default container name [\1794](https://github.com/kubeflow/training-operator/pull/1794) ([Crisescode](https://github.com/Crisescode))
- make timeout configurable from e2e tests [\1787](https://github.com/kubeflow/training-operator/pull/1787) ([nagar-ajay](https://github.com/nagar-ajay))

Page 1 of 7

Releases

Has known vulnerabilities

Kubeflow-training

Page 1 of 7

1.9.0

1.9.0rc.0

1.8.1

1.8.0

1.8.0rc.0

1.7.0

Page 1 of 7

Links

Releases