Skypilot

Latest version: v0.8.0

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.8.0

We’re thrilled to release SkyPilot v0.8.0! This update makes SkyPilot faster and more robust, with major improvements to Managed Jobs, Kubernetes support, and new cloud integrations.

Highlights

* **Faster Managed Jobs:** 3x faster job submission, controller uses 37% less memory, and support for 2000+ concurrent jobs
* **Faster Provisioning:** Kubernetes provisioning is 4x faster — provisioning a GPU cluster with 200 nodes takes under 90 seconds. `sky launch` on existing clusters is 5x faster when using `--fast` flag.
* **Intermediate buckets for managed jobs:** [bring your own buckets](https://docs.skypilot.co/en/latest/examples/managed-jobs.html#intermediate-storage-for-files) to be used as intermediate storage for managed jobs.

~/.sky/config.yaml
jobs:
bucket: s3://my-bucket

* **Exciting new features in SkyServe:**
* SkyServe load balancer now supports TLS via HTTPS
* New `load_balancing_policy` field to choose from multiple policies (round_robin, least_load)
* Replica can now expose multiple ports
* **New clouds:** Digital Ocean and [Vast](https://docs.skypilot.co/en/latest/getting-started/installation.html#vast)
* **New LLM Recipes:** [DeepSeek R1](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1) and [Janus](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-janus), [minGPT with Pytorch Distributed](https://github.com/skypilot-org/skypilot/tree/master/examples/distributed-pytorch)

Managed Jobs

* Managed jobs scheduler has been reworked: 3x faster, uses 37% less memory and can support up to 2000 jobs running simultaneously (4318, 4485, 4341)
* Brand new look for the managed jobs dashboard, with new filters, log download, and failover history (4253, 4644 ,4638)
<img width="750" src="https://github.com/user-attachments/assets/1797b73c-c30a-48c2-be71-5056b1e3af32" alt="Managed jobs dashboard"/>
* You can now bring your own bucket to act as the [intermediate storage](https://docs.skypilot.co/en/latest/examples/managed-jobs.html#intermediate-storage-for-files) for managed jobs (4257)
* If no intermediate bucket is specified, we now create one bucket per job instead of one per `file_mount`/`workdir`.
* `sky jobs logs` has a new flag `--sync-down` to download logs to local machine (4527)
* When fetching managed jobs logs, SkyPilot will autostart the jobs controller if it is not running (4380)
* Robustness of managed jobs is greatly improved (4247, 4283, 4562, 4602, 4615)

Backend

* `sky launch` on existing clusters is 5x faster when using `--fast` flag. We have reworked the provisioning logic to be more efficient when reusing clusters (4328, 4289)
* We now use `uv` under the hood for 3x faster setup phase (4414)
* Beefed up resource leak protection (4443, 4267)
* Skylet scheduler is 2x faster (4264)
* New `remote_identity: NO_UPLOAD` option to skip uploading credentials to the remote VM (4307)
* Other robustness improvements (4227, 4290, 4310, 4390, 4488)


Kubernetes

* Multi-node setup is now up to 4x faster: provisioning a GPU cluster with 200 nodes takes under 90 seconds (4297, 4240, 4393)
* TPUs (Single-host) on GKE are now supported on fixed and autoscaling node pools (3947)
* `sky check` now shows enabled contexts (4587)
<img width="400" alt="image" src="https://github.com/user-attachments/assets/1ce7155f-cedd-4e74-8be3-8d0a46ce029b" />
* SkyPilot no longer has a dependency on `lsof` in k8s environments (4304)
* `sky show-gpus --cloud kubernetes` now handles limited permissions gracefully (4208)
* Both in-cluster (service account based) and kubeconfig auth are now supported concurrently (4188)
* Custom GPU resource names are supported with `CUSTOM_GPU_RESOURCE_NAME` environment variable (4337)
* Fixed a bug with SSH on IPv6 dual stack clusters (4497)
* Fixed a bug with L40 detection when using `nvidia.com/product` labels (4511)
* `pod_config` specified in `config.yaml` is now validated before launching clusters (4466)
* Other performance and robustness improvements (4398, 4415, 4420, 4425, 4420, 4429, 4452, 4469, 4514, 4505, 4558, 4561, 4437)


CLI & Core interfaces

* `sky logs` has a new `--tail` parameter to stream job logs (4241)
* `sky.jobs.launch` from the Python API now returns the job id (4620)

SkyServe

* SkyServe now supports choosing a load balancing policy to be used by the service (4439)

service:
load_balancing_policy: round_robin round_robin, least_load


| Policy | Description |
|--------|-------------|
| `least_load` | (New default) Routes requests to replicas with the lowest current load, optimizing for latency and throughput |
| `round_robin` | Distributes requests evenly across all replicas in a circular order |

* Improved security with TLS support on the load balancer (3380)
* You can now expose multiple ports on replicas: useful for running monitoring, UI or other services on the replicas (4356)

New LLM recipes

* DeepSeek R1 (4603) and DeepSeek Janus (4611)
* minGPT with Pytorch Distributed (4464)


Cloud-specfic enhancements

* AWS:
* Disable additional auto update services for ubuntu image with cloud-init (4252)
* Adding aws assume role option, and env var detection (4550)
* Credentials are no longer uploaded when using service account auth (4395)
* Custom process based auth is now supported (4547)
* SkyPilot now only uses the specified VPC or the default VPC (No other VPCs are used unless specified) (4546)
* GCP: Fixed an issue where the service account was not activated for access google cloud storage on the controller, robustness improvements (4529, 4593)
* Azure: Support image ids tagged with `latest` and robustness improvements (4581, 4411, 4457)
* Fluidstack: H100 SXM5 support (4359)
* Lambda: Added support for GH200 and new regions (us-east-2, us-south-2, us-south-3) (4291, 4377)
* RunPod: support spot pods (4447) and private container registries (4287)
* OCI: Faster and new provisioner, support for SkyServe, default image has been upgraded to 22.04 LTS (4119, 4517)


Storage

* OCI object storage is now supported (4501)
* Fixed a bug where object stores were not being mounted when only object stores were specified in file_mounts (4317)

Docs

* Docs have been revamped: brand new Overview page explaining core concepts (4342), improved structuring (4664), docs for multi-k8s (4586), and more!

⚠️ Deprecation notice
* LocalDockerBackend is deprecated. To run [locally](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-deployment.html#kubernetes-setup-kind), use `sky local up` to setup a local k8s cluster.
* `sky spot` CLI is now removed. Use `sky jobs launch --use-spot` to launch spot instances.

Thanks to all contributors!

New contributors: weih1121, clayrosenthal, manbeardave, bend, nkwangleiGIT, kristopolous, sachiniyer, KeplerC, aylei, Yisaer, cbrownstein, chesterli29, sfrolich, AlexCuadron

Many thanks to all contributors who contributed to this release!

Contributors: romilbhardwaj, cg505, Michaelvll, zpoint, HysunHe, cblmemo, andylizf, concretevitamin, KeplerC, yika-luo, cbrownstein, weih1121, nkwangleiGIT, aylei, clayrosenthal, sethkimmel3, landscapepainter, Conless, sfrolich, AlexCuadron, shashank2000, mjibril, asaiacai, chesterli29, Yisaer, sachiniyer, manbeardave, bend, kristopolous


**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.7.0...v0.8.0

0.7.0

Not secure
We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:

* Upto 3x faster provisioning
* Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
* Observability features
* Admin policy enforcement
* Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
* New UX for `sky` CLI

and many bug fixes and enhancements!

Release Highlights

Performance
We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.

| Cloud | Provisioning Time | Speedup |
|----------------|-------------------|---------|
| AWS | 1 min 10s | 3x |
| GCP | 1 min 15s | 3x |
| Azure | 2 min 16s | 2x |
| Kubernetes | 52s | 2.5x |


Reservations
SkyPilot now supports [short-term and long-term reservations](https://skypilot.readthedocs.io/en/latest/reservations/reservations.html) across clouds:

* AWS Capacity Reservations
* AWS Capacity Blocks
* GCP reservations
* GCP Dynamic Workload Scheduler (DWS)
* Bring your own [VMs](https://skypilot.readthedocs.io/en/latest/reservations/existing-machines.html) or [Kubernetes clusters](https://skypilot.readthedocs.io/en/latest/reference/kubernetes/index.html)

SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.

Observability on Kubernetes

SkyPilot now has two new observability features on Kubernetes:
* `sky status --kubernetes` shows all SkyPilot resources on the cluster. (4040, 4079)

$ sky status --cloud kubernetes
Kubernetes cluster state (context: mycluster)
SkyPilot clusters
USER NAME LAUNCHED RESOURCES STATUS
alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP
alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP
bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP
bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP
bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP

Managed jobs
In progress tasks: 1 STARTING
USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION RECOVERIES STATUS
alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED
bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED
bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING
bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED
bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED

* `sky show-gpus --cloud kubernetes` shows detailed GPU availability information on the cluster. (3816, 4085)

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 8 8
H100 1, 2, 4, 8 16 16

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 4
my-cluster-2 H100 8 8
my-cluster-3 H100 8 8


Admin policy enforcement

SkyPilot has a new [admin policy mechanism](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html) (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.

Example policies:
* [Add Labels for all Tasks on Kubernetes](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#kubernetes-labels-policy)
* [Always Disable Public IP for AWS Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#disable-public-ip-policy)
* [Use Spot for all GPU Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#use-spot-for-gpu-policy)
* [Enforce Autostop for all Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#enforce-autostop-policy)


Azure Blob Storage support
In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (3032)

New AI hardware support
* New accelerators: TPU v6 (4115), TPU v5 (3814), H100 Mega (4099),
* Faster networking on GCP with gVNIC (4095)
* Faster disks: new disk tier `ultra` (3860) for GCP and AWS.

UX revamp
SkyPilot CLI is cleaner, simpler and even easier to parse now (4023)

<img src="https://i.imgur.com/fg8tOYq.gif" width="600"/>

New LLM Recipes

* Llama 3.1 and Llama 3.2 recipes (3990, 3779, 3780)
* llm.c training for GPT 2 (3611)
* Pixtral (3938, 3940)
* Qwen2-VL and Qwen 2.5 support (3961, 3959)
* Yi model family support (3958)
* Nemo GPT (3743)
* Other examples: Airflow (3982), AWS Neuron Accelerator (4020), and Deepspeed with k8s support (4124)

Deprecation Notice

* All `SKY_*` environment variables are deprecated in favor of `SKYPILOT_*` variables.
* All `SKY_*` variables will be removed in v0.9.0.
* See [docs](https://skypilot.readthedocs.io/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) for list of currently supported variables.


Backend

New Features

* Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (3919)
* Set `max_restarts_on_errors` to specify the number of times SkyPilot should try to restart the job.

resources:
job_recovery:
max_restarts_on_errors: 3 Retry 3 times before marking the job as failed

* Nvidia GPUs can now disable ECC (3676)
* New environment variable `SKYPILOT_NUM_NODES` to fetch the number of nodes in the cluster. (3656)
* SkyPilot [config can now be overridden](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html#experimental-configurations) in the task definition with `experimental.config_override` (3689)

experimental:
config_override:
docker:
run_options: ...
kubernetes:
pod_config: ...
provision_timeout: ...
gcp:
managed_instance_group: ...
nvidia_gpus:
disable_ecc: ...



Enhancements

* SSH keys AddKeysToAgent for ssh config file and ssh cmd https://github.com/skypilot-org/skypilot/pull/3985
* SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (3639)
* Similarly, the environment pre-configured in your docker image is no longer shadowed by SkyPilot's runtime environment (3874, 3867)
* `docker.run_options` now allows users to pass additional options when running docker containers. (3682)

Fixes
* Fix `sky cancel` not terminating all child processes (3919)
* Fix provisioning failures when multiple versions of SkyPilot are installed (3866)
* Shell autocomplete installation is now more robust (3892, 3893)


Kubernetes

New Features

* Observability improvements:
* `sky status --cloud kubernetes` shows all SkyPilot resources on the Kubernetes cluster. (4040, 4079)
* `sky show-gpus --cloud kubernetes` shows detailed GPU availability information on the cluster. (3816, 4085)
* SkyPilot now helps you set up your clusters for running SkyPilot jobs.
* If you already have a list of IPs and their SSH keys, `sky local up` can now [automatically set it up as a cluster](https://skypilot.readthedocs.io/en/latest/reservations/existing-machines.html) to be used for running jobs. (#3926)
* If you don't have a cluster yet, we provide a simple [one-click setup script](https://github.com/skypilot-org/skypilot/tree/master/examples/k8s_cloud_deploy) to deploy VMs with Kubernetes on cloud of your choice (#3929).
* SkyPilot job output is now piped to the container logs (3758)
* Use your existing logging tooling (`kubectl logs`, filebeat, etc.) to view SkyPilot job outputs.
* Support for Nvidia GPU operator labels (`nvidia.com/gpu.product`) for detecting GPU types. (3493)
* You no longer need to label GPUs if you have the Nvidia GPU operator installed.
* Spot instances are now supported on GKE clusters (3675)
* [Experimental] Multi-context support (3913, 3968, 3897, 3772, 4013)

Performance improvements:
* New command runner: 3x faster command submission for Kubernetes pods. (3157)
* `sky local up` for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (3664)
* Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (3665)
* SSH jump pod is no longer required for `port-forward` mode (3657)
* SSH setup is now parallelized to speed up multi-node provisioning (4158)

Enhancements and fixes
* H100 Mega support on GKE (3891, 3627)
* Better handling for context names with special characters (4147)
* `--k8s` is now a valid alias for `--cloud kubernetes` (4151)
* Init containers are now supported on Kubernetes (3762)
* Auth: robust service account support and updated docs on minimal permissions (3632)
* Custom metadata annotations are now propagated to services, allowing configuration of internal load balancer services on cloud hosted Kubernetes clusters (3767)
* Provisioning errors are now surfaced clearly (3590, 3795, 3821)
* Cluster attributes (autodown, idle-minutes-to-autostop) are now added as annotations to the pod (3870)
* SkyServe controller is now automatically terminated when all replicas are terminated. (3984)
* Create namespace permission is no longer required in cluster launch flow (3714)
* If your cluster does not support `apparmor`, SkyPilot will now retry without requesting it. (4176)


Cloud: GCP

New Features
* New accelerators supported:
* H100 Mega (4099)
* TPU v5 (3814)
* [Dynamic Workload Scheduler (DWS)](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) support (#3574, 3835)
* DWS helps get better availability on GCE through queuing and reservations.
* Faster `pd-extreme` disks with `disk_tier: ultra` (3860)
* New config `gcp.force_enable_external_ips` to force enable external IPs (3699)
* This is useful when communication within a VPC is desired and the VM needs to make calls to the public internet.
* TPU VMs can now run docker containers (4115)

Enhancements

* Provisioning is now 3x faster on GCP (4027)
* Faster networking support with gVNIC (4095)
* Upto ~2x faster in [pytorch distributed benchmarks](https://gist.github.com/romilbhardwaj/89f8399d8a5307df5d880cf1495ce957)


Cloud: AWS

New Features

* Capacity blocks and capacity reservations are now supported. (3852, 3853)
* You don’t have to wake up at 4:30am PDT to launch your job on a newly available capacity block: SkyPilot will wait for you until the start time of the capacity block.
* Faster `io2` disks with `disk_tier: ultra` (3860)
* Security groups: you can now specify security groups for your resources at a finer granularity. (3501)
* SkyPilot can now use encrypted EBS volumes (3765)

Enhancements

* Performance: provisioning now 3x faster on AWS (4091)
* Buckets created by SkyPilot are now tagged with labels specified in ~/.sky/config.yaml (3922)
* Label validation now handles `:` and other special characters. (3734)


Cloud: Azure

New Features
* You can now use any Azure community image with `--image-id` (4145)
* Azure Blob Storage is now supported (3032, 3796, 3807)
* Fractional A10 instance types are now supported (3877)
* You can now specify resource group for Azure instance provisioning (3764)
* Faster `Premium_LRS` disks with `disk_tier: high` (3921)

Enhancements

* Performance: provisioning is now 2x faster on Azure with our new provisioner and custom images (3697, 3704, 3696, 3700, 4139, 4167, 4205)
* Improved support for A10 GPUs (3707)
* Azure resource group is now waited to be deleted instead of erroring out (3712)

SkyServe
* Readiness probe timeout can now be set in the service spec (3472)
* You can now tear down a specific replica with `sky serve down --replica-id` (4032)
* SkyServe controller region is now chosen from the replica resources (4053)


Storage
* Azure Blob Storage is now supported. (3032, 3796, 3807)
* [`.skyignore` support](https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files) (4038)
* You can now add files to a `.skyignore` file to skip uploading them to cloud storage.
* GCSFuse is updated to 2.2.0, bringing better performance and reliability. (3619)


Other clouds

* Lambda Cloud support has been migrated to our new and more reliable provisioner (3865, 3889)
* Lambda Cloud now supports docker images (4115)
* CUDO now supports opening ports (3717)
* RunPod now supports opening ports (3748) and custom docker images (3728).
* FluidStack provisioning has been updated to their new API (3799)
* Paperspace now supports A4000 and P4000 GPUs (3991)
* OCI: bug fixes and improvements (4074, 4080)

Thanks to all contributors!

New contributors: winglian, Ultramann, jucor, BitPhinix, sethkimmel3, hyoxt121, BabyChouSr, wizenheimer, gurcangercek, shashank2000, ckgresla, bernardwin, kmushegi, Conless, JayThomason, colinjc, mtaran, Haijian06, KrishivPiduri, zpoint

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, romilbhardwaj, cblmemo, landscapepainter, asaiacai, andylizf, yika, concretevitamin, colinjc, fozziethebeat, MaoZiming, JGSweets, Ultramann, Conless, jucor, wizenheimer, Haijian06, HysunHe, gurcangercek, bernardwin, JungleCatSW, BabyChouSr, hyoxt121, winglian, sethkimmel3, mjibril, shashank2000, ckgresla, zpoint, mtaran, KrishivPiduri, JayThomason, BitPhinix, kmushegi

**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.6.0...v0.7.0

0.6.1

Not secure
This patch release brings many improvements and fixes to SkyPilot, including major performance improvements for Kubernetes and Azure and new features for AWS and GCP.

Stay tuned for a detailed changelog coming up in v0.7.0!

0.6.0

Not secure
We are excited to release SkyPilot v0.6.0! This release includes a number of new features:
* [Managed Jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html) for job execution and recovery
* SkyServe and Jobs on Kubernetes
* Mix on-demand and spot instances in SkyServe
* New cloud: [Paperspace](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#paperspace)


Release Highlights

Managed Jobs

* The spot controller has been enhanced to support any job on on-demand or spot instances.
* To use, run `sky jobs launch` instead of `sky spot launch`.
* The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
* The `sky jobs` API is identical to the `sky spot` API, but also supports on-demand instances.

SkyServe and Jobs on Kubernetes

* SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
* This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
* Simply run `sky jobs launch` or `sky serve up`, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.


Mix on-demand and spot instances in SkyServe

* SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. [Example](https://github.com/skypilot-org/skypilot/blob/master/examples/serve/spot_policy/base_on_demand_fallback_replicas.yaml).
* Uses on-demand instances to ensure availability and spot instances to save costs.
* Dynamically falls back to on-demand replicas when spot replicas are not available. [Example](https://github.com/skypilot-org/skypilot/blob/master/examples/serve/spot_policy/dynamic_on_demand_fallback.yaml).

Paperspace support
* Newest cloud to join the Sky: Paperspace!
* Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
* Simply add your Paperspace API key to `~/.paperspace/config.json` and run `sky check paperspace` to [get started](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#paperspace).
* Big thanks to asaiacai for contributing Paperspace support!


More LLMs and Recipes

* New LLM Recipes: [Llama-3](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html), [Qwen](https://skypilot.readthedocs.io/en/latest/gallery/llms/qwen.html), [Ollama](https://skypilot.readthedocs.io/en/latest/gallery/frameworks/ollama.html), [DBRX](https://skypilot.readthedocs.io/en/latest/gallery/llms/dbrx.html), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Cog](https://github.com/skypilot-org/skypilot/pull/3219)


Deprecation Notes

The following features have been deprecated and will be removed in the next minor release:

* `sky spot` CLI: use `sky jobs` CLI instead.
* `core.spot_xxx` APIs: refactored to `jobs.xxx`.
* `qps_lower_threshold` and `auto_restart` in `service`: use `target_qps_per_replica` instead.

Changelog

Managed Jobs

* Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (3289)
* The name of the spot job is now included in the `SKYPILOT_TASK_ID` environment variable (3424)
* Legacy spot job APIs have been refactored from `core.spot_xxx` to `jobs.xxx` (3417)
* Cloud for the controller is now chosen based on the resources of the replicas (3363)
* Bug fixes (3302, 3397, 3459, 3468, 3480)

SkyServe

New Features

* New intelligent policy for mixing spot and on-demand instances in SkyServe (3194)
* **SkyServe now uses proxy** instead of HTTP redirect responses for better performance (3395)
* **Readiness probe now supports headers**: this is useful for authentication or other headers required for readiness checks (3552)

Enhancements

* Optimizations - replicas are reused when only service section is changed (3214)
* Rolling updates are now the default behavior for SkyServe (3249)
* Controller cloud is now chosen from replica resources if it is not already up (3231)
* Bug fixes and API improvements (3257, 3299, 3303, 3411, 3411, 3546)


Kubernetes

* Kubernetes clusters can now run SkyServe and Managed Jobs (3377, 3524, 3521)
* `sky show-gpus` now shows realtime availability of GPUs in the cluster (3499)
* Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (3513, 3415)
* Use Kubernetes service accounts by specifying `remote_identity` in ~/.sky/config.yaml (3377, 3527)
* `sky local up` now also automatically installs the Nginx Ingress Controller (3223)
* Support for specifying custom pod configurations with `pod_config` (3244)
* Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting `HTTP_PROXY` and more! See [example `pod_config` here](https://skypilot.readthedocs.io/en/latest/reference/config.html).
* Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (3333)
* Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
* Support for PodIP mode for exposing ports (3445)

Enhancements

* **GPU Isolation**: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (3443)
* Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (3263, 3373)
* All SkyPilot pods are now labelled with `skypilot-user` to identify the owner of the pod (3576)
* Special characters in environment variables are now correctly parsed (3322)
* GPU labelling is now more robust (3274)
* Bug fixes and quality of life improvements (3266, 3392, 3439, 3509, 3524, 3525, 3532, 3563, 3578, 3374)

CLI & Core interfaces

New Features

* `resources` now supports `labels` field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (3464, 3505)
* `sky check` now supports checking credentials for specific clouds, e.g. `sky check aws gcp` (3229)
* You can also restrict which clouds are checked by setting `allowed_clouds` in `~/.sky/config.yaml`. (3556)
* `any_of` or `ordered` fields in `resources` can now have clouds that are not enabled (3567)
* A [new environment variable `SKYPILOT_CLUSTER_INFO`](https://skypilot.readthedocs.io/en/latest/running-jobs/environment-variables.html), containing cluster name, cloud, region and zone is now available in all tasks (#3424)

Enhancements

* Optimizer is up to 10x faster when multiple resources are specified (3567)
* Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (3205)
* GCP GPUs now include `DEVICE_MEM` in `sky show-gpus` (3375)
* Better sorting for `sky show-gpus` (3492)
* Handling for usernames containing invalid characters (3528)
* Null environment variables now raise an error (3557)

Runtime & Backend

* SkyPilot now supports Python 3.11 (3248)
* SkyPilot runtime is now isolated from any environment changes made by user code (3575, 3326, 3339)
* Fix for jobs and services running longer than 12 days (3460)
* Docker runtime fixes and enhancements, including fix for storage mounting in container (3450, 3436, 3481, 3343)
* Bug fixes and optimizations (3280, 3292, 3178, 3386, 3292, 3386, 3407, 3423, 3368, 3457, 3469, 3482, 3495, 3512, 3536, 3568)

Optimizations

* Lazy imports for 2x faster import times (3394, 3463)
* Faster setup and job submission (3523, 3484),

Cloud: GCP

* H100 GPUs are now supported on GCP (3279)
* Support for fine-grained GCP IAM permissions (3284)

Cloud: Azure

* Custom images are now supported on Azure. Simply specify `image_id` in the `resources` field. (3362)
* 8x faster autostop for Azure (3519)
* Fix GPUs not being detected in Azure (3313)
* Provisioning fixes (3483)

Cloud: AWS

* Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (3488, 3514)
* SkyPilot can now be run in ECS containers by assuming `container-role` IAM roles (3503)
* SkyPilot will not delete user-specified security groups (3402)

Cloud: Fluidstack

* H100 and A100 Nvlink support for Fluidstack (3467)
* Opening ports is now supported for Fluidstack (3294)
* Bug fixes (3254, 3265)

Other Clouds

* Bug fixes for Lambda provisioning and termination (3409, 3410)
* Multi-gpu fixes for RunPod (3291)
* Cudo: handle missing project errors (3438)

Thanks to all contributors!

New contributors: MysteryManav, JGSweets, Harthgar, mjkanji

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, romilbhardwaj, concretevitamin, cblmemo, MaoZiming, shethhriday29, asaiacai, JGSweets, mjkanji, MysteryManav, landscapepainter, Harthgar, mjibril, dtran24, fozziethebeat, JungleCatSW


**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.5.0...v0.6.0

0.5

0.5.0

Not secure
We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:
* SkyPilot Serving
* New provisioner
* LLM recipes for the latest open models and engines
* Kubernetes support improvement
* 4 new clouds (contributed by the cloud providers!)

and more!

Release Highlights

**New Features**

* [**Multiple candidate resources**](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html#multiple-candidate-resources): SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, `any_of` or `ordered` in `resources`), allowing users to significantly enlarge the resource pool and get higher availability.
* [**New Provisioner**](https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing): Provisioner gets a new implementation, which is **2x faster and more reliable** for supported clouds. Support launching clusters with more than **100 nodes**. Dependency requirements for clouds are also significantly reduced.
* **Disk Tier**: Introducing `best` disk tier for the best performance and cost, so you can choose the best disk for any cloud. (2434)
* Allow **2x spot jobs** to be run concurrently
* Mount storage back after cluster restart

SkyServe
[SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.

* Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (2458)
* Autoscaler: Request rate based autoscaling policy. (2868, 2878)
* Autoscaler: Support scaling to 0 when no requests (2938)
* Rolling update: Support rolling update for existing services (2935, 3057)


**Other Enhancements**

* Environment variable support in services field (3078)
* Override task configurations with CLI arguments (2979)
* Logging improvement for replicas (2924, 2949)
* Smoke tests for SkyServe (2911)
* [Documents](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) for SkyServe (#3022, 2794, 2864, 2894, 2922, 2989, 3182)
* UX improvements for SkyServe (2895, 2940, 2961, 3054, 3176, 3094)
* Bug fixes and robustness improvement (2811, 2822, 2860, 2995, 2983, 3058, 3075, 3226)

New LLM Recipes

* [Gemma](https://github.com/skypilot-org/skypilot/tree/master/llm/gemma): Serve your Gemma on any cloud (#3207, 3220)
* [SGLang](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang): Speed up your LLM deployments with [SGLang](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe (#3126, 3140, 3170, 3145)
* [Mixtral 8x7B](https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral): Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, 2888, 3017, 3067, 2882)
* [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/): Official docs for hosting Mistral 7B from mistral.ai (#2615, 2856)
* [CodeLlama](https://github.com/skypilot-org/skypilot/tree/master/llm/codellama): Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, 3143)
* [LoRAX](https://github.com/skypilot-org/skypilot/tree/master/llm/lorax): efficient multi-lora LLM inference (#2883)
* [axolotl](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl): a latest LLM tool for finetuning AI models running on SkyPilot (#2784, 2789)
* [Tabby](https://github.com/skypilot-org/skypilot/tree/master/llm/tabby): Self-host coding assistant Tabby on SkyPilot (#2597, 3068)
* [vLLM](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm): Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, 2643, 2616, 2786, 2791, 2948,3118)
* [TGI](https://github.com/skypilot-org/skypilot/tree/master/examples/serve/huggingface-tgi.yaml): Scale the inference engine TGI with SkyServe (#3121)

Kubernetes
Kubernetes support received a number of **New Features** and **Enhancements**.

* Multi-node support for Kubernetes (2609, 3019)
* Open ports support for Kubernetes (2588, 2713, 2997, 3200)
* Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (2650)
* Starting a kubernetes GPU cluster locally with `sky local up` (2890)
* Custom Image Support for Kubernetes Instances (2729, 3019, 3210)
* New provisioner for kubernets for better performance and robustneess (3019)
* Supporting Kubernetes cluster launched with k3s and Rancher (3148)

**Other Enhancements**

* Support H100 80GB in Kubernetes (2840)
* Share SSH jump pod across users to reduce resources consumption (2826)
* Allow `KUBECONFIG` env var for config file specification (3169)
* Robustify the kubernetes cluster removement (3043)
* Fixes GPU labeller (2636, 2653)
* UX and Robustness improvement (2638, 2712, 2589, 2785, 2551, 2795, 2884, 2913, 2795)
* Documents improvement (2595, 2705, 2957, 2991, 2997, 3119)

More Clouds
SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: **VMWare vSphere**, **RunPod**, **Fluidstack** and **Cudo Compute**.
* [RunPod](https://www.runpod.io/): RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, 3018)
* [Fluidstack](https://www.fluidstack.io/): Fluidstack offers accessible GPUs for AI with low cost. (#3086, 3224)
* [Cudo Compute](https://www.cudocompute.com/): GPU cloud provides low cost GPUs powered with green energy. (#2975, 3224)
* [VMWare vSphere](https://www.vmware.com/products/vsphere.html): you can now bring your own vSphere cluster to SkyPilot. ([docs](https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/vsphere.html)) (#3000)



Clouds

AWS

**New Features**

* New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (1702, 2719, 2792)
* Support for AWS Trainium accelerator (2690)
* Support null for proxy command to filter regions (2756)
* Support CUDA 12.1 with default image updates (2788)
* Job scheduling on Inferentia and Trainium (2969, 2798)
* Allow specifying security_group (3133)

**Enhancements**

* Make public / private subnet selection robust (2867)
* Avoid hanging for restarting an instance in STOPPING state (2998)
* Remove sunset instance types (2610)
* Add docs for custom VPC support (2776)

**Fixes**

* Fix conda installation on AWS default image (3206)
* Robustify the custom image support (3216)
* Fix subnet selection for AWS and autodown for spot instances (2921)
* Fix minimal permission for AWS (2978)
* Improve opening ports for AWS (2716)
* Autstop with new provisioner (2719)



GCP

**New Features**

* Security: Custom VPC support for GCP. (2764, 2772, 2854, 2944)
* Security: Support private IP with proxy jump on GCP. (2819)
* New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (2681, 2719, 2943)
* Automatically use reserved instances from multiple reserved pools (2836, 2681)
* Support L4 accelerator for GCP (2724)
* Allow stopping spot clusters on GCP (2877)

**Enhancements**

* Allow stopping VM with local SSD (2587)
* Update default runtime version for TPU node (2601, 2602)
* Handling transient error during launching GCP clusters (2669)
* Update GCSFuse version to 1.3.0 for GCS storage mount (2887)
* Set TPU VM the default option for TPU accelerators (1758)
* Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (3028, 3172, 3234)

**Fixes**

* Fix custom docker image support (3218)
* Fix minimal roles required for GCP (2704)
* Robustify the catalog fetching (3141)
* Fix ports on TPU VM and cluster launched before 0.4.0 (2641)
* Fix backward compatibility issue with GCP clusters (2604)
* Fix `--disk-size` for Custom Machine Images (2718)
* Update catalog fetcher with more options (2562)
* Assign GCP VMs with service account (2972)
* Fix machine image support (3030, 3236)
* Fix error handling for failed provisioning (2852)
* Leave out TPU v5 in catalog as it is not supported (2656)
* Fix GCP minimal permission (2947, 2770, 2761)


Azure

**Enhancements**

* Make ports openning more robust (2649, 2891, 3084)
* Additional arguments for Azure catalog fetcher and support H100 (2561, 2844, 2847)
* Support CUDA 12.1 with default image updates (2468)
* Support spot instances on Azure (2871)

**Fixes**

* Fix custom docker image support (3218)
* UX: Fix Azure disk tier explicitly shown in resources str (3064)
* Fix status query for Azure (3015)


SCP

* Fix SCP error raised in `sky check` (3038)


CLI & Core interfaces

**New Features**

* Multi-node jobs fail fast fast for single node failure (3081)
* Add configurations for not uploading credentials (2904)
* Adding `sky status --endpoints` CLI (3199)
* Support more characters in cluster name (3130)
* Show all regions and more accurate price in `sky show-gpus` (2583, 2892, 2933, 2946, 3083, 3149, 3113)
* Allow infering cloud from region or zone (2632)
* Add `--commit` and `--version` for `sky` CLI (2720, 2731, 2733)

**Enhancements**

* Robustify runtime initialization on remote cluster (3132)
* Better error message for YAML parsing (3040)
* Smarter GPU name completion (3014)
* Speed up retry until up by not doing exponential backoff (2821)
* Add schema validation for config (2645)
* Allow `--disk-tier none` override (2906)
* `sky check` improvement (3174, 3212, 3160)
* Better logging for CLIs (2535, 2691, 2728, 3139, 3175)

**Fixes**

* Fix permission issues for SSH config file on specific linux distributions (3151)
* Fix `sky_logs` and mounting directory (2667, 2845)
* Fix job related commands (2662, 2767)
* Fix `sky logs` with `--sync-down` (2660)

**Deprecations**

* Deprecate `cpunode/gpunode/tpunode`, hide `admin` (2800)
* Remove deprecated `Local` cloud which is now replaced by Kubernetes support (3037, 3186)


Backend/Provisioner

**New Features**

* Support multiple candidate resources (2498, 2803, 2833, 2886, 3107)
* Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (3004, 3005)
* Support spaces in paths (2762)
* Support long local username with special characters (3105, 3130)


**Enhancements**

* Robustify termination of failed clusters during failover (2990)
* Improve the ssh check for clusters just provisioned (2797)
* Robustify failover to avoid terminating clusters that has user data (2977)
* Move ssh config to `~/.ssh/generated/ssh` instead of directly editing `~/.ssh/config` (2706, 3069)
* Code refactoring and cleanup (2541, 2736, 3046, 2633, 2870, 2925, 3087, 3088, 3153)
* Improve usage collection (2654, 2672)
* Better explanation of failover in docs (2850, 2834)

**Fixes**

* Avoid backward compatibility issue with provisioner (2682)
* Fix cloud provisioning internal file mount cache (2715)
* Fix optimization for DAG when some resources provided are not feasible (2657)
* Fix runtime installation on remote VM (2909, 2912)
* Fix cluster termination when the cluster is not fully UP (3025)
* Fixes for tests (2651, 2976, 3023, 3166, 3167, 3202)
* Improve logging (2594, 2678, 2696, 3003)


Managed spot

**New Features**

* Allow 2x spot jobs to be run concurrently (3191, 3208)

**Enhancements**

* Better logging and UX (2630)
* Add docs for customizing spot controller (2753)
* Add spot pipeline docs (2936)

**Fixes**

* Fix private VPC support for spot jobs (2874)
* Fix `~/.sky/config.yaml` for spot jobs (2876)
* Fix OOM for long running spot jobs (2675)
* Fix AWS NoCredentialError caused by credential rotation (2695)
* Fix Azure dependency on spot controller (2875)


Storage

**New Features**

* Mount storage back to clusters after restarted (2322, 2804)

**Enhancements**

* Clarify the syntax for external and managed storage (3162, 2804)
* Confirmation prompt for sky storage delete, and --yes flag to skip it (2726)
* Refactor and clean up storage code (2774, 2986)

**Fixes**

* Fix permission issue for S3 mounting on specific images (3215)
* Fix spaces in source path for storages (2835)


Dependencies

* Recommand nightly build in docs for better performance and robustness (2984)
* Automatic build for nightly Docker image (2229)
* Avoid ray dependency locally for AWS, GCP, and Kubernetes (2625, 2943, 3019)
* Remove AWS dependency by default for better setup time and less confliction (2841, 2942)
* Fix GCP dependency by updating google-api-python-client (2577, 2759)
* Pin remote dependency for ray job (2659)
* Robustify dependencies (2642, 2679, 3024)

Examples

* NeMo distributed training for BERT and GPT3 (2533)
* Add docker compose example to run multiple containers (2745)
* Distributed ray train example (2828)
* Benchmark Torch DDP (2987)
* Example updates for supported models (2637, 2825)



**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.4.0...v0.5.0


Thanks to all contributors!
New contributors: rtalaricw, jackyk02, Vaibhav2001, rohanvaidya45, Shrinandan, manishiitg, amitkumarj441, tgaddair, aseriesof-tubes, changxiaohui, thams, kishb87, PratikKumar125, mmcclean, dtran24, davidwagnerkc, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, concretevitamin, cblmemo, romilbhardwaj, MaoZiming, landscapepainter, sunny0826, suquark, Vaibhav2001, infwinston, hemildesai, asaiacai, Shrinandan, kishb87, rtalaricw, iojw, aseriesof-tubes, manishiitg, jackyk02, mmcclean, thams, amitkumarj441, rohanvaidya45, saihtaungkham, tgaddair, davidwagnerkc, PratikKumar125, dtran24, changxiaohui, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123

Page 1 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.