We are excited to release SkyPilot v0.4.0, which brings a host of new features and improvements, including Kubernetes support, native container support, ability to open ports, and more.
Release Highlights
New Features
* **[Kubernetes support](https://skypilot.readthedocs.io/en/v0.4.0/reference/kubernetes/index.html)**: SkyPilot tasks and clusters can now run on Kubernetes clusters, including on-prem and cloud hosted deployments (GKE, EKS).
* If you have a working kubeconfig, simply run `sky check` and `sky launch --cloud kubernetes` to run your task on Kubernetes.
* If desired, tasks can also failover to the cloud when the Kubernetes cluster does not have enough resources. The same SkyPilot YAMLs and CLI works seamlessly across Kubernetes and clouds.
* **[Opening ports on clusters](https://skypilot.readthedocs.io/en/v0.4.0/examples/ports.html)**: Open ports on your clusters with the `ports` field. These ports are publicly accessible and can be used for hosting LLM inference endpoints, Jupyter notebooks, web servers, Tensorboard, and other services.
* **[Native container support](https://skypilot.readthedocs.io/en/v0.4.0/examples/docker-containers.html#using-docker-containers-as-runtime-environment)**: If your task uses docker containers, SkyPilot's `setup` and `run` commands can now directly be executed in that container. This allows you to wrap your environment in a container and run it on any cloud with SkyPilot.
* **[Reservation support](https://skypilot.readthedocs.io/en/v0.4.0/reference/config.html)**: This release adds support for [GCP reservations](https://cloud.google.com/compute/docs/instances/reservations-overview). SkyPilot will now prioritize using your reservations on the cloud to save costs and get higher availability.
* **New Managed Spot Features**
* **[Spot pipeline support](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/spot_pipeline/multi_jobs.yaml)**: automatically execute a pipeline of sequential tasks.
* **[Spot dashboard](https://skypilot.readthedocs.io/en/v0.4.0/examples/spot-jobs.html#dashboard)**: track all your spot jobs in your browser.
New LLM Recipes
* **vLLM** on any cloud - [blog](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/), [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm).
* **Llama 2** - Train Vicuna on Llama-2 and serve chatbots - [blog](https://blog.skypilot.co/finetuning-llama2-operational-guide/), [fine-tuning example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/vicuna-llama-2), [self-hosted chatbot](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/llama-2).
* **LocalGPT**: chat with your pdfs - [example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/localgpt).
* **Falcon-40B** fine-tuning guide - [example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/falcon).
More Clouds
SkyPilot now supports 8 clouds, including community contributed support for two new clouds:
* [Oracle Cloud Infrastructure (OCI)](https://skypilot.readthedocs.io/en/v0.4.0/getting-started/installation.html#oracle-cloud-infrastructure-oci)
* [Samsung Cloud Platform (SCP)](https://skypilot.readthedocs.io/en/v0.4.0/getting-started/installation.html#samsung-cloud-platform-scp)
SkyPilot now also supports IBM COS buckets (1966).
Core and UX Improvements
* **Faster failover**: 30x faster failover with our new quota optimization which checks if quotas are available before launching a cluster (Supported on GCP, AWS).
* **Easily get VM IPs**: The new `--ip` flag for `sky status` returns the public IP address of the cluster (e.g., `sky status --ip mycluster`). Use this to access services such as LLM inference endpoints, jupyter notebooks and more.
* **Improved scriptability**: SkyPilot YAMLs and CLI are more scriptable than ever - `file_mounts` can be dynamically defined with environment variables ([docs](https://skypilot.readthedocs.io/en/v0.4.0/running-jobs/environment-variables.html#using-in-file-mounts), [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/using_file_mounts_with_env_vars.yaml)), environment variables can be set through a dotenv file with the new `--env-file` flag (#2296).
* **Core optimizations**: Multi-node clusters stop 4x faster (2199), `sky status` updates for stopped clusters are 10x faster (2288), and the job queue is more memory efficient (1636).
* **Nightly releases**: We now release nightly versions of SkyPilot. To get the cutting edge of SkyPilot without installing from source, run `pip install skypilot-nightly` (1446)
Deprecation
* [SkyPilot On-prem](https://skypilot.readthedocs.io/en/v0.3.3/reference/local/index.html) is now deprecated and Kubernetes will be the recommended mode of running SkyPilot on on-prem clusters.
Below is a detailed list of changes.
Managed Spot
New Features
* Spot pipeline support: automatically handles a pipeline of spot jobs. (1982)
* Spot dashboard is now available with `sky spot dashboard`: you can now see all your spot jobs in GUI (2103, 2136)
* Spot callback - users can now run custom code when spot job status changes (2106, 2364)
* Resource configuration of the spot controller can now be customized ([docs](https://skypilot.readthedocs.io/en/v0.4.0/reference/config.html), #2040)
Enhancements
* SkyPilot now shows the spot job's resources and estimated cost before confirmation (2524)
* Switch to eager failover recovery policy for better spot lifetime (2234)
* Reduce the logging for launching spot controller (2056)
Fixes
* We now show PENDING spot job in the spot queue before it starts (2044)
* Robustness fixes (2102, 2153, 2119, 2004, 2330, 1998)
CLI & YAML interfaces
New Features
* Users can now use environment variables to dynamically define file_mounts ([docs](https://skypilot.readthedocs.io/en/v0.4.0/running-jobs/environment-variables.html#using-in-file-mounts), 2146)
* `sky status` can now show the head IP of the cluster with `-a` or `--ip` flags (2305, 2563)
* `sky down/stop/start` defaults to a unique cluster if it exists and `sky cancel` without cluster cancels the latest task (2325)
Enhancement
* `sky check` output is now friendlier with more hints for disabled clouds (2002, 2017, 2196, 2114, 2221, 2377)
* `sky down` progress bar now reflects clusters failed to terminate (1595, 2005)
* We now fail early if rsync is not installed locally (2168)
* Better messages and hints for CLI (2027, 2028, 2077, 2083, 2085)
Fixes
* Fixed the order of VMs in optimizer table when `--cpus` is provided (2037)
* Better handling when `sky launch` is interrupted (2206, 2252)
Backend
New Features
* Users can now open ports for their clusters with the `ports` field ([docs](https://skypilot.readthedocs.io/en/v0.4.0/examples/ports.html), #2210, 2477)
* Docker support in `image_id` - tasks can now be run inside docker containers ([docs](https://skypilot.readthedocs.io/en/v0.4.0/examples/docker-containers.html#using-docker-containers-as-runtime-environment), 1910)
* Users can now clone a cluster from an existing cluster's disk with the `--clone-disk-from` flag (2098)
* Users can now launch their own ray cluster on a SkyPilot cluster (2020)
Enhancements
* 30x faster failover for AWS and GCP when quotas are not available (1953, 2187, 2313)
* Faster `sky launch` by caching cluster IP address (2400)
* Job queue is now more resource efficient, with significant memory consumption reduction on remote cluster (1636)
* Cluster names no longer map directly to cloud cluster names. Instead, they are mapped to a unique cluster name on the cloud. This helps with isolation across users sharing cloud accounts. (2403)
* More efficient and robust stopping/termination for AWS (2121)
* `sky status --refresh` for STOPPED cluster is 10x faster (2079)
* Empty YAML fields are now allowed (1890)
Fixes
* Manually started/stopped clusters are now better handled (2130, 2203, 2389)
* Fix edge case where existing clusters were terminated when resources are not available (2170)
* Fixes for disk_tier UX (2156, 2215)
* Robustness fixes (2033, 2061, 2009, 2491, 2290, 1259, 2074, 2023, 2042)
Storage
New Features
* IBM COS is now supported (1966)
* `sky spot launch` will now exclude files from .gitignore (2018)
Enhancements
* Deletion is now parallelized for faster deletion (2058)
* UX improvements for `sky storage` CLI (2063, 2177)
* GCS bucket mounting now uses gcsfuse v1.0.1 (2470)
Fixes
* Fix transient failures when uploading to GCS from MacOS due to multiprocessing bug (2125)
* Robustness fixes (2049, 2117, 2165, 2259, 2326, 2250)
Dependencies
* Avoid buggy grpcio versions (2055)
* Pydantic is pinned to `<2.0` (2157)
* PyYAML is pinned to `>3.13, != 5.4.*` to avoid issues with Cython 3 (2256, 2514)
* Ray `<= 2.6.3` is supported on local machines (2401)
* `pycryptodome`, `oauth2client` are no longer required (2515)
Clouds
AWS
* H100 GPUs are now supported (2323)
* New [docs](https://skypilot.readthedocs.io/en/v0.4.0/cloud-setup/cloud-auth.html) for AWS cloud administrator about advanced login option (SSO and account switching) (#1888)
* Insufficient permission is now handled gracefully (2415, 2456)
* Fixed a bug where existing AWS cluster would end up in INIT state after changing identity (2442)
* Fix fetching AZ when describe zones permission does not exist in all regions (2463)
GCP
* Nvidia L4 GPUs are now supported (2212)
* [Machine Images](https://cloud.google.com/compute/docs/machine-images) are now supported (#2280)
* GCP reservations are now supported (2352)
* SkyPilot optimizer is 4x faster for GCP instances (2410)
* GCP pricing is now dynamically fetched and is more robust (2118, 2076, 2131)
* Default image has been updated to Debian 11 (2279)
* New [docs](https://skypilot.readthedocs.io/en/v0.4.0/cloud-setup/cloud-permissions/gcp.html) for minimal permission required by GCP account to use SkyPilot for administrator (#2100, 2112)
* Robustness fixes (2135, 2199, 2124, 1879, 2116)
* TPU support is now more robust (2310, 2471, 2350, 2540)
Azure
* westus3 region is now supported (2149)
* Fix status refresh for Azure (2120)
* Fix Azure disk tier interruption for optimize progress (2111)
* Azure catalog fetching is more robust (2115, 2553)
Lambda
* Add H100 support for Lambda Cloud (2010, 2323)
* API rate limit is now handled with backoff and retry (2265)
* Errors are now more detailed (2371)
Oracle Cloud Infrastructure (OCI)
* OCI is now supported (1909, 2047, 2057, 2068, 2034, 2070, 2069, 2062,2067, 2092, 2095, 2099)
Samsung Cloud Platform (SCP)
* Samsung Cloud Platform (SCP) is now supported for single-node clusters (1941, 2001, 2014)
Examples
* New DeepSpeed [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/deepspeed-multinode/sky.yaml) (#2208)
* New Distributed Tensorflow [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/tensorflow_distributed/tf_distributed.yaml) (#1721)
* New DVC [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/dvc/dvc_pipeline.yaml) (#2444)
* Examples dependencies are now up to date (2145, 2223, 2359)
[Full changelog](https://github.com/skypilot-org/skypilot/compare/v0.3.0...v0.4.0)
Thanks to all contributors!
New contributors: JGoo1, tobi, HysunHe, blucz, shethhriday29, MaoZiming, ksasi, pushmatrix, hzeng-0, saihtaungkham, fozziethebeat, n10dollar, asaiacai, mtaku3, gbmarc1, alex000kim, steve-marmalade, xzrderek, sunny0826.
Many thanks to all contributors who contributed to this release!
Michaelvll, concretevitamin, romilbhardwaj, cblmemo, HysunHe, landscapepainter, shethhriday29, infwinston, alex000kim, suquark, sunny0826, gbmarc1, MaoZiming, xzrderek, tobi, steve-marmalade, saihtaungkham, pushmatrix, n10dollar, mtaku3, ksasi, hzeng-0, fozziethebeat, blucz, asiaacai, WoosukKwon, JGoo1, mraheja, iojw, hemildesai, ewzeng, aviweit, Saikrishna-Achalla, Cohen-J-Omer