Skypilot

Latest version: v0.7.0

Safety actively analyzes 682471 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

0.4

0.4.0

We are excited to release SkyPilot v0.4.0, which brings a host of new features and improvements, including Kubernetes support, native container support, ability to open ports, and more.

Release Highlights

New Features
* **[Kubernetes support](https://skypilot.readthedocs.io/en/v0.4.0/reference/kubernetes/index.html)**: SkyPilot tasks and clusters can now run on Kubernetes clusters, including on-prem and cloud hosted deployments (GKE, EKS).
* If you have a working kubeconfig, simply run `sky check` and `sky launch --cloud kubernetes` to run your task on Kubernetes.
* If desired, tasks can also failover to the cloud when the Kubernetes cluster does not have enough resources. The same SkyPilot YAMLs and CLI works seamlessly across Kubernetes and clouds.
* **[Opening ports on clusters](https://skypilot.readthedocs.io/en/v0.4.0/examples/ports.html)**: Open ports on your clusters with the `ports` field. These ports are publicly accessible and can be used for hosting LLM inference endpoints, Jupyter notebooks, web servers, Tensorboard, and other services.
* **[Native container support](https://skypilot.readthedocs.io/en/v0.4.0/examples/docker-containers.html#using-docker-containers-as-runtime-environment)**: If your task uses docker containers, SkyPilot's `setup` and `run` commands can now directly be executed in that container. This allows you to wrap your environment in a container and run it on any cloud with SkyPilot.
* **[Reservation support](https://skypilot.readthedocs.io/en/v0.4.0/reference/config.html)**: This release adds support for [GCP reservations](https://cloud.google.com/compute/docs/instances/reservations-overview). SkyPilot will now prioritize using your reservations on the cloud to save costs and get higher availability.
* **New Managed Spot Features**
* **[Spot pipeline support](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/spot_pipeline/multi_jobs.yaml)**: automatically execute a pipeline of sequential tasks.
* **[Spot dashboard](https://skypilot.readthedocs.io/en/v0.4.0/examples/spot-jobs.html#dashboard)**: track all your spot jobs in your browser.

New LLM Recipes
* **vLLM** on any cloud - [blog](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/), [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm).
* **Llama 2** - Train Vicuna on Llama-2 and serve chatbots - [blog](https://blog.skypilot.co/finetuning-llama2-operational-guide/), [fine-tuning example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/vicuna-llama-2), [self-hosted chatbot](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/llama-2).
* **LocalGPT**: chat with your pdfs - [example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/localgpt).
* **Falcon-40B** fine-tuning guide - [example](https://github.com/skypilot-org/skypilot/tree/releases/0.4.0/llm/falcon).

More Clouds
SkyPilot now supports 8 clouds, including community contributed support for two new clouds:
* [Oracle Cloud Infrastructure (OCI)](https://skypilot.readthedocs.io/en/v0.4.0/getting-started/installation.html#oracle-cloud-infrastructure-oci)
* [Samsung Cloud Platform (SCP)](https://skypilot.readthedocs.io/en/v0.4.0/getting-started/installation.html#samsung-cloud-platform-scp)

SkyPilot now also supports IBM COS buckets (1966).

Core and UX Improvements
* **Faster failover**: 30x faster failover with our new quota optimization which checks if quotas are available before launching a cluster (Supported on GCP, AWS).
* **Easily get VM IPs**: The new `--ip` flag for `sky status` returns the public IP address of the cluster (e.g., `sky status --ip mycluster`). Use this to access services such as LLM inference endpoints, jupyter notebooks and more.
* **Improved scriptability**: SkyPilot YAMLs and CLI are more scriptable than ever - `file_mounts` can be dynamically defined with environment variables ([docs](https://skypilot.readthedocs.io/en/v0.4.0/running-jobs/environment-variables.html#using-in-file-mounts), [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/using_file_mounts_with_env_vars.yaml)), environment variables can be set through a dotenv file with the new `--env-file` flag (#2296).
* **Core optimizations**: Multi-node clusters stop 4x faster (2199), `sky status` updates for stopped clusters are 10x faster (2288), and the job queue is more memory efficient (1636).
* **Nightly releases**: We now release nightly versions of SkyPilot. To get the cutting edge of SkyPilot without installing from source, run `pip install skypilot-nightly` (1446)

Deprecation
* [SkyPilot On-prem](https://skypilot.readthedocs.io/en/v0.3.3/reference/local/index.html) is now deprecated and Kubernetes will be the recommended mode of running SkyPilot on on-prem clusters.


Below is a detailed list of changes.

Managed Spot

New Features
* Spot pipeline support: automatically handles a pipeline of spot jobs. (1982)
* Spot dashboard is now available with `sky spot dashboard`: you can now see all your spot jobs in GUI (2103, 2136)
* Spot callback - users can now run custom code when spot job status changes (2106, 2364)
* Resource configuration of the spot controller can now be customized ([docs](https://skypilot.readthedocs.io/en/v0.4.0/reference/config.html), #2040)

Enhancements
* SkyPilot now shows the spot job's resources and estimated cost before confirmation (2524)
* Switch to eager failover recovery policy for better spot lifetime (2234)
* Reduce the logging for launching spot controller (2056)

Fixes
* We now show PENDING spot job in the spot queue before it starts (2044)
* Robustness fixes (2102, 2153, 2119, 2004, 2330, 1998)


CLI & YAML interfaces

New Features
* Users can now use environment variables to dynamically define file_mounts ([docs](https://skypilot.readthedocs.io/en/v0.4.0/running-jobs/environment-variables.html#using-in-file-mounts), 2146)
* `sky status` can now show the head IP of the cluster with `-a` or `--ip` flags (2305, 2563)
* `sky down/stop/start` defaults to a unique cluster if it exists and `sky cancel` without cluster cancels the latest task (2325)

Enhancement
* `sky check` output is now friendlier with more hints for disabled clouds (2002, 2017, 2196, 2114, 2221, 2377)
* `sky down` progress bar now reflects clusters failed to terminate (1595, 2005)
* We now fail early if rsync is not installed locally (2168)
* Better messages and hints for CLI (2027, 2028, 2077, 2083, 2085)

Fixes
* Fixed the order of VMs in optimizer table when `--cpus` is provided (2037)
* Better handling when `sky launch` is interrupted (2206, 2252)

Backend

New Features
* Users can now open ports for their clusters with the `ports` field ([docs](https://skypilot.readthedocs.io/en/v0.4.0/examples/ports.html), #2210, 2477)
* Docker support in `image_id` - tasks can now be run inside docker containers ([docs](https://skypilot.readthedocs.io/en/v0.4.0/examples/docker-containers.html#using-docker-containers-as-runtime-environment), 1910)
* Users can now clone a cluster from an existing cluster's disk with the `--clone-disk-from` flag (2098)
* Users can now launch their own ray cluster on a SkyPilot cluster (2020)

Enhancements
* 30x faster failover for AWS and GCP when quotas are not available (1953, 2187, 2313)
* Faster `sky launch` by caching cluster IP address (2400)
* Job queue is now more resource efficient, with significant memory consumption reduction on remote cluster (1636)
* Cluster names no longer map directly to cloud cluster names. Instead, they are mapped to a unique cluster name on the cloud. This helps with isolation across users sharing cloud accounts. (2403)
* More efficient and robust stopping/termination for AWS (2121)
* `sky status --refresh` for STOPPED cluster is 10x faster (2079)
* Empty YAML fields are now allowed (1890)

Fixes
* Manually started/stopped clusters are now better handled (2130, 2203, 2389)
* Fix edge case where existing clusters were terminated when resources are not available (2170)
* Fixes for disk_tier UX (2156, 2215)
* Robustness fixes (2033, 2061, 2009, 2491, 2290, 1259, 2074, 2023, 2042)


Storage

New Features
* IBM COS is now supported (1966)
* `sky spot launch` will now exclude files from .gitignore (2018)

Enhancements
* Deletion is now parallelized for faster deletion (2058)
* UX improvements for `sky storage` CLI (2063, 2177)
* GCS bucket mounting now uses gcsfuse v1.0.1 (2470)

Fixes
* Fix transient failures when uploading to GCS from MacOS due to multiprocessing bug (2125)
* Robustness fixes (2049, 2117, 2165, 2259, 2326, 2250)


Dependencies

* Avoid buggy grpcio versions (2055)
* Pydantic is pinned to `<2.0` (2157)
* PyYAML is pinned to `>3.13, != 5.4.*` to avoid issues with Cython 3 (2256, 2514)
* Ray `<= 2.6.3` is supported on local machines (2401)
* `pycryptodome`, `oauth2client` are no longer required (2515)


Clouds

AWS
* H100 GPUs are now supported (2323)
* New [docs](https://skypilot.readthedocs.io/en/v0.4.0/cloud-setup/cloud-auth.html) for AWS cloud administrator about advanced login option (SSO and account switching) (#1888)
* Insufficient permission is now handled gracefully (2415, 2456)
* Fixed a bug where existing AWS cluster would end up in INIT state after changing identity (2442)
* Fix fetching AZ when describe zones permission does not exist in all regions (2463)


GCP
* Nvidia L4 GPUs are now supported (2212)
* [Machine Images](https://cloud.google.com/compute/docs/machine-images) are now supported (#2280)
* GCP reservations are now supported (2352)
* SkyPilot optimizer is 4x faster for GCP instances (2410)
* GCP pricing is now dynamically fetched and is more robust (2118, 2076, 2131)
* Default image has been updated to Debian 11 (2279)
* New [docs](https://skypilot.readthedocs.io/en/v0.4.0/cloud-setup/cloud-permissions/gcp.html) for minimal permission required by GCP account to use SkyPilot for administrator (#2100, 2112)
* Robustness fixes (2135, 2199, 2124, 1879, 2116)
* TPU support is now more robust (2310, 2471, 2350, 2540)

Azure

* westus3 region is now supported (2149)
* Fix status refresh for Azure (2120)
* Fix Azure disk tier interruption for optimize progress (2111)
* Azure catalog fetching is more robust (2115, 2553)


Lambda
* Add H100 support for Lambda Cloud (2010, 2323)
* API rate limit is now handled with backoff and retry (2265)
* Errors are now more detailed (2371)

Oracle Cloud Infrastructure (OCI)
* OCI is now supported (1909, 2047, 2057, 2068, 2034, 2070, 2069, 2062,2067, 2092, 2095, 2099)

Samsung Cloud Platform (SCP)
* Samsung Cloud Platform (SCP) is now supported for single-node clusters (1941, 2001, 2014)

Examples
* New DeepSpeed [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/deepspeed-multinode/sky.yaml) (#2208)
* New Distributed Tensorflow [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/tensorflow_distributed/tf_distributed.yaml) (#1721)
* New DVC [example](https://github.com/skypilot-org/skypilot/blob/releases/0.4.0/examples/dvc/dvc_pipeline.yaml) (#2444)
* Examples dependencies are now up to date (2145, 2223, 2359)

[Full changelog](https://github.com/skypilot-org/skypilot/compare/v0.3.0...v0.4.0)

Thanks to all contributors!
New contributors: JGoo1, tobi, HysunHe, blucz, shethhriday29, MaoZiming, ksasi, pushmatrix, hzeng-0, saihtaungkham, fozziethebeat, n10dollar, asaiacai, mtaku3, gbmarc1, alex000kim, steve-marmalade, xzrderek, sunny0826.

Many thanks to all contributors who contributed to this release!

Michaelvll, concretevitamin, romilbhardwaj, cblmemo, HysunHe, landscapepainter, shethhriday29, infwinston, alex000kim, suquark, sunny0826, gbmarc1, MaoZiming, xzrderek, tobi, steve-marmalade, saihtaungkham, pushmatrix, n10dollar, mtaku3, ksasi, hzeng-0, fozziethebeat, blucz, asiaacai, WoosukKwon, JGoo1, mraheja, iojw, hemildesai, ewzeng, aviweit, Saikrishna-Achalla, Cohen-J-Omer

0.3.3

This patch release brings many bug fixes and features, including new mechanics for stop/down, callbacks for spot jobs and a critical dependency fix for PyYAML after the release of cython 3.

0.3.2

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the pedantic dependency issue, disk cloning, file mounts, and cloud-specific improvements.

0.3.1

This is a patch release to ship **several important enhancements and bug fixes**:

Enhancements
- **On-demand H100 GPU from Lambda is supported!** `sky launch --gpus h100`
- To use it, remove any previous Lambda catalog: `rm -rf ~/.sky/catalogs/v5/lambda`
- Managed spot: make job cancellation during failover more robust to mitigate a rare `FAILED_SETUP` error (1998)

Fixes
- Provisioner / Backend
- Fix provision failover encountering FileNotFoundError (2005)
- Fix user-level ray cluster causing SkyPilot cluster to be in INIT state (2020)
- Logging
- Fix certain logs of multi-node jobs not being streamed due to Ray 2.4 log dedup (2026)
- Fix logs being created in current pwd `$PWD/~/sky_logs` in some cases (2009)
- Managed spot
- Fix `sky spot launch --retry-until-up` to make it actually retry until up (2004)
- Storage
- Fix a rare storage cloud check error if `sky check` has never been called (2017)
- On-prem
- Fix detecting A5000 and A6000 GPUs (2023)

**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.3.0...v0.3.1

0.3

Page 2 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.