We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:
* SkyPilot Serving
* New provisioner
* LLM recipes for the latest open models and engines
* Kubernetes support improvement
* 4 new clouds (contributed by the cloud providers!)
and more!
Release Highlights
**New Features**
* [**Multiple candidate resources**](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html#multiple-candidate-resources): SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, `any_of` or `ordered` in `resources`), allowing users to significantly enlarge the resource pool and get higher availability.
* [**New Provisioner**](https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing): Provisioner gets a new implementation, which is **2x faster and more reliable** for supported clouds. Support launching clusters with more than **100 nodes**. Dependency requirements for clouds are also significantly reduced.
* **Disk Tier**: Introducing `best` disk tier for the best performance and cost, so you can choose the best disk for any cloud. (2434)
* Allow **2x spot jobs** to be run concurrently
* Mount storage back after cluster restart
SkyServe
[SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.
* Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (2458)
* Autoscaler: Request rate based autoscaling policy. (2868, 2878)
* Autoscaler: Support scaling to 0 when no requests (2938)
* Rolling update: Support rolling update for existing services (2935, 3057)
**Other Enhancements**
* Environment variable support in services field (3078)
* Override task configurations with CLI arguments (2979)
* Logging improvement for replicas (2924, 2949)
* Smoke tests for SkyServe (2911)
* [Documents](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) for SkyServe (#3022, 2794, 2864, 2894, 2922, 2989, 3182)
* UX improvements for SkyServe (2895, 2940, 2961, 3054, 3176, 3094)
* Bug fixes and robustness improvement (2811, 2822, 2860, 2995, 2983, 3058, 3075, 3226)
New LLM Recipes
* [Gemma](https://github.com/skypilot-org/skypilot/tree/master/llm/gemma): Serve your Gemma on any cloud (#3207, 3220)
* [SGLang](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang): Speed up your LLM deployments with [SGLang](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe (#3126, 3140, 3170, 3145)
* [Mixtral 8x7B](https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral): Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, 2888, 3017, 3067, 2882)
* [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/): Official docs for hosting Mistral 7B from mistral.ai (#2615, 2856)
* [CodeLlama](https://github.com/skypilot-org/skypilot/tree/master/llm/codellama): Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, 3143)
* [LoRAX](https://github.com/skypilot-org/skypilot/tree/master/llm/lorax): efficient multi-lora LLM inference (#2883)
* [axolotl](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl): a latest LLM tool for finetuning AI models running on SkyPilot (#2784, 2789)
* [Tabby](https://github.com/skypilot-org/skypilot/tree/master/llm/tabby): Self-host coding assistant Tabby on SkyPilot (#2597, 3068)
* [vLLM](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm): Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, 2643, 2616, 2786, 2791, 2948,3118)
* [TGI](https://github.com/skypilot-org/skypilot/tree/master/examples/serve/huggingface-tgi.yaml): Scale the inference engine TGI with SkyServe (#3121)
Kubernetes
Kubernetes support received a number of **New Features** and **Enhancements**.
* Multi-node support for Kubernetes (2609, 3019)
* Open ports support for Kubernetes (2588, 2713, 2997, 3200)
* Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (2650)
* Starting a kubernetes GPU cluster locally with `sky local up` (2890)
* Custom Image Support for Kubernetes Instances (2729, 3019, 3210)
* New provisioner for kubernets for better performance and robustneess (3019)
* Supporting Kubernetes cluster launched with k3s and Rancher (3148)
**Other Enhancements**
* Support H100 80GB in Kubernetes (2840)
* Share SSH jump pod across users to reduce resources consumption (2826)
* Allow `KUBECONFIG` env var for config file specification (3169)
* Robustify the kubernetes cluster removement (3043)
* Fixes GPU labeller (2636, 2653)
* UX and Robustness improvement (2638, 2712, 2589, 2785, 2551, 2795, 2884, 2913, 2795)
* Documents improvement (2595, 2705, 2957, 2991, 2997, 3119)
More Clouds
SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: **VMWare vSphere**, **RunPod**, **Fluidstack** and **Cudo Compute**.
* [RunPod](https://www.runpod.io/): RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, 3018)
* [Fluidstack](https://www.fluidstack.io/): Fluidstack offers accessible GPUs for AI with low cost. (#3086, 3224)
* [Cudo Compute](https://www.cudocompute.com/): GPU cloud provides low cost GPUs powered with green energy. (#2975, 3224)
* [VMWare vSphere](https://www.vmware.com/products/vsphere.html): you can now bring your own vSphere cluster to SkyPilot. ([docs](https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/vsphere.html)) (#3000)
Clouds
AWS
**New Features**
* New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (1702, 2719, 2792)
* Support for AWS Trainium accelerator (2690)
* Support null for proxy command to filter regions (2756)
* Support CUDA 12.1 with default image updates (2788)
* Job scheduling on Inferentia and Trainium (2969, 2798)
* Allow specifying security_group (3133)
**Enhancements**
* Make public / private subnet selection robust (2867)
* Avoid hanging for restarting an instance in STOPPING state (2998)
* Remove sunset instance types (2610)
* Add docs for custom VPC support (2776)
**Fixes**
* Fix conda installation on AWS default image (3206)
* Robustify the custom image support (3216)
* Fix subnet selection for AWS and autodown for spot instances (2921)
* Fix minimal permission for AWS (2978)
* Improve opening ports for AWS (2716)
* Autstop with new provisioner (2719)
GCP
**New Features**
* Security: Custom VPC support for GCP. (2764, 2772, 2854, 2944)
* Security: Support private IP with proxy jump on GCP. (2819)
* New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (2681, 2719, 2943)
* Automatically use reserved instances from multiple reserved pools (2836, 2681)
* Support L4 accelerator for GCP (2724)
* Allow stopping spot clusters on GCP (2877)
**Enhancements**
* Allow stopping VM with local SSD (2587)
* Update default runtime version for TPU node (2601, 2602)
* Handling transient error during launching GCP clusters (2669)
* Update GCSFuse version to 1.3.0 for GCS storage mount (2887)
* Set TPU VM the default option for TPU accelerators (1758)
* Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (3028, 3172, 3234)
**Fixes**
* Fix custom docker image support (3218)
* Fix minimal roles required for GCP (2704)
* Robustify the catalog fetching (3141)
* Fix ports on TPU VM and cluster launched before 0.4.0 (2641)
* Fix backward compatibility issue with GCP clusters (2604)
* Fix `--disk-size` for Custom Machine Images (2718)
* Update catalog fetcher with more options (2562)
* Assign GCP VMs with service account (2972)
* Fix machine image support (3030, 3236)
* Fix error handling for failed provisioning (2852)
* Leave out TPU v5 in catalog as it is not supported (2656)
* Fix GCP minimal permission (2947, 2770, 2761)
Azure
**Enhancements**
* Make ports openning more robust (2649, 2891, 3084)
* Additional arguments for Azure catalog fetcher and support H100 (2561, 2844, 2847)
* Support CUDA 12.1 with default image updates (2468)
* Support spot instances on Azure (2871)
**Fixes**
* Fix custom docker image support (3218)
* UX: Fix Azure disk tier explicitly shown in resources str (3064)
* Fix status query for Azure (3015)
SCP
* Fix SCP error raised in `sky check` (3038)
CLI & Core interfaces
**New Features**
* Multi-node jobs fail fast fast for single node failure (3081)
* Add configurations for not uploading credentials (2904)
* Adding `sky status --endpoints` CLI (3199)
* Support more characters in cluster name (3130)
* Show all regions and more accurate price in `sky show-gpus` (2583, 2892, 2933, 2946, 3083, 3149, 3113)
* Allow infering cloud from region or zone (2632)
* Add `--commit` and `--version` for `sky` CLI (2720, 2731, 2733)
**Enhancements**
* Robustify runtime initialization on remote cluster (3132)
* Better error message for YAML parsing (3040)
* Smarter GPU name completion (3014)
* Speed up retry until up by not doing exponential backoff (2821)
* Add schema validation for config (2645)
* Allow `--disk-tier none` override (2906)
* `sky check` improvement (3174, 3212, 3160)
* Better logging for CLIs (2535, 2691, 2728, 3139, 3175)
**Fixes**
* Fix permission issues for SSH config file on specific linux distributions (3151)
* Fix `sky_logs` and mounting directory (2667, 2845)
* Fix job related commands (2662, 2767)
* Fix `sky logs` with `--sync-down` (2660)
**Deprecations**
* Deprecate `cpunode/gpunode/tpunode`, hide `admin` (2800)
* Remove deprecated `Local` cloud which is now replaced by Kubernetes support (3037, 3186)
Backend/Provisioner
**New Features**
* Support multiple candidate resources (2498, 2803, 2833, 2886, 3107)
* Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (3004, 3005)
* Support spaces in paths (2762)
* Support long local username with special characters (3105, 3130)
**Enhancements**
* Robustify termination of failed clusters during failover (2990)
* Improve the ssh check for clusters just provisioned (2797)
* Robustify failover to avoid terminating clusters that has user data (2977)
* Move ssh config to `~/.ssh/generated/ssh` instead of directly editing `~/.ssh/config` (2706, 3069)
* Code refactoring and cleanup (2541, 2736, 3046, 2633, 2870, 2925, 3087, 3088, 3153)
* Improve usage collection (2654, 2672)
* Better explanation of failover in docs (2850, 2834)
**Fixes**
* Avoid backward compatibility issue with provisioner (2682)
* Fix cloud provisioning internal file mount cache (2715)
* Fix optimization for DAG when some resources provided are not feasible (2657)
* Fix runtime installation on remote VM (2909, 2912)
* Fix cluster termination when the cluster is not fully UP (3025)
* Fixes for tests (2651, 2976, 3023, 3166, 3167, 3202)
* Improve logging (2594, 2678, 2696, 3003)
Managed spot
**New Features**
* Allow 2x spot jobs to be run concurrently (3191, 3208)
**Enhancements**
* Better logging and UX (2630)
* Add docs for customizing spot controller (2753)
* Add spot pipeline docs (2936)
**Fixes**
* Fix private VPC support for spot jobs (2874)
* Fix `~/.sky/config.yaml` for spot jobs (2876)
* Fix OOM for long running spot jobs (2675)
* Fix AWS NoCredentialError caused by credential rotation (2695)
* Fix Azure dependency on spot controller (2875)
Storage
**New Features**
* Mount storage back to clusters after restarted (2322, 2804)
**Enhancements**
* Clarify the syntax for external and managed storage (3162, 2804)
* Confirmation prompt for sky storage delete, and --yes flag to skip it (2726)
* Refactor and clean up storage code (2774, 2986)
**Fixes**
* Fix permission issue for S3 mounting on specific images (3215)
* Fix spaces in source path for storages (2835)
Dependencies
* Recommand nightly build in docs for better performance and robustness (2984)
* Automatic build for nightly Docker image (2229)
* Avoid ray dependency locally for AWS, GCP, and Kubernetes (2625, 2943, 3019)
* Remove AWS dependency by default for better setup time and less confliction (2841, 2942)
* Fix GCP dependency by updating google-api-python-client (2577, 2759)
* Pin remote dependency for ray job (2659)
* Robustify dependencies (2642, 2679, 3024)
Examples
* NeMo distributed training for BERT and GPT3 (2533)
* Add docker compose example to run multiple containers (2745)
* Distributed ray train example (2828)
* Benchmark Torch DDP (2987)
* Example updates for supported models (2637, 2825)
**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.4.0...v0.5.0
Thanks to all contributors!
New contributors: rtalaricw, jackyk02, Vaibhav2001, rohanvaidya45, Shrinandan, manishiitg, amitkumarj441, tgaddair, aseriesof-tubes, changxiaohui, thams, kishb87, PratikKumar125, mmcclean, dtran24, davidwagnerkc, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123
Many thanks to all contributors who contributed to this release!
Contributors: Michaelvll, concretevitamin, cblmemo, romilbhardwaj, MaoZiming, landscapepainter, sunny0826, suquark, Vaibhav2001, infwinston, hemildesai, asaiacai, Shrinandan, kishb87, rtalaricw, iojw, aseriesof-tubes, manishiitg, jackyk02, mmcclean, thams, amitkumarj441, rohanvaidya45, saihtaungkham, tgaddair, davidwagnerkc, PratikKumar125, dtran24, changxiaohui, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123