Skypilot

Latest version: v0.7.0

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.7.0

We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:

* Upto 3x faster provisioning
* Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
* Observability features
* Admin policy enforcement
* Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
* New UX for `sky` CLI

and many bug fixes and enhancements!

Release Highlights

Performance
We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.

| Cloud | Provisioning Time | Speedup |
|----------------|-------------------|---------|
| AWS | 1 min 10s | 3x |
| GCP | 1 min 15s | 3x |
| Azure | 2 min 16s | 2x |
| Kubernetes | 52s | 2.5x |


Reservations
SkyPilot now supports [short-term and long-term reservations](https://skypilot.readthedocs.io/en/latest/reservations/reservations.html) across clouds:

* AWS Capacity Reservations
* AWS Capacity Blocks
* GCP reservations
* GCP Dynamic Workload Scheduler (DWS)
* Bring your own [VMs](https://skypilot.readthedocs.io/en/latest/reservations/existing-machines.html) or [Kubernetes clusters](https://skypilot.readthedocs.io/en/latest/reference/kubernetes/index.html)

SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.

Observability on Kubernetes

SkyPilot now has two new observability features on Kubernetes:
* `sky status --kubernetes` shows all SkyPilot resources on the cluster. (4040, 4079)

$ sky status --cloud kubernetes
Kubernetes cluster state (context: mycluster)
SkyPilot clusters
USER NAME LAUNCHED RESOURCES STATUS
alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP
alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP
bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP
bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP
bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP

Managed jobs
In progress tasks: 1 STARTING
USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION RECOVERIES STATUS
alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED
bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED
bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING
bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED
bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED

* `sky show-gpus --cloud kubernetes` shows detailed GPU availability information on the cluster. (3816, 4085)

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 8 8
H100 1, 2, 4, 8 16 16

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 4
my-cluster-2 H100 8 8
my-cluster-3 H100 8 8


Admin policy enforcement

SkyPilot has a new [admin policy mechanism](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html) (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.

Example policies:
* [Add Labels for all Tasks on Kubernetes](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#kubernetes-labels-policy)
* [Always Disable Public IP for AWS Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#disable-public-ip-policy)
* [Use Spot for all GPU Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#use-spot-for-gpu-policy)
* [Enforce Autostop for all Tasks](https://skypilot.readthedocs.io/en/latest/cloud-setup/policy.html#enforce-autostop-policy)


Azure Blob Storage support
In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (3032)

New AI hardware support
* New accelerators: TPU v6 (4115), TPU v5 (3814), H100 Mega (4099),
* Faster networking on GCP with gVNIC (4095)
* Faster disks: new disk tier `ultra` (3860) for GCP and AWS.

UX revamp
SkyPilot CLI is cleaner, simpler and even easier to parse now (4023)

<img src="https://i.imgur.com/fg8tOYq.gif" width="600"/>

New LLM Recipes

* Llama 3.1 and Llama 3.2 recipes (3990, 3779, 3780)
* llm.c training for GPT 2 (3611)
* Pixtral (3938, 3940)
* Qwen2-VL and Qwen 2.5 support (3961, 3959)
* Yi model family support (3958)
* Nemo GPT (3743)
* Other examples: Airflow (3982), AWS Neuron Accelerator (4020), and Deepspeed with k8s support (4124)

Deprecation Notice

* All `SKY_*` environment variables are deprecated in favor of `SKYPILOT_*` variables.
* All `SKY_*` variables will be removed in v0.9.0.
* See [docs](https://skypilot.readthedocs.io/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) for list of currently supported variables.


Backend

New Features

* Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (3919)
* Set `max_restarts_on_errors` to specify the number of times SkyPilot should try to restart the job.

resources:
job_recovery:
max_restarts_on_errors: 3 Retry 3 times before marking the job as failed

* Nvidia GPUs can now disable ECC (3676)
* New environment variable `SKYPILOT_NUM_NODES` to fetch the number of nodes in the cluster. (3656)
* SkyPilot [config can now be overridden](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html#experimental-configurations) in the task definition with `experimental.config_override` (3689)

experimental:
config_override:
docker:
run_options: ...
kubernetes:
pod_config: ...
provision_timeout: ...
gcp:
managed_instance_group: ...
nvidia_gpus:
disable_ecc: ...



Enhancements

* SSH keys AddKeysToAgent for ssh config file and ssh cmd https://github.com/skypilot-org/skypilot/pull/3985
* SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (3639)
* Similarly, the environment pre-configured in your docker image is no longer shadowed by SkyPilot's runtime environment (3874, 3867)
* `docker.run_options` now allows users to pass additional options when running docker containers. (3682)

Fixes
* Fix `sky cancel` not terminating all child processes (3919)
* Fix provisioning failures when multiple versions of SkyPilot are installed (3866)
* Shell autocomplete installation is now more robust (3892, 3893)


Kubernetes

New Features

* Observability improvements:
* `sky status --cloud kubernetes` shows all SkyPilot resources on the Kubernetes cluster. (4040, 4079)
* `sky show-gpus --cloud kubernetes` shows detailed GPU availability information on the cluster. (3816, 4085)
* SkyPilot now helps you set up your clusters for running SkyPilot jobs.
* If you already have a list of IPs and their SSH keys, `sky local up` can now [automatically set it up as a cluster](https://skypilot.readthedocs.io/en/latest/reservations/existing-machines.html) to be used for running jobs. (#3926)
* If you don't have a cluster yet, we provide a simple [one-click setup script](https://github.com/skypilot-org/skypilot/tree/master/examples/k8s_cloud_deploy) to deploy VMs with Kubernetes on cloud of your choice (#3929).
* SkyPilot job output is now piped to the container logs (3758)
* Use your existing logging tooling (`kubectl logs`, filebeat, etc.) to view SkyPilot job outputs.
* Support for Nvidia GPU operator labels (`nvidia.com/gpu.product`) for detecting GPU types. (3493)
* You no longer need to label GPUs if you have the Nvidia GPU operator installed.
* Spot instances are now supported on GKE clusters (3675)
* [Experimental] Multi-context support (3913, 3968, 3897, 3772, 4013)

Performance improvements:
* New command runner: 3x faster command submission for Kubernetes pods. (3157)
* `sky local up` for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (3664)
* Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (3665)
* SSH jump pod is no longer required for `port-forward` mode (3657)
* SSH setup is now parallelized to speed up multi-node provisioning (4158)

Enhancements and fixes
* H100 Mega support on GKE (3891, 3627)
* Better handling for context names with special characters (4147)
* `--k8s` is now a valid alias for `--cloud kubernetes` (4151)
* Init containers are now supported on Kubernetes (3762)
* Auth: robust service account support and updated docs on minimal permissions (3632)
* Custom metadata annotations are now propagated to services, allowing configuration of internal load balancer services on cloud hosted Kubernetes clusters (3767)
* Provisioning errors are now surfaced clearly (3590, 3795, 3821)
* Cluster attributes (autodown, idle-minutes-to-autostop) are now added as annotations to the pod (3870)
* SkyServe controller is now automatically terminated when all replicas are terminated. (3984)
* Create namespace permission is no longer required in cluster launch flow (3714)
* If your cluster does not support `apparmor`, SkyPilot will now retry without requesting it. (4176)


Cloud: GCP

New Features
* New accelerators supported:
* H100 Mega (4099)
* TPU v5 (3814)
* [Dynamic Workload Scheduler (DWS)](https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler) support (#3574, 3835)
* DWS helps get better availability on GCE through queuing and reservations.
* Faster `pd-extreme` disks with `disk_tier: ultra` (3860)
* New config `gcp.force_enable_external_ips` to force enable external IPs (3699)
* This is useful when communication within a VPC is desired and the VM needs to make calls to the public internet.
* TPU VMs can now run docker containers (4115)

Enhancements

* Provisioning is now 3x faster on GCP (4027)
* Faster networking support with gVNIC (4095)
* Upto ~2x faster in [pytorch distributed benchmarks](https://gist.github.com/romilbhardwaj/89f8399d8a5307df5d880cf1495ce957)


Cloud: AWS

New Features

* Capacity blocks and capacity reservations are now supported. (3852, 3853)
* You don’t have to wake up at 4:30am PDT to launch your job on a newly available capacity block: SkyPilot will wait for you until the start time of the capacity block.
* Faster `io2` disks with `disk_tier: ultra` (3860)
* Security groups: you can now specify security groups for your resources at a finer granularity. (3501)
* SkyPilot can now use encrypted EBS volumes (3765)

Enhancements

* Performance: provisioning now 3x faster on AWS (4091)
* Buckets created by SkyPilot are now tagged with labels specified in ~/.sky/config.yaml (3922)
* Label validation now handles `:` and other special characters. (3734)


Cloud: Azure

New Features
* You can now use any Azure community image with `--image-id` (4145)
* Azure Blob Storage is now supported (3032, 3796, 3807)
* Fractional A10 instance types are now supported (3877)
* You can now specify resource group for Azure instance provisioning (3764)
* Faster `Premium_LRS` disks with `disk_tier: high` (3921)

Enhancements

* Performance: provisioning is now 2x faster on Azure with our new provisioner and custom images (3697, 3704, 3696, 3700, 4139, 4167, 4205)
* Improved support for A10 GPUs (3707)
* Azure resource group is now waited to be deleted instead of erroring out (3712)

SkyServe
* Readiness probe timeout can now be set in the service spec (3472)
* You can now tear down a specific replica with `sky serve down --replica-id` (4032)
* SkyServe controller region is now chosen from the replica resources (4053)


Storage
* Azure Blob Storage is now supported. (3032, 3796, 3807)
* [`.skyignore` support](https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files) (4038)
* You can now add files to a `.skyignore` file to skip uploading them to cloud storage.
* GCSFuse is updated to 2.2.0, bringing better performance and reliability. (3619)


Other clouds

* Lambda Cloud support has been migrated to our new and more reliable provisioner (3865, 3889)
* Lambda Cloud now supports docker images (4115)
* CUDO now supports opening ports (3717)
* RunPod now supports opening ports (3748) and custom docker images (3728).
* FluidStack provisioning has been updated to their new API (3799)
* Paperspace now supports A4000 and P4000 GPUs (3991)
* OCI: bug fixes and improvements (4074, 4080)

Thanks to all contributors!

New contributors: winglian, Ultramann, jucor, BitPhinix, sethkimmel3, hyoxt121, BabyChouSr, wizenheimer, gurcangercek, shashank2000, ckgresla, bernardwin, kmushegi, Conless, JayThomason, colinjc, mtaran, Haijian06, KrishivPiduri, zpoint

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, romilbhardwaj, cblmemo, landscapepainter, asaiacai, andylizf, yika, concretevitamin, colinjc, fozziethebeat, MaoZiming, JGSweets, Ultramann, Conless, jucor, wizenheimer, Haijian06, HysunHe, gurcangercek, bernardwin, JungleCatSW, BabyChouSr, hyoxt121, winglian, sethkimmel3, mjibril, shashank2000, ckgresla, zpoint, mtaran, KrishivPiduri, JayThomason, BitPhinix, kmushegi

**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.6.0...v0.7.0

0.6.1

This patch release brings many improvements and fixes to SkyPilot, including major performance improvements for Kubernetes and Azure and new features for AWS and GCP.

Stay tuned for a detailed changelog coming up in v0.7.0!

0.6.0

We are excited to release SkyPilot v0.6.0! This release includes a number of new features:
* [Managed Jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html) for job execution and recovery
* SkyServe and Jobs on Kubernetes
* Mix on-demand and spot instances in SkyServe
* New cloud: [Paperspace](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#paperspace)


Release Highlights

Managed Jobs

* The spot controller has been enhanced to support any job on on-demand or spot instances.
* To use, run `sky jobs launch` instead of `sky spot launch`.
* The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
* The `sky jobs` API is identical to the `sky spot` API, but also supports on-demand instances.

SkyServe and Jobs on Kubernetes

* SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
* This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
* Simply run `sky jobs launch` or `sky serve up`, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.


Mix on-demand and spot instances in SkyServe

* SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. [Example](https://github.com/skypilot-org/skypilot/blob/master/examples/serve/spot_policy/base_on_demand_fallback_replicas.yaml).
* Uses on-demand instances to ensure availability and spot instances to save costs.
* Dynamically falls back to on-demand replicas when spot replicas are not available. [Example](https://github.com/skypilot-org/skypilot/blob/master/examples/serve/spot_policy/dynamic_on_demand_fallback.yaml).

Paperspace support
* Newest cloud to join the Sky: Paperspace!
* Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
* Simply add your Paperspace API key to `~/.paperspace/config.json` and run `sky check paperspace` to [get started](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#paperspace).
* Big thanks to asaiacai for contributing Paperspace support!


More LLMs and Recipes

* New LLM Recipes: [Llama-3](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html), [Qwen](https://skypilot.readthedocs.io/en/latest/gallery/llms/qwen.html), [Ollama](https://skypilot.readthedocs.io/en/latest/gallery/frameworks/ollama.html), [DBRX](https://skypilot.readthedocs.io/en/latest/gallery/llms/dbrx.html), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Cog](https://github.com/skypilot-org/skypilot/pull/3219)


Deprecation Notes

The following features have been deprecated and will be removed in the next minor release:

* `sky spot` CLI: use `sky jobs` CLI instead.
* `core.spot_xxx` APIs: refactored to `jobs.xxx`.
* `qps_lower_threshold` and `auto_restart` in `service`: use `target_qps_per_replica` instead.

Changelog

Managed Jobs

* Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (3289)
* The name of the spot job is now included in the `SKYPILOT_TASK_ID` environment variable (3424)
* Legacy spot job APIs have been refactored from `core.spot_xxx` to `jobs.xxx` (3417)
* Cloud for the controller is now chosen based on the resources of the replicas (3363)
* Bug fixes (3302, 3397, 3459, 3468, 3480)

SkyServe

New Features

* New intelligent policy for mixing spot and on-demand instances in SkyServe (3194)
* **SkyServe now uses proxy** instead of HTTP redirect responses for better performance (3395)
* **Readiness probe now supports headers**: this is useful for authentication or other headers required for readiness checks (3552)

Enhancements

* Optimizations - replicas are reused when only service section is changed (3214)
* Rolling updates are now the default behavior for SkyServe (3249)
* Controller cloud is now chosen from replica resources if it is not already up (3231)
* Bug fixes and API improvements (3257, 3299, 3303, 3411, 3411, 3546)


Kubernetes

* Kubernetes clusters can now run SkyServe and Managed Jobs (3377, 3524, 3521)
* `sky show-gpus` now shows realtime availability of GPUs in the cluster (3499)
* Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (3513, 3415)
* Use Kubernetes service accounts by specifying `remote_identity` in ~/.sky/config.yaml (3377, 3527)
* `sky local up` now also automatically installs the Nginx Ingress Controller (3223)
* Support for specifying custom pod configurations with `pod_config` (3244)
* Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting `HTTP_PROXY` and more! See [example `pod_config` here](https://skypilot.readthedocs.io/en/latest/reference/config.html).
* Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (3333)
* Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
* Support for PodIP mode for exposing ports (3445)

Enhancements

* **GPU Isolation**: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (3443)
* Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (3263, 3373)
* All SkyPilot pods are now labelled with `skypilot-user` to identify the owner of the pod (3576)
* Special characters in environment variables are now correctly parsed (3322)
* GPU labelling is now more robust (3274)
* Bug fixes and quality of life improvements (3266, 3392, 3439, 3509, 3524, 3525, 3532, 3563, 3578, 3374)

CLI & Core interfaces

New Features

* `resources` now supports `labels` field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (3464, 3505)
* `sky check` now supports checking credentials for specific clouds, e.g. `sky check aws gcp` (3229)
* You can also restrict which clouds are checked by setting `allowed_clouds` in `~/.sky/config.yaml`. (3556)
* `any_of` or `ordered` fields in `resources` can now have clouds that are not enabled (3567)
* A [new environment variable `SKYPILOT_CLUSTER_INFO`](https://skypilot.readthedocs.io/en/latest/running-jobs/environment-variables.html), containing cluster name, cloud, region and zone is now available in all tasks (#3424)

Enhancements

* Optimizer is up to 10x faster when multiple resources are specified (3567)
* Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (3205)
* GCP GPUs now include `DEVICE_MEM` in `sky show-gpus` (3375)
* Better sorting for `sky show-gpus` (3492)
* Handling for usernames containing invalid characters (3528)
* Null environment variables now raise an error (3557)

Runtime & Backend

* SkyPilot now supports Python 3.11 (3248)
* SkyPilot runtime is now isolated from any environment changes made by user code (3575, 3326, 3339)
* Fix for jobs and services running longer than 12 days (3460)
* Docker runtime fixes and enhancements, including fix for storage mounting in container (3450, 3436, 3481, 3343)
* Bug fixes and optimizations (3280, 3292, 3178, 3386, 3292, 3386, 3407, 3423, 3368, 3457, 3469, 3482, 3495, 3512, 3536, 3568)

Optimizations

* Lazy imports for 2x faster import times (3394, 3463)
* Faster setup and job submission (3523, 3484),

Cloud: GCP

* H100 GPUs are now supported on GCP (3279)
* Support for fine-grained GCP IAM permissions (3284)

Cloud: Azure

* Custom images are now supported on Azure. Simply specify `image_id` in the `resources` field. (3362)
* 8x faster autostop for Azure (3519)
* Fix GPUs not being detected in Azure (3313)
* Provisioning fixes (3483)

Cloud: AWS

* Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (3488, 3514)
* SkyPilot can now be run in ECS containers by assuming `container-role` IAM roles (3503)
* SkyPilot will not delete user-specified security groups (3402)

Cloud: Fluidstack

* H100 and A100 Nvlink support for Fluidstack (3467)
* Opening ports is now supported for Fluidstack (3294)
* Bug fixes (3254, 3265)

Other Clouds

* Bug fixes for Lambda provisioning and termination (3409, 3410)
* Multi-gpu fixes for RunPod (3291)
* Cudo: handle missing project errors (3438)

Thanks to all contributors!

New contributors: MysteryManav, JGSweets, Harthgar, mjkanji

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, romilbhardwaj, concretevitamin, cblmemo, MaoZiming, shethhriday29, asaiacai, JGSweets, mjkanji, MysteryManav, landscapepainter, Harthgar, mjibril, dtran24, fozziethebeat, JungleCatSW


**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.5.0...v0.6.0

0.5

0.5.0

We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:
* SkyPilot Serving
* New provisioner
* LLM recipes for the latest open models and engines
* Kubernetes support improvement
* 4 new clouds (contributed by the cloud providers!)

and more!

Release Highlights

**New Features**

* [**Multiple candidate resources**](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html#multiple-candidate-resources): SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, `any_of` or `ordered` in `resources`), allowing users to significantly enlarge the resource pool and get higher availability.
* [**New Provisioner**](https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing): Provisioner gets a new implementation, which is **2x faster and more reliable** for supported clouds. Support launching clusters with more than **100 nodes**. Dependency requirements for clouds are also significantly reduced.
* **Disk Tier**: Introducing `best` disk tier for the best performance and cost, so you can choose the best disk for any cloud. (2434)
* Allow **2x spot jobs** to be run concurrently
* Mount storage back after cluster restart

SkyServe
[SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.

* Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (2458)
* Autoscaler: Request rate based autoscaling policy. (2868, 2878)
* Autoscaler: Support scaling to 0 when no requests (2938)
* Rolling update: Support rolling update for existing services (2935, 3057)


**Other Enhancements**

* Environment variable support in services field (3078)
* Override task configurations with CLI arguments (2979)
* Logging improvement for replicas (2924, 2949)
* Smoke tests for SkyServe (2911)
* [Documents](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) for SkyServe (#3022, 2794, 2864, 2894, 2922, 2989, 3182)
* UX improvements for SkyServe (2895, 2940, 2961, 3054, 3176, 3094)
* Bug fixes and robustness improvement (2811, 2822, 2860, 2995, 2983, 3058, 3075, 3226)

New LLM Recipes

* [Gemma](https://github.com/skypilot-org/skypilot/tree/master/llm/gemma): Serve your Gemma on any cloud (#3207, 3220)
* [SGLang](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang): Speed up your LLM deployments with [SGLang](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe (#3126, 3140, 3170, 3145)
* [Mixtral 8x7B](https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral): Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, 2888, 3017, 3067, 2882)
* [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/): Official docs for hosting Mistral 7B from mistral.ai (#2615, 2856)
* [CodeLlama](https://github.com/skypilot-org/skypilot/tree/master/llm/codellama): Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, 3143)
* [LoRAX](https://github.com/skypilot-org/skypilot/tree/master/llm/lorax): efficient multi-lora LLM inference (#2883)
* [axolotl](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl): a latest LLM tool for finetuning AI models running on SkyPilot (#2784, 2789)
* [Tabby](https://github.com/skypilot-org/skypilot/tree/master/llm/tabby): Self-host coding assistant Tabby on SkyPilot (#2597, 3068)
* [vLLM](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm): Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, 2643, 2616, 2786, 2791, 2948,3118)
* [TGI](https://github.com/skypilot-org/skypilot/tree/master/examples/serve/huggingface-tgi.yaml): Scale the inference engine TGI with SkyServe (#3121)

Kubernetes
Kubernetes support received a number of **New Features** and **Enhancements**.

* Multi-node support for Kubernetes (2609, 3019)
* Open ports support for Kubernetes (2588, 2713, 2997, 3200)
* Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (2650)
* Starting a kubernetes GPU cluster locally with `sky local up` (2890)
* Custom Image Support for Kubernetes Instances (2729, 3019, 3210)
* New provisioner for kubernets for better performance and robustneess (3019)
* Supporting Kubernetes cluster launched with k3s and Rancher (3148)

**Other Enhancements**

* Support H100 80GB in Kubernetes (2840)
* Share SSH jump pod across users to reduce resources consumption (2826)
* Allow `KUBECONFIG` env var for config file specification (3169)
* Robustify the kubernetes cluster removement (3043)
* Fixes GPU labeller (2636, 2653)
* UX and Robustness improvement (2638, 2712, 2589, 2785, 2551, 2795, 2884, 2913, 2795)
* Documents improvement (2595, 2705, 2957, 2991, 2997, 3119)

More Clouds
SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: **VMWare vSphere**, **RunPod**, **Fluidstack** and **Cudo Compute**.
* [RunPod](https://www.runpod.io/): RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, 3018)
* [Fluidstack](https://www.fluidstack.io/): Fluidstack offers accessible GPUs for AI with low cost. (#3086, 3224)
* [Cudo Compute](https://www.cudocompute.com/): GPU cloud provides low cost GPUs powered with green energy. (#2975, 3224)
* [VMWare vSphere](https://www.vmware.com/products/vsphere.html): you can now bring your own vSphere cluster to SkyPilot. ([docs](https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/vsphere.html)) (#3000)



Clouds

AWS

**New Features**

* New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (1702, 2719, 2792)
* Support for AWS Trainium accelerator (2690)
* Support null for proxy command to filter regions (2756)
* Support CUDA 12.1 with default image updates (2788)
* Job scheduling on Inferentia and Trainium (2969, 2798)
* Allow specifying security_group (3133)

**Enhancements**

* Make public / private subnet selection robust (2867)
* Avoid hanging for restarting an instance in STOPPING state (2998)
* Remove sunset instance types (2610)
* Add docs for custom VPC support (2776)

**Fixes**

* Fix conda installation on AWS default image (3206)
* Robustify the custom image support (3216)
* Fix subnet selection for AWS and autodown for spot instances (2921)
* Fix minimal permission for AWS (2978)
* Improve opening ports for AWS (2716)
* Autstop with new provisioner (2719)



GCP

**New Features**

* Security: Custom VPC support for GCP. (2764, 2772, 2854, 2944)
* Security: Support private IP with proxy jump on GCP. (2819)
* New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (2681, 2719, 2943)
* Automatically use reserved instances from multiple reserved pools (2836, 2681)
* Support L4 accelerator for GCP (2724)
* Allow stopping spot clusters on GCP (2877)

**Enhancements**

* Allow stopping VM with local SSD (2587)
* Update default runtime version for TPU node (2601, 2602)
* Handling transient error during launching GCP clusters (2669)
* Update GCSFuse version to 1.3.0 for GCS storage mount (2887)
* Set TPU VM the default option for TPU accelerators (1758)
* Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (3028, 3172, 3234)

**Fixes**

* Fix custom docker image support (3218)
* Fix minimal roles required for GCP (2704)
* Robustify the catalog fetching (3141)
* Fix ports on TPU VM and cluster launched before 0.4.0 (2641)
* Fix backward compatibility issue with GCP clusters (2604)
* Fix `--disk-size` for Custom Machine Images (2718)
* Update catalog fetcher with more options (2562)
* Assign GCP VMs with service account (2972)
* Fix machine image support (3030, 3236)
* Fix error handling for failed provisioning (2852)
* Leave out TPU v5 in catalog as it is not supported (2656)
* Fix GCP minimal permission (2947, 2770, 2761)


Azure

**Enhancements**

* Make ports openning more robust (2649, 2891, 3084)
* Additional arguments for Azure catalog fetcher and support H100 (2561, 2844, 2847)
* Support CUDA 12.1 with default image updates (2468)
* Support spot instances on Azure (2871)

**Fixes**

* Fix custom docker image support (3218)
* UX: Fix Azure disk tier explicitly shown in resources str (3064)
* Fix status query for Azure (3015)


SCP

* Fix SCP error raised in `sky check` (3038)


CLI & Core interfaces

**New Features**

* Multi-node jobs fail fast fast for single node failure (3081)
* Add configurations for not uploading credentials (2904)
* Adding `sky status --endpoints` CLI (3199)
* Support more characters in cluster name (3130)
* Show all regions and more accurate price in `sky show-gpus` (2583, 2892, 2933, 2946, 3083, 3149, 3113)
* Allow infering cloud from region or zone (2632)
* Add `--commit` and `--version` for `sky` CLI (2720, 2731, 2733)

**Enhancements**

* Robustify runtime initialization on remote cluster (3132)
* Better error message for YAML parsing (3040)
* Smarter GPU name completion (3014)
* Speed up retry until up by not doing exponential backoff (2821)
* Add schema validation for config (2645)
* Allow `--disk-tier none` override (2906)
* `sky check` improvement (3174, 3212, 3160)
* Better logging for CLIs (2535, 2691, 2728, 3139, 3175)

**Fixes**

* Fix permission issues for SSH config file on specific linux distributions (3151)
* Fix `sky_logs` and mounting directory (2667, 2845)
* Fix job related commands (2662, 2767)
* Fix `sky logs` with `--sync-down` (2660)

**Deprecations**

* Deprecate `cpunode/gpunode/tpunode`, hide `admin` (2800)
* Remove deprecated `Local` cloud which is now replaced by Kubernetes support (3037, 3186)


Backend/Provisioner

**New Features**

* Support multiple candidate resources (2498, 2803, 2833, 2886, 3107)
* Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (3004, 3005)
* Support spaces in paths (2762)
* Support long local username with special characters (3105, 3130)


**Enhancements**

* Robustify termination of failed clusters during failover (2990)
* Improve the ssh check for clusters just provisioned (2797)
* Robustify failover to avoid terminating clusters that has user data (2977)
* Move ssh config to `~/.ssh/generated/ssh` instead of directly editing `~/.ssh/config` (2706, 3069)
* Code refactoring and cleanup (2541, 2736, 3046, 2633, 2870, 2925, 3087, 3088, 3153)
* Improve usage collection (2654, 2672)
* Better explanation of failover in docs (2850, 2834)

**Fixes**

* Avoid backward compatibility issue with provisioner (2682)
* Fix cloud provisioning internal file mount cache (2715)
* Fix optimization for DAG when some resources provided are not feasible (2657)
* Fix runtime installation on remote VM (2909, 2912)
* Fix cluster termination when the cluster is not fully UP (3025)
* Fixes for tests (2651, 2976, 3023, 3166, 3167, 3202)
* Improve logging (2594, 2678, 2696, 3003)


Managed spot

**New Features**

* Allow 2x spot jobs to be run concurrently (3191, 3208)

**Enhancements**

* Better logging and UX (2630)
* Add docs for customizing spot controller (2753)
* Add spot pipeline docs (2936)

**Fixes**

* Fix private VPC support for spot jobs (2874)
* Fix `~/.sky/config.yaml` for spot jobs (2876)
* Fix OOM for long running spot jobs (2675)
* Fix AWS NoCredentialError caused by credential rotation (2695)
* Fix Azure dependency on spot controller (2875)


Storage

**New Features**

* Mount storage back to clusters after restarted (2322, 2804)

**Enhancements**

* Clarify the syntax for external and managed storage (3162, 2804)
* Confirmation prompt for sky storage delete, and --yes flag to skip it (2726)
* Refactor and clean up storage code (2774, 2986)

**Fixes**

* Fix permission issue for S3 mounting on specific images (3215)
* Fix spaces in source path for storages (2835)


Dependencies

* Recommand nightly build in docs for better performance and robustness (2984)
* Automatic build for nightly Docker image (2229)
* Avoid ray dependency locally for AWS, GCP, and Kubernetes (2625, 2943, 3019)
* Remove AWS dependency by default for better setup time and less confliction (2841, 2942)
* Fix GCP dependency by updating google-api-python-client (2577, 2759)
* Pin remote dependency for ray job (2659)
* Robustify dependencies (2642, 2679, 3024)

Examples

* NeMo distributed training for BERT and GPT3 (2533)
* Add docker compose example to run multiple containers (2745)
* Distributed ray train example (2828)
* Benchmark Torch DDP (2987)
* Example updates for supported models (2637, 2825)



**Full Changelog**: https://github.com/skypilot-org/skypilot/compare/v0.4.0...v0.5.0


Thanks to all contributors!
New contributors: rtalaricw, jackyk02, Vaibhav2001, rohanvaidya45, Shrinandan, manishiitg, amitkumarj441, tgaddair, aseriesof-tubes, changxiaohui, thams, kishb87, PratikKumar125, mmcclean, dtran24, davidwagnerkc, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123

Many thanks to all contributors who contributed to this release!

Contributors: Michaelvll, concretevitamin, cblmemo, romilbhardwaj, MaoZiming, landscapepainter, sunny0826, suquark, Vaibhav2001, infwinston, hemildesai, asaiacai, Shrinandan, kishb87, rtalaricw, iojw, aseriesof-tubes, manishiitg, jackyk02, mmcclean, thams, amitkumarj441, rohanvaidya45, saihtaungkham, tgaddair, davidwagnerkc, PratikKumar125, dtran24, changxiaohui, mjibril, kbrgl, msehsah1, JungleCatSW, Ying1123

0.4.1

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the new provisioner for AWS, fixing OOM and credential issues for long-running spot jobs, and some additional improvements.

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.