Highlights
- Ray Train is now in beta! If you are using Ray Train, we’d love to hear your feedback [here](https://docs.google.com/forms/d/e/1FAIpQLSfI3asn-m1cQSIbdrk_cd6qYenZvt-eNTVfTwba3SVhmHcHIg/viewform)!
- Ray Docker images for multiple CUDA versions are now provided (19505)! You can specify a `-cuXXX` suffix to pick a specific version.
- `ray-ml:cpu` images are now deprecated. The `ray-ml` images are only built for GPU.
- Ray Datasets now supports groupby and aggregations! See the [groupby API](https://docs.ray.io/en/master/data/package-ref.html#ray.data.Dataset.groupby) and [GroupedDataset](https://docs.ray.io/en/master/data/package-ref.html#groupeddataset-api) docs for usage.
- We are making continuing progress in improving Ray stability and usability on Windows. We encourage you to try it out and report feedback or issues at https://github.com/ray-project/ray/issues.
- We are launching a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications when you don’t want an active connection using Ray Client. This is currently in alpha, so the APIs are subject to change, but please test it out and file issues / leave feedback on GitHub & discuss.ray.io!
Ray Autoscaler
💫Enhancements:
- Graceful termination of Ray nodes prior to autoscaler scale down (20013)
- Ray Clusters on AWS are colocated in one Availability Zone to reduce costs & latency (19051)
Ray Client
🔨 Fixes:
- ray.put on a list of of objects now returns a single object ref (19737)
Ray Core
🎉 New Features:
- Support remote file storage for runtime_env (20280, 19315)
- Added ray job submission client, cli and rest api (19567, 19657, 19765, 19845, 19851, 19843, 19860, 19995, 20094, 20164, 20170, 20192, 20204)
💫Enhancements:
- Garbage collection for runtime_env (20009, 20072)
- Improved logging and error messages for runtime_env (19897, 19888, 18893)
🔨 Fixes:
- Fix runtime_env hanging issues (19823)
- Fix specifying runtime env in ray.remote decorator with Ray Client (19626)
- Threaded actor / core worker / named actor race condition fixes (19751, 19598, 20178, 20126)
📖Documentation:
- New page “Handling Dependencies”
- New page “Ray Job Submission: Going from your laptop to production”
Ray Java
API Changes:
- Fully supported namespace APIs. ([Check out the namespace for more information.](https://docs.ray.io/en/latest/namespaces.html)) #19468 19986 20057
- Removed global named actor APIs and global placement group APIs. 20219 20135
- Added timeout parameter for `Ray.Get()` API. 20282
Note:
- Use `Ray.getActor(name, namespace)` API to get a named actor between jobs instead of `Ray.getGlobalActor(name)`.
- Use `PlacementGroup.getPlacementGroup(name, namespace)` API to get a placement group between jobs instead of `PlacementGroup.getGlobalPlacementGroup(name)`.
Ray Datasets
🎉 New Features:
- Added groupby and aggregations (19435, 19673, 20010, 20035, 20044, 20074)
- Support custom write paths (19347)
🔨 Fixes:
- Support custom CSV write options (19378)
🏗 Architecture refactoring:
- Optimized block compaction (19681)
Ray Workflow
🎉 New Features:
- Workflow right now support events (19239)
- Allow user to specify metadata for workflow and steps (19372)
- Allow in-place run a step if the resources match (19928)
🔨 Fixes:
- Fix the s3 path issue (20115)
RLlib
🏗 Architecture refactoring:
- “framework=tf2” + “eager_tracing=True” is now (almost) as fast as “framework=tf”. A check for tf2.x eager re-traces has been added making sure re-tracing does not happen outside the initial function calls. All CI learning tests (CartPole, Pendulum, FrozenLake) are now also run as framework=tf2. (19273, 19981, 20109)
- Prepare deprecation of `build_trainer`/`build_(tf_)?policy` utility functions. Instead, use sub-classing of `Trainer` or `Torch|TFPolicy`. POCs done for `PGTrainer`, `PPO[TF|Torch]Policy`. (20055, 20061)
- V-trace (APPO & IMPALA): Don’t drop last ts can be optionally switch on. The default is still to drop it, but this may be changed in a future release. (19601)
- Upgrade to gym 0.21. (19535)
🔨 Fixes:
- Minor bugs/issues fixes and enhancements: 19069, 19276, 19306, 19408, 19544, 19623, 19627, 19652, 19693, 19805, 19807, 19809, 19881, 19934, 19945, 20095, 20128, 20134, 20144, 20217, 20283, 20366, 20387
📖Documentation:
- RLlib main page (“RLlib in 60sec”) overhaul. (20215, 20248, 20225, 19932, 19982)
- Major docstring cleanups in preparation for complete overhaul of API reference pages. (19784, 19783, 19808, 19759, 19829, 19758, 19830)
- Other documentation enhancements. (19908, 19672, 20390)
Tune
💫Enhancements:
- Refactored and improved experiment analysis (20197, 20181)
- Refactored cloud checkpointing API/SyncConfig (20155, 20418, 19632, 19641, 19638, 19880, 19589, 19553, 20045, 20283)
- Remove magic results (e.g. config) before calculating trial result metrics (19583)
- Removal of tech debt (19773, 19960, 19472, 17654)
- Improve testing (20016, 20031, 20263, 20210, 19730
- Various enhancements (19496, 20211)
🔨Fixes:
- Documentation fixes (20130, 19791)
- Tutorial fixes (20065, 19999)
- Drop 0 value keys from PGF (20279)
- Fix shim error message for scheduler (19642)
- Avoid looping through _live_trials twice in _get_next_trial. (19596)
- clean up legacy branch in update_avail_resources. (20071)
- fix Train/Tune integration on Client (20351)
Train
Ray Train is now in Beta! The beta version includes various usability improvements for distributed PyTorch training and checkpoint management, support for [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html), and an [integration with Ray Datasets](https://docs.ray.io/en/master/train/user_guide.html#distributed-data-ingest-ray-datasets) for distributed data ingest.
Check out the docs [here](https://docs.ray.io/en/latest/train/train.html), and the migration guide from Ray SGD to Ray Train [here](https://docs.ray.io/en/latest/train/migration-guide.html). If you are using Ray Train, we’d love to hear your feedback [here](https://docs.google.com/forms/d/e/1FAIpQLSfI3asn-m1cQSIbdrk_cd6qYenZvt-eNTVfTwba3SVhmHcHIg/viewform)!
🎉 New Features:
- New `train.torch.prepare_model(...)` and `train.torch.prepare_data_loader(...)` [API](https://docs.ray.io/en/master/train/user_guide.html#update-training-function) to automatically handle preparing your PyTorch model and DataLoader for distributed training (20254).
- Checkpoint management and support for custom checkpoint strategies (19111).
- Easily [configure](https://docs.ray.io/en/master/train/user_guide.html#configuring-checkpoints) what and how many checkpoints to save to disk.
- Support for [Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html) (#20123, 20351).
💫Enhancements:
- Simplify workflow for training with a single worker (19814).
- [Ray Placement Groups](https://docs.ray.io/en/master/placement-group.html) are used for scheduling the training workers (#20091).
- `PACK` strategy is used by default but can be changed by setting the `TRAIN_ENABLE_WORKER_SPREAD` environment variable.
- Automatically unwrap Torch DDP model and convert to CPU when saving a model as checkpoint (20333).
🔨Fixes:
- Fix `HorovodBackend` to automatically detect NICs- thanks tgaddair! (19533).
📖Documentation:
- Denote public facing APIs with beta stability (20378)
- Doc updates (20271)
Serve
We would love to hear from you! Fill out the [Ray Serve survey here](https://forms.gle/zg4gDS84z8wTpKBLA).
🎉 New Features:
- New `checkpoint_path` configuration allows Serve to save its internal state to external storage (disk, S3, and GCS) and [recover upon failure](https://docs.ray.io/en/master/serve/deployment.html#failure-recovery). (19166, 19998, 20104)
- [Replica autoscaling](https://docs.ray.io/en/master/serve/core-apis.html#autoscaling) is ready for testing out! (19559, 19520)
- Native [Pipeline API for model composition](https://docs.ray.io/en/master/serve/pipeline.html) is ready for testing as well!
🔨Fixes:
- Serve deployment functions or classes can take no parameters (19708)
- Replica slow start message is improved. You can now see whether it is slow to allocate resources or slow to run constructor. (19431)
- `pip install ray[serve]` will now install `ray[default]` as well. (19570)
🏗 Architecture refactoring:
- The terminology of “backend” and “endpoint” are officially deprecated in favor of “deployment”. (20229, 20085, 20040, 20020, 19997, 19947, 19923, 19798).
- Progress towards Java API compatibility (19463).
Dashboard
- Ray Dashboard is now enabled on Windows! (19575)
Thanks
Many thanks to all those who contributed to this release!
krfricke, stefanbschneider, ericl, nikitavemuri, qicosmos, worldveil, triciasfu, AmeerHajAli, javi-redondo, architkulkarni, pdames, clay4444, mGalarnyk, liuyang-my, matthewdeng, suquark, rkooo567, mwtian, chenk008, dependabot[bot], iycheng, jiaodong, scv119, oscarknagg, Rohan138, stephanie-wang, Zyiqin-Miranda, ijrsvt, roireshef, tkaymak, simon-mo, ashione, jovany-wang, zenoengine, tgaddair, 11rohans, amogkam, zhisbug, lchu-ibm, shrekris-anyscale, pcmoritz, yiranwang52, mattip, sven1977, Yard1, DmitriGekhtman, ckw017, WangTaoTheTonic, wuisawesome, kcpevey, kfstorm, rhamnett, renos, TeoZosa, SongGuyang, clarkzinzow, avnishn, iasoon, gjoliver, jjyao, xwjiang2010, dmatrix, edoakes, czgdp1807, heng2j, sungho-joo, lixin-wei