Release Highlights<a id="release-highlights"></a>
Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).
- [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
- [RLlib] β**New API Stack**β officially announced as alpha for PPO and SAC.
- [Serve] Added a default autoscaling policy set via `num_replicas=βautoβ` ([42613](https://github.com/ray-project/ray/issues/42613)).
- [Serve] Added support for active load shedding via `max_queued_requests` ([42950](https://github.com/ray-project/ray/issues/42950)).
- [Serve] Added replica queue length caching to the DeploymentHandle scheduler ([42943](https://github.com/ray-project/ray/pull/42943)).
- This should improve overhead in the Serve proxy and handles.
- `max_ongoing_requests (max_concurrent_queries)` is also now strictly enforced ([42947](https://github.com/ray-project/ray/issues/42947)).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0`.
- [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
- `max_concurrent_queries` -> `max_ongoing_requests`
- `target_num_ongoing_requests_per_replica` -> `target_ongoing_requests`
- `downscale_smoothing_factor` -> `downscaling_factor`
- `upscale_smoothing_factor` -> `upscaling_factor`
- [Core] [Autoscaler v2](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaler-v2) is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
- [Train] Added support for accelerator types via `ScalingConfig(accelerator_type)`.
- [Train] Revamped the `XGBoostTrainer` and `LightGBMTrainer` to no longer depend on `xgboost_ray` and `lightgbm_ray`. A new, more flexible API will be released in a future release.
- [Train/Tune] Refactored local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`.
Ray Libraries<a id="ray-libraries"></a>
Ray Data<a id="ray-data"></a>
π New Features:
- Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (43026, 43171, 43298, 43299, 42930, 42504)
- Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (42044, 43216, 42922, 42759).
- Allow tasks concurrency control for read, map, and write APIs (42849, 43113, 43177, 42637)
- Data dashboard and statistics improvement with more runtime metrics for each components (43790, 43628, 43241, 43477, 43110, 43112)
- Allow to specify application-level error to retry for actor task (42492)
- Add `num_rows_per_file` parameter to file-based writes (42694)
- Add `DataIterator.materialize` (43210)
- Skip schema call in `DataIterator.to_tf` if `tf.TypeSpec` is provided (42917)
- Add option to append for `Dataset.write_bigquery` (42584)
- Deprecate legacy components and classes (43575, 43178, 43347, 43349, 43342, 43341, 42936, 43144, 43022, 43023)
π« Enhancements:
- Restructure stdout logging for better readability (43360)
- Add a more performant way to read large TFRecord datasets (42277)
- Modify `ImageDatasource` to use `Image.BILINEAR` as the default image resampling filter (43484)
- Reduce internal stack trace output by default (43251)
- Perform incremental writes to Parquet files (43563)
- Warn on excessive driver memory usage during shuffle ops (42574)
- Distributed reads for `ray.data.from_huggingface` (42599)
- Remove `Stage` class and related usages (42685)
- Improve stability of reading JSON files to avoid PyArrow errors (42558, 42357)
π¨ Fixes:
- Turn off actor locality by default (44124)
- Normalize block types before internal multi-block operations (43764)
- Fix memory metrics for `OutputSplitter` (43740)
- Fix race condition issue in `OpBufferQueue` (43015)
- Fix early stop for multiple `Limit` operators. (42958)
- Fix deadlocks caused by `Dataset.streaming_split` for job hanging (42601)
π Documentation:
- Revamp Ray Data documentation for GA (44006, 44007, 44008, 44098, 44168, 44093, 44105)
Ray Train<a id="ray-train"></a>
π New Features:
- Add support for accelerator types via `ScalingConfig(accelerator_type)` for improved worker scheduling (43090)
π« Enhancements:
- Add a backend-specific context manager for `train_func` for setup/teardown logic (43209)
- Remove `DEFAULT_NCCL_SOCKET_IFNAME` to simplify network configuration (42808)
- Colocate Trainer with rank 0 Worker for to improve scheduling behavior (43115)
π¨ Fixes:
- Enable scheduling workers with `memory` resource requirements (42999)
- Make path behavior OS-agnostic by using `Path.as_posix` over `os.path.join` (42037)
- [Lightning] Fix resuming from checkpoint when using `RayFSDPStrategy` (43594)
- [Lightning] Fix deadlock in `RayTrainReportCallback` (42751)
- [Transformers] Fix checkpoint reporting behavior when `get_latest_checkpoint` returns None (42953)
π Documentation:
- Enhance docstring and user guides for `train_loop_config` (43691)
- Clarify in `ray.train.report` docstring that it is not a barrier (42422)
- Improve documentation for `prepare_data_loader` shuffle behavior and `set_epoch` (41807)
π Architecture refactoring:
- Simplify XGBoost and LightGBM Trainer integrations. Implemented `XGBoostTrainer` and `LightGBMTrainer` as `DataParallelTrainer`. Removed dependency on `xgboost_ray` and `lightgbm_ray`. (42111, 42767, 43244, 43424)
- Refactor local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to `storage_path`, rather than having another copy in the userβs home directory (`~/ray_results`). (43369, 43403, 43689)
- Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup (42314)
- Refactor restoration configuration to be centered around `storage_path` (42853, 43179)
- Deprecations related to `SyncConfig` (42909)
- Remove deprecated `preprocessor` argument from Trainers (43146, 43234)
- Hard-deprecate `MosaicTrainer` and remove `SklearnTrainer` (42814)
Ray Tune<a id="ray-tune"></a>
π« Enhancements:
- Increase the minimum number of allowed pending trials for faster auto-scaleup (43455)
- Add support to `TBXLogger` for logging images (37822)
- Improve validation of `Experiment(config)` to handle RLlib `AlgorithmConfig` (42816, 42116)
π¨ Fixes:
- Fix `reuse_actors` error on actor cleanup for function trainables (42951)
- Make path behavior OS-agnostic by using Path.as_posix over `os.path.join` (42037)
π Documentation:
- Minor documentation fixes (42118, 41982)
π Architecture refactoring:
- Refactor local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to `storage_path`, rather than having another copy in the userβs home directory (`~/ray_results`). (43369, 43403, 43689)
- Deprecations related to `SyncConfig` and `chdir_to_trial_dir` (42909)
- Refactor restoration configuration to be centered around `storage_path` (42853, 43179)
- Add back `NevergradSearch` (42305)
- Clean up invalid `checkpoint_dir` and `reporter` deprecation notices (42698)
Ray Serve<a id="ray-serve"></a>
π New Features:
- Added support for active load shedding via `max_queued_requests` ([42950](https://github.com/ray-project/ray/issues/42950)).
- Added a default autoscaling policy set via `num_replicas=βautoβ` ([42613](https://github.com/ray-project/ray/issues/42613)).
π API Changes:
- Renamed the following parameters. Each of the old names will be supported for another release before removal.
- `max_concurrent_queries` to `max_ongoing_requests`
- `target_num_ongoing_requests_per_replica` to `target_ongoing_requests`
- `downscale_smoothing_factor` to `downscaling_factor`
- `upscale_smoothing_factor` to `upscaling_factor`
- **WARNING**: the following default values will change in Ray 2.11:
- Default for `max_ongoing_requests` will change from 100 to 5.
- Default for `target_ongoing_requests` will change from 1 to 2.
π« Enhancements:
- Add `RAY_SERVE_LOG_ENCODING` env to set the global logging behavior for Serve ([42781](https://github.com/ray-project/ray/pull/42781)).
- Config Serve's gRPC proxy to allow large payload ([43114](https://github.com/ray-project/ray/pull/43114)).
- Add blocking flag to serve.run() ([43227](https://github.com/ray-project/ray/pull/43227)).
- Add actor id and worker id to Serve structured logs ([43725](https://github.com/ray-project/ray/pull/43725)).
- Added replica queue length caching to the DeploymentHandle scheduler ([42943](https://github.com/ray-project/ray/pull/42943)).
- This should improve overhead in the Serve proxy and handles.
- `max_ongoing_requests` (`max_concurrent_queries`) is also now strictly enforced ([42947](https://github.com/ray-project/ray/issues/42947)).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0`.
- Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas ([42578](https://github.com/ray-project/ray/pull/42578)).
- This means you can now set `max_ongoing_requests=1` for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling.
- You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
- If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable `RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0`
- Improved the downscaling behavior of smoothing_factor for low numbers of replicas ([42612](https://github.com/ray-project/ray/issues/42612)).
- Various logging improvements ([43707](https://github.com/ray-project/ray/pull/43707), [#43708](https://github.com/ray-project/ray/pull/43708), [#43629](https://github.com/ray-project/ray/pull/43629), [#43557](https://github.com/ray-project/ray/pull/43557)).
- During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones ([43187](https://github.com/ray-project/ray/pull/43187)). New replicas will be eagerly started to satisfy the target number of healthy replicas.
- This new behavior is on by default and can be turned off by setting `RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0`
π¨ Fixes:
- Fix deployment route prefix override by default route prefix from serve run cli ([43805](https://github.com/ray-project/ray/pull/43805)).
- Fixed a bug causing batch methods to hang upon cancellation ([42593](https://github.com/ray-project/ray/issues/42593)).
- Unpinned FastAPI dependency version ([42711](https://github.com/ray-project/ray/issues/42711)).
- Delay proxy marking itself as healthy until it has routes from the controller ([43076](https://github.com/ray-project/ray/issues/43076)).
- Fixed an issue where multiplexed deployments could go into infinite backoff ([43965](https://github.com/ray-project/ray/issues/43965)).
- Silence noisy `KeyError` on disconnects ([43713](https://github.com/ray-project/ray/pull/43713)).
- Fixed the prometheus counter metrics emitted as gauge bug ([43795](https://github.com/ray-project/ray/pull/43795), [#43901](https://github.com/ray-project/ray/pull/43901)).
- All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.
π Documentation:
- Update serve logging config docs ([43483](https://github.com/ray-project/ray/pull/43483)).
- Added documentation for `max_replicas_per_node` ([42743](https://github.com/ray-project/ray/pull/42743)).
RLlib<a id="rllib"></a>
π New Features:
- The **βnew API stackβ** is now in alpha stage and available for **PPO single-** (42272) and **multi-agent** and for **SAC single-agent** ([42571](https://github.com/ray-project/ray/pull/42571), [#42570](https://github.com/ray-project/ray/pull/42570), [#42568](https://github.com/ray-project/ray/pull/42568))
- **ConnectorV2 API** ([43669](https://github.com/ray-project/ray/pull/43669), [#43680](https://github.com/ray-project/ray/pull/43680), [#43040](https://github.com/ray-project/ray/pull/43040), [#41074](https://github.com/ray-project/ray/pull/41074), [#41212](https://github.com/ray-project/ray/pull/41212))
- **Episode APIs** (SingleAgentEpisode and MultiAgentEpisode) ([42009](https://github.com/ray-project/ray/pull/42009), [#43275](https://github.com/ray-project/ray/pull/43275), [#42296](https://github.com/ray-project/ray/pull/42296), [#43818](https://github.com/ray-project/ray/pull/43818), [#41631](https://github.com/ray-project/ray/pull/41631))
- **EnvRunner APIs** (SingleAgentEnvRunner and MultiAgentEnvRunner) ([41558](https://github.com/ray-project/ray/pull/41558), [#41825](https://github.com/ray-project/ray/pull/41825), [#42296](https://github.com/ray-project/ray/pull/42296), [#43779](https://github.com/ray-project/ray/pull/43779))
- In preparation of **DQN** on the new API stack: PrioritizedEpisodeReplayBuffer ([43258](https://github.com/ray-project/ray/pull/43258), [#42832](https://github.com/ray-project/ray/pull/42832))
π« Enhancements:
- **Old API Stack cleanups:**
- Move `SampleBatch` column names (e.g. `SampleBatch.OBS`) into new class (`Columns`). ([43665](https://github.com/ray-project/ray/pull/43665))
- Remove old exec_plan API code. ([41585](https://github.com/ray-project/ray/pull/41585))
- Introduce `OldAPIStack` decorator ([43657](https://github.com/ray-project/ray/pull/43657))
- **RLModule API:** Add functionality to define kernel and bias initializers via config. ([42137](https://github.com/ray-project/ray/pull/42137))
- **Learner/LearnerGroup APIs**:
- Replace Learner/LearnerGroup specific config classes (e.g. `LearnerHyperparameters`) with `AlgorithmConfig`. ([41296](https://github.com/ray-project/ray/pull/41296))
- Learner/LearnerGroup: Allow updating from Episodes. ([41235](https://github.com/ray-project/ray/pull/41235))
- In preparation of **DQN** on the new API stack: ([43199](https://github.com/ray-project/ray/pull/43199), [#43196](https://github.com/ray-project/ray/pull/43196))
π¨ Fixes:
- New API Stack bug fixes: Fix `policy_to_train` logic ([41529](https://github.com/ray-project/ray/pull/41529)), fix multi-APU for PPO on the new API stack. ([#44001](https://github.com/ray-project/ray/pull/44001)), Issue 40347: ([#42090](https://github.com/ray-project/ray/pull/42090))
- Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env ([43664](https://github.com/ray-project/ray/pull/43664)), Issue 42152 ([#43317](https://github.com/ray-project/ray/pull/43317)), issue 42396: ([#43316](https://github.com/ray-project/ray/pull/43316)), issue 41518 ([#42011](https://github.com/ray-project/ray/pull/42011)), issue 42385 ([#43313](https://github.com/ray-project/ray/pull/43313))
π Documentation:
- New API Stack examples: Self-play and league-based self-play ([43276](https://github.com/ray-project/ray/pull/43276)), MeanStdFilter (for both single-agent and multi-agent) ([#43274](https://github.com/ray-project/ray/pull/43274)), Prev-actions/prev-rewards for multi-agent ([#43491](https://github.com/ray-project/ray/pull/43491))
- Other docs fixes and enhancements: ([43438](https://github.com/ray-project/ray/pull/43438), [#41472](https://github.com/ray-project/ray/pull/41472), [#42117](https://github.com/ray-project/ray/pull/42177), [#43458](https://github.com/ray-project/ray/pull/43458))
Ray Core and Ray Clusters<a id="ray-core-and-ray-clusters"></a>
Ray Core<a id="ray-core"></a>
π New Features:
- [Autoscaler v2](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaler-v2) is in alpha and can be tried out with Kuberay.
- Introduced [subreaper](https://docs.ray.io/en/master/ray-core/user-spawn-processes.html) to prevent leaks of sub-processes created by user code. (#42992)
π« Enhancements:
- Ray state api `get_task()` now accepts ObjectRef (43507)
- Add an option to disable task tracing for task/actor (42431)
- Improved object transfer throughput. (43434)
- Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (42760)
π¨ Fixes:
- Fixed several bugs for streaming generator (43775, 43772, 43413)
- Fixed Ray counter metrics emitted as gauge bug (43795)
- Fixed a bug where empty resource task doesnβt work with placement group (43448)
- Fixed a bug where CPU resource is not released for a blocked worker inside placement group (43270)
- Fixed GCS crashes when PG commit phase failed due to node failure (43405)
- Fixed a bug where Ray memory monitor prematurely kill tasks (43071)
- Fixed placement group resource leak (42942)
- Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (42730)
π Documentation:
- Updated the doc for Ray accelerators support (41849)
Ray Clusters<a id="ray-clusters"></a>
π« Enhancements:
- [spark] Add `heap_memory` param for `setup_ray_cluster` API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (42604)
- [spark] Add global mode for ray on spark cluster (41153)
π¨ Fixes:
- [VSphere] Only deploy ovf to first host of cluster (42258)
Thanks
Many thanks to all those who contributed to this release!
ronyw7, xsqian, justinvyu, matthewdeng, sven1977, thomasdesr, veryhannibal, klebster2, can-anyscale, simran-2797, stephanie-wang, simonsays1980, kouroshHakha, Zandew, akshay-anyscale, matschaffer-roblox, WeichenXu123, matthew29tang, vitsai, Hank0626, anmyachev, kira-lin, ericl, zcin, sihanwang41, peytondmurray, raulchen, aslonnie, ruisearch42, vszal, pcmoritz, rickyyx, chrislevn, brycehuang30, alexeykudinkin, vonsago, shrekris-anyscale, andrewsykim, c21, mattip, hongchaodeng, dabauxi, fishbone, scottjlee, justina777, surenyufuz, robertnishihara, nikitavemuri, Yard1, huchen2021, shomilj, architkulkarni, liuxsh9, Jocn2020, liuyang-my, rkooo567, alanwguo, KPostOffice, woshiyyya, n30111, edoakes, y-abe, martinbomio, jiwq, arunppsg, ArturNiederfahrenhorst, kevin85421, khluu, JingChen23, masariello, angelinalg, jjyao, omatthew98, jonathan-anyscale, sjoshi6, gaborgsomogyi, rynewang, ratnopamc, chris-ray-zhang, ijrsvt, scottsun94, raychen911, franklsf95, GeneDer, madhuri-rai07, scv119, bveeramani, anyscalesam, zen-xu, npuichigo