* [Ray Jobs API](https://docs.ray.io/en/releases-2.2.0/cluster/running-applications/job-submission/index.html#ray-jobs-api) is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.
* [Ray Dashboard](https://docs.ray.io/en/releases-2.2.0/ray-core/ray-dashboard.html#ray-dashboard) has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.
* The [Out-Of-Memory (OOM) Monitor](https://docs.ray.io/en/releases-2.2.0/ray-core/scheduling/ray-oom-prevention.html) is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.
* [Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling [dynamic block splitting](https://docs.ray.io/en/releases-2.2.0/data/dataset-internals.html#execution-memory) by default, which will address the above issues by avoiding holding too much data in memory.
Ray Libraries
Ray AIR
🎉 New Features:
* Add a NumPy first path for Torch and TensorFlow Predictors (28917)
💫Enhancements:
* Suppress "NumPy array is not writable" error in torch conversion (29808)
* Add node rank and local world size info to session (29919)
🔨 Fixes:
* Fix MLflow database integrity error (29794)
* Fix ResourceChangingScheduler dropping PlacementGroupFactory args (30304)
* Fix bug passing 'raise' to FailureConfig (30814)
* Fix reserved CPU warning if no CPUs are used (30598)
📖Documentation:
* Fix examples and docs to specify batch_format in BatchMapper (30438)
🏗 Architecture refactoring:
* Deprecate Wandb mixin (29828)
* Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (30365)
Ray Data Processing
🎉 New Features:
* Support all PyArrow versions released by Apache Arrow (29993, 29999)
* Add `select_columns()` to select a subset of columns (29081)
* Add `write_tfrecords()` to write TFRecord files (29448)
* Support MongoDB data source (28550)
* Enable dynamic block splitting by default (30284)
* Add `from_torch()` to create dataset from Torch dataset (29588)
* Add `from_tf()` to create dataset from TensorFlow dataset (29591)
* Allow to set `batch_size` in `BatchMapper` (29193)
* Support read/write from/to local node file system (29565)
💫Enhancements:
* Add `include_paths` in `read_images()` to return image file path (30007)
* Print out Dataset statistics automatically after execution (29876)
* Cast tensor extension type to opaque object dtype in `to_pandas()` and `to_dask()` (29417)
* Encode number of dimensions in variable-shaped tensor extension type (29281)
* Fuse AllToAllStage and OneToOneStage with compatible remote args (29561)
* Change `read_tfrecords()` output from Pandas to Arrow format (30390)
* Handle all Ray errors in task compute strategy (30696)
* Allow nested Chain preprocessors (29706)
* Warn user if missing columns and support `str` exclude in `Concatenator` (29443)
* Raise ValueError if preprocessor column doesn't exist (29643)
🔨 Fixes:
* Support custom resource with remote args for `random_shuffle()` (29276)
* Support custom resource with remote args for `random_shuffle_each_window()` (29482)
* Add PublicAPI annotation to preprocessors (29434)
* Tensor extension column concatenation fixes (29479)
* Fix `iter_batches()` to not return empty batch (29638)
* Change `map_batches()` to fetch input blocks on-demand (29289)
* Change `take_all()` to not accept limit argument (29746)
* Convert between block and batch correctly for `map_groups()` (30172)
* Fix `stats()` call causing Dataset schema to be unset (29635)
* Raise error when `batch_format` is not specified for `BatchMapper` (30366)
* Fix ndarray representation of single-element ragged tensor slices (30514)
📖Documentation:
* Improve `map_batches()` documentation about execution model and UDF pickle-ability requirement (29233)
* Improve `to_tf()` docstring (29464)
Ray Train
🎉 New Features:
* Added MosaicTrainer (29237, 29620, 29919)
💫Enhancements:
* Fast fail upon single worker failure (29927)
* Optimize checkpoint conversion logic (29785)
🔨 Fixes:
* Propagate DatasetContext to training workers (29192)
* Show correct error message on training failure (29908)
* Fix prepare_data_loader with enable_reproducibility (30266)
* Fix usage of NCCL_BLOCKING_WAIT (29562)
📖Documentation:
* Deduplicate Train examples (29667)
🏗 Architecture refactoring:
* Hard deprecate train.report (29613)
* Remove deprecated Train modules (29960)
* Deprecate old prepare_model DDP args 30364
Ray Tune
🎉 New Features:
* Make `Tuner.restore` work with relative experiment paths (30363)
* `Tuner.restore` from a local directory that has moved (29920)
💫Enhancements:
* `with_resources` takes in a `ScalingConfig` (30259)
* Keep resource specifications when nesting `with_resources` in `with_parameters` (29740)
* Add `trial_name_creator` and `trial_dirname_creator` to `TuneConfig` (30123)
* Add option to not override the working directory (29258)
* Only convert a `BaseTrainer` to `Trainable` once in the Tuner (30355)
* Dynamically identify PyTorch Lightning Callback hooks (30045)
* Make `remote_checkpoint_dir` work with query strings (30125)
* Make cloud checkpointing retry configurable (30111)
* Sync experiment-checkpoints more often (30187)
* Update generate_id algorithm (29900)
🔨 Fixes:
* Catch SyncerCallback failure with dead node (29438)
* Do not warn in BayesOpt w/ Uniform sampler (30350)
* Fix `ResourceChangingScheduler` dropping PGF args (30304)
* Fix Jupyter output with Ray Client and `Tuner` (29956)
* Fix tests related to `TUNE_ORIG_WORKING_DIR` env variable (30134)
📖Documentation:
* Add user guide for analyzing results (using `ResultGrid` and `Result`) (29072)
* Tune checkpointing and Tuner restore docfix (29411)
* Fix and clean up PBT examples (29060)
* Fix TrialTerminationReporter in docs (29254)
🏗 Architecture refactoring:
* Remove hard deprecated SyncClient/Syncer (30253)
* Deprecate Wandb mixin, move to `setup_wandb()` function (29828)
Ray Serve
🎉 New Features:
* Guard for high latency requests (29534)
* Java API Support ([blog](http://anyscale-staging.herokuapp.com/blog/flexible-cross-language-distributed-model-inference-framework-ray-serve-with))
💫Enhancements:
* Serve K8s HA benchmarking (30278)
* Add method info for http metrics (29918)
🔨 Fixes:
* Fix log format error (28760)
* Inherit previous deployment num_replicas (29686)
* Polish serve run deploy message (29897)
* Remove calling of get_event_loop from python 3.10
RLlib
🎉 New Features:
* Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (29938, 30118, 30334, 30252, 29703, 30183, 30327, 29953)
💫Enhancements:
* RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the [new CLI documentation](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-cli.html). (#29204, 29459, 30526, 29661, 29972)
* Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (`from_checkpoint()`) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). [For a detailed overview, see here](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-saving-and-loading-algos-and-policies.html). (#28812, 29772, 29370, 29520, 29328)
* A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (29983)
🏗 Architecture refactoring:
* AlgorithmConfig classes: All of RLlib’s Algorithms, RolloutWorkers, and other important classes now use AlgorithmConfig objects under the hood, instead of python config dicts. It is no longer recommended (however, still supported) to create a new algorithm (or a Tune+RLlib experiment) using a python dict as configuration. For more details on how to convert your scripts to the [new AlgorithmConfig design, see here](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-training.html#configuring-rllib-algorithms). (29796, 30020, 29700, 29799, 30096, 29395, 29755, 30053, 29974, 29854, 29546, 30042, 29544, 30079, 30486, 30361)
* Major progress was made on the new Connector API and making sure it can be used (tentatively) with the “config.rollouts(enable_connectors=True)” flag. Will be fully supported, across all of RLlib’s algorithms, in Ray 2.3. (30307, 30434, 30459, 30308, 30332, 30320, 30383, 30457, 30446, 30024, 29064, 30398, 29385, 30481, 30241, 30285, 30423, 30288, 30313, 30220, 30159)
* Progress was made on the upcoming RLModule/RLTrainer/RLOptimizer APIs. (30135, 29600, 29599, 29449, 29642)
🔨 Fixes:
* Various bug fixes: 25925, 30279, 30478, 30461, 29867, 30099, 30185, 29222, 29227, 29494, 30257, 29798, 30176, 29648, 30331
📖Documentation:
* [RLlib CLI](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-cli.html), [Checkpoint overhaul](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-saving-and-loading-algos-and-policies.html), [AlgorithmConfigs](https://docs.ray.io/en/releases-2.2.0/rllib/rllib-training.html#configuring-rllib-algorithms)
* Minor fixes: 29261, 29752
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
* [Out-of-memory monitor](https://docs.ray.io/en/releases-2.2.0/ray-core/scheduling/ray-oom-prevention.html) is now Beta and is enabled by default.
💫Enhancements:
* The [Ray Jobs API](https://docs.ray.io/en/releases-2.2.0/cluster/running-applications/job-submission/index.html#ray-jobs-api) has graduated from Beta to GA. This means Ray Jobs will maintain API backward compatibility.
* Run Ray job entrypoint commands (“driver scripts”) on worker nodes by specifying `entrypoint_num_cpus`, `entrypoint_num_gpus`, or `entrypoint_resources`. (28564, 28203)
* (Beta) OpenAPI spec for Ray Jobs REST API (30417)
* Improved Ray health checking mechanism. The fix will reduce the frequency of GCS marking raylets fail mistakenly when it is overloaded. (29346, 29442, 29389, 29924)
🔨 Fixes:
* Various fixes for hanging / deadlocking (29491, 29763, 30371, 30425)
* Set OMP_NUM_THREADS to `num_cpus` required by task/actors by default (30496)
* set worker non recyclable if gpu is envolved by default (30061)
📖Documentation:
* General improvements of Ray Core docs, including design patterns and tasks.
Ray Clusters
💫Enhancements:
* Stability improvements for Ray Autoscaler / KubeRay Operator integration. (29933 , 30281, 30502)
Dashboard
🎉 New Features:
* Additional improvements from the default metrics dashboard. We now have actor, placement group, and per component memory usage breakdown. [You can see details from the doc](https://docs.ray.io/en/releases-2.2.0/ray-observability/ray-metrics.html#system-metrics).
* New profiling feature using py-spy under the hood. You can click buttons to see stack trace or cpu flame graphs of your workers.
* Autoscaler and job events are available from the dashboard. You can also access the same data using `ray list cluster-events`.
🔨 Fixes:
* Stability improvements from the dashboard
* Dashboard now works at large scale cluster! It is tested with 250 nodes and 10K+ actors (which matches the [Ray scalability envelope](https://github.com/ray-project/ray/tree/master/release/benchmarks)).
* Smarter api fetching logic. We now wait for the previous API to finish before sending a new API request when polling for new data.
* Fix [agent memory leak](https://github.com/ray-project/ray/issues/29199) and high CPU usage.
💫Enhancements:
* General improvements to the progress bar. You can now see progress bars for each task name if you drill into the job details.
* More metadata is available in the jobs and actors tables.
* There is now a feedback button embedded into the dashboard. Please submit any bug reports or suggestions!
Many thanks to all those who contributed to this release!
shrekris-anyscale, rickyyx, scottjlee, shogohida, liuyang-my, matthewdeng, wjrforcyber, linusbiostat, clarkzinzow, justinvyu, zygi, christy, amogkam, cool-RR, jiaodong, EvgeniiTitov, jjyao, ilee300a, jianoaix, rkooo567, mattip, maxpumperla, ericl, cadedaniel, bveeramani, rueian, stephanie-wang, lcipolina, bparaj, JoonHong-Kim, avnishn, tomsunelite, larrylian, alanwguo, VishDev12, c21, dmatrix, xwjiang2010, thomasdesr, tiangolo, sokratisvas, heyitsmui, scv119, pcmoritz, bhavika, yzs981130, andraxin, Chong-Li, clarng, acxz, ckw017, krfricke, kouroshHakha, sijieamoy, iycheng, gjoliver, peytondmurray, xcharleslin, DmitriGekhtman, andreichalapco, vitrioil, architkulkarni, simon-mo, ArturNiederfahrenhorst, sihanwang41, pabloem, sven1977, avivhaber, wuisawesome, jovany-wang, Yard1