Torch-xla

Latest version: v2.6.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 3

2.5.1

PyTorch/XLA 2.5.1 fixes the `torch_xla[tpu]` PyPi README instructions, aligns with the PyTorch 2.5.1 hot fix release; No new feature is added between PyTorch/XLA 2.5.0 and 2.5.1.

2.5.0

Cloud TPUs now support the Pytorch 2.5 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.5 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

Highlights
We are excited to announce the release of PyTorch XLA 2.5! PyTorch 2.5 supports `torch_xla.compile` function which improves the debugging experience for developers during the development process, and aligns distributed APIs with upstream PyTorch with the traceable collective support for both Dynamo and non-Dynamo cases. Start from PyTorch/XLA 2.5, proposed a [clarified vision](https://github.com/pytorch/xla/issues/8000) for deprecation of the older torch_xla API in favor of moving towards the existing PyTorch API, providing for a simplified developer experience.

If you’ve used [vLLM](https://docs.vllm.ai/en/latest/index.html) for serving models on GPUs, you’ll now be able to seamlessly switch to its TPU backend. vLLM is a widely adopted inference framework that also serves as an excellent way to drive accelerator interoperability. With vLLM on TPU, users will retain the same vLLM interface we’ve grown to love, with direct integration with [Hugging Face Models](https://huggingface.co/models) to make model experimentation easy.

STABLE FEATURES
Eager
- Increase max in flight operation to accommodate eager mode [[7263](https://github.com/pytorch/xla/pull/7263)]
- Unify the logics to check eager mode [[7709](https://github.com/pytorch/xla/pull/7709)]
- Update `eager.md` [[7710](https://github.com/pytorch/xla/pull/7710)]
- Optimize execution for ops that have multiple output in eager mode [[7680](https://github.com/pytorch/xla/pull/7680)]

Quantization / Low Precision
- Asymmetric quantized `matmul` support [[7626](https://github.com/pytorch/xla/pull/7626)]
- Add blockwise quantized dot support [[7605](https://github.com/pytorch/xla/pull/7605)]
- Support `int4` weight in quantized matmul / linear [[7235](https://github.com/pytorch/xla/pull/7235)]
- Support `fp8e5m2 dtype` [[7740](https://github.com/pytorch/xla/pull/7740)]
- Add `fp8e4m3fn` support [[7842](https://github.com/pytorch/xla/pull/7842)]
- Support dynamic activation quant for per-channel quantized matmul [[7867](https://github.com/pytorch/xla/pull/7867)]
- Enable cross entropy loss for xla autocast with FP32 precision [[8094]](https://github.com/pytorch/xla/pull/8094)

Pallas Kernels
- Support ab for `flash_attention` [[7840](https://github.com/pytorch/xla/pull/7840)], actual kernel is implemented in [JAX](https://github.com/jax-ml/jax/blob/3e634d95304afae56e01de0145d9cb068351df3c/jax/experimental/pallas/ops/tpu/flash_attention.py#L144)
- Support `logits_soft_cap` parameter in `paged_attention` [[7704](https://github.com/pytorch/xla/pull/7704)], actual kernel is implemented in [JAX](https://github.com/jax-ml/jax/blob/3e634d95304afae56e01de0145d9cb068351df3c/jax/experimental/pallas/ops/tpu/paged_attention/paged_attention_kernel.py#L136)
- Support `gmm` and `tgmm trace_pallas` caching [[7921](https://github.com/pytorch/xla/pull/7921)]
- Cache flash attention tracing [[8026](https://github.com/pytorch/xla/pull/8026)]
- Improve the user guide [[7625](https://github.com/pytorch/xla/pull/7625)]
- Update pallas doc with `paged_attention` [[7591](https://github.com/pytorch/xla/pull/7591)]

StableHLO
- Add user guide for stablehlo composite op [[7826](https://github.com/pytorch/xla/pull/7826)]

gSPMD
- Handle the parameter wrapping for SPMD [[7604](https://github.com/pytorch/xla/pull/7604)]
- Add helper function to get 1d mesh [[7577](https://github.com/pytorch/xla/pull/7577)]
- Support manual `all-reduce` [[7576](https://github.com/pytorch/xla/pull/7576)]
- Expose `apply_backward_optimization_barrier` [[7477](https://github.com/pytorch/xla/pull/7477)]
- Support reduce-scatter in manual sharding [[7231](https://github.com/pytorch/xla/pull/7231)]
- Allow `MpDeviceLoader` to shard dictionaries of tensor [[8202](https://github.com/pytorch/xla/pull/8202)]

Dynamo
- Optimize dynamo dynamic shape caching [[7726](https://github.com/pytorch/xla/pull/7726)]
- Add support for dynamic shape in dynamo [[7676](https://github.com/pytorch/xla/pull/7676)]
- In dynamo optim_mode avoid unnecessary set_attr [[7915](https://github.com/pytorch/xla/pull/7915)]
- Fix the crash with copy op in dynamo [[7902](https://github.com/pytorch/xla/pull/7902)]
- Optimize `_split_xla_args_tensor_sym_constant` [[7900](https://github.com/pytorch/xla/pull/7900)]
- DYNAMO RNG seed update optimization [[7884](https://github.com/pytorch/xla/pull/7884)]
- Support `mark_dynamic` [[7812](https://github.com/pytorch/xla/pull/7812)]
- Support gmm as a custom op for dynamo [[7672](https://github.com/pytorch/xla/pull/7672)]
- Fix dynamo inplace copy [[7933](https://github.com/pytorch/xla/pull/7933)]
- CPU time optimization for `GraphInputMatcher` [[7895](https://github.com/pytorch/xla/pull/7895)]

PJRT
- Improve device auto-detection [[7787](https://github.com/pytorch/xla/pull/7787)]
- Move _xla_register_custom_call_target implementation into PjRtComputationClient [[7801](https://github.com/pytorch/xla/pull/7801)]
- Handle SPMD case inside of ComputationClient::WaitDeviceOps [[7796](https://github.com/pytorch/xla/pull/7796)]

GKE
- Add tpu example for torchrun on GKE [[7620](https://github.com/pytorch/xla/pull/7620)]
- Add an example of using GKE with torchrun [[7589](https://github.com/pytorch/xla/pull/7589)]

Functionalization
- Add 1-layer gradient accumulation test to check aliasing [[7692](https://github.com/pytorch/xla/pull/7692)]

AMP
- Fix norm data-type when using AMP [[7878](https://github.com/pytorch/xla/pull/7878)]

BETA FEATURES
Op Lowering
- Lower `aten::_linalg_eigh` [[7674](https://github.com/pytorch/xla/pull/7674)]
- Fallback `_embedding_bag_backward` and force `sparse=false` [[7584](https://github.com/pytorch/xla/pull/7584)]
- Support trilinear by using upstream decomp [[7586](https://github.com/pytorch/xla/pull/7586)]

Higher order ops
- [Fori_loop] Update randint max range to Support bool dtype [[7632](https://github.com/pytorch/xla/pull/7632)]

TorchBench Integration
- [benchmarks] API alignment with PyTorch profiler events [[7930](https://github.com/pytorch/xla/pull/7930)]
- [benchmarks] Add IR dump option when run torchbench [[7927](https://github.com/pytorch/xla/pull/7927)]
- [benchmarks] Use same `matmul` precision between PyTorch and PyTorch/XLA[[7748](https://github.com/pytorch/xla/pull/7748)]
- [benchmarks] Introduce verifier to verify the model output correctness against native pytorch [[7724](https://github.com/pytorch/xla/pull/7724), [#7777](https://github.com/pytorch/xla/pull/7777)]
- [benchmarks] Fix moco model issue on XLA [[7257](https://github.com/pytorch/xla/pull/7257), [#7598](https://github.com/pytorch/xla/pull/7598)]
- Type annotation for `benchmarks/` [[7289](https://github.com/pytorch/xla/pull/7289)]
- Default with `CUDAGraphs` on for inductor [[7749](https://github.com/pytorch/xla/pull/7749)]

GPU
- Deprecate `XRT` for `XLA:CUDA` [[8006](https://github.com/pytorch/xla/pull/8006)]

EXPERIMENTAL FEATURES
[Backward Compatibility](https://github.com/pytorch/xla/issues/8000) & APIs that will be removed in 2.7 release:
- Deprecate APIs (deprecated → new):
| Deprecated | New | PRs |
| -------- | ------- | ------- |
| `xla_model.xrt_world_size()` | `runtime.world_size()` | [[7679](https://github.com/pytorch/xla/pull/7679)][[#7743](https://github.com/pytorch/xla/pull/7743)] |
| `xla_model.get_ordinal()` | `runtime.global_ordinal()` | [[7679](https://github.com/pytorch/xla/pull/7679)] |
| `xla_model.get_local_ordinal()` | `runtime.global_ordinal()` | [[7679](https://github.com/pytorch/xla/pull/7679)] |
- Internalize APIs
- `xla_model.parse_xla_device()` [[7675](https://github.com/pytorch/xla/pull/7675)]
- Improvement
- Automatic PJRT device detection when importing `torch_xla` [[7787](https://github.com/pytorch/xla/pull/7787)]
- Add deprecated decorator [[7703](https://github.com/pytorch/xla/pull/7703)]

Distributed
- Enable bucketized all-reduce for gradients [[7216](https://github.com/pytorch/xla/pull/7216)]
- Use reduce-scatter coalescing for FSDP [[6024](https://github.com/pytorch/xla/pull/6024)]

Distributed API
We have aligned our distributed APIs with upstream PyTorch. Previously, we implemented custom distributed APIs, such as torch_xla.xla_model.all_reduce. With the traceable collective support, we now enable `torch.distributed.all_reduce` and similar functions for both Dynamo and non-Dynamo cases in `torch_xla`.
- Support of upstream distributed APIs (torch.distributed.*) like `all_reduce`, `all_gather`, `reduce_scatter_tensor`, `all_to_all`. Previously we used xla specific distributed APIs in xla_model [[7860](https://github.com/pytorch/xla/pull/7860), [#7950](https://github.com/pytorch/xla/pull/7950/), [#8064](https://github.com/pytorch/xla/pull/8064)].
- Introduce `torch_xla.launch()` to launch the multiprocess in order to unify torchrun and `torch_xla.distributed.xla_multiprocessing.spawn()` [[7764](https://github.com/pytorch/xla/pull/7764), [#7648](https://github.com/pytorch/xla/pull/7648), [#7695](https://github.com/pytorch/xla/pull/7695)].
- `torch.distributed.reduce_scatter_tensor()`: [[7950]](https://github.com/pytorch/xla/pull/7950/)
- Register sdp lower precision autocast [[7299](https://github.com/pytorch/xla/pull/7299)]
- Add Python binding for xla::DotGeneral [[7863](https://github.com/pytorch/xla/pull/7863)]
- Fix input output alias for custom inplace ops [[7822](https://github.com/pytorch/xla/pull/7822)]

`torch_xla.compile`
- Support `full_graph` which will error out if there will be more than one graph being executed in the compiled region. [[7776](https://github.com/pytorch/xla/pull/7776)][[#7789](https://github.com/pytorch/xla/pull/7789)]
- Support the dynamic shape detection which will print a useful error message when the number of different graphs being executed across different executions exceeds the predefined limits. [[7918](https://github.com/pytorch/xla/pull/7918)]
- Support naming each compiled program which will make debug messages more informative. [[7802](https://github.com/pytorch/xla/pull/7802)]

Usability & Debuggability
- Wheel name change to support pip>=24.1: [[issue7697](https://github.com/pytorch/xla/issues/7697)]
- Add `tpu-info` as a dependency of `torch_xla[tpu]` and test: [[7938](https://github.com/pytorch/xla/pull/7938)][[#7337](https://github.com/pytorch/xla/pull/7337)]
- Support `torch_xla.manual_seed`: [[7340](https://github.com/pytorch/xla/pull/7340)]
- Support callback on tensor when async execution is finished [[7984](https://github.com/pytorch/xla/pull/7984)]
- Implement `torch.ops._c10d_functional.broadcast`: [[7770](https://github.com/pytorch/xla/pull/7770)]
- Flags `XLA_USE_BF16`, `XLA_DOWNCAST_BF16` will be removed in 2.6 release [[7582](https://github.com/pytorch/xla/pull/7582)][[#7945](https://github.com/pytorch/xla/pull/7945)]

AWS Neuron:
- Update Neuron initializations [[7952](https://github.com/pytorch/xla/pull/7952)]
- Pass local_world_size into neuron.initialize_env [[7852](https://github.com/pytorch/xla/pull/7852)]
- Update and short circuit the Neuron initialization [[8041](https://github.com/pytorch/xla/pull/8041)]
- Introduce multi-node SPMD support for Neuron [[8224](https://github.com/pytorch/xla/pull/8224)]

2.4.0

Cloud TPUs now support the Pytorch 2.4 release, via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.4 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

**:rocket: PyTorch/XLA 2.4 release delivers a 4% speedup boost (Geometric Mean) on [torchbench](https://github.com/pytorch/benchmark) evaluation benchmarks using `openxla_eval` dynamo backend on TPUs, compared to the 2.3 release.**

Highlights
We are excited to announce the release of PyTorch XLA 2.4! PyTorch 2.4 offers improved support for custom kernels using Pallas, including kernels like FlashAttention and Group Matrix Multiplication that can be used like any other torch operators and inference support for the PagedAttention kernel. We also add experimental support for eager mode that compiles and executes each operator for a better debugging and development experience.

Stable Features
PJRT
- Enable dynamic plugins by default [7270](https://github.com/pytorch/xla/pull/7270)

GSPMD
- Support manual sharding and introduce high level manual sharding APIs [6915](https://github.com/pytorch/xla/pull/6915), [#6931](https://github.com/pytorch/xla/pull/6931)
- Support SPMDFullToShardShape, SPMDShardToFullShape [6922](https://github.com/pytorch/xla/pull/6922), [#6925](https://github.com/pytorch/xla/pull/6925)

Torch Compile
- Add a DynamoSyncInputExecuteTime counter [6813](https://github.com/pytorch/xla/pull/6813)
- Fix runtime error when run dynamo with a profiler scope [6913](https://github.com/pytorch/xla/pull/6913)

Export
- Add fx passes to support unbounded dynamism [6653](https://github.com/pytorch/xla/pull/6653)
- Add dynamism support to conv1d, view, softmax [6653](https://github.com/pytorch/xla/pull/6653)
- Add dynamism support to aten.embedding and aten.split_with_sizes [6781](https://github.com/pytorch/xla/pull/6781)
- Inline all scalars by default in export path [6803](https://github.com/pytorch/xla/pull/6803)
- Run shape propagation for inserted fx nodes [6805](https://github.com/pytorch/xla/pull/6805)
- Add an option to not generate weights [6909](https://github.com/pytorch/xla/pull/6909)
- Support export custom op to stablehlo custom call [7017](https://github.com/pytorch/xla/pull/7017)
- Support array attribute in stablehlo composite [6840](https://github.com/pytorch/xla/pull/6840)
- Add option to export FX Node metadata to StableHLO [7046](https://github.com/pytorch/xla/pull/7046)

Beta Features
Pallas
- Support FlashAttention backward kernels [6870](https://github.com/pytorch/xla/pull/6870)
- Make FlashAttention as torch.autograd.Function [6886](https://github.com/pytorch/xla/pull/6886)
- Remove torch.empty in tracing to avoid allocating extra memory [6897](https://github.com/pytorch/xla/pull/6897)
- Integrate FlashAttention with SPMD [6935](https://github.com/pytorch/xla/pull/6935)
- Support scaling factor for attention weights in FlashAttention [7035](https://github.com/pytorch/xla/pull/7035)
- Support segment ids in FlashAttention [6943](https://github.com/pytorch/xla/pull/6943)
- Enable PagedAttention through Pallas [6912](https://github.com/pytorch/xla/pull/6912)
- Properly support PagedAttention dynamo code path [7022](https://github.com/pytorch/xla/pull/7022)
- Support megacore_mode in PagedAttention [7060](https://github.com/pytorch/xla/pull/7060)
- Add Megablocks’ Group Matrix Multiplication kernel [6940](https://github.com/pytorch/xla/pull/6940), [#7117](https://github.com/pytorch/xla/pull/7117), [#7120](https://github.com/pytorch/xla/pull/7120), [#7119](https://github.com/pytorch/xla/pull/7119), [#7133](https://github.com/pytorch/xla/pull/7133), [#7151](https://github.com/pytorch/xla/pull/7151)
- Support histogram [7115](https://github.com/pytorch/xla/pull/7115), [#7202](https://github.com/pytorch/xla/pull/7202)
- Support tgmm [7137](https://github.com/pytorch/xla/pull/7137)
- Make repeat_with_fixed_output_size not OOM on VMEM [7145](https://github.com/pytorch/xla/pull/7145)
- Introduce GMM torch.autograd.function [7152](https://github.com/pytorch/xla/pull/7152)

CoreAtenOpSet
- Lower embedding_bag_forward_only [6951](https://github.com/pytorch/xla/pull/6951)
- Implement Repeat with fixed output shape [7114](https://github.com/pytorch/xla/pull/7114)
- Add int8 per channel weight-only quantized matmul [7201](https://github.com/pytorch/xla/pull/7201)

FSDP via SPMD
- Support multislice [7044](https://github.com/pytorch/xla/pull/7044)
- Allow sharding on the maximal dimension of the weights [7134](https://github.com/pytorch/xla/pull/7134)
- Apply optimization-barrier to all params and buffers during grad checkpointing [7206](https://github.com/pytorch/xla/pull/7206)

Distributed Checkpoint
- Add optimizer priming for distributed checkpointing [6572](https://github.com/pytorch/xla/pull/6572)

Usability
- Add xla.sync as a better name for mark_step. See [6399](https://github.com/pytorch/xla/issues/6399). [#6914](https://github.com/pytorch/xla/pull/6914)
- Add xla.step context manager to handle exceptions better. See [6751](https://github.com/pytorch/xla/issues/6751). [#7068](https://github.com/pytorch/xla/pull/7068)
- Implement ComputationClient::GetMemoryInfo for getting TPU memory allocation [7086](https://github.com/pytorch/xla/pull/7086)
- Dump HLO HBM usage info [7085](https://github.com/pytorch/xla/pull/7085)
- Add function for retrieving fallback operations [7116](https://github.com/pytorch/xla/pull/7116)
- Deprecate XLA_USE_BF16 and XLA_USE_FP16 [7150](https://github.com/pytorch/xla/pull/7150)
- Add PT_XLA_DEBUG_LEVEL to make it easier to distinguish between execution cause and compilation cause [7149](https://github.com/pytorch/xla/pull/7149)
- Warn when using persistent cache with debug env vars [7175](https://github.com/pytorch/xla/pull/7175)
- Add experimental MLIR debuginfo writer API [6799](https://github.com/pytorch/xla/pull/6799)

GPU CUDA Fallback
- Add dlpack support [7025](https://github.com/pytorch/xla/pull/7025)
- Make from_dlpack handle cuda synchronization implicitly for input tensors that have `__dlpack__` and `__dlpack_device__` attributes. [7125](https://github.com/pytorch/xla/pull/7125)

Distributed
- Switch all_reduce to use the new functional collective op [6887](https://github.com/pytorch/xla/pull/6887)
- Allow user to configure distributed runtime service. [7204](https://github.com/pytorch/xla/pull/7204)
- Use dest_offsets directly in LoadPlanner [7243](https://github.com/pytorch/xla/pull/7243)

Experimental Features

Eager Mode
- Enable Eager mode for PyTorch/XLA [7611](https://github.com/pytorch/xla/pull/7611)
- Support eager mode with torch.compile [7649](https://github.com/pytorch/xla/pull/7649)
- Eagerly execute inplace ops in eager mode [7666](https://github.com/pytorch/xla/pull/7666)
- Support eager mode for multi-process training [7668](https://github.com/pytorch/xla/pull/7668)
- Handle random seed for eager mode [7669](https://github.com/pytorch/xla/pull/7669)
- Enable SPMD with eager mode [7673](https://github.com/pytorch/xla/pull/7673)

Triton
- Add support for Triton GPU kernels [6798](https://github.com/pytorch/xla/pull/6798)
- Make Triton kernels work with CUDA plugin [7303](https://github.com/pytorch/xla/pull/7303)

While Loop
- Prepare for torch while_loop signature change. [6872](https://github.com/pytorch/xla/pull/6872)
- Implement fori_loop as a wrapper around while_loop [6850](https://github.com/pytorch/xla/pull/6850)
- Complete fori_loop/while_loop and additional test case [7306](https://github.com/pytorch/xla/pull/7306)

Bug Fixes and Improvements
- Fix type promotion for pow. ([6745](https://github.com/pytorch/xla/pull/6745))
- Fix vector norm lowering [6883](https://github.com/pytorch/xla/pull/6883)
- Manually init absl log to avoid log spam [6890](https://github.com/pytorch/xla/pull/6890)
- Fix pixel_shuffle return empty [6907](https://github.com/pytorch/xla/pull/6907)
- Make nms fallback to CPU implementation by default [6933](https://github.com/pytorch/xla/pull/6933)
- Fix torch.full scalar type [7010](https://github.com/pytorch/xla/pull/7010)
- Handle multiple inplace update input output aliasing [7023](https://github.com/pytorch/xla/pull/7023)
- Fix overflow for div arguments. [7081](https://github.com/pytorch/xla/pull/7081)
- Add data_type promotion to gelu_backward, stack [7090](https://github.com/pytorch/xla/pull/7090), [#7091](https://github.com/pytorch/xla/pull/7091)
- Fix index of 0-element tensor by 0-element tensor [7113](https://github.com/pytorch/xla/pull/7113)
- Fix output data-type for upsample_bilinear [7168](https://github.com/pytorch/xla/pull/7168)
- Fix a data-type related problem for mul operation by converting inputs to result type [7130](https://github.com/pytorch/xla/pull/7130)
- Make clip_grad_norm_ follow input’s dtype [7205](https://github.com/pytorch/xla/pull/7205)

2.3.0

Highlights
We are excited to announce the release of PyTorch XLA 2.3! PyTorch 2.3 offers experimental support for SPMD Auto Sharding on single TPU host, this allows user to shard their models on TPU with a single config change. We also add the experimental support for Pallas custom kernel for inference, which enables users to make use of the popular custom kernel like flash attention and paged attention on TPU.

Stable Features
PJRT
- Experimental GPU PJRT Plugin ([6240](https://github.com/pytorch/xla/pull/6240))
- Define PJRT plugin interface in C++ [(](https://github.com/pytorch/xla/commit/bd95eb1300c84efbf0a5885963a52a8aa7c861ae)[#6360](https://github.com/pytorch/xla/pull/6360)[)](https://github.com/pytorch/xla/commit/bd95eb1300c84efbf0a5885963a52a8aa7c861ae)
- Add limit to max inflight TPU computations ([6533](https://github.com/pytorch/xla/pull/6533))
- Remove TPU_C_API device type ([6435](https://github.com/pytorch/xla/pull/6435))

GSPMD
- Introduce global mesh ([6498](https://github.com/pytorch/xla/pull/6498))
- Introduce xla_distribute_module for DTensor integration [(](https://github.com/pytorch/xla/commit/b6b9c6dabe6359196596a9890e60d6c15c0b7a7d)[#6683](https://github.com/pytorch/xla/pull/6683)[)](https://github.com/pytorch/xla/commit/b6b9c6dabe6359196596a9890e60d6c15c0b7a7d)

Torch Compile
- Support activation sharding within torch.compile [(](https://github.com/pytorch/xla/commit/a80f1d7903ff9074c13e529ca1acf579206c2879)[#6524](https://github.com/pytorch/xla/pull/6524)[)](https://github.com/pytorch/xla/commit/a80f1d7903ff9074c13e529ca1acf579206c2879)
- Do not cache FX input args in dynamo bridge to avoid memory leak [(](https://github.com/pytorch/xla/commit/6aeab3006e48e2e70e8b7daa98b5add63824afc3)[#6553](https://github.com/pytorch/xla/pull/6553)[)](https://github.com/pytorch/xla/commit/6aeab3006e48e2e70e8b7daa98b5add63824afc3)
- Ignore non-XLA nodes and their direct dependents. ([6170](https://github.com/pytorch/xla/pull/6170))

Export
- Support of implicit broadcasting with unbounded dynamism ([6219](https://github.com/pytorch/xla/pull/6219))
- Support multiple StableHLO Composite outputs ([6295](https://github.com/pytorch/xla/pull/6295))
- Add support of dynamism for add [(](https://github.com/pytorch/xla/commit/8d91ff585c883e5ab6ffd11c1461d9029ca11263)[#6443](https://github.com/pytorch/xla/pull/6443)[)](https://github.com/pytorch/xla/commit/8d91ff585c883e5ab6ffd11c1461d9029ca11263)
- Enable unbounded dynamism on conv, softmax, addmm, slice ([6494](https://github.com/pytorch/xla/pull/6494))
- Handle constant variable [(](https://github.com/pytorch/xla/commit/0fa24a136c8db152273f40e85a79b37827f8b5df)[#6510](https://github.com/pytorch/xla/pull/6510)[)](https://github.com/pytorch/xla/commit/0fa24a136c8db152273f40e85a79b37827f8b5df)

Beta Features
CoreAtenOpSet
Support all Core Aten Ops used by `torch.export`
- Lower reflection_pad1d, reflection_pad1d_backward, reflection_pad3d and reflection_pad3d_backward [(](https://github.com/pytorch/xla/commit/13e8647c2cd3804ff7dc30f1c1652774941b0bfc)[#6588](https://github.com/pytorch/xla/pull/6588)[)](https://github.com/pytorch/xla/commit/13e8647c2cd3804ff7dc30f1c1652774941b0bfc)
- lower replication_pad3d and replication_pad3d_backward ([6566](https://github.com/pytorch/xla/pull/6566))
- Lower the embedding op ([6495](https://github.com/pytorch/xla/pull/6495))
- Lowering for _pdist_forward ([6507](https://github.com/pytorch/xla/pull/6507))
- Support mixed precision for torch.where ([6303](https://github.com/pytorch/xla/pull/6303))

Benchmark
- Unify PyTorch/XLA and Pytorch torchbench model configuration using the same [torchbench.yaml](https://github.com/pytorch/pytorch/blob/c797fbc4e1a5829c51630e72e8f55ae67a11cc16/benchmarks/dynamo/torchbench.yaml) ([#6881](https://github.com/pytorch/xla/pull/6881/))
- Align model data precision settings with [pytorch HUD](https://hud.pytorch.org/benchmark/compilers) ([#6447](https://github.com/pytorch/xla/pull/6447/), [#6518](https://github.com/pytorch/xla/pull/6518), [#6555](https://github.com/pytorch/xla/pull/6550))
- Fix some torchbench models configuration to make it runnable using XLA ([6509](https://github.com/pytorch/xla/pull/6509), [#6542](https://github.com/pytorch/xla/pull/6542), [#6558](https://github.com/pytorch/xla/pull/6558), [#6612](https://github.com/pytorch/xla/pull/6612)).

FSDP via SPMD
- Make FSDPv2 to use the global mesh API ([6500](https://github.com/pytorch/xla/pull/6500))
- Enable auto-wrapping([6499](https://github.com/pytorch/xla/pull/6499))

Distributed Checkpoint
- Add process group documentation for SPMD [(](https://github.com/pytorch/xla/commit/732a1c7f13912c63c0570db3e263b319e4950407)[#6469](https://github.com/pytorch/xla/pull/6469)[)](https://github.com/pytorch/xla/commit/732a1c7f13912c63c0570db3e263b319e4950407)

Usability
- Support `torch_xla.device` [(](https://github.com/pytorch/xla/commit/0ec5b91787adb0bfe3fcab7be8d9c464aa610e84)[#6571](https://github.com/pytorch/xla/pull/6571)[)](https://github.com/pytorch/xla/commit/0ec5b91787adb0bfe3fcab7be8d9c464aa610e84)

GPU
- Fix global_device_count(), local_device_count() for single process on CUDA([6022](https://github.com/pytorch/xla/pull/6022))
- Automatically use XLA:GPU if on a GPU machine [(](https://github.com/pytorch/xla/commit/cb4983e93d70319db56440872567e2dc98d0ce1f)[#6605](https://github.com/pytorch/xla/pull/6605)[)](https://github.com/pytorch/xla/commit/cb4983e93d70319db56440872567e2dc98d0ce1f)
- Add SPMD on GPU instructions [(](https://github.com/pytorch/xla/commit/cdd6466c9546075eb860ebfbbf944e4e9542102a)[#6684](https://github.com/pytorch/xla/pull/6684)[)](https://github.com/pytorch/xla/commit/cdd6466c9546075eb860ebfbbf944e4e9542102a)
- Build XLA:GPU as a separate Plugin ([6825](https://github.com/pytorch/xla/pull/6825))

Distributed
- Support tensor bucketing for all-gather and reduce-scatter for ZeRO1 [(](https://github.com/pytorch/xla/commit/a805505d8ff6745c2acbfdef5c67a1f93ab7cf3e)[#6025](https://github.com/pytorch/xla/pull/6025)[)](https://github.com/pytorch/xla/commit/a805505d8ff6745c2acbfdef5c67a1f93ab7cf3e)

Experimental Features
Pallas
- Introduce Flash Attention kernel using Pallas [(](https://github.com/pytorch/xla/commit/db7112af0b71f075b325d3e28fff52146f7f1bba)[#6827](https://github.com/pytorch/xla/pull/6827)[)](https://github.com/pytorch/xla/commit/db7112af0b71f075b325d3e28fff52146f7f1bba)
- Support Flash Attention kernel with casual mask ([6837](https://github.com/pytorch/xla/pull/6837))
- Support Flash Attention kernel with `torch.compile` ([6875](https://github.com/pytorch/xla/pull/6875))
- Support Pallas kernel ([6340](https://github.com/pytorch/xla/pull/6340))
- Support programmatically extracting the payload from Pallas kernel [(](https://github.com/pytorch/xla/commit/370679179aebaef6a0e68a26384926b7e6ee84a7)[#6696](https://github.com/pytorch/xla/pull/6696)[)](https://github.com/pytorch/xla/commit/370679179aebaef6a0e68a26384926b7e6ee84a7)
- Support Pallas kernel with `torch.compile` [(](https://github.com/pytorch/xla/commit/ce8ee38e508605ca33335e60cec238795d4d742a)[#6477](https://github.com/pytorch/xla/pull/6477)[)](https://github.com/pytorch/xla/commit/ce8ee38e508605ca33335e60cec238795d4d742a)
- Introduce helper to convert Pallas kernel to PyTorch/XLA callable ([6713](https://github.com/pytorch/xla/pull/6713))

GSPMD Auto-Sharding
- Support auto-sharding for single host TPU ([6719](https://github.com/pytorch/xla/pull/6719))
- Auto construct auto-sharding mesh ids ([6770](https://github.com/pytorch/xla/pull/6770))

Input Output Aliasing
- Support torch.compile for `dynamo_set_buffer_donor`
- Use XLA’s new API to alias graph input and output ([6855](https://github.com/pytorch/xla/pull/6855))

While Loop
- Support `torch._higher_order_ops.while_loop` with simple examples ([6532](https://github.com/pytorch/xla/pull/6532), [#6603](https://github.com/pytorch/xla/pull/6603))

Bug Fixes and Improvements
- Propagates requires_grad over to AllReduce output [(](https://github.com/pytorch/xla/commit/40727e4d367e183baccd9a2ce734ca7632ca09ac)[#6326](https://github.com/pytorch/xla/pull/6326)[)](https://github.com/pytorch/xla/commit/40727e4d367e183baccd9a2ce734ca7632ca09ac)
- Avoid fallback for avg_pool ([6409](https://github.com/pytorch/xla/pull/6409))
- Fix output tensor shape for argmin and argmax where keepdim=True and dim=None ([6536](https://github.com/pytorch/xla/pull/6536))
- Fix preserve_rng_state for activation checkpointing ([4690](https://github.com/pytorch/xla/pull/4690))
- Allow int data-type for Embedding indices [(](https://github.com/pytorch/xla/commit/5a7bcac7d5311ad8dd61bad01d2ce5a3e484a0de)[#6718](https://github.com/pytorch/xla/pull/6718)[)](https://github.com/pytorch/xla/commit/5a7bcac7d5311ad8dd61bad01d2ce5a3e484a0de)
- Don't terminate the whole process when Compile fails [(](https://github.com/pytorch/xla/commit/d503b71eda0fe15dfecb20d00ba0625f318ba0bf)[#6707](https://github.com/pytorch/xla/pull/6707)[)](https://github.com/pytorch/xla/commit/d503b71eda0fe15dfecb20d00ba0625f318ba0bf)
- Fix a incorrect assert on frame count for PT_XLA_DEBUG=1 ([6466](https://github.com/pytorch/xla/pull/6466))
- Refactor nms into TorchVision variant.([6814](https://github.com/pytorch/xla/pull/6814))

2.2.0

Cloud TPUs now support the [PyTorch 2.2 release](https://github.com/pytorch/pytorch/releases), via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.2 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

Installing PyTorch and PyTorch/XLA 2.2.0 wheel:

pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html

Please note that you might have to re-install the libtpu on your TPUVM depending on your previous installation:

pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html

* Note: If you meet the error `RuntimeError: operator torchvision::nms does not exist` when using torchvision in the 2.2.0 docker image, please try the following command to fix the issue:

pip uninstall torch -y; pip install torch==2.2.0

Stable Features
PJRT
- `PJRT_DEVICE=GPU` has been renamed to `PJRT_DEVICE=CUDA` (https://github.com/pytorch/xla/pull/5754).
- `PJRT_DEVICE=GPU` will be removed in the 2.3 release.
- Optimize **Host to Device** transfer (https://github.com/pytorch/xla/pull/5772) and device to host transfer (https://github.com/pytorch/xla/pull/5825).
- Miscellaneous low-level refactoring and performance improvements ([5799](https://github.com/pytorch/xla/pull/5799), [#5737](https://github.com/pytorch/xla/pull/5737), [#5794](https://github.com/pytorch/xla/pull/5794), [#5793](https://github.com/pytorch/xla/pull/5793), [#5546](https://github.com/pytorch/xla/pull/5546)).

Beta Features
GSPMD
- Support **DTensor API** integration and move GSPMD out of experimental ([5776](https://github.com/pytorch/xla/pull/5776)).
- Enable debug visualization func `visualize_tensor_sharding` ([5742](https://github.com/pytorch/xla/pull/5742)), added [doc](https://github.com/pytorch/xla/blob/master/docs/spmd.md#spmd-debugging-tool).
- Support `mark_shard` scalar tensors ([6158](https://github.com/pytorch/xla/pull/6158)).
- Add `apply_backward_optimization_barrier` ([6157](https://github.com/pytorch/xla/pull/6157)).

Export
- Handled lifted constants in torch export (https://github.com/pytorch/xla/pull/6111).
- Run decomp before processing (https://github.com/pytorch/xla/pull/5713).
- Support export to `tf.saved_model` for models with unused params (https://github.com/pytorch/xla/pull/5694).
- Add an option to not save the weights ([5964](https://github.com/pytorch/xla/pull/5964)).
- Experimental support for dynamic dimension sizes in torch export to StableHLO ([5790](https://github.com/pytorch/xla/pull/5790), [openxla/xla#6897](https://github.com/openxla/xla/pull/6897)).

CoreAtenOpSet
- PyTorch/XLA aims to support all PyTorch core ATen ops in the 2.3 release. We’re actively working on this, remaining issues to be closed can be found at [issue list](https://github.com/pytorch/xla/issues?q=is%3Aopen+is%3Aissue+label%3A%22core+aten+opset%22).

Benchmark
- Support of benchmark running automation and metric report analysis on both TPU and GPU ([doc](https://github.com/pytorch/xla/blob/r2.2/benchmarks/README.md)).

Experimental Features
FSDP via SPMD
- Introduce **FSDP** via **SPMD**, or **FSDPv2** ([6187](https://github.com/pytorch/xla/pull/6187)). The RFC can be found ([#6379](https://github.com/pytorch/xla/issues/6379)).
- Add **FSDPv2** user guide ([6386](https://github.com/pytorch/xla/pull/6386)).

Distributed Op
- Support **all-gather** coalescing (https://github.com/pytorch/xla/pull/5950).
- Support **reduce-scatter** coalescing (https://github.com/pytorch/xla/pull/5956).

Persistent Compilation
- Enable persistent compilation caching (https://github.com/pytorch/xla/pull/6065).
- Document and introduce `xr.initialize_cache` python API (https://github.com/pytorch/xla/pull/6046).

Checkpointing
- Support auto checkpointing for TPU preemption (https://github.com/pytorch/xla/pull/5753).
- Support **Async** checkpointing through **CheckpointManager** (https://github.com/pytorch/xla/pull/5697).

Usability
- Document Compilation/Execution analysis (https://github.com/pytorch/xla/pull/6039).
- Add profiler API for async capture (https://github.com/pytorch/xla/pull/5969).

Quantization
- Lower **quant/dequant** torch op to StableHLO (https://github.com/pytorch/xla/pull/5763).

GPU
- Document **multihost** gpu training (https://github.com/pytorch/xla/pull/5704).
- Support **multinode** training via `torchrun` (https://github.com/pytorch/xla/pull/5657).

Bug Fixes and Improvements
- Pow precision issue (https://github.com/pytorch/xla/pull/6103).
- Handle negative dim for **Diagonal Scatter** (https://github.com/pytorch/xla/pull/6123).
- Fix `as_strided` for inputs smaller than the arguments specification (https://github.com/pytorch/xla/pull/5914).
- Fix **squeeze** op lowering issue when dim is not in sorted order (https://github.com/pytorch/xla/pull/5751).
- Optimize **RNG seed dtype** for better memory utilization (https://github.com/pytorch/xla/pull/5710).

Lowering
- `_prelu_kernel_backward` (https://github.com/pytorch/xla/pull/5724).

2.1.0

Cloud TPUs now support the [PyTorch 2.1 release](https://github.com/pytorch/pytorch/releases), via PyTorch/XLA integration. On top of the underlying improvements and bug fixes in the PyTorch 2.1 release, this release introduces several features, and PyTorch/XLA specific bug fixes.

PJRT is now PyTorch/XLA's officially supported runtime! PJRT brings improved performance, superior usability, and broader device support. PyTorch/XLA r2.1 will be the last release with XRT available as a legacy runtime. Our main release build will not include XRT, but it will be available in a separate package. In most cases, we expect the migration to PJRT to require minimal changes. For more information, see our [PJRT documentation](https://github.com/pytorch/xla/blob/r2.1/docs/pjrt.md).

GSPMD support has been added as an experimental feature to the PyTorch/XLA 2.1 release. GSPMD will transform the single device program into a partitioned one with proper collectives, based on the user provided sharding hints. This feature allows developers to write PyTorch programs as if they are on a single large device without any custom sharded computation ops and/or collective communications to scale. We published a [blog post](https://pytorch.org/blog/pytorch-xla-spmd/) explaining the technical details and expected usage, you can also find more detail in this [user guide](https://github.com/pytorch/xla/blob/r2.1/docs/spmd.md).

PyTorch/XLA has transitioned from depending on TensorFlow to depending on the new OpenXLA repo. This allows us to reduce our binary size and simplify our build system. Starting from 2.1, PyTorch/XLA will release our TPU whl on the [pypi](https://pypi.org/project/torch-xla/).

To install PyTorch/XLA 2.1.0 wheels, please find the installation instructions below.

Installing PyTorch and PyTorch/XLA 2.1.0 wheel:

pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 -f https://storage.googleapis.com/libtpu-releases/index.html

Please note that you might have to re-install the libtpu on your TPUVM depending on your previous installation:

pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html

Stable Features
OpenXLA
* Migrate to pull XLA from TensorFlow to OpenXLA, TF pin dependency sunset ([5202](https://github.com/pytorch/xla/pull/5202))
* Instructions to build PyTorch/XLA with OpenXLA can be found in [this doc](https://github.com/pytorch/xla/blob/r2.1/CONTRIBUTING.md#building-manually).
PjRt Runtime
* Move PJRT APIs from experimental to `torch_xla.runtime` ([5011](https://github.com/pytorch/xla/pull/5011))
* Enable PJRT C API Client and other changes for Neuron ([5428](https://github.com/pytorch/xla/pull/5428))
* Enable PJRT C API Client for Intel XPU ([4891](https://github.com/pytorch/xla/pull/4891))
* Change pjrt:// init method to xla:// ([5560](https://github.com/pytorch/xla/pull/5560))
* Make TPU detection more robust ([5271](https://github.com/pytorch/xla/pull/5271))
* Add runtime.host_index ([5283](https://github.com/pytorch/xla/pull/5283))
Functionalization
* Functionalization integration ([4158](https://github.com/pytorch/xla/pull/4158))
* Add support for XLA_DISABLE_FUNCTIONALIZATION flag ([4792](https://github.com/pytorch/xla/pull/4792))
Improvements and additions
* Op Lowering
* squeeze_copy.dims ([5286](https://github.com/pytorch/xla/pull/5286))
* native_dropout ([5643](https://github.com/pytorch/xla/pull/5643))
* native_dropout_backward ([5642](https://github.com/pytorch/xla/pull/5642))
* count_nonzero ([5137](https://github.com/pytorch/xla/pull/5137))
* Build System
* Migrate the build system to Bazel ([4528](https://github.com/pytorch/xla/pull/4528))

Beta Features
AMP (Automatic MIxed Precision)
* Added bfloat16 support on TPUs. ([5161](https://github.com/pytorch/xla/pull/5161))
* Documentation can be found in [amp.md](https://github.com/pytorch/xla/blob/r2.1/docs/amp.md)

TorchDynamo
* Support CPU egaer fallback in Dynamo bridge ([5000](https://github.com/pytorch/xla/pull/5000))
* Support `torch.compile` with SPMD for inference ([5002](https://github.com/pytorch/xla/pull/5002))
* Update the dynamo backend name to `openxla` and `openxla_eval` ([5402](https://github.com/pytorch/xla/pull/5402))
* Inference optimization for SPMD inference + `torch.compile` ([5447](https://github.com/pytorch/xla/pull/5447), [#5446](https://github.com/pytorch/xla/pull/5446))

Traceable Collectives
* Adopts traceable `all_reduce` ([4915](https://github.com/pytorch/xla/pull/4915))
* Make xm.all_gather a single graph in Dynamo ([4922](https://github.com/pytorch/xla/pull/4922))

Experimental Features
GSPMD
* Add SPMD [user guide](https://github.com/pytorch/xla/blob/r2.1/docs/spmd.md)
* Enable Input-output aliasing ([5320](https://github.com/pytorch/xla/pull/5320))
* Introduce `global_runtime_device_count` to query the runtime device count ([5129](https://github.com/pytorch/xla/pull/5129))
* Support partial replication ([5411](https://github.com/pytorch/xla/pull/5411) )
* Support tuple partition spec ([5488](https://github.com/pytorch/xla/pull/5488))
* Support mark_sharding on IRs ([5301](https://github.com/pytorch/xla/pull/5301))
* Make IR sharding custom sharding op ([5433](https://github.com/pytorch/xla/pull/5433))
* Introduce Hybrid Device mesh creation ([5147](https://github.com/pytorch/xla/pull/5147))
* Introduce SPMD-friendly patched nn.Linear ([5491](https://github.com/pytorch/xla/pull/5491))
* Allow dumping post optimizations HLO ([5302](https://github.com/pytorch/xla/pull/5302))
* Allow sharding n-d tensor on (n+1)-d Mesh ([5268](https://github.com/pytorch/xla/pull/5268))
* Support synchronous distributed checkpointing ([5130](https://github.com/pytorch/xla/pull/5130), [#5170](https://github.com/pytorch/xla/pull/5170))

Serving Support
* SavedModel
* Added a script stablehlo-to-saved-model ([5493](https://github.com/pytorch/xla/pull/5493))
* docs:https://github.com/pytorch/xla/blob/r2.1/docs/stablehlo.md#convert-saved-stablehlo-for-serving

StableHLO
* Add StableHLO user guide ([5523](https://github.com/pytorch/xla/pull/5523))
* Add save_as_stablehlo and save_torch_model_as_stablehlo APIs ([5493](https://github.com/pytorch/xla/pull/5493))
* Make StableHLO executable ([5476](https://github.com/pytorch/xla/pull/5476))

Ongoing Development
TorchDynamo
* Enable single step graph for training
* Avoid inter-graph reshapes from aot_autograd
* Support GSPMD for activation checkpointing

GSPMD
* Support auto-sharding
* Benchmark and improving GSPMD for XLA:GPU
* Integrating to PyTorch’s Distributed Tensor API

GPU
* Support Multi-host GPU for PJRT runtime
* Improve performance on torchbench models

Quantization
* Support PyTorch PT2E quantization workflow

Bug Fixes and Improvements
* Fix unexpected Dynamo crash due to `clear_pending_ir` call([5582](https://github.com/pytorch/xla/pull/5582))
* Fix FSDP for Models with Frozen Weights ([5484](https://github.com/pytorch/xla/pull/5484))
* Fix data type in Pow with Scalar base and Tensor exponent ([5467](https://github.com/pytorch/xla/pull/5467))
* Fix the inplace op crash when applied on self tensors in dynamo ([5309](https://github.com/pytorch/xla/pull/5309))

Page 1 of 3

Releases

Has known vulnerabilities

Torch-xla

Page 1 of 3

2.5.1

2.5.0

2.4.0

2.3.0

2.2.0

2.1.0

Page 1 of 3

Links

Releases