The 0.5 release of ExecuTorch accompanies the release of PyTorch 2.6, and includes various updates and improvements to ExecuTorch’s backend delegates, as well as slight improvements to the Python and C++ APIs. Most notably, dim order has been enabled in ExecuTorch export by default.
On the front of Llama model support, an [eager runner](https://github.com/pytorch/executorch/blob/main/examples/models/llama/runner/eager.py) has been added to the Llama example to allow running inference in eager mode; additionally, support for [AttentionSink](https://arxiv.org/abs/2309.17453) has been added for eager mode execution.
API Changes
* Introduced a C++ `TensorAccessor` class for ExecuTorch tensors based on PyTorch’s [`TensorAccessor`](https://github.com/pytorch/pytorch/blob/release/2.6/aten/src/ATen/core/TensorAccessor.h) class
* Introduced a Python `save(path: str)` to `ExecutorchProgramManager` to reduce boilerplate code required to serialize to a `.pte` file
* Introduced the C++ `PlatformMemoryAllocator` class to allow kernel authors to provide their own memory allocation implementation
* Introduced the C++ `num_instructions()` function to the C++ `Method` class
* Enabled direct serialization of `uint16 types` in ExecuTorch programs
Build
* ExecuTorch nightly binaries are now built only for python `3.10`, `3.11` and `3.12`
* Introduced nightly builds for Apple platforms, which can be found listed [here](https://ossci-ios.s3.amazonaws.com/list.html)
* Added support for NumPy 2
Backends
Arm
* Added support for the following operators:
1D convolution, Tanh activation, `select`, 2D max pooling, `upsample_nearest2d`, `cat`/`stack`, `rshift`, `concat`, `log_softmax`, `var`, `layer_norm`
* Improved support of reduction operators
* Extended softmax to handle dim < 0
* Added support for `keep_dims == True` for `mean` and `var` operators
* Enabled reporting of Ethos-U PMU hardware counters in the Arm delegate executor
* Multiple TOSA Spec support
* Adding model evaluation functionality to the AOT Compiler
Cadence
* Migrated most of the graph level compiler from internal Meta location to OSS location
* Cadence OSS flow is now using ~50 graph-level optimization passes
* Various improvements to the export workflow for Cadence chips
* Expanded operator support to include 33 ATen operators and 11 quantized operators
* Integrated multiple optimized kernels for HiFi and Fusion chips, resulting in large performance gains (double digit percent to orders of magnitude)
* Enabled `mobilenet_v2` and `resnet50` as e2e tests
CoreML
* Added the option to specify which CoreML compute unit to use in the Llama model export script
* Fixed a compilation crash on iOS <16
* Added support for dim order
Qualcomm
* Enabled batch prefill for llama with weight sharing feature
* Various improvements to Llama model support for both prefill and decode, including sha, static_llama (kv cache as io), graph break reduction, and more
* Added example for the `wav2letter` model
* Added support for the `retinanet_fpn` model
* Added support for the SA8295 SoC
* Added support for QAT
* Added support dim order
* Added `DrawGraph` utility for graph visualization
MediaTek
* Integrated the MediaTek backend in the Android Llama application
* Added support for dim order
MPS
* Added support for dim order
Vulkan
* Improved support for Llama model architectures in the Vulkan backend:
* Added implementation of SDPA + KV cache updated fused operator
* Added implementation of rotary embeddings
* Various improvements to compute shader latency and memory footprint, such as:
* Introduced support for push constants in compute shaders, used to pass in tensor metadata (i.e. sizes)
* Switched from `VK_IMAGE_TILING_OPTIMAL` to `VK_IMAGE_TILING_LINEAR` as the default texture tiling setting which greatly reduces memory footprint of image textures used to store tensors
* Reduced register pressure in compute shaders by using lower precision integer types to store texture positions and tensor indices
* Added export pass to automatically insert transition ops to switch to a different optimal/required storage types or memory layout between operators in the export graph
XNNPACK
* Updated XNNPACK Version to commit hash `1ed874e65` which includes the newest KleidiAI Blockwise Kernels which gives around 20% performance improvement on Llama Prefill.
* Support for delegating models quantized via `torchao`’s `quantize_` api
* New Partitioner XNNPACK Partitioner, with configurable settings that allow users greater control over how ops are partitioned
* Support for `to_edge_transform_and_lower`, leveraging this API with the partitioner provides more stable lowerings
* Allowed `addmm` and `mm` to call dynamic fp32 kernels
* Fixes to partitioning of unsupported operators
* Update `cpuinfo` dependency to resolve intermittent faults on UNISOC-based phones
Devtools
* Added a [public benchmark dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch), offering insights into ExecuTorch model performance trends, commit-to-commit comparisons, and anomaly detection. Onboarded Llama3.2-1B to track perf with SpinQuant, QLora, CoreML ANE.
* Added support for `uint16` in the devtools inspector
Llama Model Support
* Swapped TorchTune attention with custom export-friendly ExecuTorch attention
* Added `llama3_2_vision` text decoder as a TorchTune exportable model
* Added a [React Native LLaMA](https://github.com/pytorch/executorch/tree/release/0.5/examples/demo-apps/react-native/rnllama) app for iOS devices
* Added support for the `bfloat16` dtype in the LLM runner binary and the `export_llama` script
* Added support for [AttentionSink](https://arxiv.org/abs/2309.17453) in the Llama example
* Added TorchAO MPS low bit operators to the Llama runner
* Added support for kv cache quantization; currently only 8-bit per token quantization is supported with FP32 as a dequantized dtype. This can be enabled in the `export_llama` script using the `–quanitze_kv_cache` option.
* Added support for quantized versions of Llama 3.2 1B/3B
Kernel Libraries
* Implemented several portable operators: `pixel_unshuffle`, `gather`, `topk`, `convolution_backward`, `narrow_copy`, `masked_select`, `max.unary_out`, `min.unary_out`, `scatter.src_out`, `scatter.value_out`. `repeat_interleave.Tensor_out`
* Implemented `tile_crop` custom operator
* Implemented scalar `trunc` primitive operator
* Implemented BFloat16 support, focusing on LLM operator coverage (`op_to_copy`, `op_mul`, `op_mm`, `op_copy`, `op_slice_scatter`, `op_scalar_tensor`, `op_where`, `op_add`, CPUBLAS gemm).
* Fixed handling of rank 0 tensors in optimized `add`/`sub`/`div`/`mul`
* Fixed `_native_batch_norm_legit_no_stats_out`
First Time Contributors
Thanks to the following contributors for making their first commit for this release!
navsud, meyering, tugsbayasgalan, Abhishek8394, RahulK4102, RdoubleA, varunchariArm, laithsakka, limintang, veselinp, MaggieMoss, azad-meta, anyj0527, jainapurva, suchir1, ru-m8, wdvr, anijain2305, tianxf99, sxu, f-meloni, Vysarat, georgehong, lg-zhang, h-friederich, AIWintermuteAI, itisgrisha, ykhrustalev, hietalajulius, Nick-Wei, Abhi-hpp, KapJI, YIWENX14, clee2000, Michiel-Olieslagers, karthik-manju, jakmro, Aleksei-grovety,
**Full Changelog**: https://github.com/pytorch/executorch/compare/v0.4.0...v0.5.0