We're excited to announce the Beta release of ExecuTorch! This release includes many new features, improvements, and bug fixes.
API Stability and Runtime Compatibility Guarantees
Starting with this release, ExecuTorch's Python and C++ APIs will follow the [API Lifecycle and Deprecation Policy](https://pytorch.org/executorch/0.4/api-life-cycle.html), and the `.pte` file format will comply with the [Runtime Compatibility Policy](https://github.com/pytorch/executorch/blob/release/0.4/runtime/COMPATIBILITY.md).
New Features
- Introduced `exir.to_edge_transform_and_lower` API for combining the functionality of `to_edge`, `transform`, and `to_backend`
- Allows users to prevent specific op decompositions while lowering to backends that implement those ops
- Increased operator coverage for ExecuTorch’s portable library
- Added new experimental APIs:
- LLM runner C++ APIs such as `prefill_image()`, `prefill_prompt()`, and `generate_from_pos()` with multimodal support
- `executorch.runtime` python module for loading `.pte` files and running them with the underlying C++ runtime
- Added a new [Tensor API](https://pytorch.org/executorch/0.4/extension-module.html) to bundle the dynamic data and metadata within a Tensor object.
- Improved the [Module API](https://pytorch.org/executorch/0.4/extension-module.html) to share an ExecuTorch Program between several Modules and provide APIs to set inputs/outputs before execution
- Added `find_package(executorch)` for projects to easily link to ExecuTorch’s prebuilt library in CMake
- Introduced reproducible [benchmarking infrastructure](https://github.com/pytorch/executorch/blob/release/0.4/extension/benchmark/README.md?plain=1) to measure, debug, and track performance, enabling on-demand and automated nightly benchmarking of models and backend delegates on modern smartphones
- New benchmarking apps for Apple platforms to measure model performance on [iOS/macOS](https://github.com/pytorch/executorch/blob/release/0.4/extension/apple/Benchmark/README.md) and [Android](https://github.com/pytorch/executorch/blob/release/0.4/extension/android/benchmark/README.md)
- Added support for TikToken v5 vision tokenizer
- Improved parallelization for LLM prefill
- Added experimental capabilities for on-device training, along with an [example prototype](https://github.com/pytorch/executorch/pull/5233/) for LLM finetuning
Supported Models
- Added support for the following models:
- LLaMA 3 models, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- [MultiModal] LLaVA (Large Language and Vision Assistant)
- Phi-3-mini
- Gemma 2B
- Added LLaMA 3, 3.1, and 3.2 to the [Android Llama Demo app](https://github.com/pytorch/executorch/blob/release/0.4/examples/demo-apps/android/LlamaDemo/README.md?plain=1)
- Added LLaVa multimodal support to the iOS iLLaMA and Android LLaMa Demo apps
Hardware Acceleration
- Delegate framework
- Allow delegate to [consume buffer mutations](https://github.com/pytorch/executorch/pull/4830)
- **[New]** MediaTek
- Added support for a new MediaTek backend
- Enabled LLaMa 3 acceleration on MediaTek’s NPU
- Added export scripts and runners for OSS models
- CoreML
- Added LLaMA support for in-place KV cache, fused SDPA kernel, and 4-bit per-block quantization
- Added primitive support for dynamic shapes to work without `torch._check`
- Expanded operator coverage to over 100 ops
- Enabled stateful runtime execution
- MPS
- Added support for 4-bit linear kernels (iOS 18 only)
- Enabled LLaMa 2 7B and LLaMa 3 8B
- Qualcomm (Qualcomm Neural Network)
- Enabled LLaMa 3 8B with 4-bit linear kernel, SpinQuant, fused RMSNorm from QNN 2.25, and model sharding
- Added support for the AI Hub model format
- ARM
- Added new operators
- `addm`, `addmmaddm`, `avg_pool2daddm`, `batch_normaddm`, `bmmaddm`, `clone/cataddm`, `conv2d improvementsaddm`, `divaddm`, `ecpaddm`, `fulladdm`, `hardtanhaddm`, `logaddm`, `mean_dumaddm`, `muladdm`, `permuteaddm`, `reluaddm`, `sigmoidaddm`, `sliceaddm`, `softmaxaddm`, `subaddm`, `unsqueezeaddm`, `view`
- Added/enabled lowering passes to improve network compatibility
- Improved quantization support
- Made quantization accuracy improvements for all models
- Added quantization coverage for all available ops
- Improved channel last support by reducing overhead and number of conversions
- Added performance measurements on Corstone-300 FVP for Ethos-U55
- Moved to new compilation flow in Vela to provide better performance and compatibility
- Improved code documentation for third party contributors
- XNNPACK
- Enhanced XNNPACK backend performance
- Added support for new LLaMa models and other quantized LLMs on Android/iOS devices, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- Introduced major partitioner refactor to improve UX and stability
- Improved model [coverage](https://github.com/pytorch/executorch/blob/release/0.4/examples/xnnpack/__init__.py#L16) to ensure better stability
- Vulkan
- Made latency optimizations for Vulkan convolution and matrix multiplication compute shaders through various algorithmic improvements
- Added quantizer for 8 bit weight-only quantization
- Expanded operator coverage to 63 ops
- Added 4-bit and 8-bit weight quantized linear kernels
- Added support for view tensors in the Vulkan graph runtime, allowing for no-copy permutes, squeeze/unsqueeze etc.
- Added support for symbolic integers in the Vulkan graph runtime
- Integration with ExecuTorch SDK to track compute shader latencies
- Cadence
- Added an x86 executor to sanity check and numerically verify models locally
- Added multiple supported e2e models such as wav2vec2
- Integrated low-level optimizations resulting in 10x+ performance improvements
- Migrated more graph-level optimizations to the open source repository
- Enabled more types in the CadenceQuantizer, and moved to int8 default for better performance
Developer Experience
- Introduced API to enable intermediate output logging in delegates
- Improved CMake build system and reduced reliance on Buck2
- Added override options for fallback PAL implementations through CMake flag (`-DEXECUTORCH_PAL_DEFAULT`)
- Changes to DimOrder (please see [this issue](https://github.com/pytorch/executorch/issues/6330) for current progress and next steps)
Bug Fixes
- Fixed various issues related to quantization, tensor operations, and backend integrations
- Resolved memory allocation and management issues
- Fixed compatibility issues with different Python and dependency versions
- Fixed [bundled program and plan_execute in pybindings](https://github.com/pytorch/executorch/pull/4595)
Breaking Changes
- Updated the minimum C++ version to C++17 for the core runtime
- Removed all C++ headers under `//executorch/util` (see `extension/runner_util/inputs.h` for a `PrepareInputTensors` replacement)
- Users are expected now to provide their own `read_file.h` functionality
- Renamed instances of `sdk` to `devtools` for file names, function names, and CMake options
Deprecation
- Added new annotations and decorators for API lifecycle and deprecation management
- New `ET_EXPERIMENTAL` annotation indicates C++ APIs that may change without notice
- New `deprecated` and `experimental` python decorators indicate non-stable APIs
- Names under the `torch::` namespace are deprecated in favor of names under the `executorch::` namespace, please migrate code to use the new namespace and avoid adding new references to the `torch::` namespace
- Constant buffers are no longer stored inside the `.pte` flatbuffer and are stored in a segment attached to the `.pte` moving forward
- All C++ macros beginning with underscores such as `__ET_UNUSED` are deprecated in favor of unprefixed names such as `ET_UNUSED`
- `capture_pre_autograd_graph()` is deprecated in lieu of the new `torch.export_for_training()` API
Thanks to the following open source contributors for their work on this release!
[denisVieriu97](https://github.com/denisVieriu97), [Erik-Lundell](https://github.com/Erik-Lundell), [Esteb37](https://github.com/Esteb37), [SaoirseARM](https://github.com/SaoirseARM), [benkli01](https://github.com/benkli01), [bigfootjon](https://github.com/bigfootjon), [chuntl](https://github.com/chuntl), [cymbalrush](https://github.com/cymbalrush), [derekxu](https://github.com/derekxu), [dulinriley](https://github.com/dulinriley), [freddan80](https://github.com/freddan80), [haowhsu-quic](https://github.com/haowhsu-quic), [namanahuja](https://github.com/namanahuja), [neuropilot-captain](https://github.com/neuropilot-captain), [oscarandersson8218](https://github.com/oscarandersson8218), [per](https://github.com/per), [python3kgae](https://github.com/python3kgae), [r-barnes](https://github.com/r-barnes), [robell](https://github.com/robell), [salykova](https://github.com/salykova), [shewu-quic](https://github.com/shewu-quic), [tom-arm](https://github.com/tom-arm), [winskuo-quic](https://github.com/winskuo-quic), [zingo](https://github.com/zingo)
**Full Changelog**: https://github.com/pytorch/executorch/compare/v0.3.0...v0.4.0