onednn Changelog

3.6

Performance Optimizations

Intel Architecture Processors

* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
* Improved performance of group normalization primitive.
* Improved `bf16` matmul performance with `int4` compressed weights on processors with Intel AMX instruction set support.
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
* Improved `fp32` RNN primitive performance on processors with Intel AVX2 instruction set support.
* Improved performance of the following subgraphs with Graph API:
- `convolution` and `binary` operation fusions with better layout selection in Graph API.
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX instruction set support.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.

Intel Graphics Products

* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
* Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
* Introduced support for Intel Arc Graphics for future Intel Core Ultra processor (code name Arrow Lake-H).
* Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
* Improved `int8` convolution performance with weight zero-points.
* Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
* Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- `fp8`, `convolution`, and `unary` or `binary` on the Intel Data Center GPU Max Series.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.

AArch64-based Processors

* Improved `fp32` convolution backpropagation performance on processors with SVE support.
* Improved reorder performance for blocked format on processors with SVE support.
* Improved `bf16` softmax performance on processors with SVE support.
* Improved batch normalization performance on processors with SVE support.
* Improved matmul performance on processors with SVE support.
* Improved `fp16` convolution with Arm Compute Library (ACL).
* Improved matmul performance with ACL.
* Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

* Introduced [generic GPU] support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based implementations.
* Enabled support for `int8` activations with grouped scales and `int8` or `int4` compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
* Introduces support for stochastic rounding for `fp8` data type functionality.
* **[experimental]** Extended [microkernel API]:
- Introduced `int8` quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
* **[experimental]** Extended [sparse API]:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
* Graph API:
- Introduced `GroupNorm` operation and fusions in Graph API.
- Introduced support for standalone `StaticReshape` and `StaticTranspose` operations.

[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md
[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html
[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse

Usability

* Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns implementation with Graph API.
* Added [an example][deconvolution example] for deconvolution primitive.
* Added examples for [Vanilla RNN][Vanilla RNN example] and [LBR GRU][LBR GRU example] RNN cells.
* Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
* Introduced interoperability with [SYCL Graph] record/replay mode.
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
* **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog library.
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
* Improved performance of `get_partitions()` function in Graph API.

[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph
[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp
[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp
[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp
[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs
[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging

Validation

* Introduced protection from out-of-memory scenarios in benchdnn Graph API driver.

Deprecated Functionality

* Experimental [Graph Compiler] is deprecated and will be removed in future releases.

[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_graph_compiler.html

Breaking Changes

* Experimental [microkernel API] in this release is not compatible with [the version available][microkernel API v3.5] in oneDNN v3.5.
* Updated minimal supported ACL version to 24.08.1 (was 24.04).

[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html

Thanks to these Contributors

This release contains contributions from the [project core team] as well as Abdel quickwritereader, Adam Jackson nwnk, Aleksandr Voron alvoron, Alexey Makarevich amakarev, Annop Wongwathanarat annop-w, Daniel Kuts apach301, deepeshfujitsu, Fadi Arafeh fadara01, Fritz Heckel fwph, Gorokhov Dmitriy dmitry-gorokhov, Deeksha Kasture kasturedeeksha, Kentaro Kawakami kawakami-k, Marek Michalowski michalowski-arm, matthias-bonne, Menooker, Michael Froelich MichaelFroelich,
Nicolas Miller npmiller, Nikhil Sharma nikhilfujitsu, nishith-fujitsu, Permanence AI Coder Permanence-AI-Coder, Radu Salavat Radu2k, Renato Barros Arantes renato-arantes, Robert Cohn rscohn2, Robert Hardwick robert-hardwick, Ryo Suzuki Ryo-not-rio, Shreyas-fuj Shreyas-fuj, Shu Chen shu1chen, Siddhartha Menon Sqvid, Song Jiaming Litchilitchy, Vladimir Paramuzov vladimir-paramuzov, Yifei Zhang yifeizh2. We would also like to thank everyone who asked questions and reported issues.

[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md

3.6rc

Performance Optimizations
Intel Architecture Processors
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
* Improved performance of group normalization primitive.
* Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
* Improved `fp32` RNN primitive performance on processors with Intel AVX2 instruction set support.
* Improved performance of the following subgraphs with Graph API:
- `convolution` and `binary` operation fusions with better layout selection in Graph API.
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX instruction set.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.

Intel Graphics Products
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
* Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
* Introduced support for Intel Arc Graphics for future Intel Core Ultra Processor (code name Arrow Lake-H).
* Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
* Improved `int8` convolution performance with weight zero points.
* Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
* Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- `fp8` `convolution` and `unary` or `binary` on Intel Data Center GPU Max Series.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.

AArch64-based Processors
* Improved `fp32` convolution backpropagation performance on processors with SVE support.
* Improved reorder performance for blocked format on processors with SVE support.
* Improved `bf16` softmax performance on processors with SVE support.
* Improved batch normalization performance on processors with SVE support.
* Improved matmul performance on processors with SVE support.
* Improved `fp16` convolution with Arm Compute Library (ACL).
* Improved matmul performance with ACL.
* Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality
* Introduced [generic GPU](https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md) support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based implementations.
* Enabled support for `int8` activations with grouped scales and `int8` or `int4` compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
* Introduces support for stochastic rounding for `fp8` data type functionality.
* **[experimental]** Extended [microkernel API](https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html):
- Introduced `int8` quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
* **[experimental]** Extended [sparse API](https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse):
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
* Graph API:
- Introduced `GroupNorm` operation and fusions in Graph API.
- Introduced support for standalone `StaticReshape` and `StaticTranspose` operations.

Usability
* Added [examples](https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph) for SDPA, MQA, and GQA patterns implementation with Graph API.
* Added [an example](https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp) for deconvolution primitive.
* Added examples for [Vanilla RNN](https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp) and [LBR GRU](https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp) RNN cells.
* Introduced support for Intel DPC++/C++ Compiler 2025.0.
* Introduced interoperability with [SYCL Graph](https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs) record/replay mode.
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
* **[experimental]** Introduced [logging mechanism](https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging) based on spdlog library.
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
* Improved performance of `get_partitions()` function in Graph API.

Validation
* Introduced protection from out of memory scenarios in benchdnn Graph API driver.

Breaking Changes
* Experimental [microkernel API](https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html) in this release is not compatible with [the version available](https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html) in oneDNN v3.5.
* Updated minimal supported ACL version to 24.08.1 (was 24.04).

Thanks to these Contributors
This release contains contributions from the [project core team](https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md) as well as Abdel quickwritereader, Adam Jackson nwnk, Aleksandr Voron alvoron, Alexey Makarevich amakarev, Annop Wongwathanarat annop-w, Daniel Kuts apach301, deepeshfujitsu, Fadi Arafeh fadara01, Fritz Heckel fwph, Gorokhov Dmitriy dmitry-gorokhov, Deeksha Kasture kasturedeeksha, Kentaro Kawakami kawakami-k, Marek Michalowski michalowski-arm, matthias-bonne, Menooker, Michael Froelich MichaelFroelich, Nicolas Miller npmiller, Nikhil Sharma nikhilfujitsu, nishith-fujitsu, Permanence AI Coder Permanence-AI-Coder, Radu Salavat Radu2k, Renato Barros Arantes renato-arantes, Robert Cohn rscohn2, Robert Hardwick robert-hardwick, Ryo Suzuki Ryo-not-rio, Shreyas-fuj Shreyas-fuj, Shu Chen shu1chen, Siddhartha Menon Sqvid, Song Jiaming Litchilitchy, Vladimir Paramuzov vladimir-paramuzov, Yifei Zhang yifeizh2. We would also like to thank everyone who asked questions and reported issues.

3.5.3

This is a patch release containing the following changes to v3.5.2:
* Fixed correctness issue in convolution weight gradient for small shapes on Intel GPUs (49eee6a145467d133af80bb3429a1153fcae2545, 281dd3bd38049f70da38d8b9d485c39ae80be78a)
* Extended MLP patterns supported by experimental Graph Compiler to cover cases relevant to ChatGLM model (ff680fc68bb633290531ec2f6c13abd39c072d50)
* Fixed performance regression in bf16 depthwise convolution on Intel CPUs (d6c216a7b59359790b9e572b46ec992adb873f95)

3.5.2

This is a patch release containing the following changes to v3.5.1:
* Fixed performance regression for some Graph API subgraphs with LayerNorm operation (82f629c1afa4ae2d50396c4e0e25cd26631daf2a)
* Fixed runtime error for Graph API subgraphs including 6D LayerNorm operation (f704f0910fcbf618a7c2ca41f8239c1c02057ec7)
* Fixed an issue with host compiler version detection in SYCL configurations (730b9766cf9a304dddf40a84575f2d93fdec76be)
* Fixed an issue with missing `DNNL_TARGET_ARCH` define for builds not relying on CMake (87848b9c953c9c57b5fd9bb78b505ab486e684b1)
* Fixed a test issue for matmul with low-precision scales and/or zero-points (91c35d8f5bdd7b58a8f30f1f11cb91dcb78a1dd9)
* Fixed segfault issue in bfloat16 shuffle on AArch64 processors (91166816ce10dd241cacffccc971e6e6f3b546f6)
* Fixed runtime issue in quantized layer normalization pattern with Graph API (0013e8ce633a8cac5edd01034d4d24c12dcb2ff8)

3.5.1

This is a patch release containing the following changes to v3.5:
* Fixed potential page fault in matmul on Intel Datacenter Max Series GPUs (a9c525d5af0919f26f62eeba8973ab5bc3468e21)
* Fixed potential stack overflow issue in convolution implementation for Intel GPUs (0fb7e6ed4f32e5d89832b2bd742bbf834cd296ed)
* Added test cases for matmul with compressed weights (015ccb1067eb1fd470025c08517a23a6971db9b9)
* Extended Graph API `LayerNorm` operation with zero points support (dc2701ae41345e4939eb328f1c1182d40eafd035)
* Fixed primitive creation error for depthwise convolution backpropagation on Intel GPUs (4a045e43509987517bfdf1e9e778f9b429510858, b529d2241001ba77e8a2eff78cba71121da09627)

3.5

Performance Optimizations

Intel Architecture Processors
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
* Improved performance of group normalization primitive.
* Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
* Improved performance of the following subgraphs with Graph API:
* Multi-Query Attention (MQA).
* Scaled Dot Product Attention (SDPA), including the variant with `select` operation.
* `LayerNorm` + `Multiply` + `Quantize` produced by SmoothQuant algorithm.
* `Convolution` + `Sigmoid` + `Multiply` with mixed precisions.

Intel Graphics Products
* Improved performance for Processor Graphics based on Xe2 architecture.
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
* Improved RNN primitive performance for LSTM cell case.
* Improved performance of `f8_e4m3` data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).

AArch64-based Processors
* Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
* Improved `bf16` matmul, convolution, and reorder primitives performance with Arm Compute Library (ACL).
* Improved eltwise primitive performance with `gelu_erf` algorithm with ACL.

Functionality
* Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
* Introduced support for `int4` data type and extended quantization model with support for grouped scales and zero points.
* Introduced `fp64` matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.
* Extended [floating point math mode API](https://oneapi-src.github.io/oneDNN/v3.5/dev_guide_attributes_fpmath_mode.html) to support weight decompression scenarios. See [matmul weights decompression example](https://github.com/oneapi-src/oneDNN/blob/rls-v3.5/examples/tutorials/matmul/weights_decompression_matmul.cpp) to get started. New floating mode is supported in the following configurations:
* `bfloat16` matmul with `int8` weights on Intel CPUs.
* `float16` and `bfloat16` matmul with `int8` or `int4` weights on Intel GPUs.
* **[experimental]** Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability
* Extended error messages for engine and memory objects creation errors.
* Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
* Introduced support for `clang++` host compiler in SYCL builds.
* Introduced API for tensor serialization and deserialization.
* Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
* Introduced OpenCL runtime support for Graph API.
* Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation
* Extended benchdnn with support for tensor tags in RNN primitive validation.

Breaking Changes
* Updated minimal supported ACL version to 24.04 (was 23.11).

Thanks to these Contributors
This release contains contributions from the project core team as well as Abdel quickwritereader, AngryLoki, Crefeda Rodrigues cfRod, Daniel Richard G. iskunk, David Svantesson davsva01, deepeshfujitsu, Dylan Angus dylan-angus-codeplay, Emanuele Rocca ema, Fadi Arafeh fadara01, Hernan Martinez hmartinez82, John Osorio kala855, Jonathan Deakin jondea, kasturedeeksha, Kentaro Kawakami kawakami-k, Nikita Shulga malfet, Radu Salavat Radu2k, Renato Barros Arantes renato-arantes, Roman Zhukov rozhukov, Ryo Suzuki Ryo-not-rio, Shreyas-fuj, Sunita Nadampalli snadampal, Tadej Ciglarič t4c1, Vineel Abhinav vineelabhinav, vishwascm. We would also like to thank everyone who asked questions and reported issues.

Onednn

Page 2 of 27

3.6

3.6rc

3.5.3

3.5.2

3.5.1

3.5

Page 2 of 27

Links

Releases