Performance Optimizations
Intel Architecture Processors
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
* Improved performance of group normalization primitive.
* Improved `bf16` matmul performance with `int4` compressed weights on processors with Intel AMX instruction set support.
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
* Improved `fp32` RNN primitive performance on processors with Intel AVX2 instruction set support.
* Improved performance of the following subgraphs with Graph API:
- `convolution` and `binary` operation fusions with better layout selection in Graph API.
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX instruction set support.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.
Intel Graphics Products
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
* Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
* Introduced support for Intel Arc Graphics for future Intel Core Ultra processor (code name Arrow Lake-H).
* Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
* Improved `int8` convolution performance with weight zero-points.
* Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
* Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- `fp8`, `convolution`, and `unary` or `binary` on the Intel Data Center GPU Max Series.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and zero-points.
AArch64-based Processors
* Improved `fp32` convolution backpropagation performance on processors with SVE support.
* Improved reorder performance for blocked format on processors with SVE support.
* Improved `bf16` softmax performance on processors with SVE support.
* Improved batch normalization performance on processors with SVE support.
* Improved matmul performance on processors with SVE support.
* Improved `fp16` convolution with Arm Compute Library (ACL).
* Improved matmul performance with ACL.
* Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.
Functionality
* Introduced [generic GPU] support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based implementations.
* Enabled support for `int8` activations with grouped scales and `int8` or `int4` compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
* Introduces support for stochastic rounding for `fp8` data type functionality.
* **[experimental]** Extended [microkernel API]:
- Introduced `int8` quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
* **[experimental]** Extended [sparse API]:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
* Graph API:
- Introduced `GroupNorm` operation and fusions in Graph API.
- Introduced support for standalone `StaticReshape` and `StaticTranspose` operations.
[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md
[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html
[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse
Usability
* Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns implementation with Graph API.
* Added [an example][deconvolution example] for deconvolution primitive.
* Added examples for [Vanilla RNN][Vanilla RNN example] and [LBR GRU][LBR GRU example] RNN cells.
* Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
* Introduced interoperability with [SYCL Graph] record/replay mode.
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
* **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog library.
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
* Improved performance of `get_partitions()` function in Graph API.
[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph
[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp
[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp
[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp
[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs
[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging
Validation
* Introduced protection from out-of-memory scenarios in benchdnn Graph API driver.
Deprecated Functionality
* Experimental [Graph Compiler] is deprecated and will be removed in future releases.
[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_graph_compiler.html
Breaking Changes
* Experimental [microkernel API] in this release is not compatible with [the version available][microkernel API v3.5] in oneDNN v3.5.
* Updated minimal supported ACL version to 24.08.1 (was 24.04).
[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html
Thanks to these Contributors
This release contains contributions from the [project core team] as well as Abdel quickwritereader, Adam Jackson nwnk, Aleksandr Voron alvoron, Alexey Makarevich amakarev, Annop Wongwathanarat annop-w, Daniel Kuts apach301, deepeshfujitsu, Fadi Arafeh fadara01, Fritz Heckel fwph, Gorokhov Dmitriy dmitry-gorokhov, Deeksha Kasture kasturedeeksha, Kentaro Kawakami kawakami-k, Marek Michalowski michalowski-arm, matthias-bonne, Menooker, Michael Froelich MichaelFroelich,
Nicolas Miller npmiller, Nikhil Sharma nikhilfujitsu, nishith-fujitsu, Permanence AI Coder Permanence-AI-Coder, Radu Salavat Radu2k, Renato Barros Arantes renato-arantes, Robert Cohn rscohn2, Robert Hardwick robert-hardwick, Ryo Suzuki Ryo-not-rio, Shreyas-fuj Shreyas-fuj, Shu Chen shu1chen, Siddhartha Menon Sqvid, Song Jiaming Litchilitchy, Vladimir Paramuzov vladimir-paramuzov, Yifei Zhang yifeizh2. We would also like to thank everyone who asked questions and reported issues.
[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md