Onednn

Latest version: v2025.1.0

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 27

3.7.2

This is a patch release containing the following changes to v3.7.1:
* Fixed hang in matmul with odd shapes on Intel Arc GPUs (46e7499d8576ac036346f23c9f6d746823dd56b0)
* Fixed out-of-registers error in matmul on Intel Arc GPUs (599c8390610bbd9e8c8afdfeba582726ad3af0e1)
* Fixed incorrect results in SDPA pattern on Intel GPUs (6343c73143a6185d440bbd61e4089ed07196e9ec)
* Fixed integer overflow in convolution with large shapes on x64 CPUs (c541100d5b7678cf042c698aca5cddcbac1426a2)
* Fixed access violation issue in experimental Graph Compiler (8b0e6265648a95f3fdb8f3e734b8eb3075538de2)
* Fixed access violation in pooling on Intel GPUs (cd2cd5d654078608b9f0de3a1a847708dc3d8e15)
* Improved performance of int8 matmul with int4 weights on Intel GPUs (d6c98ec835f449c71fe359f9da68cf48ed68fad1)

3.7.1

This is a patch release containing the following changes to v3.7:
* Fixed correctness issue in `int8` matmul primitive with `int4` weights on on Intel Arc graphics (b16184d155b578036c94e44fbc960e25b3c522f7)
* Fixed matmul performance regression on Intel Arc graphics (41e406bfb448a0600a50ef213c5237d4a3ce3155)
* Fixed potential integer overflow in `bf16` convolution for processors with Intel AVX-512 instruction set support (f882861fe0d0f2ed61b42377408ad62e5a665bc2)
* Fixed functional issue in matmul with dropout attribute on generic GPUs (83033303c072c3b18e070d1abfedfd1f50248eac)
* Fixed functional issues in matmul with scales on NVIDIA GPUs (e8d8594956a626be1debf359618024d1cb7c702a)
* Fixed integer overflows for large shapes in convolution for x64 processors (fc3f17ad469b8a6da7192ae12d32625faa509f1e, 31b079f482a80836cd4192f85e703d424876febc)
* Worked around an MSVC 19.29.30158.0 bug that results in a crash at binary primitive creation on x64 processors (50dd6cc832732ff0ea07e993bb07f61343ca1375)

3.7

Performance Optimizations

Intel Architecture Processors
* Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
* Improved performance of `int8` and `fp32` forward convolution primitive on processors with Intel AVX2 instruction set support.
* Improved performance of `fp8` matmul primitives with `bf16` and `fp16` bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
* Improved performance of `int8` RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
* Improved performance of `int8` depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
* Improved `fp16` and `bf16` softmax performance with relaxed [accumulation mode].
* Improved performance of `int8` matmul primitive with `fp16` output data type.
* Improved performance of the following subgraphs with Graph API:
* [Gated Multi-Layer Perceptron (Gated MLP)].

[accumulation mode]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode

Intel Graphics Products
* Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
* Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
* Improved performance of convolution with source zero points by pre-packing compenstation.
* Improved performance of backward by data convolution with strides for large filter.
* Improved performance of the following subgraphs with Graph API:
* Scaled Dot-Product Attention (SDPA) with [implicit causal mask].
* SDPA with [`int8` or `int4` compressed key and value].
* Gated MLP.

[implicit causal mask]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa
[`int8` or `int4` compressed key and value]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv
[Gated Multi-Layer Perceptron (Gated MLP)]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_gated_mlp.html#doxid-dev-guide-graph-gated-mlp

AArch64-based Processors
* Improved `bf16` matmul performance with `fp32` destination with Arm Compute Library (ACL).
* Improved `bf16` to `fp32` reorder performance.
* Improved `bf16` reorder performance.
* Improved `bf16` convolution with ACL.

NVIDIA GPUs
* Improved matmul performance using cuBLASLt-based implementation.

Functionality

Common
* Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
* Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
* Introduced initial support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
* Introduced [`GenIndex`], and [`GreaterEqual`] operations in Graph API.

[`GenIndex`]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_op_genindex.html
[`GreaterEqual`]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_op_greaterequal.html

Intel Architecture Processors
* Introduced support for `fp32` matmul with `fp16` and `bf16` weights.

Intel Graphics Products
* Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
* Introduced support for strided memory formats in convolution.

Generic GPU vendor
* Introduced support for reduction primitive.
* Introduced support for inner product primitive forward propagation.

Usability

Common
* With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
* Added Graph API examples for [Gated MLP] and [`int4` Gated MLP] patterns.

[Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp.cpp
[`int4` Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp_int4.cpp

Intel Architecture Processors
* Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
* Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

Intel Processor Graphics
* Improved verbose diagnostics for Intel GPU driver compatibility issues.
* Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
* Reduced scratchpad usage for NCHW convolution on Intel GPUs.

AArch64-based Processors
* Added support for the Arm Compute Library (ACL) thread_local scheduler via ThreadpoolScheduler.
* Improved memory efficiency in ACL matmuls by fixing a bug where scratchpad memory was not being used.
* Made the ACL matmul primitive thread-safe which allows concurrent execution.

Validation
* Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
* Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
* Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.

Deprecated Functionality
* Experimental [Graph Compiler] is deprecated and will be removed in future releases.

[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_compiler.html

Breaking Changes
* Updated minimal supported CMake version to 3.13 (was 2.8.12).
* Updated minimal supported GCC version to 8.0 (was 4.8).
* Updated minimal supported Clang version to 11.0 (was 3.0).
* Updated minimal supported ACL version to 24.11.1 (was 24.09).
* Removed support for SYCL standards preceding SYCL 2020.
* Enforced `fp32` accumulation mode in `fp16` matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed [accumulation mode].

Thanks to our Contributors

This release contains contributions from the [project core team] as well as Aditya Tewari aditew01, Alexandra Sidorova a-sidorova, Atharva Dubey AD2605, Deb Taylor deb-intel, Dmitriy Ovchinnikov inteldimitrius, Fadi Arafeh fadara01, Hengyu Meng airMeng, hmaciak, John Karasev karasjoh000, John Osorio kala855, Keola Wierschem kwiersch, Marek Michalowski michalowski-arm, Michael Froelich MichaelFroelich, Michał Górny mgorny, Nicolò Scipione s-Nick, Nikhil Sharma nikhilfujitsu, Permanence AI Coder Permanence-AI-Coder, raistefintel, Ravi Pushkar rpushkarr, Renato Barros Arantes renato-arantes, Romain Biessy Rbiessy, Ryo Suzuki Ryo-not-rio, Shreyas-fuj, Tadej Ciglarič t4c1, Varad Ahirwadkar varad-ahirwadkar, Viktoriia Gvozdeva vgvozdeva, vishwascm, yair-obodovsky, Ye Tao taoye9. We would also like to thank everyone who asked questions and reported issues.

[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md

3.7rc

Performance Optimizations
Intel Architecture Processors
* Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
* Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
* Improved performance of convolution and matmul primitives on processors with Intel AMX support.
* Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
* Improved performance of int8 matmul primitive with fp16 output datatype.
* Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.

Intel Graphics Products
* Introduced initial optimizations for GPUs based on Xe3 architecture.
* Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
* Improved performance of the following subgraphs with Graph API
* Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
* Scaled dot-product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)

Functionality
* Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
* Enabled support for matmul primitive with grouped quantization on weight along N dimension.
* Graph API: new [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations.
* Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
* Introduced support for grouped scales and zero points in reorder primitive.
* Enabled support for 4d weight scale in matmul primitive.
* Graph API: added support for Quantized and non-quantized Gated MLP pattern.
* Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder.

Usability
* With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
* Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
* Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
* Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
* Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP.

Validation
* Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
* Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
* Extended benchdnn with support and validation for the number of partition returned from the test JSON files.

Breaking Changes
* Updated minimal supported CMake version to 3.13 (was 2.8.12).
* Updated minimal supported GCC version to 8.0 (was 4.8).
* Updated minimal supported Clang version to 11.0 (was 3.0).
* Removed support for SYCL older than 2020.

Thanks to these Contributors
This release contains contributions from the [project core team] as well as Aditya Tewari aditew01, Alexandra Sidorova a-sidorova, Atharva Dubey AD2605, Deb Taylor deb-intel, Dmitriy Ovchinnikov inteldimitrius, Fadi Arafeh fadara01, Hengyu Meng airMeng, hmaciak, John Osorio kala855, Marek Michalowski michalowski-arm, Michael Froelich MichaelFroelich, Michał Górny mgorny, Nikhil Sharma nikhilfujitsu, Permanence AI Coder Permanence-AI-Coder, raistefintel, Ravi Pushkar rpushkarr, Renato Barros Arantes renato-arantes, Romain Biessy Rbiessy, Ryo Suzuki Ryo-not-rio, Shreyas-fuj, Varad Ahirwadkar varad-ahirwadkar, vishwascm, and Ye Tao taoye9. We would also like to thank everyone who asked questions and reported issues.

[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md

3.6.2

This is a patch release containing the following changes to v3.6.1:
* Fixed segmentation fault issue in convolution primitive on processors with Intel AVX2 instruction set support (2eb3dd1082db767fab171e934c551c609008289a)
* Added a workaround for build issue with GCC 8.2 and GNU binutils 2.27 (19ef223ec2095e3293fc672ae598588b7a85304b, 262fb02aff10ce0220dc1d116c5d0ef5e027d573, e3782e8c1355c176efae877af0f05831158f28f8)
* Fixed a thread safety issue in matmul primitive for builds relying on Arm Compute Library (ACL) and bumped minimal supported ACL version to [24.11.1](https://github.com/ARM-software/ComputeLibrary/releases/tag/v24.11.1) (4d962e7442e25792920e37a0b1b31d6618581c2a)
* Suppressed spurious warnings for GCC (7d3164d14ce57a1e2e3e8e9ef021fe3965c09f32, c805a5033c49382151a50d87d78e38248eafe363, e526172f57725230d41d76e83ec9a622a58829bc, dc780cbb0f6e124b78ff57a939757876cd2d6b60)
* Fixed segfaults in BRGEMM-based matmul, convolution, and deconvolution implementations on AArch64-based processors (a873a1c0e29e3e9e2dc482b3ce344b5c1dee2d42, 9a1dc92e8085f630b3d6f1764b5b0dcdeebd68ae)
* Fixed performance regression in `bf16` convolution with ACL on AArch64-based processors (4793296821c61c75b4af2f24fb542921ee931aaf)
* Fixed an issue with convolution primitive creation with `PREFER_YMM` CPU ISA hint on AArch64-based processors (e34d9921c3e8420f924cd6c9837e63aa9945aff3)
* Improved `bf16` matmul performance with fp32 destination with ACL on AArch64-based processors (548d5d67cec2589bf04aed804ab8c7c37ac8d13c)
* Improved `bf16` to `fp32` reorder performance on AArch64-based Processors (917dd13f1da78c643604bec142e274be5631b421)
* Fixed issue in matmul primitive with 4D tensors on AArch64-based processors (d13c966b7bca8d8bad5efe95d369cc3f448a3c59)
* Suppressed spurious GCC warnings in deconvolution primitive on AArch64-based processors (f90f60ea066d40d48b57f77d2421bf92c14e8c88)
* Fixed warnings in BRGEMM implementation on AArch64-based processors (866b196ab86558743c4ac09ecfcd0bd26f02af98)
* Fixed correctness issue in reorder primitive with zero points for 4D shapes on AArch64-based Processors (836ea10e3afc642a6de0018186c023efc09559cc)
* Improved `bf16` reorder performance on AArch64-based Processors (12bafbe1346f2c8f82c9c14a9b4e12d259f76135)
* Fixed performance regression for backward convolution primitive descriptor creation time on Intel processors (2b3389fe52e0557be0c16df6dfa433f59d485118)
* Improved performance of `fp16` matmul with `int4` weights on Intel GPUs based on Xe2 architecture (4c8fb2c2e0d4e54f799fc76329642bc97e60635f, 3dd4f43c27fc5e1552b064ab6a7cc3dca003c51c, 280bd28fd8ba33aa99df1abb1de5ec782dab2159)
* Fixed performance regression for `int8` convolution with large spatial sizes on processors with Intel AMX support (05d68df233bb67046697758962cd32bd6d23a956)
* Restricted check for microkernel fusion support to cases when fusion functionality is actually used on Intel GPUs (48f6bd93fba1bf01a678c29d9be28b111b595a57)

3.6.1

This is a patch release containing the following changes to v3.6:
* Fixed convolution correctness issue in some scenarios involving persistent cache on Intel GPUs (e595e595a7aeecc74f1b34e194f787d7639519c8)
* Fixed potential page faults in reduction primitive implementation for Intel GPUs (7740c75ad347bfc4491d7dfc2ffb269a24c56490, a4fcef9ed1ebd5190cb7c5d5f998ce33c8d120e3, 32d86600146a2970724f4d6a4dbaa8c937afef6e)
* Implemented a workaround for GCC 13 bug that resulted in matmul hangs on some Intel Arc graphics SKUs (a30d5267c2baaa916a92761fb3e187eb2bfd1ecd)
* Updated execution units (EU) number detection logic for Intel GPUs based on Xe2 architecture to accommodate for behavior changes in Linux driver (04e7eaccf5039db5369ccf87957873989f59f01f, 97b04bdd8536ee3f13ff5038691d4ef5f6f00d6e)
* Fixed build issue for static library with ONEDNN_VERBOSE=OFF (7f476cbbdbb171e4a394f66fd83eb7a86755ab04)
* Fixed correctness issue in SYCL deconvolution implementation with post-ops (8f600a3314374306e3370ed37b8f8aef539cf79a)
* Fixed memory formats checks in SYCL softmax implementation (6ae73e4f1039d2eb1530bf863cefdf7145b856aa)
* Fixed correctness issue in SYCL resampling implementation with post-ops (984505720edddde0ae7f9e17ed60853721e17091)
* Aligned accessor types in SYCL kernels with SYCL specification (0d9b3bd68405b9fb3606c498c46df74c4170cead)
* Improved scales argument checks in generic SYCL kernels (9f73bf19ca594dbe52086294fad5492dbd0fbdd7, 7d85c7546b98589435677d624e52387f51ef9421)
* Fixed correctness issue in int8 convolution with sum post-op on NVIDIA GPUs (7486ed83f72fa60ad84f5cbf549c4128ea8de8be)
* Relaxed accuracy test threshold for bf16 softmax on NVIDIA GPUs (e9d0fdbfa757c9e3738a5721a737505e75857598)
* Added support for bf16 and fp16 bias for fp8 matmul on Intel CPUs (188ae7f3e3410a76a81d10b2b2bed3be7afdc307)
* Fixed a bug that prevented dispatching Intel AVX-512 with Intel DL Boost implementation in int8 RNN primitive (bf58e72e5776a831bc65882e97e38f64626d5dec)
* Fixed a runtime fail with `CL_OUT_OF_RESOURCES` error in fp16 convolution on Intel Arc graphics (39a5f6753ae376cebb2ed4e16379ca1d78d1459b, 7e1663fea5f9e0db643cce3d993d7f34145dcda6)

Page 1 of 27

Releases

Has known vulnerabilities

Onednn

Page 1 of 27

3.7.2

3.7.1

3.7

3.7rc

3.6.2

3.6.1

Page 1 of 27

Links

Releases