Onednn

Latest version: v2025.1.0

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 27

3.3.1

This is a patch release containing the following changes to v3.3:
* Fixed int8 convolution accuracy issue on Intel GPUs (09c87c79bccbad8fa451b224a0f07f87095e3907)
* Switched internal stream to in-order mode for NVIDIA and AMD GPUs to avoid synchronization issues (db01d62b3fc80897d88dc42f4dcdfcb0d90c131a)
* Fixed runtime error for `avgpool_bwd` operation in Graph API (d025ef6620b131f3487bb748866ddd9d7225c09f, 9e0602ad37afa18d46f407cb52577f1afead238b, e0dc1b3d070313052f5fd6ac739778d45b57859c)
* Fixed benchdnn error reporting for some Graph API cases (98dc9dbecb3f36234474c9d6e96ab6571497633b)
* Fixed accuracy issue in experimental Graph Compiler for int8 MHA variant from StarCoder model (5476ef7c165d943fbce94ca0f44a13d6868e65f3)
* Fixed incorrect results for layer normalization with trivial dimensions on Intel GPUs (a2ec0a0c5805314220db925e1323e4675e3ca379)
* Removed redundant synchronization for out-of-order SYCL queues (a96e9b1a6769171e74b0b8e031489303438906e5)
* Fixed runtime error in experimental Graph Compiler for int8 MLP subgraph from LLAMA model (595543dd093df3e92621c253d6da3f9092ec7ff8)
* Fixed `SEGFAULT` in experimental Graph Compiler for fp32 MLP subgraph (42071057abb2fcbbca6ed67117bdb1a5ee3dc0cd)
* Fixed incorrect results in experimental Graph Compiler for MLP subgraph (57e14b56d4e6fab2ab49dbd47fd579482d79535a)
* Fixed the issue with f16 inner product primitive with s8 output returning `unimplemented` on Intel GPUs (bf12207b0312c0174f0c47ae0d3abd70edc31957, 800b5e9613bd0994af82706ef024ad2b453be2b6, ec7054a2c79ae33d3db4ff04ce11360c2c896d56)
* Fixed incorrect results for int8 deconvolution with zero-points on processors with Intel AMX instructions support (55d2cecd698f865efac2e1dbf2f701b4b8095df1)

3.3

Performance Optimizations
* Intel Architecture Processors:
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
* Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via [CPU dispatcher control](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html).
* Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
* Improved s32 binary primitive performance.
* Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
* Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
* Improved performance of convolution for depthwise cases with Graph API.
* **[experimental]** Improved performance of LLAMA2 MLP block with Graph Compiler.
* Intel Graphics Products:
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
* Reduced RNN primitive initialization time on Intel GPUs.
* AArch64-based Processors:
* Improved fp32 to bf16 reorder performance.
* Improved max pooling performance with Arm Compute Library (ACL).
* Improved dilated convolution performance for depthwise cases with ACL.

Functionality
* Introduced group normalization primitive support. The functionality is currently available on CPUs.
* Intel CPUs:
* Introduced support for zero points in int8 convolution with groups and 3D spatial.
Usability
* Extended [verbose mode](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) output:
* Improved diagnostics on engine creation errors.
* Added information on Graph API calls.
* Added information on strides for non-dense memory objects.
* Added values of runtime dimension.
* Added indication that primitive descriptor was created with `any` memory format tag.
* Introduced [examples for Graph API](https://github.com/oneapi-src/oneDNN/tree/master/examples/graph).
* Graph API constant tensor cache is now disabled by default and requires opt-in with [`dnnl::graph::set_constant_tensor_cache()`](https://oneapi-src.github.io/oneDNN/group_dnnl_graph_api_constant_tensor_cache.html#doxid-group-dnnl-graph-api-constant-tensor-cache-1ga9e37974d35ff5aafe1cbae2f69a2ab00) call.
* Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation
* Extended benchdnn performance reporting with primitive creation time.
* Introduced [cold cache mode](https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/cold_cache.md) in benchdnn.

Known Limitations
* Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
* Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
* Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
* Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
* Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
* Int8 softmax may fail crash on Windows in SYCL debug configuration.

Thanks to these Contributors
This release contains contributions from the project core team as well as Amy Wignall AmyWignall-arm, baibeta, Benjamin Taylor bentaylorhk-arm, Ilya Lavrenov ilya-lavrenov, Kentaro Kawakami kawakami-k, Milos Puzovic milpuz01, Renato Barros Arantes renato-arantes, snadampal, sparkyrider, and Thomas Köppe tkoeppe. We would also like to thank everyone who asked questions and reported issues.

3.3rc

Performance Optimizations
* Intel Architecture Processors:
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
* Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via [CPU dispatcher control](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html).
* Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
* Improved s32 binary primitive performance.
* Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
* Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
* Improved performance of convolution for depthwise cases with Graph API.
* **[experimental]** Improved performance of LLAMA2 MLP block with Graph Compiler.
* Intel Graphics Products:
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
* Reduced RNN primitive initialization time on Intel GPUs.
* AArch64-based Processors:
* Improved fp32 to bf16 reorder performance.
* Improved max pooling performance with Arm Compute Library (ACL).
* Improved dilated convolution performance for depthwise cases with ACL.

Functionality
* Introduced group normalization primitive support. The functionality is currently available on CPUs.
* Intel CPUs:
* Introduced support for zero points in int8 convolution with groups and 3D spatial.
Usability
* Extended [verbose mode](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) output:
* Improved diagnostics on engine creation errors.
* Added information on Graph API calls.
* Added information on strides for non-dense memory objects.
* Added values of runtime dimension.
* Added indication that primitive descriptor was created with `any` memory format tag.
* Introduced [examples for Graph API](https://github.com/oneapi-src/oneDNN/tree/master/examples/graph).
* Graph API constant tensor cache is now disabled by default and requires opt-in with [`dnnl::graph::set_constant_tensor_cache()`](https://oneapi-src.github.io/oneDNN/group_dnnl_graph_api_constant_tensor_cache.html#doxid-group-dnnl-graph-api-constant-tensor-cache-1ga9e37974d35ff5aafe1cbae2f69a2ab00) call.
* Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation
* Extended benchdnn performance reporting with primitive creation time.
* Introduced [cold cache mode](https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/cold_cache.md) in benchdnn.

Thanks to these Contributors
This release contains contributions from the project core team as well as Amy Wignall AmyWignall-arm, baibeta, Benjamin Taylor bentaylorhk-arm, Ilya Lavrenov ilya-lavrenov, Kentaro Kawakami kawakami-k, Milos Puzovic milpuz01, Renato Barros Arantes renato-arantes, snadampal, sparkyrider, and Thomas Köppe tkoeppe. We would also like to thank everyone who asked questions and reported issues.

3.2.1

This is a patch release containing the following changes to v3.2:
* Fixed a potential issue `SEGFAULT` when oneDNN primitives created in parallel (0a6202f5000cf347995ab744c25aa26cabf2482d)
* Replaced deprecated SYCL API `get_pointer` with `get_multi_ptr` (fdbff4591f952d02a0c934f854a9b225a7097a21, 51ed43bb5cb08f38b0b652255a13bb4072b2ee57)
* Fixed an error in device indices detection for persistent cache (25575c2d20a9885640c89771c99a0d27b5444b4d)
* Improved benchdnn performance results accuracy for Graph API (9dfe343992209ecc6eb1265a140b6f0db228d90a)
* Fixed an issue with profiling API not respecting `ONEDNN_EXPERIMENTAL_PROFILING` build knob. This behavior manifests in apparent memory leak when oneDNN primitives are executed on a queue with enabled profiling (8d796efb609c33ecdd31e3e7b26d94d959dd83b9, 51a8f7ad892b1174d32cba8358804fad09b58f76, 2ca29381eeb5dde64d90468e440f87b6f9ad01d9)
* Fixed a correctness issue in resampling primitive with binary and/or sum post-op on Intel CPUs (65ccd2506eeafb44822c682acfef97ef18bea09f, 4a0e087b405f4ebc682cf82c4a5bb96e9b9976d4, f333bb8c191fbfab368645aeac1c3a0d1892eda4)
* Fixed a correctness issue in int8 matmul with zero-points for processors with Intel AVX2 and Intel DL Boost instructions support (ec0b2ee85fc2a2dbdeec10035c5ef5813d8fb5ea, 6d2e567c9361992adf235545c9fc2047184ed6e6)
* Fixed a correctness issue in fp32 batched matmul with transposed source tensor on processors with Intel AVX-512 instruction set support (36f355e0773f79cca5a639a5a3558f45da57c35d)
* Fixed a correctness issue in matmul and inner product with post-ops on processors with Intel AVX2 and Intel DL Boost with fp16 and bfloat16 instruction set support (b76d4cae333fc4e015d47eb737e10551daf30334)
* Fixed a potential out of bounds issue during GPU kernel creation (190a9b28170f5326241c9c4ab6bc7964877e953d)
* Updated build system to use TBB-provided CMake config file when available (40112196287e8866a7259df35f817229454d0c96)

3.2

Performance Optimizations
* Intel Architecture Processors:
* Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
* Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via [CPU dispatcher control](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html).
* Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
* Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.

* Intel Graphics Products:
* Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
* Reduced creation time for matmul, inner product, and RNN primitives.

* AArch64-based Processors:
* Improved convolution performance with post-ops on processors with SVE support.
* Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
* Improved fp32 deconvolution performance for math mode `bf16` or `any` with ACL.

* IBM Z Platform:
* Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality
* **[experimental]** Introduced [Graph Compiler backend](https://oneapi-src.github.io/oneDNN/dev_guide_graph_compiler.html) for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
* Extended Graph API with boolean data type, [select](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), and [pow](https://oneapi-src.github.io/oneDNN/dev_guide_op_pow.html) operations.
* Introduced support for binary and eltwise post-ops in softmax primitives.
* Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.

* Intel Graphics Products:
* Introduced mixed precision support for binary primitives.

* NVIDIA GPUs:
* Introduced bfloat16 support for deconvolution and softmax primitives.

* AMD GPUs:
* Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability
* Extended [verbose mode](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
* Reduced stack consumption to less than 20 KB across implementations.
* **[experimental]** Introduced [profiling API](https://oneapi-src.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling) for SYCL and OpenCL applications.

Validation
* Introduced fast performance validation mode (`--mode=F`) in [benchdnn](https://github.com/oneapi-src/oneDNN/tree/master/tests/benchdnn#readme). Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
* Reduced benchdnn memory consumption in performance validation mode.
* Introduced [smoke test set](https://github.com/oneapi-src/oneDNN/blob/rls-v3.2/cmake/options.cmake#L75) for benchdnn. This test set provides basic validation for all primitives.

Known Limitations
* fp32 matmul with bfloat16 binary post-op may produce incorrect results on processors with Intel AVX2 and Intel DL Boost support.
* fp32 convolution forward propagation with strides has performance regression on processors with Intel AVX-512 instructions support.
* Resampling primitive with binary post-op may produce incorrect results on CPUs.
* Extensive use of the RNN primitive on Intel GPUs with default primitive cache settings may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
* Convolution and deconvolution primitives on Intel Arc GPUs on Windows may cause memory corruption under heavy repeated use.
* bfloat16 matmul primitive may crash on Intel Arc GPUs on Windows.
* Pooling, resampling, PRelu, batch normalization, and layer normalization may sporadically produce incorrect results on Intel Arc GPUs on Windows.
* oneDNN Graph partitions containing `ConvTransposeBackwardWeights` or int8 `matmul` operations may produce incorrect results on Intel Processor Graphics on Windows.
* bfloat16 matmul primitive has performance regression with shapes 14x128:128x200:14x200 and 200x128:128x200:200x200 on the Intel Data Center GPU MAX Series.
* oneDNN primitives may crash or produce incorrect results with tensors exceeding 4 Gb in size on Intel GPUs.
* Softmax primitive with a NHWC memory format may produce incorrect results on the Intel Data Center GPU Max Series.
* Inner product weight gradient may produce incorrect results on Intel Processor Graphics on Windows.

Thanks to the Contributors
This release contains contributions from the project core team as well as Abdelrauf quickwritereader, Alexey Vishnyakov SweetVishnya, Annop Wongwathanarat annop-w, Anthony Roberts anthony-linaro, Crefeda Rodrigues cfRod, David Svantesson davsva01, Fadi Arafeh fadara01, Ilya Lavrenov ilya-lavrenov, Jonathan Deakin jondea, Kentaro Kawakami kawakami-k, Milos Puzovic milpuz01, RambabuSwargam RambabuSwargam, Sai Teja saiteja13427, Taiju Tsuiki tzik. We would also like to thank everyone who asked questions and reported issues.

3.2rc

Performance Optimizations
* Intel Architecture Processors:
* Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
* Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via [CPU dispatcher control](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html).
* Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
* Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.

* Intel Graphics Products:
* Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
* Reduced creation time for matmul, inner product, and RNN primitives.

* AArch64-based Processors:
* Improved convolution performance with post-ops on processors with SVE support.
* Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
* Improved fp32 deconvolution performance for math mode `bf16` or `any` with ACL.

* IBM Z Platform:
* Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality
* *[experimental]* Introduced [Graph Compiler backend](https://oneapi-src.github.io/oneDNN/dev_guide_graph_compiler.html) for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
* Extended Graph API with boolean data type, [select](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), and [pow](https://oneapi-src.github.io/oneDNN/dev_guide_op_pow.html) operations.
* Introduced support for binary and eltwise post-ops in softmax primitives.
* Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.

* Intel Graphics Products:
* Introduced mixed precision support for binary primitives.

* NVIDIA GPUs:
* Introduced bfloat16 support for deconvolution and softmax primitives.

* AMD GPUs:
* Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability
* Extended [verbose mode](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
* Reduced stack consumption to less than 20 KB across implementations.
* *[experimental]* Introduced [profiling API](https://oneapi-src.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling) for SYCL and OpenCL applications.

Validation
* Introduced fast performance validation mode (`--mode=F`) in [benchdnn](https://github.com/oneapi-src/oneDNN/tree/master/tests/benchdnn#readme). Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
* Reduced benchdnn memory consumption in performance validation mode.
* Introduced [smoke test set](https://github.com/oneapi-src/oneDNN/blob/rls-v3.2/cmake/options.cmake#L75) for benchdnn. This test set provides basic validation for all primitives.

Thanks to the Contributors
This release contains contributions from the project core team as well as Abdelrauf quickwritereader, Alexey Vishnyakov SweetVishnya, Annop Wongwathanarat annop-w, Anthony Roberts anthony-linaro, Crefeda Rodrigues cfRod, David Svantesson davsva01, Fadi Arafeh fadara01, Ilya Lavrenov ilya-lavrenov, Jonathan Deakin jondea, Kentaro Kawakami kawakami-k, Milos Puzovic milpuz01, RambabuSwargam RambabuSwargam, Sai Teja saiteja13427, Taiju Tsuiki tzik. We would also like to thank everyone who asked questions and reported issues.

Page 5 of 27

Releases

Has known vulnerabilities

Previous Next

Onednn

Page 5 of 27

3.3.1

3.3

3.3rc

3.2.1

3.2

3.2rc

Page 5 of 27

Links

Releases