Onednn

Latest version: v2025.1.0

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 27

3.4rc

Performance Optimizations

* Intel Architecture Processors:
* Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
* Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
* Improved RNN primitive performance with LBR_GRU cell.
* Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
* Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
* Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
* Improved int8 matmul performance with transposed A tensor.
* Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
* Improved performance of int8 convolution with post-ops.
* Optimized batch matmul with binary post-op and broadcast mask `1` and `14`.
* Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
* Improved performance of subgraphs including `matmul` and `add` operations and mixed int8 and bfloat16 data types with Graph API.
* **[experimental]** Improved performance of `reduction`, `softmax` and `layernorm` operations with experimental Graph Compiler backend.
* **[experimental]** Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.

* Intel Graphics Products:
* Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
* Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
* Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
* Improved convolution performance for cases relevant to the Stable Diffusion model.
* Improved RNN primitive performance.
* Improved pooling forward propagation performance.
* Improved batched matmul performance for cases with 5 dimensions or more.

* AArch64-based Processors:
* Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
* Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
* Improved bf16 inner product product primitive performance with ACL.

Functionality
* Introduced [GPT-Q support](https://github.com/igorsafo/oneDNN/tree/rfcs-gpt-quantization/rfcs/20231108-gpt-quantization) to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support [matmul with int8 wight compression](https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp).
* Introduced [fp8 data type](https://oneapi-src.github.io/oneDNN/dev_guide_data_types.html) support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
* Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
* **[experimental]** Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
* Intel Graphics Products
* Introduced PReLU post-op support for inner product and matmul primitives.

Usability
* Added opt-in [deterministic mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_deterministic.html) support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
* Introduced [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html) control.
* Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
* Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
* Reduced RNN primitive memory consumption on GPUs.
* Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
* Extended tensor constructor in Graph API to support memory allocation and management by the library.
* Introduced new API and environment variable to manage [Graph API constant tensor cache capacity](https://oneapi-src.github.io/oneDNN/dev_guide_constant_tensor_cache.html).
* Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
* Changed default optimization flags for AArch64 builds to `-mcpu=generic` to improve portability.

Validation
* Improved benchdnn performance by optimizing bottlenecks in validation code.
* Introduced `--num-streams` knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes
* Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors
This release contains contributions from the project core team as well as Alexander Grund Flamefire, David Svantesson davsva01, Fadi Arafeh fadara01, Hugh Delaney hdelan, Ilya Lavrenov ilya-lavrenov, Jacob Kahn jacobkahn, Nathan John Sircombe nSircombe, Renato Barros Arantes renato-arantes, Sergey Shalnov shssf, Sunita Nadampalli snadampal, and Svetlozar Georgiev sgeor255. We would also like to thank everyone who asked questions and reported issues.

3.3.6

This is a patch release containing the following changes to v3.3.5:
* Fixed crash during platform detection on some AArch64-based systems (3e0e69b21ba0694db95bd2af0877f936dcc86dd2)
* Improved inner product performance with Arm Compute Library (ACL) (e7abee2d883d41613cf243c135037fc68d2dacd0, 214fb9e14227880097729ffffac3b666a0febcd7, 8aacc8ff0dfefddfae30681d056757dba1fb0815)
* Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e04df62cf3042ebdc578a72883bde35079a)
* Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad7234741459bab6afc21f571ddb645bcae)

3.3.5

This is a patch release containing the following changes to v3.3.4:
* Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec145f8c64818fd5c3ed2cb9e2ae69daef887)
* Added warning for ACL versions newer than maximum supported (7473012743ae3227dbfa208cad260d29d86d5080)
* Added citation file (fea9f88fa7f8056a5addedfdebdb2dda35ee7a9d)
* Fixed `SEGFAULT` in int8 convolution on processors with Intel AMX support (2a8e122b63b55f897c470d23f21003bb70f0e839)

3.3.4

This is a patch release containing the following changes to v3.3.3:
* Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c5aeb6be1ce992d799943fdc4f3123905f)
* Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38cdf1201caf8ffd2906077defdfe7f4aaa3, fa4364057891fdec528d9442c88d0715306bff2d)
* Fixed `SEGFAULT` in 3D convolutions with different `h` and `w` parameters on Intel CPUs (b5f916ec068f783dbba2cd4f04a673e996f9efba)
* Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d5388d7d749a120cf8522efd6f5aeecc09)
* Reduced benchdnn memory consumption on Intel GPUs (84a8f57d45f215cf89d0f80a57a66b78eaf9b440)

3.3.3

This is a patch release containing the following changes to v3.3.2:
* Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661ff735e5448ef3a80e4e2df7a1556f8a84f)
* Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd116e245e4a167a64bd39a24e957d2b939de)
* Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7efadeaf42d75f75e64d095635458836cd7)
* Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b23d57c38a439c50232783f654b96f575c)
* Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9312a9b76a1880e1aaac513188793ecaa7)
* Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84c4f09398858393035eafa2bd4a29ec0b0, 79bc6cc0477db1ce7e732f20d005ff2b9e88390e, c9c0b09c5e64114eada1b6beb7f6db36331e0fac)

3.3.2

This is a patch release containing the following changes to v3.3.1:
* Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980286c506908f98819e068a047a1d268842, ed9de2afd1fede32a317cbc5df953dfe997e78ea, 0c6bda10b3ea760205d4707a554b76045ef6f964)
* Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f01ec5cf8b30ee0b474aa25417f0493897)
* Updated compiler optimization flags for AArch64 processors to make build portable (8829c249b713dddc87c2669120a9798e202ac633)
* Fixed segmentation fault during library initialization on AArch64 processors (3e15c6113ffeff3545775cbcca9bd84911856cb9)

Page 4 of 27

Releases

Has known vulnerabilities

Previous Next

Onednn

Page 4 of 27

3.4rc

3.3.6

3.3.5

3.3.4

3.3.3

3.3.2

Page 4 of 27

Links

Releases