Performance Optimizations
* Intel Architecture Processors
* Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
* Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.
* Intel Graphics Products
* Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
* Introduced performance optimizations for [tf32 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
* Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M)
* AArch64-based Processors
* Improved convolution and binary primitive performance for processors with SVE 512 support.
* Improved shuffle and eltwise primitives performance for processors with SVE 256 and SVE 128 support.
* Improved PReLU, batch normalization, and pooling primitives performance via Compute Library for the Arm Architecture (ACL).
* Improved performance of inner product, matmul, convolution, and batch norm primitives with post-ops via ACL.
* PowerPC64-based Processors
* Introduced performance optimizations for int8 and bfloat16 GEMM.
Functionality
* Introduced runtime output scales support in all primitives.
* Introduced scales support in concat primitive.
* Extended [floating point math mode API](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) with tf32 data type option.
* Extended eltwise primitive with support for `hardsigmoid` algorithm.
* Extended layer normalization primitive with support for mixed source and destination data types.
* Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
* Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
* Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
* Extended batch normalization primitive with `dnnl_fuse_norm_add_relu` flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
* Extended GPU deconvolution primitive implementation with support for output scales and zero points.
* Introduced threadpool threading support for AArch64-based processors.
* Introduced Unified Shared Memory (USM) support for SYCL backend on NVIDIA GPUs.
* Introduced initial support for AMD GPUs via MIOpen library. Supported primitives include Local Response Normalization (LRN), softmax, and eltwise.
Usability
* Added `matmul_perf` example that benchmarks matmul primitive for all supported data types.
* Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
* Extended verbose logs converter with RNN primitive support.
* Added verbose output for `dnnl_*gemm*` calls.
* Removed Level Zero headers from the list of build time dependencies.
* Adjusted NVIDIA GPU implementation to comply with oneDNN numerical behavior. Implicit downconvert to fp16 and tf32 are now managed via [math mode API](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html).
Validation
* Added benchdnn driver for validation of internal BRGEMM implementation.
* Improved benchdnn reference implementation performance with threadpool threading model.
* Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (`mode=po`).
Deprecated Functionality
* Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
* Static output scales are deprecated and will be removed in the next release.
* Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.
Breaking Changes
* Changed formula for AUGRU RNN cell to align with Tensorflow. See [proposal](https://github.com/oneapi-src/oneDNN/blob/rfcs/rfcs/20211025-augru/augru-v2.md) for details.
Thanks to the Contributors
This release contains contributions from the project core team as well as Aidan Belton AidanBeltonS, akshatasangelkar, Alex Bojan lb991, Crefeda Rodrigues cfRod, Damian Szwichtenberg dszwicht, Diana Bite diaena, Divakar Mariyanna bmdivakar, Emilio Cota cota, Gordon Fossum austinpagan, Hugh Delaney hdelan, Jacek Czaja jczaja, jakpiase, Jonathan Deakin jondea, Kentaro Kawakami kawakami-k, Kotha Sowmya Sowmyakotha1999, Louie Tsai louie-tsai, Mark Ryan markdryan, MITSUNARI Shigeo herumi, Mona Minakshi monaminakshi, NaNAGISaSA, Nathan John Sircombe nSircombe, Peter Caday petercad, pgorlani, Sreekanth Yalachigere sreekanth-yalachigere, Tadej Ciglarič t4c1, and Thiago Macieira thiagomacieira. We would also like to thank everyone who asked questions and reported issues.