Performance optimizations
Intel Architecture processors
* Introduced initial int8 optimizations for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via [CPU dispatcher control](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html).
* Improved matmul and inner product performance with bfloat16 data type.
* Improved performance of `tanh` algorithm for eltwise primitive and LSTM cells.
Intel Processor Graphics and Xe architecture-based Graphics
* Improved performance of Convolution, RNN, Inner Product and Matmul functionality for all supported GPUs.
* Improved performance of int8 convolutions with activations in NHWC format for Xe architecture-based Graphics (code named DG1 and Tiger Lake).
AArch64-based processors
* Added support for ArmPL library to improve performance of functionality relying on GEMM (matmul, inner product, convolutions).
New Functionality
* Introduced support for processors based on IBM POWER architecture.
* Introduced Linear-Before-Reset GRU for GPU.
* Extended [eltwise primitive](https://oneapi-src.github.io/oneDNN/group__dnnl__api__eltwise.html) with support for `round` operation.
Usability
* Reduced primitives creation time due to enabled OpenCL pre-compiled headers feature in recent versions of OpenCL driver.
* Reduced entitlement required on macOS with hardened runtime to `allow-jit`.
* Extended documentation on runtime and build time controls for JIT profilers support, primitive cache, CPU dispatcher controls, and verbose mode.
Validation
* Introduced validation mode for out of memory situations.
Thanks to the contributors
This release contains contributions from the project core team as well as Alberto Gonzalez Palomo AlbertoGP, Arthur Mitrano aaraujom, Ilia Taraban itaraban, Nathan John Sircombe nSircombe, Peter Caday petercad, Tsao Zhong CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.