Performance optimizations
* Added SGEMM copy-based kernels for Intel SSE 4.1, Intel AVX, Intel AVX 2 and Intel AVX 512 architectures. With this optimization Intel MKL-DNN’ JIT SGEMM implementation achieves comparable performance to Intel MKL.
* Improved GEMM performance for n=1.
* Improved performance of s8s8s32 GEMM.
New functionality
* Introduced [Intel Processor Graphics support](https://github.com/intel/mkl-dnn#gpu-support) covering fp16 and fp32 inference, and fp32 training. Intel MKL-DNN relies on OpenCL\* runtime to execute computations on Intel Processor Graphics and provides [interoperability with user’s OpenCL code](http://intel.github.io/mkl-dnn/dev_guide_opencl_interoperability.html).
* Added post-ops support in Inner Product and GEMM-based convolution.
* Introduced [bfloat16 training and inference support](http://intel.github.io/mkl-dnn/dev_guide_training_bf16.html) in reorders, (de-)convolution, pooling, batch normalization, local response normalization, eltwise, inner product, shuffle, sum, and concat. The implementation relies on new instructions targeting future Intel Xeon Scalable processor (codename Cooper Lake). On Intel Xeon processors with Intel AVX512 support bfloat16 arithmetic is emulated.
* Added GELU activation support.
Usability improvements
* Introduced new [developer guide](http://intel.github.io/mkl-dnn/index.html) and new examples.
* Removed dependency on Intel MKL (or Intel MKL small libraries) as JIT implementation delivers comparable performance.
* Introduced explicit [scratchpad management](http://intel.github.io/mkl-dnn/dev_guide_attributes_scratchpad.html).
* Lowered requirements for Intel SSE4 optimizations to Intel SSE 4.1.
* Added out of the box [Intel VTune profiling](http://intel.github.io/mkl-dnn/dev_guide_vtune.html) support.
* Introduced binary distribution.
Breaking changes to the API
This is a major release that introduces several breaking changes. See [v1.0 transition guide](http://intel.github.io/mkl-dnn/dev_guide_transition_to_v1.html) for the full list of changes and replacement functions.
* Removed previously deprecated APIs.
* Removed experimental s16 data type support.
* Removed unused parameters `rounding_mode` and `padding_kind`
* Removed view primitive. The functionality is supported directly by memory descriptor.
* Separated RNN primitive into separate primitives for each cell type.
* Separated cell states and hidden states in LSTM cell.
* Changed matrix layout in GEMM to row-major and calling convention to C-style.
* Changed the offset handling in integer GEMM (now the offsets are subtracted from matrices A and B).
* Changed execution API to accept memory buffers at primitive execution
* Simplified memory descriptor and removed memory primitive descriptor entity
Thanks to the contributors
This release contains contributions from many Intel Performance Libraries developers as well as Andrew Senkevich, Benjamin Fitch, Nathan Greeneltch nathan-greeneltch-intel, Ilia Taraban, Shigeo Mitsunari herumi, Nikolay Tyukaev, Ivan Samsonov, Kalina Tomasz, basargin and Louie Tsai louie-tsai. We would also like to thank everyone who asked questions and reported issues.
*Other names and brands may be claimed as the property of others.