New Features:
* None
Changes:
* Dense convolutions on AVX2 systems were optimized, improving performance for many non-pruned networks. In particular, this results in a speed improvement for batch size 64 ResNet-50 of up to 28% on Intel AVX2 systems and up to 39% on AMD AVX2 systems.
* Operations to shuffle activations in engine optimized, resulting in up to 14% speed improvement for batch size 64 pruned quantized MobileNetV1.
* Performance improvements made for networks with large output arrays.
Resolved Issues:
* Engine no longer fails with an assert when running some quantized networks.
* Some Resize operators were not optimized if they had a ROI input.
* Memory leak addressed on multi-socket systems when batch size > 1.
* Docs and readme corrections made for minor issues and broken links.
* Makefile no longer deletes files for docs compilation and cleaning.
Known Issues:
* In rare cases where a tensor, used as the input or output to an operation, is larger than 2GB, the engine can segfault. Users should decrease the batch size as a workaround.
* In some cases, models running complicated pre- or post-processing steps could diminish the DeepSparse Engine performance by up to a factor of 10x due to hyperthreading, as two engine threads can run on the same physical core. Address the performance issue by trying the following recommended solutions in order of preference:
1. [Enable thread binding](https://docs.neuralmagic.com/deepsparse/debugging-optimizing/diagnostics-debugging.html#performance-tuning)
If that does not give performance benefit or you want to try additional options:
2. [Use the **numactl** utility](https://docs.neuralmagic.com/deepsparse/debugging-optimizing/numactl-utility.html) to prevent the process from running on hyperthreads.
3. Manually set the thread affinity in Python as follows:
import os
from deepsparse.cpu import cpu_architecture
ARCH = cpu_architecture()
if ARCH.vendor == "GenuineIntel":
os.sched_setaffinity(0, range(ARCH.num_physical_cores()))
elif ARCH.vendor == "AuthenticAMD":
os.sched_setaffinity(0, range(0, 2*ARCH.num_physical_cores(), 2))
else:
raise RuntimeError(f"Unknown CPU vendor {ARCH.vendor}")