New Features:
**Documentation:**
* [SparseServer.UI](https://github.com/neuralmagic/deepsparse/tree/main/examples/sparseserver-ui): a Streamlit app for deploying the DeepSparse Server for exploring the inference performance of BERT on the question answering task.
* [DeepSparse Server README](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server): `deepsparse.server` capabilities, including single model and multi-model inferencing.
* [Twitter NLP Inference Examples](https://github.com/neuralmagic/deepsparse/tree/main/examples/twitter-nlp) added.
Changes:
**Performance:**
* Speedup for large batch sizes when using sync mode on AMD EPYC processors.
* AVX2 improvements for
* Up to 40% speedup out of the box for dense quantized models.
* Up to 20% speedup for pruned quantized BERT, ResNet-50 and MobileNet.
* Speedup from sparsity realized for ConvInteger operators.
* Model compilation time decreased on systems with many cores.
* Multi-stream Scheduler: certain computations that were executed during runtime are now precomputed.
* Hugging Face Transformers integration updated to latest state from upstream main branch.
**Documentation:**
* [DeepSparse README](https://github.com/neuralmagic/deepsparse): references to `deepsparse.server`, `deepsparse.benchmark`, and Transformer pipelines.
* [DeepSparse Benchmark README](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark_model): highlights of `deepsparse.benchmark` CLI command.
* [Transformers 🤗 Inference Pipelines](https://github.com/neuralmagic/deepsparse/tree/main/examples/huggingface-transformers): examples included on how to run inference via Python for several NLP tasks.
Resolved Issues:
* When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will no longer disable optimizations and see very poor performance.
* Users executing `arch.bin` now receive a correct architecture profile of their system.
Known Issues:
* When running the DeepSparse engine on a system with a nonuniform system topology, for example, an AMD EPYC processor where some cores per core-complex (CCX) have been disabled, model compilation will never terminate. A workaround is to set the environment variable `NM_SERIAL_UNIT_GENERATION=1`.