Deepsparse

Latest version: v1.8.0

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 7

1.7.1

This is a patch release for 1.7.0 that contains the following changes:
- Detokenization has been fixed for streaming outputs with models that use `sentencepiece`-based tokenizers. (1635)

1.7.0

New Features:
* DeepSparse Pipelines v2 was introduced, enabling more complex pipelines to be represented. Text Generation (compatible with Hugging Face Transformers) and Image Classification pipelines have been refactored to the v2 format. (1324, 1385, 1460, 1596, 1502, 1460, 1626)
* OpenAI Server compatibility added on top of Pipelines v2. (1445, 1477)
* `deepsparse.evaluate` APIs and CLIs added with plugins for perplexity and lm-eval-harness for LLM evaluations. (1596)
* An example was added demonstrating how to use LLMPerf for benchmarking DeepSparse LLM servers. (1502)
* Continuous batching support has been added for text generation pipelines and inference server pathways, enabling inference over multiple text streams at once. (1569, 1571)

Changes:
* Exposed `sequence_length` for greater control over text generation pipelines. (1518)
* `deepsparse.analyze` functionality has been updated to work properly with LLMs. (1324)
* The logging and timing infrastructure for Pipelines expanded to enable more thorough tracking and logging, in addition to furthering support for integrations with Prometheus and other standard logging platforms. (1614)
* UX improved for text generation pipelines to more closely match Hugging Face Transformers pipelines. (1583, 1584, 1590, 1592, 1598)

Resolved Issues:
* Compile time for dense LLMs is no longer very slow.
* Text generation pipeline bug fixes: corrected sampling logic errors and inappropriate in-place logits mutation resulting in incorrect answers for LLMs when using sampling. (1406, 1414)
* KV cache was fixed for improper handling of the `kv_cache` input while using external KV cache management, which resulted in inaccurate model inference for ONNX Runtime comparison pathways. (1337)
* Benchmarking runs for LLMs with internal KV cache no longer crash or report inaccurate numbers. (1512, 1514)
* SciPy dependencies were removed to address issues for CV pipelines where they would fail on import of `scipy` and crash. (1604, 1602)

Known Issues:
* OPT models produce incorrect outputs and are no longer supported.
* Streaming support is limited within the DeepSparse Pipeline v2 framework for tasks other than text generation.

1.6.1

This is a patch release for 1.6.0 that contains the following changes:
- The filename of the [Neural Magic DeepSparse Community License](https://github.com/neuralmagic//deepsparse/blob/main/LICENSE) in the DeepSparse GitHub repository has been renamed from `LICENSE-NEURALMAGIC` to `LICENSE ` for higher visibility in the DeepSparse GitHub repository and the C++ engine package tarball, deepsparse_api_demo.tar.gz. (#1485)

Known Issues:
* The compile time for dense LLMs can be very slow. Compile time to be addressed in forthcoming release.
* Docker images are not currently pushing. A resolution is forthcoming for functional Docker builds. [RESOLVED]

1.6.0

New Features:
* Version support added:
- Python 3.11 (1323, 1432)
- ONNX 1.14 and Opset 14 (1072, 1097)
- NumPy 1.21.6 (1094)

* Decoder-only text generation LLMs are optimized in DeepSparse and offer state-of-the-art performance with sparsity!`pip install deepsparse[llm]` and then use the [TextGeneration Pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md). For performance details, check out our [Sparse Fine-Tuning paper](https://github.com/neuralmagic/deepsparse/blob/ce60541134246975c3ba762a74258b1e7dcc21af/research/mpt/README.md).
(1022, 1035, 1061, 1081, 1132, 1122, 1137, 1121, 1139, 1126, 1151, 1140, 1173, 1166, 1176, 1172, 1190, 1142, 1205, 1204, 1212, 1214, 1194, 1218, 1196, 1217, 1216, 1225, 1240, 1254, 1246, 1250, 1266, 1270, 1276, 1274, 1235, 1284, 1285, 1304, 1308, 1310, 1313, 1272)

* OpenAI-compatible DeepSparse Server has been added, enabling standard OpenAI requests for performant LLMs. (1171, 1221, 1228, 1317)

* MLServer-compatible pathways for DeepSparse Server to enable standard MLServer requests. (1237)
* CLIP model support for deployments and performance functionality is now enabled. ([Documentation](https://github.com/neuralmagic/deepsparse/tree/bc4ffd305ac52718cc755430a1c44eb48739cfb4/src/deepsparse/clip)) (#1098, 1145, 1203)
* Several encoder-decoder networks have been optimized for performance: Donut, Whisper, and T5.
* Support for ARM processors is now generally available. ARMv8.2 or above is required for quantized performance. (1307)
* Support for macOS is now in Beta. macOS Ventura (version 13) or above and Apple silicon are required. (1088, 1096, 1290, 1307)
* DeepSparse Server updated to support generic pipeline Python implementations for easy extensibility. (1033)
* YOLOv8 deployment pipelines and model support have been added. (1044, 1052, 1040, 1138, 1261)

* AWS and GCP marketplace documentation added: [AWS](https://github.com/neuralmagic/deepsparse/tree/main/examples/aws-marketplace) | [GCP](https://github.com/neuralmagic/deepsparse/tree/main/examples/gcp-marketplace) (#1056, 1057)
* DigitalOcean marketplace integration added. ([Documentation](https://github.com/neuralmagic/deepsparse/tree/main/examples/do-marketplace)) (#1109)
* DeepSparse Azure marketplace integration added. ([Documentation](https://github.com/neuralmagic/deepsparse/tree/main/examples/azure-vm)) (#1066)
* DeepSparse Pipeline timing added. To access, utilize `pipeline.timer_manager` or utilize `deepsparse.benchmark_pipeline` CLI. (1062, 1150, 1268, 1259, 1294)

* `TorchScriptEngine` class added to enable benchmarking and evaluation comparisons to DeepSparse. (1015)
debug_analysis API now supports exporting CSVs, enabling easier analysis. (1253)

* SentenceTransformers deployment and performance support have been added. (1301)

Changes:

* DeepSparse upgraded for the SparseZoo V2 model file structure changes, which expands the number of supported files and reduces the number of bytes that need to be downloaded for model checkpoints, folders, and files. (1233, 1234, 1303, 1318)
* YOLOv5 deployment pipelines migrated to install from `nm-yolov5` on PyPI and remove the autoinstall from the `nm-yolov5` GitHub repository that would happen on invocation of the relevant pathways, enabling more predictable environments. (1030, 1101, 1129, 1111, 1167)

* Docker builds are updated to consistently rebuild for new releases and nightlies. ( 1012, 1068, 1069, 1113, 1144)

* Torchvision deployment pipelines have been upgraded to support 0.14.x. (1034)

* README and documentation updated to include: Slack Community name change, Contact Us form introduction, Python version changes; corrections for YOLOv5 torchvision, transformers, and SparseZoo broken links; and installation command. (1041, 1042, 1043, 1039, 1048, 931, 960, 1279, 1282, 1280, 1313)

* Python 3.7 is now deprecated. (1060, 1148)
* ONNX utilities are updated so that ONNX model arguments can be passed as either a model file path (past behavior) or an ONNX ModelProto Python object. (1089)
* Deployment directories containing a `model.onnx` will now load properly for all pipelines supported by DeepSparse Server. Before, specific paths needed to be supplied to the exact `model.onnx` file rather than a deployment directory. (1131)
* Flake8 updated to 6.1 to enable the latest standards for running make quality. (1156)
* Automatic link checking has been added to GitHub actions. (1226)
* DeepSparse Pipeline has been changed to make it printable, such that ` __str__ and __repr__ ` is implemented and will show useful information when a pipeline is printed. (1298)
* `nm-transformers `package has been fully removed and replaced with the native transformers package that works with DeepSparse. (1302)

Performance and Compression Improvements:

- The memory footprint used during model compilation for models with external weights has been greatly reduced.
- The memory footprint has been reduced by sharing weights between compiled engines, for example, when using bucketing.
- Matrix-Vector Multiplication (GEVM) with a sparse weight matrix is now supported for both performance and reduced memory footprint.
- Matrix-Matrix Multiplication (GEMM) with a sparse weight matrix is further optimized for performance and reduced memory footprint.
- AVX2-VNNI instructions are now used to improve the performance of DeepSparse.
- Grouped Query Attention (GQA) in transformers is now optimized.
- Improved performance of Gathers with constant data and dynamic indices, like the ones used for embeddings in transformers and recommendation models.
- The InstanceNormalization operator is now supported for performance.
- The Where operator has improved performance in some cases by fusing it onto other operators.
- The CLIP operator is now supported for performance with operands of any data type.

Resolved Issues:
* Assertion failures for GEMM operations with broadcast-stacked dimensions have been resolved.
* Updated unit and integration tests to remove temporary test files and limit test file creation, which were not being properly deleted. (1058)
* `deepsparse.benchmark` was failing with `AttributeError` when the `-shapes` argument was supplied, causing no benchmarks to be measured. (1071)
* Deepsparse Server with a `model.onnx` file in the model directory was causing the server to raise an exception for image classification pipelines. (1070)
* `Generate_random_inputs` function no longer creates random data with shapes 0 when ONNX files containing dynamic dimensions were given. (1086)
* Pydantic version pinned to <2.0 preventing NameErrors from being raised anytime pipelines are constructed. (1104)
* AWS Lambda serverless examples and implementations updated to avoid exceptions being thrown while running inference in AWS Lambda. (1115)
* DeepSparse Pipelines: if `num_cores` was not supplied as an explicit kwarg for a bucketing pipeline, it would trigger a key error. This is now updated to ensure the pipeline works correctly without `num_cores` being explicitly supplied as an kwarg. (1152)
* `eval_downstream` for Transformers pathways no longer fails due to a PyTorch requirement not being installed. The fix now removes the PyTorch support dependency, and it runs correctly through. (1187)
* Reliability for unit test `test_pipeline_call_is_async` has been improved to produce consistent test results. (1251, 1264, 1267)
* Torchvision previously needed to be installed for any tests to pass, including transformers and other unrelated pipelines. If it was not installed, then the tests would fail with an import error. (1251)

Known Issues:
* The compile time for dense LLMs can be very slow. Compile time to be addressed in forthcoming release.
* Docker images are not currently pushing. A resolution is forthcoming for functional Docker builds. [RESOLVED]

1.5.3

This is a patch release for 1.5.0 that contains the following changes:
- A rare segmentation fault on AVX2 systems has been fixed. This could have happened when an input to the network is quantized.

1.5.2

This is a patch release for 1.5.0 that contains the following changes:
- Pinned dependency Pydantic, a data validation library for Python, to < v2.0, to prevent current workflows from breaking. Pydantic upgrade planned for future release. (1107)

Page 1 of 7

Releases

Has known vulnerabilities

Deepsparse

Page 1 of 7

1.7.1

1.7.0

1.6.1

1.6.0

1.5.3

1.5.2

Page 1 of 7

Links

Releases