Initial release of 🪄 nm-vllm 🪄
[nm-vllm](https://pypi.org/project/nm-vllm/) is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
This release is based on `vllm==0.3.2`
Key Features
This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.
Model Inference with Marlin (4-bit Quantization)
Marlin is enabled automatically if a quantized model has the `"is_marlin_format": true` flag present in it's `quant_config.json`
python
from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")
Optionally, you can specify it explicitly by setting `quantization="marlin"`.
<p align="center">
<img alt="Marlin Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/6ac9f5b0-667a-41f3-8e6d-ca51c268bec5" width="60%" />
</p>
Model Inference with Weight Sparsity
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:
python
from vllm import LLM, SamplingParams
model = LLM(
"nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
sparsity="sparse_w16a16",
max_model_len=1024
)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
There is also support for semi-structured 2:4 sparsity using the `sparsity="semi_structured_sparse_w16a16"` argument:
python
from vllm import LLM, SamplingParams
model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
<p align="center">
<img alt="Sparse Memory Compression" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/2fdd2212-3081-4b97-b492-a809ce23fdd3" width="40%" />
<img alt="Sparse Inference Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/3448e3ee-535f-4c50-ac9b-00645673cc8c" width="40%" />
</p>
What's Changed
* Sparsity by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/1
* Sparse fused gemm integration by LucasWilkinson in https://github.com/neuralmagic/nm-vllm/pull/12
* Abf149/fix semi structured sparse by afeldman-nm in https://github.com/neuralmagic/nm-vllm/pull/16
* Enable bfloat16 for sparse_w16a16 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/18
* seed workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/19
* Add bias support for sparse layers by mgoin in https://github.com/neuralmagic/nm-vllm/pull/25
* Use naive decompress for SM<8.0 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/32
* Varun/benchmark workflow by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/28
* initial GHA workflows for "build test" and "remote push" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/27
* Only import magic_wand if sparsity is enabled by mgoin in https://github.com/neuralmagic/nm-vllm/pull/37
* Sparsity fix by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/40
* Add NM benchmarking scripts & utils by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/14
* Rs/marlin downstream v0.3.2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/43
* Update README.md by mgoin in https://github.com/neuralmagic/nm-vllm/pull/47
* additional updates to "bump-to-v0.3.2" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/39
* Add empty tensor initialization to LazyCompressedParameter by alexm-nm in https://github.com/neuralmagic/nm-vllm/pull/53
* Update arg_utils.py with `semi_structured_sparse_w16a16` by mgoin in https://github.com/neuralmagic/nm-vllm/pull/45
* additions for bump to v0.3.2 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/50
* formatting patch by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/54
* Rs/bump main to v0.3.2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/38
* Update setup.py naming by mgoin in https://github.com/neuralmagic/nm-vllm/pull/44
* Loudly reject compression when the tensor isn't sparse enough by mgoin in https://github.com/neuralmagic/nm-vllm/pull/55
* Benchmarking : Fix server response aggregation by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/51
* initial whl workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/57
* GHA Benchmark : Automatic benchmarking on manual trigger by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/46
* delete NOTICE.txt by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/63
* pin GPU and use "--forked" for some tests by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/58
* obsfucate pypi server ip by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/64
* add HF cache by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/65
* Rs/sparse integration test clean 2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/67
* neuralmagic-vllm -> nm-vllm by mgoin in https://github.com/neuralmagic/nm-vllm/pull/69
* Mark files that have been modified by Neural Magic by tlrmchlsmth in https://github.com/neuralmagic/nm-vllm/pull/70
* Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/66
* Jfinks license by jeanniefinks in https://github.com/neuralmagic/nm-vllm/pull/72
* Add Nightly benchmark workflow by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/62
* Rs/licensing by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/68
* Rs/model integration tests logprobs by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/71
* fixes issue identified by derek by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/83
* Add `nm-vllm[sparse]`+`nm-vllm[sparsity]` extras, move version to `0.1` by mgoin in https://github.com/neuralmagic/nm-vllm/pull/76
* Update setup.py by mgoin in https://github.com/neuralmagic/nm-vllm/pull/82
* Fixes the multi-gpu tests by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/79
* various updates to "build whl" workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/59
* Change magic_wand to nm-magic-wand by mgoin in https://github.com/neuralmagic/nm-vllm/pull/86
New Contributors
* LucasWilkinson made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/12
* alexm-nm made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/53
* tlrmchlsmth made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/70
* jeanniefinks made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/72
**Full Changelog**: https://github.com/neuralmagic/nm-vllm/commits/0.1.0