Nm-vllm

Latest version: v0.6.3.0

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 2

0.5.0

Key Features
This is based on upstream vllm = v0.5.0.post

What's Changed
* bump up version to 0.5.0 by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/278
* update publish.yml by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/280
* fix a minor bug for docker build by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/281
* update publish.yml by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/282
* [CI/Build] Verify licenses by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/272
* strip binaries by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/283
* only run multi-gpu for python 3.10.12 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/284
* add more models, new num_logprobs by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/285
* upload NIGHTLY assets to GCP by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/286
* GCP test runners by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/275
* Add nightly tag by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/287
* Upstream sync 2024 06 08 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/288
* [Rel Eng] Update Nightly Workflow To Use Proper Skip List by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/296
* [Rel Eng] Upstream sync 2024 06 11 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/298
* use nm-pypi service account by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/300
* default nvcc_threads to 8 in order to reduce build execution time by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/304
* Upstream sync 2024 06 12 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/302
* Fix docker image build issue by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/305
* Remote push refactor by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/297
* Update nm-nightly.yml by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/308
* Use shared actions by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/309
* enble tests that require C compiler by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/310
* [ CI ] Fix Failing Test Server Logprobs (tolerance tweak) by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/312
* [ CI ] Fix Failing Magic Wand Test by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/311
* Add githash to nm-vllm by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/299
* Upstream sync 2024 06 16 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/307
* [ CI ] skip local_workers_clean_shutdown by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/317
* set PYTHON-3-10 job to gcp by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/318
* [Rel Eng] Dial In LM Eval Tests Phase 1 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/289
* revert githash commit by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/320
* Pruned Readme by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/313
* Force-disable upstream tracking by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/321
* [ README ] Update README.md by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/323

**Full Changelog**: https://github.com/neuralmagic/nm-vllm/compare/0.4.0...0.5.0

0.4.0

Key Features
This release is based on `vllm==0.4.3`

What's Changed
* turn off single gpu scenario by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/88
* Benchmarking : Absolute -> Relative imports by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/85
* Benchmarking : update Gi_per_thread by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/90
* Update README.md with sparsity and quantization explainers by mgoin in https://github.com/neuralmagic/nm-vllm/pull/91
* Add notebooks for sparsegpt and marlin compression with nm-vllm by mgoin in https://github.com/neuralmagic/nm-vllm/pull/94
* upstream sync 2024-03-04 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/89
* Update README.md by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/96
* Formatting : Fix yapf by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/101
* Lower unstructured sparsity threshold to 40% by mgoin in https://github.com/neuralmagic/nm-vllm/pull/100
* Benchmarking : Misc updates by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/95
* upstream merge sync 2024-03-11 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/108
* Add lm-eval comparison script by mgoin in https://github.com/neuralmagic/nm-vllm/pull/99
* Benchmarks : Standardize benchmark result store by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/87
* seed whl centric workflows by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/116
* Benchmarking : Remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/92
* reverted accidental commit to main by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/119
* skipped test for nightly failure by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/120
* Turned back on the Marlin tests by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/121
* Benchmarking : Prepare for GHA benchmark UI by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/122
* Upstream sync 2024 03 14 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/127
* Benchmark : Update benchmark configs for Nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/126
* Benchmark : Modify/Add workflows/actions for github-action-benchmark by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/123
* Benchmark: fix nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/131
* Fix nightly - 03/18/2024 by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/136
* Upstream sync 2024 03 18 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/134
* Update Dockerfile with extensions support by mgoin in https://github.com/neuralmagic/nm-vllm/pull/107
* Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/130
* Benchmark Fix : Remove special tokens from warmup prompts by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/140
* Delete .github/pull_request_template.md by mgoin in https://github.com/neuralmagic/nm-vllm/pull/145
* Benchmarking : Update readme by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/144
* Initial Layerwise Profiler by LucasWilkinson in https://github.com/neuralmagic/nm-vllm/pull/124
* Benchmark Fix : Fix JSON decode error by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/142
* Upstream sync 2024 03 24 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/143
* Benchmark : Fix remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/129
* Benchmarks : Prune nightly benchmarks by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/150
* Lock lm-evaluation-harness to commit 262f879 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/151
* Benchmarks : Copy benchmark results to EFS by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/148
* update readme with nvcc threads option by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/153
* Generate tarball along with wheel build, and upload both in a package to GH by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/138
* switch to nightly whl's by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/154
* whl centric workflow for "remote push" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/117
* remove low-workload benchmarks that are flaky by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/156
* nightly patches by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/160
* Upstream sync v0.4.0.post1 (merged with `upstream-v0.4.0.post1`) by mgoin in https://github.com/neuralmagic/nm-vllm/pull/157
* Bump version to 0.2 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/165
* rename wheels to manylinux and remove unused action by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/167
* Update collect_env.py package list by mgoin in https://github.com/neuralmagic/nm-vllm/pull/169
* Add lm-eval full accuracy sweep using GSM8k by mgoin in https://github.com/neuralmagic/nm-vllm/pull/166
* Upstream sync 2024 04 08 by SageMoore in https://github.com/neuralmagic/nm-vllm/pull/173
* Updated logo in README by rgreenberg1 in https://github.com/neuralmagic/nm-vllm/pull/178
* Fix sparsity arg in Engine/ModelArgs by mgoin in https://github.com/neuralmagic/nm-vllm/pull/179
* rm model_executor/layers/attention directory since it's been moved by tlrmchlsmth in https://github.com/neuralmagic/nm-vllm/pull/181
* Upstream sync 2024 04 12 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/183
* mm publish workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/193
* GCP related build workflow updates by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/196
* switch to GCP based build VM by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/201
* cleanup venv by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/217
* Upstream sync 2024 04 26 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/211
* update workflows to use generated whls by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/204
* Fix nightly benchmark scripts by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/229
* Add lm-eval correctness test by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/210
* switch to k8s runners by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/231
* Upstream sync 2024 05 05 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/224
* Marlin 2:4 Downstream (for v0.3 release) by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/239
* Misc CI/CD updates by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/240
* bump version to 0.3.0 by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/241
* [Bugfix] Fix marlin 2:4 kernel crash on H100 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/243
* switch runner from aws to gcp for generate whl workflow by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/242
* Add FP8 and marlin 2:4 tests for lm-eval by mgoin in https://github.com/neuralmagic/nm-vllm/pull/244
* updates for nm-magic-wand, nightly or release by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/247
* version check patch by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/251
* increase timeouts by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/253
* `requirements-dev.txt` and workflow patches by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/255
* updates for automation (and release) by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/265
* update install commands by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/264
* Address py38/39 incompatibilities by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/261
* [CI/Build] Basic server correctness test by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/237
* bump up version and gate magic-wand version by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/267
* remove release worklfow concurrency limit by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/270
* [CI/Build] include NOTICE in package dist-info by derekk-nm in https://github.com/neuralmagic/nm-vllm/pull/271
* switch benchmarking and testing jobs to run using "test" label by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/273
* Handle server startup failure in __enter__ by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/274
* Upstream sync 2024 05 19 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/249
* Docker image improvements by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/276
* add latest tag for release docker image by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/279

New Contributors
* SageMoore made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/173
* rgreenberg1 made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/178
* derekk-nm made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/237

**Full Changelog**: https://github.com/neuralmagic/nm-vllm/compare/0.1.0...v0.4.0

0.3.0

Key Features
This release is based on `vllm==0.4.2`

What's Changed
* turn off single gpu scenario by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/88
* Benchmarking : Absolute -> Relative imports by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/85
* Benchmarking : update Gi_per_thread by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/90
* Update README.md with sparsity and quantization explainers by mgoin in https://github.com/neuralmagic/nm-vllm/pull/91
* Add notebooks for sparsegpt and marlin compression with nm-vllm by mgoin in https://github.com/neuralmagic/nm-vllm/pull/94
* upstream sync 2024-03-04 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/89
* Update README.md by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/96
* Formatting : Fix yapf by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/101
* Lower unstructured sparsity threshold to 40% by mgoin in https://github.com/neuralmagic/nm-vllm/pull/100
* Benchmarking : Misc updates by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/95
* upstream merge sync 2024-03-11 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/108
* Add lm-eval comparison script by mgoin in https://github.com/neuralmagic/nm-vllm/pull/99
* Benchmarks : Standardize benchmark result store by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/87
* seed whl centric workflows by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/116
* Benchmarking : Remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/92
* reverted accidental commit to main by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/119
* skipped test for nightly failure by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/120
* Turned back on the Marlin tests by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/121
* Benchmarking : Prepare for GHA benchmark UI by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/122
* Upstream sync 2024 03 14 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/127
* Benchmark : Update benchmark configs for Nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/126
* Benchmark : Modify/Add workflows/actions for github-action-benchmark by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/123
* Benchmark: fix nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/131
* Fix nightly - 03/18/2024 by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/136
* Upstream sync 2024 03 18 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/134
* Update Dockerfile with extensions support by mgoin in https://github.com/neuralmagic/nm-vllm/pull/107
* Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/130
* Benchmark Fix : Remove special tokens from warmup prompts by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/140
* Delete .github/pull_request_template.md by mgoin in https://github.com/neuralmagic/nm-vllm/pull/145
* Benchmarking : Update readme by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/144
* Initial Layerwise Profiler by LucasWilkinson in https://github.com/neuralmagic/nm-vllm/pull/124
* Benchmark Fix : Fix JSON decode error by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/142
* Upstream sync 2024 03 24 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/143
* Benchmark : Fix remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/129
* Benchmarks : Prune nightly benchmarks by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/150
* Lock lm-evaluation-harness to commit 262f879 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/151
* Benchmarks : Copy benchmark results to EFS by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/148
* update readme with nvcc threads option by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/153
* Generate tarball along with wheel build, and upload both in a package to GH by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/138
* switch to nightly whl's by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/154
* whl centric workflow for "remote push" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/117
* remove low-workload benchmarks that are flaky by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/156
* nightly patches by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/160
* Upstream sync v0.4.0.post1 (merged with `upstream-v0.4.0.post1`) by mgoin in https://github.com/neuralmagic/nm-vllm/pull/157
* Bump version to 0.2 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/165
* rename wheels to manylinux and remove unused action by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/167
* Update collect_env.py package list by mgoin in https://github.com/neuralmagic/nm-vllm/pull/169
* Add lm-eval full accuracy sweep using GSM8k by mgoin in https://github.com/neuralmagic/nm-vllm/pull/166
* Upstream sync 2024 04 08 by SageMoore in https://github.com/neuralmagic/nm-vllm/pull/173
* Updated logo in README by rgreenberg1 in https://github.com/neuralmagic/nm-vllm/pull/178
* Fix sparsity arg in Engine/ModelArgs by mgoin in https://github.com/neuralmagic/nm-vllm/pull/179
* rm model_executor/layers/attention directory since it's been moved by tlrmchlsmth in https://github.com/neuralmagic/nm-vllm/pull/181
* Upstream sync 2024 04 12 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/183
* mm publish workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/193
* GCP related build workflow updates by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/196
* switch to GCP based build VM by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/201
* cleanup venv by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/217
* Upstream sync 2024 04 26 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/211
* update workflows to use generated whls by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/204
* Fix nightly benchmark scripts by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/229
* Add lm-eval correctness test by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/210
* switch to k8s runners by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/231
* Upstream sync 2024 05 05 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/224
* Marlin 2:4 Downstream (for v0.3 release) by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/239
* Misc CI/CD updates by dbarbuzzi in https://github.com/neuralmagic/nm-vllm/pull/240
* bump version to 0.3.0 by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/241
* [Cherrypick, Bugfix] Fix marlin 2:4 kernel crash on H100 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/245
* [cherry-pick] Update gen-whl.yml by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/246
* updates for nm-magic-wand, nightly or release (247) by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/248

New Contributors
* SageMoore made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/173
* rgreenberg1 made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/178
* dbarbuzzi made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/229

**Full Changelog**: https://github.com/neuralmagic/nm-vllm/compare/0.1.0...0.3.0

0.2.0

Key Features
This release is based on `vllm==0.4.0.post1`

* New model architectures supported! `DbrxForCausalLM`, `CohereForCausalLM` (Command-R), `JAISLMHeadModel`, `LlavaForConditionalGeneration` (experimental vision LM), `OrionForCausalLM`, `Qwen2MoeForCausalLM`, `StableLmForCausalLM`, `Starcoder2ForCausalLM`, `XverseForCausalLM`
* Automated benchmarking
* Code coverage reporting
* lm-evaluation-harness nightly accuracy testing
* Layerwise Profiling for the inference graph (https://github.com/neuralmagic/nm-vllm/pull/124)

What's Changed
* turn off single gpu scenario by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/88
* Benchmarking : Absolute -> Relative imports by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/85
* Benchmarking : update Gi_per_thread by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/90
* Update README.md with sparsity and quantization explainers by mgoin in https://github.com/neuralmagic/nm-vllm/pull/91
* Add notebooks for sparsegpt and marlin compression with nm-vllm by mgoin in https://github.com/neuralmagic/nm-vllm/pull/94
* upstream sync 2024-03-04 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/89
* Update README.md by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/96
* Formatting : Fix yapf by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/101
* Lower unstructured sparsity threshold to 40% by mgoin in https://github.com/neuralmagic/nm-vllm/pull/100
* Benchmarking : Misc updates by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/95
* upstream merge sync 2024-03-11 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/108
* Add lm-eval comparison script by mgoin in https://github.com/neuralmagic/nm-vllm/pull/99
* Benchmarks : Standardize benchmark result store by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/87
* seed whl centric workflows by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/116
* Benchmarking : Remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/92
* reverted accidental commit to main by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/119
* skipped test for nightly failure by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/120
* Turned back on the Marlin tests by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/121
* Benchmarking : Prepare for GHA benchmark UI by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/122
* Upstream sync 2024 03 14 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/127
* Benchmark : Update benchmark configs for Nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/126
* Benchmark : Modify/Add workflows/actions for github-action-benchmark by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/123
* Benchmark: fix nightly by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/131
* Fix nightly - 03/18/2024 by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/136
* Upstream sync 2024 03 18 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/134
* Update Dockerfile with extensions support by mgoin in https://github.com/neuralmagic/nm-vllm/pull/107
* Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/130
* Benchmark Fix : Remove special tokens from warmup prompts by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/140
* Delete .github/pull_request_template.md by mgoin in https://github.com/neuralmagic/nm-vllm/pull/145
* Benchmarking : Update readme by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/144
* Initial Layerwise Profiler by LucasWilkinson in https://github.com/neuralmagic/nm-vllm/pull/124
* Benchmark Fix : Fix JSON decode error by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/142
* Upstream sync 2024 03 24 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/143
* Benchmark : Fix remote push job by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/129
* Benchmarks : Prune nightly benchmarks by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/150
* Lock lm-evaluation-harness to commit 262f879 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/151
* Benchmarks : Copy benchmark results to EFS by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/148
* update readme with nvcc threads option by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/153
* Generate tarball along with wheel build, and upload both in a package to GH by dhuangnm in https://github.com/neuralmagic/nm-vllm/pull/138
* switch to nightly whl's by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/154
* whl centric workflow for "remote push" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/117
* remove low-workload benchmarks that are flaky by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/156
* nightly patches by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/160
* Upstream sync v0.4.0.post1 (merged with `upstream-v0.4.0.post1`) by mgoin in https://github.com/neuralmagic/nm-vllm/pull/157
* Bump version to 0.2 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/165

New Contributors
* dhuangnm made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/138

**Full Changelog**: https://github.com/neuralmagic/nm-vllm/compare/0.1.0...0.2.0

0.1.0

Initial release of 🪄 nm-vllm 🪄

[nm-vllm](https://pypi.org/project/nm-vllm/) is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This release is based on `vllm==0.3.2`

Key Features

This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.

Model Inference with Marlin (4-bit Quantization)

Marlin is enabled automatically if a quantized model has the `"is_marlin_format": true` flag present in it's `quant_config.json`
python
from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")

Optionally, you can specify it explicitly by setting `quantization="marlin"`.

<p align="center">
<img alt="Marlin Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/6ac9f5b0-667a-41f3-8e6d-ca51c268bec5" width="60%" />
</p>

Model Inference with Weight Sparsity

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:

python
from vllm import LLM, SamplingParams

model = LLM(
"nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
sparsity="sparse_w16a16",
max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the `sparsity="semi_structured_sparse_w16a16"` argument:
python
from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

<p align="center">
<img alt="Sparse Memory Compression" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/2fdd2212-3081-4b97-b492-a809ce23fdd3" width="40%" />
<img alt="Sparse Inference Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/3448e3ee-535f-4c50-ac9b-00645673cc8c" width="40%" />
</p>

What's Changed
* Sparsity by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/1
* Sparse fused gemm integration by LucasWilkinson in https://github.com/neuralmagic/nm-vllm/pull/12
* Abf149/fix semi structured sparse by afeldman-nm in https://github.com/neuralmagic/nm-vllm/pull/16
* Enable bfloat16 for sparse_w16a16 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/18
* seed workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/19
* Add bias support for sparse layers by mgoin in https://github.com/neuralmagic/nm-vllm/pull/25
* Use naive decompress for SM<8.0 by mgoin in https://github.com/neuralmagic/nm-vllm/pull/32
* Varun/benchmark workflow by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/28
* initial GHA workflows for "build test" and "remote push" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/27
* Only import magic_wand if sparsity is enabled by mgoin in https://github.com/neuralmagic/nm-vllm/pull/37
* Sparsity fix by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/40
* Add NM benchmarking scripts & utils by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/14
* Rs/marlin downstream v0.3.2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/43
* Update README.md by mgoin in https://github.com/neuralmagic/nm-vllm/pull/47
* additional updates to "bump-to-v0.3.2" by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/39
* Add empty tensor initialization to LazyCompressedParameter by alexm-nm in https://github.com/neuralmagic/nm-vllm/pull/53
* Update arg_utils.py with `semi_structured_sparse_w16a16` by mgoin in https://github.com/neuralmagic/nm-vllm/pull/45
* additions for bump to v0.3.2 by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/50
* formatting patch by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/54
* Rs/bump main to v0.3.2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/38
* Update setup.py naming by mgoin in https://github.com/neuralmagic/nm-vllm/pull/44
* Loudly reject compression when the tensor isn't sparse enough by mgoin in https://github.com/neuralmagic/nm-vllm/pull/55
* Benchmarking : Fix server response aggregation by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/51
* initial whl workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/57
* GHA Benchmark : Automatic benchmarking on manual trigger by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/46
* delete NOTICE.txt by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/63
* pin GPU and use "--forked" for some tests by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/58
* obsfucate pypi server ip by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/64
* add HF cache by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/65
* Rs/sparse integration test clean 2 by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/67
* neuralmagic-vllm -> nm-vllm by mgoin in https://github.com/neuralmagic/nm-vllm/pull/69
* Mark files that have been modified by Neural Magic by tlrmchlsmth in https://github.com/neuralmagic/nm-vllm/pull/70
* Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/66
* Jfinks license by jeanniefinks in https://github.com/neuralmagic/nm-vllm/pull/72
* Add Nightly benchmark workflow by varun-sundar-rabindranath in https://github.com/neuralmagic/nm-vllm/pull/62
* Rs/licensing by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/68
* Rs/model integration tests logprobs by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/71
* fixes issue identified by derek by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/83
* Add `nm-vllm[sparse]`+`nm-vllm[sparsity]` extras, move version to `0.1` by mgoin in https://github.com/neuralmagic/nm-vllm/pull/76
* Update setup.py by mgoin in https://github.com/neuralmagic/nm-vllm/pull/82
* Fixes the multi-gpu tests by robertgshaw2-neuralmagic in https://github.com/neuralmagic/nm-vllm/pull/79
* various updates to "build whl" workflow by andy-neuma in https://github.com/neuralmagic/nm-vllm/pull/59
* Change magic_wand to nm-magic-wand by mgoin in https://github.com/neuralmagic/nm-vllm/pull/86

New Contributors
* LucasWilkinson made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/12
* alexm-nm made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/53
* tlrmchlsmth made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/70
* jeanniefinks made their first contribution in https://github.com/neuralmagic/nm-vllm/pull/72

**Full Changelog**: https://github.com/neuralmagic/nm-vllm/commits/0.1.0

Page 2 of 2

Releases

Has known vulnerabilities

Nm-vllm

Page 2 of 2

0.5.0

0.4.0

0.3.0

0.2.0

0.1.0

Page 2 of 2

Links

Releases