Vllm

Latest version: v0.6.4.post1

Safety actively analyzes 688007 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 8

0.6.1.post1

Not secure
Highlights
This release features important bug fixes and enhancements for
- Pixtral models. (8415, 8425, 8399, 8431)
- Chunked scheduling has been turned off for vision models. Please replace `--max_num_batched_tokens 16384` with `--max-model-len 16384`
- Multistep scheduling. (8417, 7928, 8427)
- Tool use. (8423, 8366)

Also
* support multiple images for qwen-vl (8247)
* removes `engine_use_ray` (8126)
* add engine option to return only deltas or final output (7381)
* add bitsandbytes support for Gemma2 (8338)


What's Changed
* [MISC] Dump model runner inputs when crashing by comaniac in https://github.com/vllm-project/vllm/pull/8305
* [misc] remove engine_use_ray by youkaichao in https://github.com/vllm-project/vllm/pull/8126
* [TPU] Use Ray for default distributed backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/8389
* Fix the AMD weight loading tests by mgoin in https://github.com/vllm-project/vllm/pull/8390
* [Bugfix]: Fix the logic for deciding if tool parsing is used by tomeras91 in https://github.com/vllm-project/vllm/pull/8366
* [Gemma2] add bitsandbytes support for Gemma2 by blueyo0 in https://github.com/vllm-project/vllm/pull/8338
* [Misc] Raise error when using encoder/decoder model with cpu backend by kevin314 in https://github.com/vllm-project/vllm/pull/8355
* [Misc] Use RoPE cache for MRoPE by WoosukKwon in https://github.com/vllm-project/vllm/pull/8396
* [torch.compile] hide slicing under custom op for inductor by youkaichao in https://github.com/vllm-project/vllm/pull/8384
* [Hotfix][VLM] Fixing max position embeddings for Pixtral by ywang96 in https://github.com/vllm-project/vllm/pull/8399
* [Bugfix] Fix InternVL2 inference with various num_patches by Isotr0py in https://github.com/vllm-project/vllm/pull/8375
* [Model] Support multiple images for qwen-vl by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8247
* [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by lnykww in https://github.com/vllm-project/vllm/pull/8403
* [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by vegaluisjose in https://github.com/vllm-project/vllm/pull/8423
* [Bugfix] Offline mode fix by joerunde in https://github.com/vllm-project/vllm/pull/8376
* [multi-step] add flashinfer backend by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7928
* [Core] Add engine option to return only deltas or final output by njhill in https://github.com/vllm-project/vllm/pull/7381
* [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8427
* [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/8425
* [CI/Build] Disable multi-node test for InternVL2 by ywang96 in https://github.com/vllm-project/vllm/pull/8428
* [Hotfix][Pixtral] Fix multiple images bugs by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8415
* [Bugfix] Fix weight loading issue by rename variable. by wenxcs in https://github.com/vllm-project/vllm/pull/8293
* [Misc] Update Pixtral example by ywang96 in https://github.com/vllm-project/vllm/pull/8431
* [BugFix] fix group_topk by dsikka in https://github.com/vllm-project/vllm/pull/8430
* [Core] Factor out input preprocessing to a separate class by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7329
* [Bugfix] Mapping physical device indices for e2e test utils by ShangmingCai in https://github.com/vllm-project/vllm/pull/8290
* [Bugfix] Bump fastapi and pydantic version by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8435
* [CI/Build] Update pixtral tests to use JSON by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8436
* [Bugfix] Fix async log stats by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8417
* [bugfix] torch profiler bug for single gpu with GPUExecutor by SolitaryThinker in https://github.com/vllm-project/vllm/pull/8354
* bump version to v0.6.1.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/8440

New Contributors
* blueyo0 made their first contribution in https://github.com/vllm-project/vllm/pull/8338
* lnykww made their first contribution in https://github.com/vllm-project/vllm/pull/8403
* vegaluisjose made their first contribution in https://github.com/vllm-project/vllm/pull/8423

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.1...v0.6.1.post1

0.6.1

Not secure
Highlights

Model Support
* Added support for Pixtral (`mistralai/Pixtral-12B-2409`). (8377, 8168)
* Added support for Llava-Next-Video (7559), Qwen-VL (8029), Qwen2-VL (7905)
* Multi-input support for LLaVA (8238), InternVL2 models (8201)

Performance Enhancements
* Memory optimization for awq_gemm and awq_dequantize, 2x throughput (8248)

Production Engine
* Support load and unload LoRA in api server (6566)
* Add progress reporting to batch runner (8060)
* Add support for NVIDIA ModelOpt static scaling checkpoints. (6112)

Others
* Update the docker image to use Python 3.12 for small performance bump. (8133)
* Added CODE_OF_CONDUCT.md (8161)



What's Changed
* [Doc] [Misc] Create CODE_OF_CONDUCT.md by mmcelaney in https://github.com/vllm-project/vllm/pull/8161
* [bugfix] Upgrade minimum OpenAI version by SolitaryThinker in https://github.com/vllm-project/vllm/pull/8169
* [Misc] Clean up RoPE forward_native by WoosukKwon in https://github.com/vllm-project/vllm/pull/8076
* [ci] Mark LoRA test as soft-fail by khluu in https://github.com/vllm-project/vllm/pull/8160
* [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by elfiegg in https://github.com/vllm-project/vllm/pull/8173
* [Doc] Add multi-image input example and update supported models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8181
* Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by Manikandan-Thangaraj-ZS0321 in https://github.com/vllm-project/vllm/pull/7860
* [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8029
* Move verify_marlin_supported to GPTQMarlinLinearMethod by mgoin in https://github.com/vllm-project/vllm/pull/8165
* [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by sroy745 in https://github.com/vllm-project/vllm/pull/7962
* [Core] Support load and unload LoRA in api server by Jeffwan in https://github.com/vllm-project/vllm/pull/6566
* [BugFix] Fix Granite model configuration by njhill in https://github.com/vllm-project/vllm/pull/8216
* [Frontend] Add --logprobs argument to `benchmark_serving.py` by afeldman-nm in https://github.com/vllm-project/vllm/pull/8191
* [Misc] Use ray[adag] dependency instead of cuda by ruisearch42 in https://github.com/vllm-project/vllm/pull/7938
* [CI/Build] Increasing timeout for multiproc worker tests by alexeykondrat in https://github.com/vllm-project/vllm/pull/8203
* [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by rasmith in https://github.com/vllm-project/vllm/pull/8248
* [Misc] Remove `SqueezeLLM` by dsikka in https://github.com/vllm-project/vllm/pull/8220
* [Model] Allow loading from original Mistral format by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8168
* [misc] [doc] [frontend] LLM torch profiler support by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7943
* [Bugfix] Fix Hermes tool call chat template bug by K-Mistele in https://github.com/vllm-project/vllm/pull/8256
* [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8238
* Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by wschin in https://github.com/vllm-project/vllm/pull/8241
* [tpu][misc] fix typo by youkaichao in https://github.com/vllm-project/vllm/pull/8260
* [Bugfix] Fix broken OpenAI tensorizer test by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8258
* [Model][VLM] Support multi-images inputs for InternVL2 models by Isotr0py in https://github.com/vllm-project/vllm/pull/8201
* [Model][VLM] Decouple weight loading logic for `Paligemma` by Isotr0py in https://github.com/vllm-project/vllm/pull/8269
* ppc64le: Dockerfile fixed, and a script for buildkite by sumitd2 in https://github.com/vllm-project/vllm/pull/8026
* [CI/Build] Use python 3.12 in cuda image by joerunde in https://github.com/vllm-project/vllm/pull/8133
* [Bugfix] Fix async postprocessor in case of preemption by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8267
* [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by K-Mistele in https://github.com/vllm-project/vllm/pull/8272
* [Frontend] Add progress reporting to run_batch.py by alugowski in https://github.com/vllm-project/vllm/pull/8060
* [Bugfix] Correct adapter usage for cohere and jamba by vladislavkruglikov in https://github.com/vllm-project/vllm/pull/8292
* [Misc] GPTQ Activation Ordering by kylesayrs in https://github.com/vllm-project/vllm/pull/8135
* [Misc] Fused MoE Marlin support for GPTQ by dsikka in https://github.com/vllm-project/vllm/pull/8217
* Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by simon-mo in https://github.com/vllm-project/vllm/pull/8319
* [Bugfix] Fix missing `post_layernorm` in CLIP by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8155
* [CI/Build] enable ccache/scccache for HIP builds by dtrifiro in https://github.com/vllm-project/vllm/pull/8327
* [Frontend] Clean up type annotations for mistral tokenizer by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8314
* [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by alexeykondrat in https://github.com/vllm-project/vllm/pull/8130
* Fix ppc64le buildkite job by sumitd2 in https://github.com/vllm-project/vllm/pull/8309
* [Spec Decode] Move ops.advance_step to flash attn advance_step by kevin314 in https://github.com/vllm-project/vllm/pull/8224
* [Misc] remove peft as dependency for prompt models by prashantgupta24 in https://github.com/vllm-project/vllm/pull/8162
* [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by comaniac in https://github.com/vllm-project/vllm/pull/8342
* [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8340
* [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by SolitaryThinker in https://github.com/vllm-project/vllm/pull/8172
* [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8043
* [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by jeejeelee in https://github.com/vllm-project/vllm/pull/8329
* [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by Isotr0py in https://github.com/vllm-project/vllm/pull/8299
* [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by pavanimajety in https://github.com/vllm-project/vllm/pull/6112
* [model] Support for Llava-Next-Video model by TKONIY in https://github.com/vllm-project/vllm/pull/7559
* [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by pooyadavoodi in https://github.com/vllm-project/vllm/pull/8347
* [Model][VLM] Add Qwen2-VL model support by fyabc in https://github.com/vllm-project/vllm/pull/7905
* [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/7257
* [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by alexeykondrat in https://github.com/vllm-project/vllm/pull/8373
* [Bugfix] Add missing attributes in mistral tokenizer by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8364
* [Kernel][Misc] Add meta functions for ops to prevent graph breaks by bnellnm in https://github.com/vllm-project/vllm/pull/6917
* [Misc] Move device options to a single place by akx in https://github.com/vllm-project/vllm/pull/8322
* [Speculative Decoding] Test refactor by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8317
* Pixtral by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8377
* Bump version to v0.6.1 by simon-mo in https://github.com/vllm-project/vllm/pull/8379

New Contributors
* mmcelaney made their first contribution in https://github.com/vllm-project/vllm/pull/8161
* elfiegg made their first contribution in https://github.com/vllm-project/vllm/pull/8173
* Manikandan-Thangaraj-ZS0321 made their first contribution in https://github.com/vllm-project/vllm/pull/7860
* sumitd2 made their first contribution in https://github.com/vllm-project/vllm/pull/8026
* alugowski made their first contribution in https://github.com/vllm-project/vllm/pull/8060
* vladislavkruglikov made their first contribution in https://github.com/vllm-project/vllm/pull/8292
* kevin314 made their first contribution in https://github.com/vllm-project/vllm/pull/8224
* TKONIY made their first contribution in https://github.com/vllm-project/vllm/pull/7559
* akx made their first contribution in https://github.com/vllm-project/vllm/pull/8322

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.0...v0.6.1

0.6.0

Not secure
Highlights

Performance Update
* We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting `--num-scheduler-steps 8` in the engine arguments. Please note that it still have some limitations and being actively hardened, see 7528 for known issues.
* Multi-step scheduler now supports LLMEngine and log_probs (7789, 7652)
* Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (7049, 7911, 7921, 8050)
* Using FlashInfer backend for FP8 KV Cache (7798, 7985), rejection sampling in Speculative Decoding (7244)

Model Support
* Support bitsandbytes 8-bit and FP4 quantized models (7445)
* New LLMs: Exaone (7819), Granite (7436), Phi-3.5-MoE (7729)
* A new tokenizer mode for mistral models to use the native mistral-commons package (7739)
* Multi-modality:
* multi-image input support for LLaVA-Next (7230), Phi-3-vision models (7783)
* Ultravox support for multiple audio chunks (7963)
* TP support for ViTs (7186)

Hardware Support
* NVIDIA GPU: extend cuda graph size for H200 (7894)
* AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (7386)
* Intel GPU: pipeline parallel support (7810)
* Neuron: context lengths and token generation buckets (7885, 8062)
* TPU: single and multi-host TPUs on GKE (7613), Async output processing (8011)

Production Features
* OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (5649)
* Add json_schema support from OpenAI protocol (7654)
* Enable chunked prefill and prefix caching together (7753, 8120)
* Multimodal support in offline chat (8098), and multiple multi-modal items in the OpenAI frontend (8049)

Misc
* Support benchmarking async engine in benchmark_throughput.py (7964)
* Progress in integration with `torch.compile`: avoid Dynamo guard evaluation overhead (7898), skip compile for profiling (7796)

What's Changed
* [Core] Add multi-step support to LLMEngine by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7789
* [Bugfix] Fix run_batch logger by pooyadavoodi in https://github.com/vllm-project/vllm/pull/7640
* [Frontend] Publish Prometheus metrics in run_batch API by pooyadavoodi in https://github.com/vllm-project/vllm/pull/7641
* [Frontend] add json_schema support from OpenAI protocol by rockwotj in https://github.com/vllm-project/vllm/pull/7654
* [misc][core] lazy import outlines by youkaichao in https://github.com/vllm-project/vllm/pull/7831
* [ci][test] exclude model download time in server start time by youkaichao in https://github.com/vllm-project/vllm/pull/7834
* [ci][test] fix RemoteOpenAIServer by youkaichao in https://github.com/vllm-project/vllm/pull/7838
* [Bugfix] Fix Phi-3v crash when input images are of certain sizes by zifeitong in https://github.com/vllm-project/vllm/pull/7840
* [Model][VLM] Support multi-images inputs for Phi-3-vision models by Isotr0py in https://github.com/vllm-project/vllm/pull/7783
* [Misc] Remove snapshot_download usage in InternVL2 test by Isotr0py in https://github.com/vllm-project/vllm/pull/7835
* [misc][cuda] improve pynvml warning by youkaichao in https://github.com/vllm-project/vllm/pull/7852
* [Spec Decoding] Streamline batch expansion tensor manipulation by njhill in https://github.com/vllm-project/vllm/pull/7851
* [Bugfix]: Use float32 for base64 embedding by HollowMan6 in https://github.com/vllm-project/vllm/pull/7855
* [CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7836
* [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by comaniac in https://github.com/vllm-project/vllm/pull/7822
* [Misc] Update `qqq` to use vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7805
* [Misc] Update `gptq_marlin_24` to use vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7762
* [misc] fix custom allreduce p2p cache file generation by youkaichao in https://github.com/vllm-project/vllm/pull/7853
* [Bugfix] neuron: enable tensor parallelism by omrishiv in https://github.com/vllm-project/vllm/pull/7562
* [Misc] Update compressed tensors lifecycle to remove `prefix` from `create_weights` by dsikka in https://github.com/vllm-project/vllm/pull/7825
* [Core] Asynchronous Output Processor by megha95 in https://github.com/vllm-project/vllm/pull/7049
* [Tests] Disable retries and use context manager for openai client by njhill in https://github.com/vllm-project/vllm/pull/7565
* [core][torch.compile] not compile for profiling by youkaichao in https://github.com/vllm-project/vllm/pull/7796
* Revert 7509 by comaniac in https://github.com/vllm-project/vllm/pull/7887
* [Model] Add Mistral Tokenization to improve robustness and chat encoding by patrickvonplaten in https://github.com/vllm-project/vllm/pull/7739
* [CI/Build][VLM] Cleanup multiple images inputs model test by Isotr0py in https://github.com/vllm-project/vllm/pull/7897
* [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by jikunshang in https://github.com/vllm-project/vllm/pull/7810
* [CI/Build][ROCm] Enabling tensorizer tests for ROCm by alexeykondrat in https://github.com/vllm-project/vllm/pull/7237
* [Bugfix] Fix phi3v incorrect image_idx when using async engine by Isotr0py in https://github.com/vllm-project/vllm/pull/7916
* [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by youkaichao in https://github.com/vllm-project/vllm/pull/7924
* [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by dsikka in https://github.com/vllm-project/vllm/pull/7766
* [benchmark] Update TGI version by philschmid in https://github.com/vllm-project/vllm/pull/7917
* [Model] Add multi-image input support for LLaVA-Next offline inference by zifeitong in https://github.com/vllm-project/vllm/pull/7230
* [mypy] Enable mypy type checking for `vllm/core` by jberkhahn in https://github.com/vllm-project/vllm/pull/7229
* [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by petersalas in https://github.com/vllm-project/vllm/pull/7902
* [hardware][rocm] allow rocm to override default env var by youkaichao in https://github.com/vllm-project/vllm/pull/7926
* [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by bnellnm in https://github.com/vllm-project/vllm/pull/7886
* [mypy][CI/Build] Fix mypy errors by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7929
* [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7911
* [Performance] Enable chunked prefill and prefix caching together by comaniac in https://github.com/vllm-project/vllm/pull/7753
* [ci][test] fix pp test failure by youkaichao in https://github.com/vllm-project/vllm/pull/7945
* [Doc] fix the autoAWQ example by stas00 in https://github.com/vllm-project/vllm/pull/7937
* [Bugfix][VLM] Fix incompatibility between 7902 and 7230 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7948
* [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by pavanimajety in https://github.com/vllm-project/vllm/pull/7798
* [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by rasmith in https://github.com/vllm-project/vllm/pull/7386
* [TPU] Upgrade PyTorch XLA nightly by WoosukKwon in https://github.com/vllm-project/vllm/pull/7967
* [Doc] fix 404 link by stas00 in https://github.com/vllm-project/vllm/pull/7966
* [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by mzusman in https://github.com/vllm-project/vllm/pull/7651
* [Bugfix] Make torch registration of punica ops optional by bnellnm in https://github.com/vllm-project/vllm/pull/7970
* [torch.compile] avoid Dynamo guard evaluation overhead by youkaichao in https://github.com/vllm-project/vllm/pull/7898
* Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by mgoin in https://github.com/vllm-project/vllm/pull/7961
* [Frontend] Minor optimizations to zmq decoupled front-end by njhill in https://github.com/vllm-project/vllm/pull/7957
* [torch.compile] remove reset by youkaichao in https://github.com/vllm-project/vllm/pull/7975
* [VLM][Core] Fix exceptions on ragged NestedTensors by petersalas in https://github.com/vllm-project/vllm/pull/7974
* Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by youkaichao in https://github.com/vllm-project/vllm/pull/7982
* [Bugfix] Unify rank computation across regular decoding and speculative decoding by jmkuebler in https://github.com/vllm-project/vllm/pull/7899
* [Core] Combine async postprocessor and multi-step by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7921
* [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by pavanimajety in https://github.com/vllm-project/vllm/pull/7985
* extend cuda graph size for H200 by kushanam in https://github.com/vllm-project/vllm/pull/7894
* [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by Isotr0py in https://github.com/vllm-project/vllm/pull/7954
* [misc] update tpu int8 to use new vLLM Parameters by dsikka in https://github.com/vllm-project/vllm/pull/7973
* [Neuron] Adding support for context-lenght, token-gen buckets. by hbikki in https://github.com/vllm-project/vllm/pull/7885
* support bitsandbytes 8-bit and FP4 quantized models by chenqianfzh in https://github.com/vllm-project/vllm/pull/7445
* Add more percentiles and latencies by wschin in https://github.com/vllm-project/vllm/pull/7759
* [VLM] Disallow overflowing `max_model_len` for multimodal models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7998
* [Core] Logprobs support in Multi-step by afeldman-nm in https://github.com/vllm-project/vllm/pull/7652
* [TPU] Async output processing for TPU by WoosukKwon in https://github.com/vllm-project/vllm/pull/8011
* [Kernel] changing fused moe kernel chunk size default to 32k by avshalomman in https://github.com/vllm-project/vllm/pull/7995
* [MODEL] add Exaone model support by nayohan in https://github.com/vllm-project/vllm/pull/7819
* Support vLLM single and multi-host TPUs on GKE by richardsliu in https://github.com/vllm-project/vllm/pull/7613
* [Bugfix] Fix import error in Exaone model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8034
* [VLM][Model] TP support for ViTs by ChristopherCho in https://github.com/vllm-project/vllm/pull/7186
* [Core] Increase default `max_num_batched_tokens` for multimodal models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8028
* [Frontend]-config-cli-args by KaunilD in https://github.com/vllm-project/vllm/pull/7737
* [TPU][Bugfix] Fix tpu type api by WoosukKwon in https://github.com/vllm-project/vllm/pull/8035
* [Model] Adding support for MSFT Phi-3.5-MoE by wenxcs in https://github.com/vllm-project/vllm/pull/7729
* [Bugfix] Address 8009 and add model test for flashinfer fp8 kv cache. by pavanimajety in https://github.com/vllm-project/vllm/pull/8013
* [Bugfix] Fix import error in Phi-3.5-MoE by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8052
* [Bugfix] Fix ModelScope models in v0.5.5 by NickLucche in https://github.com/vllm-project/vllm/pull/8037
* [BugFix][Core] Multistep Fix Crash on Request Cancellation by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/8059
* [Frontend][VLM] Add support for multiple multi-modal items in the OpenAI frontend by ywang96 in https://github.com/vllm-project/vllm/pull/8049
* [Misc] Optional installation of audio related packages by ywang96 in https://github.com/vllm-project/vllm/pull/8063
* [Model] Adding Granite model. by shawntan in https://github.com/vllm-project/vllm/pull/7436
* [SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7244
* [TPU] Align worker index with node boundary by WoosukKwon in https://github.com/vllm-project/vllm/pull/7932
* [Core][Bugfix] Accept GGUF model without .gguf extension by Isotr0py in https://github.com/vllm-project/vllm/pull/8056
* [Bugfix] Fix internlm2 tensor parallel inference by Isotr0py in https://github.com/vllm-project/vllm/pull/8055
* [Bugfix] Fix 7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. by noooop in https://github.com/vllm-project/vllm/pull/7874
* [Bugfix] Fix single output condition in output processor by WoosukKwon in https://github.com/vllm-project/vllm/pull/7881
* [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/8061
* [Performance] Enable chunked prefill and prefix caching together by comaniac in https://github.com/vllm-project/vllm/pull/8120
* [CI] Only PR reviewers/committers can trigger CI on PR by khluu in https://github.com/vllm-project/vllm/pull/8124
* [Core] Optimize Async + Multi-step by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8050
* [Misc] Raise a more informative exception in add/remove_logger by Yard1 in https://github.com/vllm-project/vllm/pull/7750
* [CI/Build] fix: Add the +empty tag to the version only when the VLLM_TARGET_DEVICE envvar was explicitly set to "empty" by tomeras91 in https://github.com/vllm-project/vllm/pull/8118
* [ci] Fix GHA workflow by khluu in https://github.com/vllm-project/vllm/pull/8129
* [TPU][Bugfix] Fix next_token_ids shape by WoosukKwon in https://github.com/vllm-project/vllm/pull/8128
* [CI] Change PR remainder to avoid at-mentions by simon-mo in https://github.com/vllm-project/vllm/pull/8134
* [Misc] Update `GPTQ` to use `vLLMParameters` by dsikka in https://github.com/vllm-project/vllm/pull/7976
* [Benchmark] Add `--async-engine` option to benchmark_throughput.py by njhill in https://github.com/vllm-project/vllm/pull/7964
* [TPU][Bugfix] Use XLA rank for persistent cache path by WoosukKwon in https://github.com/vllm-project/vllm/pull/8137
* [Misc] Update fbgemmfp8 to use `vLLMParameters` by dsikka in https://github.com/vllm-project/vllm/pull/7972
* [Model] Add Ultravox support for multiple audio chunks by petersalas in https://github.com/vllm-project/vllm/pull/7963
* [Frontend] Multimodal support in offline chat by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8098
* chore: Update check-wheel-size.py to read VLLM_MAX_SIZE_MB from env by haitwang-cloud in https://github.com/vllm-project/vllm/pull/8103
* [Bugfix] remove post_layernorm in siglip by wnma3mz in https://github.com/vllm-project/vllm/pull/8106
* [MISC] Consolidate FP8 kv-cache tests by comaniac in https://github.com/vllm-project/vllm/pull/8131
* [CI/Build][ROCm] Enabling LoRA tests on ROCm by alexeykondrat in https://github.com/vllm-project/vllm/pull/7369
* [CI] Change test input in Gemma LoRA test by WoosukKwon in https://github.com/vllm-project/vllm/pull/8163
* [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models by K-Mistele in https://github.com/vllm-project/vllm/pull/5649
* [MISC] Replace input token throughput with total token throughput by comaniac in https://github.com/vllm-project/vllm/pull/8164
* [Neuron] Adding support for adding/ overriding neuron configuration a… by hbikki in https://github.com/vllm-project/vllm/pull/8062
* Bump version to v0.6.0 by simon-mo in https://github.com/vllm-project/vllm/pull/8166

New Contributors
* rockwotj made their first contribution in https://github.com/vllm-project/vllm/pull/7654
* HollowMan6 made their first contribution in https://github.com/vllm-project/vllm/pull/7855
* patrickvonplaten made their first contribution in https://github.com/vllm-project/vllm/pull/7739
* philschmid made their first contribution in https://github.com/vllm-project/vllm/pull/7917
* jberkhahn made their first contribution in https://github.com/vllm-project/vllm/pull/7229
* pavanimajety made their first contribution in https://github.com/vllm-project/vllm/pull/7798
* rasmith made their first contribution in https://github.com/vllm-project/vllm/pull/7386
* jmkuebler made their first contribution in https://github.com/vllm-project/vllm/pull/7899
* kushanam made their first contribution in https://github.com/vllm-project/vllm/pull/7894
* hbikki made their first contribution in https://github.com/vllm-project/vllm/pull/7885
* wschin made their first contribution in https://github.com/vllm-project/vllm/pull/7759
* nayohan made their first contribution in https://github.com/vllm-project/vllm/pull/7819
* richardsliu made their first contribution in https://github.com/vllm-project/vllm/pull/7613
* KaunilD made their first contribution in https://github.com/vllm-project/vllm/pull/7737
* wenxcs made their first contribution in https://github.com/vllm-project/vllm/pull/7729
* NickLucche made their first contribution in https://github.com/vllm-project/vllm/pull/8037
* shawntan made their first contribution in https://github.com/vllm-project/vllm/pull/7436
* noooop made their first contribution in https://github.com/vllm-project/vllm/pull/7874
* haitwang-cloud made their first contribution in https://github.com/vllm-project/vllm/pull/8103
* wnma3mz made their first contribution in https://github.com/vllm-project/vllm/pull/8106
* K-Mistele made their first contribution in https://github.com/vllm-project/vllm/pull/5649

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.5...v0.6.0

0.5.5

Not secure
Highlights

Performance Update
* We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (7000, 7387, 7452, 7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set `--num-scheduler-steps 8` as a parameter to the API server (via `vllm serve`) or `AsyncLLMEngine`. We are working on expanding the coverage to `LLM` class and aiming to turning it on by default
* Various enhancements:
* Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (7137)
* Reduce Python allocations, leading to 24% throughput speedup (7162, 7364)
* Improvements to the zeromq based decoupled frontend (7570, 7716, 7484)

Model Support
* Support Jamba 1.5 (7415, 7601, 6739)
* Support for the first audio model `UltravoxModel` (7615, 7446)
* Improvements to vision models:
* Support image embeddings as input (6613)
* Support SigLIP encoder and alternative decoders for LLaVA models (7153)
* Support loading GGUF model (5191) with tensor parallelism (7520)
* Progress in encoder decoder models: support for serving encoder/decoder models (7258), and architecture for cross-attention (4942)

Hardware Support
* AMD: Add fp8 Linear Layer for rocm (7210)
* Enhancements to TPU support: load time W8A16 quantization (7005), optimized rope (7635), and support multi-host inference (7457).
* Intel: various refactoring for worker, executor, and model runner (7686, 7712)

Others
* Optimize prefix caching performance (7193)
* Speculative decoding
* Use target model max length as default for draft model (7706)
* EAGLE Implementation with Top-1 proposer (6830)
* Entrypoints
* A new `chat` method in the `LLM` class (5049)
* Support embeddings in the run_batch API (7132)
* Support `prompt_logprobs` in Chat Completion (7453)
* Quantizations
* Expand MoE weight loading + Add Fused Marlin MoE Kernel (7527)
* Machete - Hopper Optimized Mixed Precision Linear Kernel (7174)
* `torch.compile`: register custom ops for kernels (7591, 7594, 7536)

What's Changed
* [ci][frontend] deduplicate tests by youkaichao in https://github.com/vllm-project/vllm/pull/7101
* [Doc] [SpecDecode] Update MLPSpeculator documentation by tdoublep in https://github.com/vllm-project/vllm/pull/7100
* [Bugfix] Specify device when loading LoRA and embedding tensors by jischein in https://github.com/vllm-project/vllm/pull/7129
* [MISC] Use non-blocking transfer in prepare_input by comaniac in https://github.com/vllm-project/vllm/pull/7172
* [Core] Support loading GGUF model by Isotr0py in https://github.com/vllm-project/vllm/pull/5191
* [Build] Add initial conditional testing spec by simon-mo in https://github.com/vllm-project/vllm/pull/6841
* [LoRA] Relax LoRA condition by jeejeelee in https://github.com/vllm-project/vllm/pull/7146
* [Model] Support SigLIP encoder and alternative decoders for LLaVA models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7153
* [BugFix] Fix DeepSeek remote code by dsikka in https://github.com/vllm-project/vllm/pull/7178
* [ BugFix ] Fix ZMQ when `VLLM_PORT` is set by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7205
* [Bugfix] add gguf dependency by kpapis in https://github.com/vllm-project/vllm/pull/7198
* [SpecDecode] [Minor] Fix spec decode sampler tests by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7183
* [Kernel] Add per-tensor and per-token AZP epilogues by ProExpertProg in https://github.com/vllm-project/vllm/pull/5941
* [Core] Optimize evictor-v2 performance by xiaobochen123 in https://github.com/vllm-project/vllm/pull/7193
* [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by afeldman-nm in https://github.com/vllm-project/vllm/pull/4942
* [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by mgoin in https://github.com/vllm-project/vllm/pull/7225
* [BugFix] Overhaul async request cancellation by njhill in https://github.com/vllm-project/vllm/pull/7111
* [Doc] Mock new dependencies for documentation by ywang96 in https://github.com/vllm-project/vllm/pull/7245
* [BUGFIX]: top_k is expected to be an integer. by Atllkks10 in https://github.com/vllm-project/vllm/pull/7227
* [Frontend] Gracefully handle missing chat template and fix CI failure by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7238
* [distributed][misc] add specialized method for cuda platform by youkaichao in https://github.com/vllm-project/vllm/pull/7249
* [Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` by dsikka in https://github.com/vllm-project/vllm/pull/5874
* [ BugFix ] Move `zmq` frontend to IPC instead of TCP by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7222
* Fixes typo in function name by rafvasq in https://github.com/vllm-project/vllm/pull/7275
* [Bugfix] Fix input processor for InternVL2 model by Isotr0py in https://github.com/vllm-project/vllm/pull/7164
* [OpenVINO] migrate to latest dependencies versions by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7251
* [Doc] add online speculative decoding example by stas00 in https://github.com/vllm-project/vllm/pull/7243
* [BugFix] Fix frontend multiprocessing hang by maxdebayser in https://github.com/vllm-project/vllm/pull/7217
* [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by mgoin in https://github.com/vllm-project/vllm/pull/7219
* [ci] Make building wheels per commit optional by khluu in https://github.com/vllm-project/vllm/pull/7278
* [Bugfix] Fix gptq failure on T4s by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7264
* [FrontEnd] Make `merge_async_iterators` `is_cancelled` arg optional by njhill in https://github.com/vllm-project/vllm/pull/7282
* [Doc] Update supported_hardware.rst by mgoin in https://github.com/vllm-project/vllm/pull/7276
* [Kernel] Fix Flashinfer Correctness by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7284
* [Misc] Fix typos in scheduler.py by ruisearch42 in https://github.com/vllm-project/vllm/pull/7285
* [Frontend] remove max_num_batched_tokens limit for lora by NiuBlibing in https://github.com/vllm-project/vllm/pull/7288
* [Bugfix] Fix LoRA with PP by andoorve in https://github.com/vllm-project/vllm/pull/7292
* [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by jeejeelee in https://github.com/vllm-project/vllm/pull/7273
* [Bugfix][Kernel] Increased atol to fix failing tests by ProExpertProg in https://github.com/vllm-project/vllm/pull/7305
* [Frontend] Kill the server on engine death by joerunde in https://github.com/vllm-project/vllm/pull/6594
* [Bugfix][fast] Fix the get_num_blocks_touched logic by zachzzc in https://github.com/vllm-project/vllm/pull/6849
* [Doc] Put collect_env issue output in a <detail> block by mgoin in https://github.com/vllm-project/vllm/pull/7310
* [CI/Build] Dockerfile.cpu improvements by dtrifiro in https://github.com/vllm-project/vllm/pull/7298
* [Bugfix] Fix new Llama3.1 GGUF model loading by Isotr0py in https://github.com/vllm-project/vllm/pull/7269
* [Misc] Temporarily resolve the error of BitAndBytes by jeejeelee in https://github.com/vllm-project/vllm/pull/7308
* Add Skywork AI as Sponsor by simon-mo in https://github.com/vllm-project/vllm/pull/7314
* [TPU] Add Load-time W8A16 quantization for TPU Backend by lsy323 in https://github.com/vllm-project/vllm/pull/7005
* [Core] Support serving encoder/decoder models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7258
* [TPU] Fix dockerfile.tpu by WoosukKwon in https://github.com/vllm-project/vllm/pull/7331
* [Performance] Optimize e2e overheads: Reduce python allocations by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7162
* [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7218
* [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by SolitaryThinker in https://github.com/vllm-project/vllm/pull/6971
* [Core] Streamline stream termination in `AsyncLLMEngine` by njhill in https://github.com/vllm-project/vllm/pull/7336
* [Model][Jamba] Mamba cache single buffer by mzusman in https://github.com/vllm-project/vllm/pull/6739
* [VLM][Doc] Add `stop_token_ids` to InternVL example by Isotr0py in https://github.com/vllm-project/vllm/pull/7354
* [Performance] e2e overheads reduction: Small followup diff by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7364
* [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7360
* [Frontend] Support embeddings in the run_batch API by pooyadavoodi in https://github.com/vllm-project/vllm/pull/7132
* [Bugfix] Fix ITL recording in serving benchmark by ywang96 in https://github.com/vllm-project/vllm/pull/7372
* [Core] Add span metrics for model_forward, scheduler and sampler time by sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7089
* [Bugfix] Fix `PerTensorScaleParameter` weight loading for fused models by dsikka in https://github.com/vllm-project/vllm/pull/7376
* [Misc] Add numpy implementation of `compute_slot_mapping` by Yard1 in https://github.com/vllm-project/vllm/pull/7377
* [Core] Fix edge case in chunked prefill + block manager v2 by cadedaniel in https://github.com/vllm-project/vllm/pull/7380
* [Bugfix] Fix phi3v batch inference when images have different aspect ratio by Isotr0py in https://github.com/vllm-project/vllm/pull/7392
* [TPU] Use mark_dynamic to reduce compilation time by WoosukKwon in https://github.com/vllm-project/vllm/pull/7340
* Updating LM Format Enforcer version to v0.10.6 by noamgat in https://github.com/vllm-project/vllm/pull/7189
* [core] [2/N] refactor worker_base input preparation for multi-step by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7387
* [CI/Build] build on empty device for better dev experience by tomeras91 in https://github.com/vllm-project/vllm/pull/4773
* [Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by tomeras91 in https://github.com/vllm-project/vllm/pull/7403
* [misc] add commit id in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/7405
* [Docs] Update readme by simon-mo in https://github.com/vllm-project/vllm/pull/7316
* [CI/Build] Minor refactoring for vLLM assets by ywang96 in https://github.com/vllm-project/vllm/pull/7407
* [Kernel] Flashinfer correctness fix for v0.1.3 by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7319
* [Core][VLM] Support image embeddings as input by ywang96 in https://github.com/vllm-project/vllm/pull/6613
* [Frontend] Disallow passing `model` as both argument and option by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7347
* [CI/Build] bump Dockerfile.neuron image base, use public ECR by dtrifiro in https://github.com/vllm-project/vllm/pull/6832
* [Bugfix] Fix logit soft cap in flash-attn backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/7425
* [ci] Entrypoints run upon changes in vllm/ by khluu in https://github.com/vllm-project/vllm/pull/7423
* [ci] Cancel fastcheck run when PR is marked ready by khluu in https://github.com/vllm-project/vllm/pull/7427
* [ci] Cancel fastcheck when PR is ready by khluu in https://github.com/vllm-project/vllm/pull/7433
* [Misc] Use scalar type to dispatch to different `gptq_marlin` kernels by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7323
* [Core] Consolidate `GB` constant and enable float GB arguments by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7416
* [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by jon-chuang in https://github.com/vllm-project/vllm/pull/7208
* [Bugfix] Handle PackageNotFoundError when checking for xpu version by sasha0552 in https://github.com/vllm-project/vllm/pull/7398
* [CI/Build] bump minimum cmake version by dtrifiro in https://github.com/vllm-project/vllm/pull/6999
* [Core] Shut down aDAG workers with clean async llm engine exit by ruisearch42 in https://github.com/vllm-project/vllm/pull/7224
* [mypy] Misc. typing improvements by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7417
* [Misc] improve logits processors logging message by aw632 in https://github.com/vllm-project/vllm/pull/7435
* [ci] Remove fast check cancel workflow by khluu in https://github.com/vllm-project/vllm/pull/7455
* [Bugfix] Fix weight loading for Chameleon when TP>1 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7410
* [hardware] unify usage of is_tpu to current_platform.is_tpu() by youkaichao in https://github.com/vllm-project/vllm/pull/7102
* [TPU] Suppress import custom_ops warning by WoosukKwon in https://github.com/vllm-project/vllm/pull/7458
* Revert "[Doc] Update supported_hardware.rst (7276)" by WoosukKwon in https://github.com/vllm-project/vllm/pull/7467
* [Frontend][Core] Add plumbing to support audio language models by petersalas in https://github.com/vllm-project/vllm/pull/7446
* [Misc] Update LM Eval Tolerance by dsikka in https://github.com/vllm-project/vllm/pull/7473
* [Misc] Update `gptq_marlin` to use new vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7281
* [Misc] Update Fused MoE weight loading by dsikka in https://github.com/vllm-project/vllm/pull/7334
* [Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` by dsikka in https://github.com/vllm-project/vllm/pull/7422
* Announce NVIDIA Meetup by simon-mo in https://github.com/vllm-project/vllm/pull/7483
* [frontend] spawn engine process from api server process by youkaichao in https://github.com/vllm-project/vllm/pull/7484
* [Misc] `compressed-tensors` code reuse by kylesayrs in https://github.com/vllm-project/vllm/pull/7277
* [misc][plugin] add plugin system implementation by youkaichao in https://github.com/vllm-project/vllm/pull/7426
* [TPU] Support multi-host inference by WoosukKwon in https://github.com/vllm-project/vllm/pull/7457
* [Bugfix][CI] Import ray under guard by WoosukKwon in https://github.com/vllm-project/vllm/pull/7486
* [CI/Build]Reduce the time consumption for LoRA tests by jeejeelee in https://github.com/vllm-project/vllm/pull/7396
* [misc][ci] fix cpu test with plugins by youkaichao in https://github.com/vllm-project/vllm/pull/7489
* [Bugfix][Docs] Update list of mock imports by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7493
* [doc] update test script to include cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/7501
* Fix empty output when temp is too low by CatherineSue in https://github.com/vllm-project/vllm/pull/2937
* [ci] fix model tests by youkaichao in https://github.com/vllm-project/vllm/pull/7507
* [Bugfix][Frontend] Disable embedding API for chat models by QwertyJack in https://github.com/vllm-project/vllm/pull/7504
* [Misc] Deprecation Warning when setting --engine-use-ray by wallashss in https://github.com/vllm-project/vllm/pull/7424
* [VLM][Core] Support profiling with multiple multi-modal inputs per prompt by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7126
* [core] [3/N] multi-step args and sequence.py by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7452
* [TPU] Set per-rank XLA cache by WoosukKwon in https://github.com/vllm-project/vllm/pull/7533
* [Misc] Revert `compressed-tensors` code reuse by kylesayrs in https://github.com/vllm-project/vllm/pull/7521
* llama_index serving integration documentation by pavanjava in https://github.com/vllm-project/vllm/pull/6973
* [Bugfix][TPU] Correct env variable for XLA cache path by WoosukKwon in https://github.com/vllm-project/vllm/pull/7544
* [Bugfix] update neuron for version > 0.5.0 by omrishiv in https://github.com/vllm-project/vllm/pull/7175
* [Misc] Update dockerfile for CPU to cover protobuf installation by PHILO-HE in https://github.com/vllm-project/vllm/pull/7182
* [Bugfix] Fix default weight loading for scalars by mgoin in https://github.com/vllm-project/vllm/pull/7534
* [Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by mgoin in https://github.com/vllm-project/vllm/pull/7566
* [Misc] Add quantization config support for speculative model. by ShangmingCai in https://github.com/vllm-project/vllm/pull/7343
* [Feature]: Add OpenAI server prompt_logprobs support 6508 by gnpinkert in https://github.com/vllm-project/vllm/pull/7453
* [ci/test] rearrange tests and make adag test soft fail by youkaichao in https://github.com/vllm-project/vllm/pull/7572
* Chat method for offline llm by nunjunj in https://github.com/vllm-project/vllm/pull/5049
* [CI] Move quantization cpu offload tests out of fastcheck by mgoin in https://github.com/vllm-project/vllm/pull/7574
* [Misc/Testing] Use `torch.testing.assert_close` by jon-chuang in https://github.com/vllm-project/vllm/pull/7324
* register custom op for flash attn and use from torch.ops by youkaichao in https://github.com/vllm-project/vllm/pull/7536
* [Core] Use uvloop with zmq-decoupled front-end by njhill in https://github.com/vllm-project/vllm/pull/7570
* [CI] Fix crashes of performance benchmark by KuntaiDu in https://github.com/vllm-project/vllm/pull/7500
* [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by gongdao123 in https://github.com/vllm-project/vllm/pull/7513
* support tqdm in notebooks by fzyzcjy in https://github.com/vllm-project/vllm/pull/7510
* [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by charlifu in https://github.com/vllm-project/vllm/pull/7210
* [Kernel] W8A16 Int8 inside FusedMoE by mzusman in https://github.com/vllm-project/vllm/pull/7415
* [Kernel] Add tuned triton configs for ExpertsInt8 by mgoin in https://github.com/vllm-project/vllm/pull/7601
* [spec decode] [4/N] Move update_flash_attn_metadata to attn backend by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7571
* [Core] Fix tracking of model forward time to the span traces in case of PP>1 by sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7440
* [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by mgoin in https://github.com/vllm-project/vllm/pull/7444
* [Doc] Update quantization supported hardware table by mgoin in https://github.com/vllm-project/vllm/pull/7595
* [Kernel] register punica functions as torch ops by bnellnm in https://github.com/vllm-project/vllm/pull/7591
* [Kernel][Misc] dynamo support for ScalarType by bnellnm in https://github.com/vllm-project/vllm/pull/7594
* [Kernel] fix types used in aqlm and ggml kernels to support dynamo by bnellnm in https://github.com/vllm-project/vllm/pull/7596
* [Model] Align nemotron config with final HF state and fix lm-eval-small by mgoin in https://github.com/vllm-project/vllm/pull/7611
* [Bugfix] Fix custom_ar support check by bnellnm in https://github.com/vllm-project/vllm/pull/7617
* .[Build/CI] Enabling passing AMD tests. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7610
* [Bugfix] Clear engine reference in AsyncEngineRPCServer by ruisearch42 in https://github.com/vllm-project/vllm/pull/7618
* [aDAG] Unflake aDAG + PP tests by rkooo567 in https://github.com/vllm-project/vllm/pull/7600
* [Bugfix] add >= 1.0 constraint for openai dependency by metasyn in https://github.com/vllm-project/vllm/pull/7612
* [misc] use nvml to get consistent device name by youkaichao in https://github.com/vllm-project/vllm/pull/7582
* [ci][test] fix engine/logger test by youkaichao in https://github.com/vllm-project/vllm/pull/7621
* [core][misc] update libcudart finding by youkaichao in https://github.com/vllm-project/vllm/pull/7620
* [Model] Pipeline parallel support for JAIS by mrbesher in https://github.com/vllm-project/vllm/pull/7603
* [ci][test] allow longer wait time for api server by youkaichao in https://github.com/vllm-project/vllm/pull/7629
* [Misc]Fix BitAndBytes exception messages by jeejeelee in https://github.com/vllm-project/vllm/pull/7626
* [VLM] Refactor `MultiModalConfig` initialization and profiling by ywang96 in https://github.com/vllm-project/vllm/pull/7530
* [TPU] Skip creating empty tensor by WoosukKwon in https://github.com/vllm-project/vllm/pull/7630
* [TPU] Use mark_dynamic only for dummy run by WoosukKwon in https://github.com/vllm-project/vllm/pull/7634
* [TPU] Optimize RoPE forward_native2 by WoosukKwon in https://github.com/vllm-project/vllm/pull/7636
* [ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7279
* [CI/Build] Add text-only test for Qwen models by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/7475
* [Misc] Refactor Llama3 RoPE initialization by WoosukKwon in https://github.com/vllm-project/vllm/pull/7637
* [Core] Optimize SPMD architecture with delta + serialization optimization by rkooo567 in https://github.com/vllm-project/vllm/pull/7109
* [Core] Use flashinfer sampling kernel when available by peng1999 in https://github.com/vllm-project/vllm/pull/7137
* fix xpu build by jikunshang in https://github.com/vllm-project/vllm/pull/7644
* [Misc] Remove Gemma RoPE by WoosukKwon in https://github.com/vllm-project/vllm/pull/7638
* [MISC] Add prefix cache hit rate to metrics by comaniac in https://github.com/vllm-project/vllm/pull/7606
* [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by c3-ali in https://github.com/vllm-project/vllm/pull/5428
* [core] Multi Step Scheduling by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7000
* [Core] Support tensor parallelism for GGUF quantization by Isotr0py in https://github.com/vllm-project/vllm/pull/7520
* [Bugfix] Don't disable existing loggers by a-ys in https://github.com/vllm-project/vllm/pull/7664
* [TPU] Fix redundant input tensor cloning by WoosukKwon in https://github.com/vllm-project/vllm/pull/7660
* [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7665
* [doc] fix doc build error caused by msgspec by youkaichao in https://github.com/vllm-project/vllm/pull/7659
* [Speculative Decoding] Fixing hidden states handling in batch expansion by abhigoyal1997 in https://github.com/vllm-project/vllm/pull/7508
* [ci] Install Buildkite test suite analysis by khluu in https://github.com/vllm-project/vllm/pull/7667
* [Bugfix] support `tie_word_embeddings` for all models by zijian-hu in https://github.com/vllm-project/vllm/pull/5724
* [CI] Organizing performance benchmark files by KuntaiDu in https://github.com/vllm-project/vllm/pull/7616
* [misc] add nvidia related library in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/7674
* [XPU] fallback to native implementation for xpu custom op by jianyizh in https://github.com/vllm-project/vllm/pull/7670
* [misc][cuda] add warning for pynvml user by youkaichao in https://github.com/vllm-project/vllm/pull/7675
* [Core] Refactor executor classes to make it easier to inherit GPUExecutor by jikunshang in https://github.com/vllm-project/vllm/pull/7673
* [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7174
* [OpenVINO] Updated documentation by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7687
* [VLM][Model] Add test for InternViT vision encoder by Isotr0py in https://github.com/vllm-project/vllm/pull/7409
* [Hardware] [Intel GPU] refactor xpu worker/executor by jikunshang in https://github.com/vllm-project/vllm/pull/7686
* [CI/Build] Pin OpenTelemetry versions and make availability errors clearer by ronensc in https://github.com/vllm-project/vllm/pull/7266
* [Misc] Add jinja2 as an explicit build requirement by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7695
* [Core] Add `AttentionState` abstraction by Yard1 in https://github.com/vllm-project/vllm/pull/7663
* [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by jikunshang in https://github.com/vllm-project/vllm/pull/7685
* [ci][test] adjust max wait time for cpu offloading test by youkaichao in https://github.com/vllm-project/vllm/pull/7709
* [Core] Pipe `worker_class_fn` argument in Executor by Yard1 in https://github.com/vllm-project/vllm/pull/7707
* [ci] try to log process using the port to debug the port usage by youkaichao in https://github.com/vllm-project/vllm/pull/7711
* [Model] Add AWQ quantization support for InternVL2 model by Isotr0py in https://github.com/vllm-project/vllm/pull/7187
* [Doc] Section for Multimodal Language Models by ywang96 in https://github.com/vllm-project/vllm/pull/7719
* [mypy] Enable following imports for entrypoints by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7248
* [Bugfix] Mirror jinja2 in pyproject.toml by sasha0552 in https://github.com/vllm-project/vllm/pull/7723
* [BugFix] Avoid premature async generator exit and raise all exception variations by njhill in https://github.com/vllm-project/vllm/pull/7698
* [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by learninmou in https://github.com/vllm-project/vllm/pull/7509
* [Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/7735
* [Spec Decoding] Use target model max length as default for draft model by njhill in https://github.com/vllm-project/vllm/pull/7706
* [Bugfix] chat method add_generation_prompt param by brian14708 in https://github.com/vllm-project/vllm/pull/7734
* [Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7394
* [Bugfix] Pass PYTHONPATH from setup.py to CMake by sasha0552 in https://github.com/vllm-project/vllm/pull/7730
* [multi-step] Raise error if not using async engine by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7703
* [Frontend] Improve Startup Failure UX by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7716
* [misc] Add Torch profiler support by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7451
* [Model] Add UltravoxModel and UltravoxConfig by petersalas in https://github.com/vllm-project/vllm/pull/7615
* [ci] [multi-step] narrow multi-step test dependency paths by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7760
* [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by dsikka in https://github.com/vllm-project/vllm/pull/7527
* [distributed][misc] error on same VLLM_HOST_IP setting by youkaichao in https://github.com/vllm-project/vllm/pull/7756
* [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by gshtras in https://github.com/vllm-project/vllm/pull/7477
* [Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` by ProExpertProg in https://github.com/vllm-project/vllm/pull/7233
* [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by zifeitong in https://github.com/vllm-project/vllm/pull/7710
* [Bug][Frontend] Improve ZMQ client robustness by joerunde in https://github.com/vllm-project/vllm/pull/7443
* Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (7527)" by mgoin in https://github.com/vllm-project/vllm/pull/7764
* [TPU] Avoid initializing TPU runtime in is_tpu by WoosukKwon in https://github.com/vllm-project/vllm/pull/7763
* [ci] refine dependency for distributed tests by youkaichao in https://github.com/vllm-project/vllm/pull/7776
* [Misc] Use torch.compile for GemmaRMSNorm by WoosukKwon in https://github.com/vllm-project/vllm/pull/7642
* [Speculative Decoding] EAGLE Implementation with Top-1 proposer by abhigoyal1997 in https://github.com/vllm-project/vllm/pull/6830
* Fix ShardedStateLoader for vllm fp8 quantization by sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/7708
* [Bugfix] Don't build machete on cuda <12.0 by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7757
* [Misc] update fp8 to use `vLLMParameter` by dsikka in https://github.com/vllm-project/vllm/pull/7437
* [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7232
* [Misc] Enhance prefix-caching benchmark tool by Jeffwan in https://github.com/vllm-project/vllm/pull/6568
* [Doc] Fix incorrect docs from 7615 by petersalas in https://github.com/vllm-project/vllm/pull/7788
* [Bugfix] Use LoadFormat values as choices for `vllm serve --load-format` by mgoin in https://github.com/vllm-project/vllm/pull/7784
* [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by khluu in https://github.com/vllm-project/vllm/pull/7705
* [Misc] fix typo in triton import warning by lsy323 in https://github.com/vllm-project/vllm/pull/7794
* [Frontend] error suppression cleanup by joerunde in https://github.com/vllm-project/vllm/pull/7786
* [Ray backend] Better error when pg topology is bad. by rkooo567 in https://github.com/vllm-project/vllm/pull/7584
* [Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by jikunshang in https://github.com/vllm-project/vllm/pull/7712
* [misc] Add Torch profiler support for CPU-only devices by DamonFool in https://github.com/vllm-project/vllm/pull/7806
* [BugFix] Fix server crash on empty prompt by maxdebayser in https://github.com/vllm-project/vllm/pull/7746
* [github][misc] promote asking llm first by youkaichao in https://github.com/vllm-project/vllm/pull/7809
* [Misc] Update `marlin` to use vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7803
* Bump version to v0.5.5 by simon-mo in https://github.com/vllm-project/vllm/pull/7823

New Contributors
* jischein made their first contribution in https://github.com/vllm-project/vllm/pull/7129
* kpapis made their first contribution in https://github.com/vllm-project/vllm/pull/7198
* xiaobochen123 made their first contribution in https://github.com/vllm-project/vllm/pull/7193
* Atllkks10 made their first contribution in https://github.com/vllm-project/vllm/pull/7227
* stas00 made their first contribution in https://github.com/vllm-project/vllm/pull/7243
* maxdebayser made their first contribution in https://github.com/vllm-project/vllm/pull/7217
* NiuBlibing made their first contribution in https://github.com/vllm-project/vllm/pull/7288
* lsy323 made their first contribution in https://github.com/vllm-project/vllm/pull/7005
* pooyadavoodi made their first contribution in https://github.com/vllm-project/vllm/pull/7132
* sfc-gh-mkeralapura made their first contribution in https://github.com/vllm-project/vllm/pull/7089
* jon-chuang made their first contribution in https://github.com/vllm-project/vllm/pull/7208
* aw632 made their first contribution in https://github.com/vllm-project/vllm/pull/7435
* petersalas made their first contribution in https://github.com/vllm-project/vllm/pull/7446
* kylesayrs made their first contribution in https://github.com/vllm-project/vllm/pull/7277
* QwertyJack made their first contribution in https://github.com/vllm-project/vllm/pull/7504
* wallashss made their first contribution in https://github.com/vllm-project/vllm/pull/7424
* pavanjava made their first contribution in https://github.com/vllm-project/vllm/pull/6973
* PHILO-HE made their first contribution in https://github.com/vllm-project/vllm/pull/7182
* gnpinkert made their first contribution in https://github.com/vllm-project/vllm/pull/7453
* gongdao123 made their first contribution in https://github.com/vllm-project/vllm/pull/7513
* charlifu made their first contribution in https://github.com/vllm-project/vllm/pull/7210
* metasyn made their first contribution in https://github.com/vllm-project/vllm/pull/7612
* mrbesher made their first contribution in https://github.com/vllm-project/vllm/pull/7603
* alex-jw-brooks made their first contribution in https://github.com/vllm-project/vllm/pull/7475
* a-ys made their first contribution in https://github.com/vllm-project/vllm/pull/7664
* zijian-hu made their first contribution in https://github.com/vllm-project/vllm/pull/5724
* jianyizh made their first contribution in https://github.com/vllm-project/vllm/pull/7670
* learninmou made their first contribution in https://github.com/vllm-project/vllm/pull/7509
* brian14708 made their first contribution in https://github.com/vllm-project/vllm/pull/7734
* sfc-gh-zhwang made their first contribution in https://github.com/vllm-project/vllm/pull/7708

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5

0.5.4

Not secure
Highlights

Model Support
* Enhanced pipeline parallelism support for DeepSeek v2 (6519), Qwen (6974), Qwen2 (6924), and Nemotron (6863)
* Enhanced vision language model support for InternVL2 (6514, 7067), BLIP-2 (5920), MiniCPM-V (4087, 7122).
* Added H2O Danube3-4b (6451)
* Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (6611)

Hardware Support
* TPU enhancements: collective communication, TP for async engine, faster compile time (6891, 6933, 6856, 6813, 5871)
* Intel CPU: enable multiprocessing and tensor parallelism (6125)

Performance
We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

* Separated OpenAI Server's HTTP request handling and model inference loop with `zeromq`. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (6883)
* Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (6779)
* Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (6532)
* Optimize `get_seqs` function, bring 2% throughput enhancements. (7051)

Production Features
* Enhancements to speculative decoding: FlashInfer in DraftModelRunner (6926), observability (6963), and benchmarks (6964)
* Refactor the punica kernel based on Triton (5036)
* Support for guided decoding for offline LLM (6878)

Quantization
* Support W4A8 quantization for vllm (5218)
* Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (6677, 6996, 6848)
* Support reading bitsandbytes pre-quantized model (5753)


What's Changed
* [Docs] Announce llama3.1 support by WoosukKwon in https://github.com/vllm-project/vllm/pull/6688
* [doc][distributed] fix doc argument order by youkaichao in https://github.com/vllm-project/vllm/pull/6691
* [Bugfix] Fix a log error in chunked prefill by WoosukKwon in https://github.com/vllm-project/vllm/pull/6694
* [BugFix] Fix RoPE error in Llama 3.1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/6693
* Bump version to 0.5.3.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/6696
* [Misc] Add ignored layers for `fp8` quantization by mgoin in https://github.com/vllm-project/vllm/pull/6657
* [Frontend] Add Usage data in each chunk for chat_serving. 6540 by yecohn in https://github.com/vllm-project/vllm/pull/6652
* [Model] Pipeline Parallel Support for DeepSeek v2 by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6519
* Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon by ywang96 in https://github.com/vllm-project/vllm/pull/6690
* [build] relax wheel size limit by youkaichao in https://github.com/vllm-project/vllm/pull/6704
* [CI] Add smoke test for non-uniform AutoFP8 quantization by mgoin in https://github.com/vllm-project/vllm/pull/6702
* [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by tdoublep in https://github.com/vllm-project/vllm/pull/6645
* [bitsandbytes]: support read bnb pre-quantized model by thesues in https://github.com/vllm-project/vllm/pull/5753
* [Bugfix] fix flashinfer cudagraph capture for PP by SolitaryThinker in https://github.com/vllm-project/vllm/pull/6708
* [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by njhill in https://github.com/vllm-project/vllm/pull/6714
* [Bugfix] Fix token padding for chameleon by ywang96 in https://github.com/vllm-project/vllm/pull/6724
* [Docs][ROCm] Detailed instructions to build from source by WoosukKwon in https://github.com/vllm-project/vllm/pull/6680
* [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/6711
* [Bugfix]fix modelscope compatible issue by liuyhwangyh in https://github.com/vllm-project/vllm/pull/6730
* Adding f-string to validation error which is missing by luizanao in https://github.com/vllm-project/vllm/pull/6748
* [Bugfix] Fix speculative decode seeded test by njhill in https://github.com/vllm-project/vllm/pull/6743
* [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by AllenDou in https://github.com/vllm-project/vllm/pull/6686
* [Frontend] split run_server into build_server and run_server by dtrifiro in https://github.com/vllm-project/vllm/pull/6740
* [Kernels] Add fp8 support to `reshape_and_cache_flash` by Yard1 in https://github.com/vllm-project/vllm/pull/6667
* [Core] Tweaks to model runner/input builder developer APIs by Yard1 in https://github.com/vllm-project/vllm/pull/6712
* [Bugfix] Bump transformers to 4.43.2 by mgoin in https://github.com/vllm-project/vllm/pull/6752
* [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by hongxiayang in https://github.com/vllm-project/vllm/pull/6754
* [core][distributed] fix zmq hang by youkaichao in https://github.com/vllm-project/vllm/pull/6759
* [Frontend] Represent tokens with identifiable strings by ezliu in https://github.com/vllm-project/vllm/pull/6626
* [Model] Adding support for MiniCPM-V by HwwwwwwwH in https://github.com/vllm-project/vllm/pull/4087
* [Bugfix] Fix decode tokens w. CUDA graph by comaniac in https://github.com/vllm-project/vllm/pull/6757
* [Bugfix] Fix awq_marlin and gptq_marlin flags by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6745
* [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by CatherineSue in https://github.com/vllm-project/vllm/pull/6755
* [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by HwwwwwwwH in https://github.com/vllm-project/vllm/pull/6787
* [ Misc ] `fp8-marlin` channelwise via `compressed-tensors` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6524
* [Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints by mgoin in https://github.com/vllm-project/vllm/pull/6761
* [Bugfix] Add synchronize to prevent possible data race by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6788
* [Doc] Add documentations for nightly benchmarks by KuntaiDu in https://github.com/vllm-project/vllm/pull/6412
* [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by LucasWilkinson in https://github.com/vllm-project/vllm/pull/6798
* [doc][distributed] improve multinode serving doc by youkaichao in https://github.com/vllm-project/vllm/pull/6804
* [Docs] Publish 5th meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/6799
* [Core] Fix ray forward_dag error mssg by rkooo567 in https://github.com/vllm-project/vllm/pull/6792
* [ci][distributed] fix flaky tests by youkaichao in https://github.com/vllm-project/vllm/pull/6806
* [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by khluu in https://github.com/vllm-project/vllm/pull/6810
* Fix ReplicatedLinear weight loading by qingquansong in https://github.com/vllm-project/vllm/pull/6793
* [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by eaplatanios in https://github.com/vllm-project/vllm/pull/6770
* [Core] Use array to speedup padding by peng1999 in https://github.com/vllm-project/vllm/pull/6779
* [doc][debugging] add known issues for hangs by youkaichao in https://github.com/vllm-project/vllm/pull/6816
* [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by mgoin in https://github.com/vllm-project/vllm/pull/6611
* [Bugfix][Kernel] Promote another index to int64_t by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6838
* [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by WoosukKwon in https://github.com/vllm-project/vllm/pull/6811
* [Misc][TPU] Support TPU in initialize_ray_cluster by WoosukKwon in https://github.com/vllm-project/vllm/pull/6812
* [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/6125
* [Doc] Add Nemotron to supported model docs by mgoin in https://github.com/vllm-project/vllm/pull/6843
* [Doc] Update SkyPilot doc for wrong indents and instructions for update service by Michaelvll in https://github.com/vllm-project/vllm/pull/4283
* Update README.md by gurpreet-dhami in https://github.com/vllm-project/vllm/pull/6847
* enforce eager mode with bnb quantization temporarily by chenqianfzh in https://github.com/vllm-project/vllm/pull/6846
* [TPU] Support collective communications in XLA devices by WoosukKwon in https://github.com/vllm-project/vllm/pull/6813
* [Frontend] Factor out code for running uvicorn by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6828
* [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by LucasWilkinson in https://github.com/vllm-project/vllm/pull/6852
* [Bugfix]: Fix Tensorizer test failures by sangstar in https://github.com/vllm-project/vllm/pull/6835
* [ROCm] Upgrade PyTorch nightly version by WoosukKwon in https://github.com/vllm-project/vllm/pull/6845
* [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by omrishiv in https://github.com/vllm-project/vllm/pull/6844
* [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by tomeras91 in https://github.com/vllm-project/vllm/pull/6784
* [Model] H2O Danube3-4b by g-eoj in https://github.com/vllm-project/vllm/pull/6451
* [Hardware][TPU] Implement tensor parallelism with Ray by WoosukKwon in https://github.com/vllm-project/vllm/pull/5871
* [Doc] Add missing mock import to docs `conf.py` by hmellor in https://github.com/vllm-project/vllm/pull/6834
* [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6802
* [Misc][VLM][Doc] Consolidate offline examples for vision language models by ywang96 in https://github.com/vllm-project/vllm/pull/6858
* [Bugfix] Fix VLM example typo by ywang96 in https://github.com/vllm-project/vllm/pull/6859
* [bugfix] make args.stream work by WrRan in https://github.com/vllm-project/vllm/pull/6831
* [CI/Build][Doc] Update CI and Doc for VLM example changes by ywang96 in https://github.com/vllm-project/vllm/pull/6860
* [Model] Initial support for BLIP-2 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5920
* [Docs] Add RunLLM chat widget by cw75 in https://github.com/vllm-project/vllm/pull/6857
* [TPU] Reduce compilation time & Upgrade PyTorch XLA version by WoosukKwon in https://github.com/vllm-project/vllm/pull/6856
* [Kernel] Increase precision of GPTQ/AWQ Marlin kernel by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6795
* Add Nemotron to PP_SUPPORTED_MODELS by mgoin in https://github.com/vllm-project/vllm/pull/6863
* [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 by zeyugao in https://github.com/vllm-project/vllm/pull/6871
* [Model] Initialize support for InternVL2 series models by Isotr0py in https://github.com/vllm-project/vllm/pull/6514
* [Kernel] Tuned FP8 Kernels for Ada Lovelace by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6677
* [Core] Reduce unnecessary compute when logprobs=None by peng1999 in https://github.com/vllm-project/vllm/pull/6532
* [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6901
* [TPU] Add TPU tensor parallelism to async engine by etwk in https://github.com/vllm-project/vllm/pull/6891
* [Bugfix] Allow vllm to still work if triton is not installed. by tdoublep in https://github.com/vllm-project/vllm/pull/6786
* [Frontend] New `allowed_token_ids` decoding request parameter by njhill in https://github.com/vllm-project/vllm/pull/6753
* [Kernel] Remove unused variables in awq/gemm_kernels.cu by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6908
* [ci] GHA workflow to remove ready label upon "/notready" comment by khluu in https://github.com/vllm-project/vllm/pull/6921
* [Kernel] Fix marlin divide-by-zero warnings by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6904
* [Kernel] Tuned int8 kernels for Ada Lovelace by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6848
* [TPU] Fix greedy decoding by WoosukKwon in https://github.com/vllm-project/vllm/pull/6933
* [Bugfix] Fix PaliGemma MMP by ywang96 in https://github.com/vllm-project/vllm/pull/6930
* [Doc] Super tiny fix doc typo by fzyzcjy in https://github.com/vllm-project/vllm/pull/6949
* [BugFix] Fix use of per-request seed with pipeline parallel by njhill in https://github.com/vllm-project/vllm/pull/6698
* [Kernel] Squash a few more warnings by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6914
* [OpenVINO] Updated OpenVINO requirements and build docs by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/6948
* [Bugfix] Fix tensorizer memory profiling bug during testing by sangstar in https://github.com/vllm-project/vllm/pull/6881
* [Kernel] Remove scaled_fp8_quant kernel padding footgun by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6842
* [core][misc] improve free_finished_seq_groups by youkaichao in https://github.com/vllm-project/vllm/pull/6865
* [Build] Temporarily Disable Kernels and LoRA tests by simon-mo in https://github.com/vllm-project/vllm/pull/6961
* [Nightly benchmarking suite] Remove pkill python from run benchmark suite by cadedaniel in https://github.com/vllm-project/vllm/pull/6965
* [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists by cadedaniel in https://github.com/vllm-project/vllm/pull/6706
* [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/6964
* [mypy] Enable following imports for some directories by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6681
* [Bugfix] Fix broadcasting logic for `multi_modal_kwargs` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6836
* [CI/Build] Fix mypy errors by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6968
* [Bugfix][TPU] Set readonly=True for non-root devices by WoosukKwon in https://github.com/vllm-project/vllm/pull/6980
* [Bugfix] fix logit processor excceed vocab size issue by FeiDeng in https://github.com/vllm-project/vllm/pull/6927
* Support W4A8 quantization for vllm by HandH1998 in https://github.com/vllm-project/vllm/pull/5218
* [Bugfix] Clean up MiniCPM-V by HwwwwwwwH in https://github.com/vllm-project/vllm/pull/6939
* [Bugfix] Fix feature size calculation for LLaVA-NeXT by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6982
* [Model] use FusedMoE layer in Jamba by avshalomman in https://github.com/vllm-project/vllm/pull/6935
* [MISC] Introduce pipeline parallelism partition strategies by comaniac in https://github.com/vllm-project/vllm/pull/6920
* [Bugfix] Support cpu offloading with quant_method.process_weights_after_loading by mgoin in https://github.com/vllm-project/vllm/pull/6960
* [Kernel] Enable FP8 Cutlass for Ada Lovelace by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6950
* [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6996
* [Misc] Add compressed-tensors to optimized quant list by mgoin in https://github.com/vllm-project/vllm/pull/7006
* Revert "[Frontend] Factor out code for running uvicorn" by simon-mo in https://github.com/vllm-project/vllm/pull/7012
* [Kernel][RFC] Refactor the punica kernel based on Triton by jeejeelee in https://github.com/vllm-project/vllm/pull/5036
* [Model] Pipeline parallel support for Qwen2 by xuyi in https://github.com/vllm-project/vllm/pull/6924
* [Bugfix][TPU] Do not use torch.Generator for TPUs by WoosukKwon in https://github.com/vllm-project/vllm/pull/6981
* [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6758
* PP comm optimization: replace send with partial send + allgather by aurickq in https://github.com/vllm-project/vllm/pull/6695
* [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user by zifeitong in https://github.com/vllm-project/vllm/pull/6954
* [core][scheduler] simplify and improve scheduler by youkaichao in https://github.com/vllm-project/vllm/pull/6867
* [Build/CI] Fixing Docker Hub quota issue. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7043
* [CI/Build] Update torch to 2.4 by SageMoore in https://github.com/vllm-project/vllm/pull/6951
* [Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm by Isotr0py in https://github.com/vllm-project/vllm/pull/6992
* [CI/Build] Remove sparseml requirement from testing by mgoin in https://github.com/vllm-project/vllm/pull/7037
* [Bugfix] Lower gemma's unloaded_params exception to warning by mgoin in https://github.com/vllm-project/vllm/pull/7002
* [Models] Support Qwen model with PP by andoorve in https://github.com/vllm-project/vllm/pull/6974
* Update run-amd-test.sh by okakarpa in https://github.com/vllm-project/vllm/pull/7044
* [Misc] Support attention logits soft-capping with flash-attn by WoosukKwon in https://github.com/vllm-project/vllm/pull/7022
* [CI/Build][Bugfix] Fix CUTLASS header-only line by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7034
* [Performance] Optimize `get_seqs` by WoosukKwon in https://github.com/vllm-project/vllm/pull/7051
* [Kernel] Fix input for flashinfer prefill wrapper. by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7008
* [mypy] Speed up mypy checking by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7056
* [ci][distributed] try to fix pp test by youkaichao in https://github.com/vllm-project/vllm/pull/7054
* Fix tracing.py by bong-furiosa in https://github.com/vllm-project/vllm/pull/7065
* [cuda][misc] remove error_on_invalid_device_count_status by youkaichao in https://github.com/vllm-project/vllm/pull/7069
* [Core] Comment out unused code in sampler by peng1999 in https://github.com/vllm-project/vllm/pull/7023
* [Hardware][Intel CPU] Update torch 2.4.0 for CPU backend by DamonFool in https://github.com/vllm-project/vllm/pull/6931
* [ci] set timeout for test_oot_registration.py by youkaichao in https://github.com/vllm-project/vllm/pull/7082
* [CI/Build] Add support for Python 3.12 by mgoin in https://github.com/vllm-project/vllm/pull/7035
* [Misc] Disambiguate quantized types via a new ScalarType by LucasWilkinson in https://github.com/vllm-project/vllm/pull/6396
* [Core] Pipeline parallel with Ray ADAG by ruisearch42 in https://github.com/vllm-project/vllm/pull/6837
* [Misc] Revive to use loopback address for driver IP by ruisearch42 in https://github.com/vllm-project/vllm/pull/7091
* [misc] add a flag to enable compile by youkaichao in https://github.com/vllm-project/vllm/pull/7092
* [ Frontend ] Multiprocessing for OpenAI Server with `zeromq` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6883
* [ci][distributed] shorten wait time if server hangs by youkaichao in https://github.com/vllm-project/vllm/pull/7098
* [Frontend] Factor out chat message parsing by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7055
* [ci][distributed] merge distributed test commands by youkaichao in https://github.com/vllm-project/vllm/pull/7097
* [ci][distributed] disable ray dag tests by youkaichao in https://github.com/vllm-project/vllm/pull/7099
* [Model] Refactor and decouple weight loading logic for InternVL2 model by Isotr0py in https://github.com/vllm-project/vllm/pull/7067
* [Bugfix] Fix block table for seqs that have prefix cache hits by zachzzc in https://github.com/vllm-project/vllm/pull/7018
* [LoRA] ReplicatedLinear support LoRA by jeejeelee in https://github.com/vllm-project/vllm/pull/7081
* [CI] Temporarily turn off H100 performance benchmark by KuntaiDu in https://github.com/vllm-project/vllm/pull/7104
* [ci][test] finalize fork_new_process_for_each_test by youkaichao in https://github.com/vllm-project/vllm/pull/7114
* [Frontend] Warn if user `max_model_len` is greater than derived `max_model_len` by fialhocoelho in https://github.com/vllm-project/vllm/pull/7080
* Support for guided decoding for offline LLM by kevinbu233 in https://github.com/vllm-project/vllm/pull/6878
* [misc] add zmq in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/7119
* [core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase by youkaichao in https://github.com/vllm-project/vllm/pull/7117
* [Model]Refactor MiniCPMV by jeejeelee in https://github.com/vllm-project/vllm/pull/7020
* [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator by tdoublep in https://github.com/vllm-project/vllm/pull/7105
* [misc][distributed] improve libcudart.so finding by youkaichao in https://github.com/vllm-project/vllm/pull/7127
* Clean up remaining Punica C information by jeejeelee in https://github.com/vllm-project/vllm/pull/7027
* [Model] Add multi-image support for minicpmv offline inference by HwwwwwwwH in https://github.com/vllm-project/vllm/pull/7122
* [Frontend] Reapply "Factor out code for running uvicorn" by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7095
* [Model] SiglipVisionModel ported from transformers by ChristopherCho in https://github.com/vllm-project/vllm/pull/6942
* [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification by cadedaniel in https://github.com/vllm-project/vllm/pull/6963
* [SpecDecode] Support FlashInfer in DraftModelRunner by bong-furiosa in https://github.com/vllm-project/vllm/pull/6926
* [BugFix] Use IP4 localhost form for zmq bind by njhill in https://github.com/vllm-project/vllm/pull/7163
* [BugFix] Use args.trust_remote_code by VastoLorde95 in https://github.com/vllm-project/vllm/pull/7121
* [Misc] Fix typo in GroupCoordinator.recv() by ruisearch42 in https://github.com/vllm-project/vllm/pull/7167
* [Kernel] Update CUTLASS to 3.5.1 by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7085
* [CI/Build] Suppress divide-by-zero and missing return statement warnings by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7001
* [Bugfix][CI/Build] Fix CUTLASS FetchContent by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7171
* bump version to v0.5.4 by simon-mo in https://github.com/vllm-project/vllm/pull/7139

New Contributors
* yecohn made their first contribution in https://github.com/vllm-project/vllm/pull/6652
* thesues made their first contribution in https://github.com/vllm-project/vllm/pull/5753
* luizanao made their first contribution in https://github.com/vllm-project/vllm/pull/6748
* ezliu made their first contribution in https://github.com/vllm-project/vllm/pull/6626
* HwwwwwwwH made their first contribution in https://github.com/vllm-project/vllm/pull/4087
* LucasWilkinson made their first contribution in https://github.com/vllm-project/vllm/pull/6798
* qingquansong made their first contribution in https://github.com/vllm-project/vllm/pull/6793
* eaplatanios made their first contribution in https://github.com/vllm-project/vllm/pull/6770
* gurpreet-dhami made their first contribution in https://github.com/vllm-project/vllm/pull/6847
* omrishiv made their first contribution in https://github.com/vllm-project/vllm/pull/6844
* cw75 made their first contribution in https://github.com/vllm-project/vllm/pull/6857
* zeyugao made their first contribution in https://github.com/vllm-project/vllm/pull/6871
* etwk made their first contribution in https://github.com/vllm-project/vllm/pull/6891
* fzyzcjy made their first contribution in https://github.com/vllm-project/vllm/pull/6949
* FeiDeng made their first contribution in https://github.com/vllm-project/vllm/pull/6927
* HandH1998 made their first contribution in https://github.com/vllm-project/vllm/pull/5218
* xuyi made their first contribution in https://github.com/vllm-project/vllm/pull/6924
* bong-furiosa made their first contribution in https://github.com/vllm-project/vllm/pull/7065
* zachzzc made their first contribution in https://github.com/vllm-project/vllm/pull/7018
* fialhocoelho made their first contribution in https://github.com/vllm-project/vllm/pull/7080
* ChristopherCho made their first contribution in https://github.com/vllm-project/vllm/pull/6942
* VastoLorde95 made their first contribution in https://github.com/vllm-project/vllm/pull/7121

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.3...v0.5.4

0.5.3.post1

Not secure
Highlights
* We fixed an configuration incompatibility between vLLM (which tested against pre-released version) and the published Meta Llama 3.1 weights (6693)

What's Changed
* [Docs] Announce llama3.1 support by WoosukKwon in https://github.com/vllm-project/vllm/pull/6688
* [doc][distributed] fix doc argument order by youkaichao in https://github.com/vllm-project/vllm/pull/6691
* [Bugfix] Fix a log error in chunked prefill by WoosukKwon in https://github.com/vllm-project/vllm/pull/6694
* [BugFix] Fix RoPE error in Llama 3.1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/6693
* Bump version to 0.5.3.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/6696


**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.3...v0.5.3.post1

Page 2 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.