Highlights
Performance Update
* We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (7000, 7387, 7452, 7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set `--num-scheduler-steps 8` as a parameter to the API server (via `vllm serve`) or `AsyncLLMEngine`. We are working on expanding the coverage to `LLM` class and aiming to turning it on by default
* Various enhancements:
* Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (7137)
* Reduce Python allocations, leading to 24% throughput speedup (7162, 7364)
* Improvements to the zeromq based decoupled frontend (7570, 7716, 7484)
Model Support
* Support Jamba 1.5 (7415, 7601, 6739)
* Support for the first audio model `UltravoxModel` (7615, 7446)
* Improvements to vision models:
* Support image embeddings as input (6613)
* Support SigLIP encoder and alternative decoders for LLaVA models (7153)
* Support loading GGUF model (5191) with tensor parallelism (7520)
* Progress in encoder decoder models: support for serving encoder/decoder models (7258), and architecture for cross-attention (4942)
Hardware Support
* AMD: Add fp8 Linear Layer for rocm (7210)
* Enhancements to TPU support: load time W8A16 quantization (7005), optimized rope (7635), and support multi-host inference (7457).
* Intel: various refactoring for worker, executor, and model runner (7686, 7712)
Others
* Optimize prefix caching performance (7193)
* Speculative decoding
* Use target model max length as default for draft model (7706)
* EAGLE Implementation with Top-1 proposer (6830)
* Entrypoints
* A new `chat` method in the `LLM` class (5049)
* Support embeddings in the run_batch API (7132)
* Support `prompt_logprobs` in Chat Completion (7453)
* Quantizations
* Expand MoE weight loading + Add Fused Marlin MoE Kernel (7527)
* Machete - Hopper Optimized Mixed Precision Linear Kernel (7174)
* `torch.compile`: register custom ops for kernels (7591, 7594, 7536)
What's Changed
* [ci][frontend] deduplicate tests by youkaichao in https://github.com/vllm-project/vllm/pull/7101
* [Doc] [SpecDecode] Update MLPSpeculator documentation by tdoublep in https://github.com/vllm-project/vllm/pull/7100
* [Bugfix] Specify device when loading LoRA and embedding tensors by jischein in https://github.com/vllm-project/vllm/pull/7129
* [MISC] Use non-blocking transfer in prepare_input by comaniac in https://github.com/vllm-project/vllm/pull/7172
* [Core] Support loading GGUF model by Isotr0py in https://github.com/vllm-project/vllm/pull/5191
* [Build] Add initial conditional testing spec by simon-mo in https://github.com/vllm-project/vllm/pull/6841
* [LoRA] Relax LoRA condition by jeejeelee in https://github.com/vllm-project/vllm/pull/7146
* [Model] Support SigLIP encoder and alternative decoders for LLaVA models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7153
* [BugFix] Fix DeepSeek remote code by dsikka in https://github.com/vllm-project/vllm/pull/7178
* [ BugFix ] Fix ZMQ when `VLLM_PORT` is set by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7205
* [Bugfix] add gguf dependency by kpapis in https://github.com/vllm-project/vllm/pull/7198
* [SpecDecode] [Minor] Fix spec decode sampler tests by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7183
* [Kernel] Add per-tensor and per-token AZP epilogues by ProExpertProg in https://github.com/vllm-project/vllm/pull/5941
* [Core] Optimize evictor-v2 performance by xiaobochen123 in https://github.com/vllm-project/vllm/pull/7193
* [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by afeldman-nm in https://github.com/vllm-project/vllm/pull/4942
* [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by mgoin in https://github.com/vllm-project/vllm/pull/7225
* [BugFix] Overhaul async request cancellation by njhill in https://github.com/vllm-project/vllm/pull/7111
* [Doc] Mock new dependencies for documentation by ywang96 in https://github.com/vllm-project/vllm/pull/7245
* [BUGFIX]: top_k is expected to be an integer. by Atllkks10 in https://github.com/vllm-project/vllm/pull/7227
* [Frontend] Gracefully handle missing chat template and fix CI failure by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7238
* [distributed][misc] add specialized method for cuda platform by youkaichao in https://github.com/vllm-project/vllm/pull/7249
* [Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` by dsikka in https://github.com/vllm-project/vllm/pull/5874
* [ BugFix ] Move `zmq` frontend to IPC instead of TCP by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7222
* Fixes typo in function name by rafvasq in https://github.com/vllm-project/vllm/pull/7275
* [Bugfix] Fix input processor for InternVL2 model by Isotr0py in https://github.com/vllm-project/vllm/pull/7164
* [OpenVINO] migrate to latest dependencies versions by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7251
* [Doc] add online speculative decoding example by stas00 in https://github.com/vllm-project/vllm/pull/7243
* [BugFix] Fix frontend multiprocessing hang by maxdebayser in https://github.com/vllm-project/vllm/pull/7217
* [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by mgoin in https://github.com/vllm-project/vllm/pull/7219
* [ci] Make building wheels per commit optional by khluu in https://github.com/vllm-project/vllm/pull/7278
* [Bugfix] Fix gptq failure on T4s by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7264
* [FrontEnd] Make `merge_async_iterators` `is_cancelled` arg optional by njhill in https://github.com/vllm-project/vllm/pull/7282
* [Doc] Update supported_hardware.rst by mgoin in https://github.com/vllm-project/vllm/pull/7276
* [Kernel] Fix Flashinfer Correctness by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7284
* [Misc] Fix typos in scheduler.py by ruisearch42 in https://github.com/vllm-project/vllm/pull/7285
* [Frontend] remove max_num_batched_tokens limit for lora by NiuBlibing in https://github.com/vllm-project/vllm/pull/7288
* [Bugfix] Fix LoRA with PP by andoorve in https://github.com/vllm-project/vllm/pull/7292
* [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by jeejeelee in https://github.com/vllm-project/vllm/pull/7273
* [Bugfix][Kernel] Increased atol to fix failing tests by ProExpertProg in https://github.com/vllm-project/vllm/pull/7305
* [Frontend] Kill the server on engine death by joerunde in https://github.com/vllm-project/vllm/pull/6594
* [Bugfix][fast] Fix the get_num_blocks_touched logic by zachzzc in https://github.com/vllm-project/vllm/pull/6849
* [Doc] Put collect_env issue output in a <detail> block by mgoin in https://github.com/vllm-project/vllm/pull/7310
* [CI/Build] Dockerfile.cpu improvements by dtrifiro in https://github.com/vllm-project/vllm/pull/7298
* [Bugfix] Fix new Llama3.1 GGUF model loading by Isotr0py in https://github.com/vllm-project/vllm/pull/7269
* [Misc] Temporarily resolve the error of BitAndBytes by jeejeelee in https://github.com/vllm-project/vllm/pull/7308
* Add Skywork AI as Sponsor by simon-mo in https://github.com/vllm-project/vllm/pull/7314
* [TPU] Add Load-time W8A16 quantization for TPU Backend by lsy323 in https://github.com/vllm-project/vllm/pull/7005
* [Core] Support serving encoder/decoder models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7258
* [TPU] Fix dockerfile.tpu by WoosukKwon in https://github.com/vllm-project/vllm/pull/7331
* [Performance] Optimize e2e overheads: Reduce python allocations by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7162
* [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7218
* [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by SolitaryThinker in https://github.com/vllm-project/vllm/pull/6971
* [Core] Streamline stream termination in `AsyncLLMEngine` by njhill in https://github.com/vllm-project/vllm/pull/7336
* [Model][Jamba] Mamba cache single buffer by mzusman in https://github.com/vllm-project/vllm/pull/6739
* [VLM][Doc] Add `stop_token_ids` to InternVL example by Isotr0py in https://github.com/vllm-project/vllm/pull/7354
* [Performance] e2e overheads reduction: Small followup diff by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7364
* [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7360
* [Frontend] Support embeddings in the run_batch API by pooyadavoodi in https://github.com/vllm-project/vllm/pull/7132
* [Bugfix] Fix ITL recording in serving benchmark by ywang96 in https://github.com/vllm-project/vllm/pull/7372
* [Core] Add span metrics for model_forward, scheduler and sampler time by sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7089
* [Bugfix] Fix `PerTensorScaleParameter` weight loading for fused models by dsikka in https://github.com/vllm-project/vllm/pull/7376
* [Misc] Add numpy implementation of `compute_slot_mapping` by Yard1 in https://github.com/vllm-project/vllm/pull/7377
* [Core] Fix edge case in chunked prefill + block manager v2 by cadedaniel in https://github.com/vllm-project/vllm/pull/7380
* [Bugfix] Fix phi3v batch inference when images have different aspect ratio by Isotr0py in https://github.com/vllm-project/vllm/pull/7392
* [TPU] Use mark_dynamic to reduce compilation time by WoosukKwon in https://github.com/vllm-project/vllm/pull/7340
* Updating LM Format Enforcer version to v0.10.6 by noamgat in https://github.com/vllm-project/vllm/pull/7189
* [core] [2/N] refactor worker_base input preparation for multi-step by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7387
* [CI/Build] build on empty device for better dev experience by tomeras91 in https://github.com/vllm-project/vllm/pull/4773
* [Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by tomeras91 in https://github.com/vllm-project/vllm/pull/7403
* [misc] add commit id in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/7405
* [Docs] Update readme by simon-mo in https://github.com/vllm-project/vllm/pull/7316
* [CI/Build] Minor refactoring for vLLM assets by ywang96 in https://github.com/vllm-project/vllm/pull/7407
* [Kernel] Flashinfer correctness fix for v0.1.3 by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7319
* [Core][VLM] Support image embeddings as input by ywang96 in https://github.com/vllm-project/vllm/pull/6613
* [Frontend] Disallow passing `model` as both argument and option by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7347
* [CI/Build] bump Dockerfile.neuron image base, use public ECR by dtrifiro in https://github.com/vllm-project/vllm/pull/6832
* [Bugfix] Fix logit soft cap in flash-attn backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/7425
* [ci] Entrypoints run upon changes in vllm/ by khluu in https://github.com/vllm-project/vllm/pull/7423
* [ci] Cancel fastcheck run when PR is marked ready by khluu in https://github.com/vllm-project/vllm/pull/7427
* [ci] Cancel fastcheck when PR is ready by khluu in https://github.com/vllm-project/vllm/pull/7433
* [Misc] Use scalar type to dispatch to different `gptq_marlin` kernels by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7323
* [Core] Consolidate `GB` constant and enable float GB arguments by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7416
* [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by jon-chuang in https://github.com/vllm-project/vllm/pull/7208
* [Bugfix] Handle PackageNotFoundError when checking for xpu version by sasha0552 in https://github.com/vllm-project/vllm/pull/7398
* [CI/Build] bump minimum cmake version by dtrifiro in https://github.com/vllm-project/vllm/pull/6999
* [Core] Shut down aDAG workers with clean async llm engine exit by ruisearch42 in https://github.com/vllm-project/vllm/pull/7224
* [mypy] Misc. typing improvements by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7417
* [Misc] improve logits processors logging message by aw632 in https://github.com/vllm-project/vllm/pull/7435
* [ci] Remove fast check cancel workflow by khluu in https://github.com/vllm-project/vllm/pull/7455
* [Bugfix] Fix weight loading for Chameleon when TP>1 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7410
* [hardware] unify usage of is_tpu to current_platform.is_tpu() by youkaichao in https://github.com/vllm-project/vllm/pull/7102
* [TPU] Suppress import custom_ops warning by WoosukKwon in https://github.com/vllm-project/vllm/pull/7458
* Revert "[Doc] Update supported_hardware.rst (7276)" by WoosukKwon in https://github.com/vllm-project/vllm/pull/7467
* [Frontend][Core] Add plumbing to support audio language models by petersalas in https://github.com/vllm-project/vllm/pull/7446
* [Misc] Update LM Eval Tolerance by dsikka in https://github.com/vllm-project/vllm/pull/7473
* [Misc] Update `gptq_marlin` to use new vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7281
* [Misc] Update Fused MoE weight loading by dsikka in https://github.com/vllm-project/vllm/pull/7334
* [Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` by dsikka in https://github.com/vllm-project/vllm/pull/7422
* Announce NVIDIA Meetup by simon-mo in https://github.com/vllm-project/vllm/pull/7483
* [frontend] spawn engine process from api server process by youkaichao in https://github.com/vllm-project/vllm/pull/7484
* [Misc] `compressed-tensors` code reuse by kylesayrs in https://github.com/vllm-project/vllm/pull/7277
* [misc][plugin] add plugin system implementation by youkaichao in https://github.com/vllm-project/vllm/pull/7426
* [TPU] Support multi-host inference by WoosukKwon in https://github.com/vllm-project/vllm/pull/7457
* [Bugfix][CI] Import ray under guard by WoosukKwon in https://github.com/vllm-project/vllm/pull/7486
* [CI/Build]Reduce the time consumption for LoRA tests by jeejeelee in https://github.com/vllm-project/vllm/pull/7396
* [misc][ci] fix cpu test with plugins by youkaichao in https://github.com/vllm-project/vllm/pull/7489
* [Bugfix][Docs] Update list of mock imports by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7493
* [doc] update test script to include cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/7501
* Fix empty output when temp is too low by CatherineSue in https://github.com/vllm-project/vllm/pull/2937
* [ci] fix model tests by youkaichao in https://github.com/vllm-project/vllm/pull/7507
* [Bugfix][Frontend] Disable embedding API for chat models by QwertyJack in https://github.com/vllm-project/vllm/pull/7504
* [Misc] Deprecation Warning when setting --engine-use-ray by wallashss in https://github.com/vllm-project/vllm/pull/7424
* [VLM][Core] Support profiling with multiple multi-modal inputs per prompt by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7126
* [core] [3/N] multi-step args and sequence.py by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7452
* [TPU] Set per-rank XLA cache by WoosukKwon in https://github.com/vllm-project/vllm/pull/7533
* [Misc] Revert `compressed-tensors` code reuse by kylesayrs in https://github.com/vllm-project/vllm/pull/7521
* llama_index serving integration documentation by pavanjava in https://github.com/vllm-project/vllm/pull/6973
* [Bugfix][TPU] Correct env variable for XLA cache path by WoosukKwon in https://github.com/vllm-project/vllm/pull/7544
* [Bugfix] update neuron for version > 0.5.0 by omrishiv in https://github.com/vllm-project/vllm/pull/7175
* [Misc] Update dockerfile for CPU to cover protobuf installation by PHILO-HE in https://github.com/vllm-project/vllm/pull/7182
* [Bugfix] Fix default weight loading for scalars by mgoin in https://github.com/vllm-project/vllm/pull/7534
* [Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by mgoin in https://github.com/vllm-project/vllm/pull/7566
* [Misc] Add quantization config support for speculative model. by ShangmingCai in https://github.com/vllm-project/vllm/pull/7343
* [Feature]: Add OpenAI server prompt_logprobs support 6508 by gnpinkert in https://github.com/vllm-project/vllm/pull/7453
* [ci/test] rearrange tests and make adag test soft fail by youkaichao in https://github.com/vllm-project/vllm/pull/7572
* Chat method for offline llm by nunjunj in https://github.com/vllm-project/vllm/pull/5049
* [CI] Move quantization cpu offload tests out of fastcheck by mgoin in https://github.com/vllm-project/vllm/pull/7574
* [Misc/Testing] Use `torch.testing.assert_close` by jon-chuang in https://github.com/vllm-project/vllm/pull/7324
* register custom op for flash attn and use from torch.ops by youkaichao in https://github.com/vllm-project/vllm/pull/7536
* [Core] Use uvloop with zmq-decoupled front-end by njhill in https://github.com/vllm-project/vllm/pull/7570
* [CI] Fix crashes of performance benchmark by KuntaiDu in https://github.com/vllm-project/vllm/pull/7500
* [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by gongdao123 in https://github.com/vllm-project/vllm/pull/7513
* support tqdm in notebooks by fzyzcjy in https://github.com/vllm-project/vllm/pull/7510
* [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by charlifu in https://github.com/vllm-project/vllm/pull/7210
* [Kernel] W8A16 Int8 inside FusedMoE by mzusman in https://github.com/vllm-project/vllm/pull/7415
* [Kernel] Add tuned triton configs for ExpertsInt8 by mgoin in https://github.com/vllm-project/vllm/pull/7601
* [spec decode] [4/N] Move update_flash_attn_metadata to attn backend by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7571
* [Core] Fix tracking of model forward time to the span traces in case of PP>1 by sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7440
* [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by mgoin in https://github.com/vllm-project/vllm/pull/7444
* [Doc] Update quantization supported hardware table by mgoin in https://github.com/vllm-project/vllm/pull/7595
* [Kernel] register punica functions as torch ops by bnellnm in https://github.com/vllm-project/vllm/pull/7591
* [Kernel][Misc] dynamo support for ScalarType by bnellnm in https://github.com/vllm-project/vllm/pull/7594
* [Kernel] fix types used in aqlm and ggml kernels to support dynamo by bnellnm in https://github.com/vllm-project/vllm/pull/7596
* [Model] Align nemotron config with final HF state and fix lm-eval-small by mgoin in https://github.com/vllm-project/vllm/pull/7611
* [Bugfix] Fix custom_ar support check by bnellnm in https://github.com/vllm-project/vllm/pull/7617
* .[Build/CI] Enabling passing AMD tests. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7610
* [Bugfix] Clear engine reference in AsyncEngineRPCServer by ruisearch42 in https://github.com/vllm-project/vllm/pull/7618
* [aDAG] Unflake aDAG + PP tests by rkooo567 in https://github.com/vllm-project/vllm/pull/7600
* [Bugfix] add >= 1.0 constraint for openai dependency by metasyn in https://github.com/vllm-project/vllm/pull/7612
* [misc] use nvml to get consistent device name by youkaichao in https://github.com/vllm-project/vllm/pull/7582
* [ci][test] fix engine/logger test by youkaichao in https://github.com/vllm-project/vllm/pull/7621
* [core][misc] update libcudart finding by youkaichao in https://github.com/vllm-project/vllm/pull/7620
* [Model] Pipeline parallel support for JAIS by mrbesher in https://github.com/vllm-project/vllm/pull/7603
* [ci][test] allow longer wait time for api server by youkaichao in https://github.com/vllm-project/vllm/pull/7629
* [Misc]Fix BitAndBytes exception messages by jeejeelee in https://github.com/vllm-project/vllm/pull/7626
* [VLM] Refactor `MultiModalConfig` initialization and profiling by ywang96 in https://github.com/vllm-project/vllm/pull/7530
* [TPU] Skip creating empty tensor by WoosukKwon in https://github.com/vllm-project/vllm/pull/7630
* [TPU] Use mark_dynamic only for dummy run by WoosukKwon in https://github.com/vllm-project/vllm/pull/7634
* [TPU] Optimize RoPE forward_native2 by WoosukKwon in https://github.com/vllm-project/vllm/pull/7636
* [ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7279
* [CI/Build] Add text-only test for Qwen models by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/7475
* [Misc] Refactor Llama3 RoPE initialization by WoosukKwon in https://github.com/vllm-project/vllm/pull/7637
* [Core] Optimize SPMD architecture with delta + serialization optimization by rkooo567 in https://github.com/vllm-project/vllm/pull/7109
* [Core] Use flashinfer sampling kernel when available by peng1999 in https://github.com/vllm-project/vllm/pull/7137
* fix xpu build by jikunshang in https://github.com/vllm-project/vllm/pull/7644
* [Misc] Remove Gemma RoPE by WoosukKwon in https://github.com/vllm-project/vllm/pull/7638
* [MISC] Add prefix cache hit rate to metrics by comaniac in https://github.com/vllm-project/vllm/pull/7606
* [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by c3-ali in https://github.com/vllm-project/vllm/pull/5428
* [core] Multi Step Scheduling by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7000
* [Core] Support tensor parallelism for GGUF quantization by Isotr0py in https://github.com/vllm-project/vllm/pull/7520
* [Bugfix] Don't disable existing loggers by a-ys in https://github.com/vllm-project/vllm/pull/7664
* [TPU] Fix redundant input tensor cloning by WoosukKwon in https://github.com/vllm-project/vllm/pull/7660
* [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7665
* [doc] fix doc build error caused by msgspec by youkaichao in https://github.com/vllm-project/vllm/pull/7659
* [Speculative Decoding] Fixing hidden states handling in batch expansion by abhigoyal1997 in https://github.com/vllm-project/vllm/pull/7508
* [ci] Install Buildkite test suite analysis by khluu in https://github.com/vllm-project/vllm/pull/7667
* [Bugfix] support `tie_word_embeddings` for all models by zijian-hu in https://github.com/vllm-project/vllm/pull/5724
* [CI] Organizing performance benchmark files by KuntaiDu in https://github.com/vllm-project/vllm/pull/7616
* [misc] add nvidia related library in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/7674
* [XPU] fallback to native implementation for xpu custom op by jianyizh in https://github.com/vllm-project/vllm/pull/7670
* [misc][cuda] add warning for pynvml user by youkaichao in https://github.com/vllm-project/vllm/pull/7675
* [Core] Refactor executor classes to make it easier to inherit GPUExecutor by jikunshang in https://github.com/vllm-project/vllm/pull/7673
* [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7174
* [OpenVINO] Updated documentation by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7687
* [VLM][Model] Add test for InternViT vision encoder by Isotr0py in https://github.com/vllm-project/vllm/pull/7409
* [Hardware] [Intel GPU] refactor xpu worker/executor by jikunshang in https://github.com/vllm-project/vllm/pull/7686
* [CI/Build] Pin OpenTelemetry versions and make availability errors clearer by ronensc in https://github.com/vllm-project/vllm/pull/7266
* [Misc] Add jinja2 as an explicit build requirement by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7695
* [Core] Add `AttentionState` abstraction by Yard1 in https://github.com/vllm-project/vllm/pull/7663
* [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by jikunshang in https://github.com/vllm-project/vllm/pull/7685
* [ci][test] adjust max wait time for cpu offloading test by youkaichao in https://github.com/vllm-project/vllm/pull/7709
* [Core] Pipe `worker_class_fn` argument in Executor by Yard1 in https://github.com/vllm-project/vllm/pull/7707
* [ci] try to log process using the port to debug the port usage by youkaichao in https://github.com/vllm-project/vllm/pull/7711
* [Model] Add AWQ quantization support for InternVL2 model by Isotr0py in https://github.com/vllm-project/vllm/pull/7187
* [Doc] Section for Multimodal Language Models by ywang96 in https://github.com/vllm-project/vllm/pull/7719
* [mypy] Enable following imports for entrypoints by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7248
* [Bugfix] Mirror jinja2 in pyproject.toml by sasha0552 in https://github.com/vllm-project/vllm/pull/7723
* [BugFix] Avoid premature async generator exit and raise all exception variations by njhill in https://github.com/vllm-project/vllm/pull/7698
* [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by learninmou in https://github.com/vllm-project/vllm/pull/7509
* [Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/7735
* [Spec Decoding] Use target model max length as default for draft model by njhill in https://github.com/vllm-project/vllm/pull/7706
* [Bugfix] chat method add_generation_prompt param by brian14708 in https://github.com/vllm-project/vllm/pull/7734
* [Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7394
* [Bugfix] Pass PYTHONPATH from setup.py to CMake by sasha0552 in https://github.com/vllm-project/vllm/pull/7730
* [multi-step] Raise error if not using async engine by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7703
* [Frontend] Improve Startup Failure UX by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7716
* [misc] Add Torch profiler support by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7451
* [Model] Add UltravoxModel and UltravoxConfig by petersalas in https://github.com/vllm-project/vllm/pull/7615
* [ci] [multi-step] narrow multi-step test dependency paths by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7760
* [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by dsikka in https://github.com/vllm-project/vllm/pull/7527
* [distributed][misc] error on same VLLM_HOST_IP setting by youkaichao in https://github.com/vllm-project/vllm/pull/7756
* [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by gshtras in https://github.com/vllm-project/vllm/pull/7477
* [Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` by ProExpertProg in https://github.com/vllm-project/vllm/pull/7233
* [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by zifeitong in https://github.com/vllm-project/vllm/pull/7710
* [Bug][Frontend] Improve ZMQ client robustness by joerunde in https://github.com/vllm-project/vllm/pull/7443
* Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (7527)" by mgoin in https://github.com/vllm-project/vllm/pull/7764
* [TPU] Avoid initializing TPU runtime in is_tpu by WoosukKwon in https://github.com/vllm-project/vllm/pull/7763
* [ci] refine dependency for distributed tests by youkaichao in https://github.com/vllm-project/vllm/pull/7776
* [Misc] Use torch.compile for GemmaRMSNorm by WoosukKwon in https://github.com/vllm-project/vllm/pull/7642
* [Speculative Decoding] EAGLE Implementation with Top-1 proposer by abhigoyal1997 in https://github.com/vllm-project/vllm/pull/6830
* Fix ShardedStateLoader for vllm fp8 quantization by sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/7708
* [Bugfix] Don't build machete on cuda <12.0 by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7757
* [Misc] update fp8 to use `vLLMParameter` by dsikka in https://github.com/vllm-project/vllm/pull/7437
* [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by tjohnson31415 in https://github.com/vllm-project/vllm/pull/7232
* [Misc] Enhance prefix-caching benchmark tool by Jeffwan in https://github.com/vllm-project/vllm/pull/6568
* [Doc] Fix incorrect docs from 7615 by petersalas in https://github.com/vllm-project/vllm/pull/7788
* [Bugfix] Use LoadFormat values as choices for `vllm serve --load-format` by mgoin in https://github.com/vllm-project/vllm/pull/7784
* [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by khluu in https://github.com/vllm-project/vllm/pull/7705
* [Misc] fix typo in triton import warning by lsy323 in https://github.com/vllm-project/vllm/pull/7794
* [Frontend] error suppression cleanup by joerunde in https://github.com/vllm-project/vllm/pull/7786
* [Ray backend] Better error when pg topology is bad. by rkooo567 in https://github.com/vllm-project/vllm/pull/7584
* [Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by jikunshang in https://github.com/vllm-project/vllm/pull/7712
* [misc] Add Torch profiler support for CPU-only devices by DamonFool in https://github.com/vllm-project/vllm/pull/7806
* [BugFix] Fix server crash on empty prompt by maxdebayser in https://github.com/vllm-project/vllm/pull/7746
* [github][misc] promote asking llm first by youkaichao in https://github.com/vllm-project/vllm/pull/7809
* [Misc] Update `marlin` to use vLLMParameters by dsikka in https://github.com/vllm-project/vllm/pull/7803
* Bump version to v0.5.5 by simon-mo in https://github.com/vllm-project/vllm/pull/7823
New Contributors
* jischein made their first contribution in https://github.com/vllm-project/vllm/pull/7129
* kpapis made their first contribution in https://github.com/vllm-project/vllm/pull/7198
* xiaobochen123 made their first contribution in https://github.com/vllm-project/vllm/pull/7193
* Atllkks10 made their first contribution in https://github.com/vllm-project/vllm/pull/7227
* stas00 made their first contribution in https://github.com/vllm-project/vllm/pull/7243
* maxdebayser made their first contribution in https://github.com/vllm-project/vllm/pull/7217
* NiuBlibing made their first contribution in https://github.com/vllm-project/vllm/pull/7288
* lsy323 made their first contribution in https://github.com/vllm-project/vllm/pull/7005
* pooyadavoodi made their first contribution in https://github.com/vllm-project/vllm/pull/7132
* sfc-gh-mkeralapura made their first contribution in https://github.com/vllm-project/vllm/pull/7089
* jon-chuang made their first contribution in https://github.com/vllm-project/vllm/pull/7208
* aw632 made their first contribution in https://github.com/vllm-project/vllm/pull/7435
* petersalas made their first contribution in https://github.com/vllm-project/vllm/pull/7446
* kylesayrs made their first contribution in https://github.com/vllm-project/vllm/pull/7277
* QwertyJack made their first contribution in https://github.com/vllm-project/vllm/pull/7504
* wallashss made their first contribution in https://github.com/vllm-project/vllm/pull/7424
* pavanjava made their first contribution in https://github.com/vllm-project/vllm/pull/6973
* PHILO-HE made their first contribution in https://github.com/vllm-project/vllm/pull/7182
* gnpinkert made their first contribution in https://github.com/vllm-project/vllm/pull/7453
* gongdao123 made their first contribution in https://github.com/vllm-project/vllm/pull/7513
* charlifu made their first contribution in https://github.com/vllm-project/vllm/pull/7210
* metasyn made their first contribution in https://github.com/vllm-project/vllm/pull/7612
* mrbesher made their first contribution in https://github.com/vllm-project/vllm/pull/7603
* alex-jw-brooks made their first contribution in https://github.com/vllm-project/vllm/pull/7475
* a-ys made their first contribution in https://github.com/vllm-project/vllm/pull/7664
* zijian-hu made their first contribution in https://github.com/vllm-project/vllm/pull/5724
* jianyizh made their first contribution in https://github.com/vllm-project/vllm/pull/7670
* learninmou made their first contribution in https://github.com/vllm-project/vllm/pull/7509
* brian14708 made their first contribution in https://github.com/vllm-project/vllm/pull/7734
* sfc-gh-zhwang made their first contribution in https://github.com/vllm-project/vllm/pull/7708
**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5