Highlights
* vLLM now has pipeline parallelism! (4412, 5408, 6115, 6120). You can now run the API server with `--pipeline-parallel-size`. This feature is in early stage, please let us know your feedback.
Model Support
* Support Gemma 2 (5908, 6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded [here](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.8)
* Support Jamba (4115). This is vLLM's first state space model!
* Support Deepseek-V2 (4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
* Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (4986, 5276, 5214)
* Notably, it has a **breaking change** that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in `<image>` into the prompt instead of complicated prompt formatting. See more [here](https://docs.vllm.ai/en/latest/models/vlm.html#offline-batched-inference)
* There is also a new [guide](https://docs.vllm.ai/en/latest/models/enabling_multimodal_inputs.html) on adding VLMs! We would love your contribution for new models!
Hardware Support
* Enhancement to TPU support (5292, 5878, 5850, 5831, 5855)
* OpenVINO backend (5379)
Production Service
* Support for sharded tensorized models (4990)
* Continous streaming of OpenAI response token stats (5742)
Performance
* Enhancement in distributed communication via shared memory (5399)
* Latency enhancement in block manager (5584)
* Enhancements to `compressed-tensors` supporting Marlin, W4A16 (5435, 5385)
* Faster FP8 quantize kernel (5396), FP8 on Ampere (5975)
* Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (4628)
* Speculative Decoding
* MLPSpeculator (4947, 6050)
* Typical Acceptance Sampler (5131, 5348)
* Draft Model Runner (5799)
Development Productivity
* Post merge benchmark is now available at perf.vllm.ai!
* Addition of A100 in CI environment (5658)
* Step towards nightly wheel publication (5610)
What's Changed
* [CI/Build] Add `is_quant_method_supported` to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5253
* Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" by simon-mo in https://github.com/vllm-project/vllm/pull/5463
* [CI] Upgrade codespell version. by rkooo567 in https://github.com/vllm-project/vllm/pull/5381
* [Hardware] Initial TPU integration by WoosukKwon in https://github.com/vllm-project/vllm/pull/5292
* [Bugfix] Add device assertion to TorchSDPA by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5402
* [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by khluu in https://github.com/vllm-project/vllm/pull/5464
* [Kernel] Vectorized FP8 quantize kernel by comaniac in https://github.com/vllm-project/vllm/pull/5396
* [Bugfix] TYPE_CHECKING for MultiModalData by kimdwkimdw in https://github.com/vllm-project/vllm/pull/5444
* [Frontend] [Core] Support for sharded tensorized models by tjohnson31415 in https://github.com/vllm-project/vllm/pull/4990
* [misc] add hint for AttributeError by youkaichao in https://github.com/vllm-project/vllm/pull/5462
* [Doc] Update debug docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5438
* [Bugfix] Fix typo in scheduler.py (requeset -> request) by mgoin in https://github.com/vllm-project/vllm/pull/5470
* [Frontend] Add "input speed" to tqdm postfix alongside output speed by mgoin in https://github.com/vllm-project/vllm/pull/5425
* [Bugfix] Fix wrong multi_modal_input format for CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/5451
* [Core][Distributed] add coordinator to reduce code duplication in tp and pp by youkaichao in https://github.com/vllm-project/vllm/pull/5293
* [ci] Use sccache to build images by khluu in https://github.com/vllm-project/vllm/pull/5419
* [Bugfix]if the content is started with ":"(response of ping), client should i… by sywangyi in https://github.com/vllm-project/vllm/pull/5303
* [Kernel] `w4a16` support for `compressed-tensors` by dsikka in https://github.com/vllm-project/vllm/pull/5385
* [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5466
* [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by wenyujin333 in https://github.com/vllm-project/vllm/pull/5497
* [Hardware][Intel] Optimize CPU backend and add more performance tips by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4971
* [Docs] Add 4th meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/5509
* [Misc] Add vLLM version getter to utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5098
* [CI/Build] Simplify OpenAI server setup in tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5100
* [Doc] Update LLaVA docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5437
* [Kernel] Factor out epilogues from cutlass kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5391
* [MISC] Remove FP8 warning by comaniac in https://github.com/vllm-project/vllm/pull/5472
* Seperate dev requirements into lint and test by Yard1 in https://github.com/vllm-project/vllm/pull/5474
* Revert "[Core] Remove unnecessary copies in flash attn backend" by Yard1 in https://github.com/vllm-project/vllm/pull/5478
* [misc] fix format.sh by youkaichao in https://github.com/vllm-project/vllm/pull/5511
* [CI/Build] Disable test_fp8.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5508
* [Kernel] Disable CUTLASS kernels for fp8 by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5505
* Add `cuda_device_count_stateless` by Yard1 in https://github.com/vllm-project/vllm/pull/5473
* [Hardware][Intel] Support CPU inference with AVX2 ISA by DamonFool in https://github.com/vllm-project/vllm/pull/5452
* [Bugfix]typofix by AllenDou in https://github.com/vllm-project/vllm/pull/5507
* bump version to v0.5.0.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/5522
* [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label by KuntaiDu in https://github.com/vllm-project/vllm/pull/5073
* [CI/Build] Disable LLaVA-NeXT CPU test by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
* [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
* [Misc] Fix arg names by AllenDou in https://github.com/vllm-project/vllm/pull/5524
* [ Misc ] Rs/compressed tensors cleanup by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
* [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
* [mis] fix flaky test of test_cuda_device_count_stateless by youkaichao in https://github.com/vllm-project/vllm/pull/5546
* [Core] Remove duplicate processing in async engine by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
* [misc][distributed] fix benign error in `is_in_the_same_node` by youkaichao in https://github.com/vllm-project/vllm/pull/5512
* [Docs] Add ZhenFund as a Sponsor by simon-mo in https://github.com/vllm-project/vllm/pull/5548
* [Doc] Update documentation on Tensorizer by sangstar in https://github.com/vllm-project/vllm/pull/5471
* [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by tdoublep in https://github.com/vllm-project/vllm/pull/5460
* [Bugfix] Fix typo in Pallas backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
* [Core][Distributed] improve p2p cache generation by youkaichao in https://github.com/vllm-project/vllm/pull/5528
* Add ccache to amd by simon-mo in https://github.com/vllm-project/vllm/pull/5555
* [Core][Bugfix]: fix prefix caching for blockv2 by leiwen83 in https://github.com/vllm-project/vllm/pull/5364
* [mypy] Enable type checking for test directory by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
* [CI/Build] Test both text and token IDs in batched OpenAI Completions API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
* [misc] Do not allow to use lora with chunked prefill. by rkooo567 in https://github.com/vllm-project/vllm/pull/5538
* add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
* [BugFix] Don't start a Ray cluster when not using Ray by njhill in https://github.com/vllm-project/vllm/pull/5570
* [Fix] Correct OpenAI batch response format by zifeitong in https://github.com/vllm-project/vllm/pull/5554
* Add basic correctness 2 GPU tests to 4 GPU pipeline by Yard1 in https://github.com/vllm-project/vllm/pull/5518
* [CI][BugFix] Flip is_quant_method_supported condition by mgoin in https://github.com/vllm-project/vllm/pull/5577
* [build][misc] limit numpy version by youkaichao in https://github.com/vllm-project/vllm/pull/5582
* [Doc] add debugging tips for crash and multi-node debugging by youkaichao in https://github.com/vllm-project/vllm/pull/5581
* Fix w8a8 benchmark and add Llama-3-8B by comaniac in https://github.com/vllm-project/vllm/pull/5562
* [Model] Rename Phi3 rope scaling type by garg-amit in https://github.com/vllm-project/vllm/pull/5595
* Correct alignment in the seq_len diagram. by CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
* [Kernel] `compressed-tensors` marlin 24 support by dsikka in https://github.com/vllm-project/vllm/pull/5435
* [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by zhyncs in https://github.com/vllm-project/vllm/pull/5588
* [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by jikunshang in https://github.com/vllm-project/vllm/pull/3814
* [CI/BUILD] Support non-AVX512 vLLM building and testing by DamonFool in https://github.com/vllm-project/vllm/pull/5574
* [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
* [bugfix][distributed] fix 16 gpus local rank arrangement by youkaichao in https://github.com/vllm-project/vllm/pull/5604
* [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by youkaichao in https://github.com/vllm-project/vllm/pull/5584
* [Bugfix] Fix KV head calculation for MPT models when using GQA by bfontain in https://github.com/vllm-project/vllm/pull/5142
* [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by zifeitong in https://github.com/vllm-project/vllm/pull/5606
* [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by sroy745 in https://github.com/vllm-project/vllm/pull/5131
* [Model] Initialize Phi-3-vision support by Isotr0py in https://github.com/vllm-project/vllm/pull/4986
* [Kernel] Add punica dimensions for Granite 13b by joerunde in https://github.com/vllm-project/vllm/pull/5559
* [misc][typo] fix typo by youkaichao in https://github.com/vllm-project/vllm/pull/5620
* [Misc] Fix typo by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
* [CI] Avoid naming different metrics with the same name in performance benchmark by KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
* [bugfix][distributed] do not error if two processes do not agree on p2p capability by youkaichao in https://github.com/vllm-project/vllm/pull/5612
* [Misc] Remove import from transformers logging by CatherineSue in https://github.com/vllm-project/vllm/pull/5625
* [CI/Build][Misc] Update Pytest Marker for VLMs by ywang96 in https://github.com/vllm-project/vllm/pull/5623
* [ci] Deprecate original CI template by khluu in https://github.com/vllm-project/vllm/pull/5624
* [Misc] Add OpenTelemetry support by ronensc in https://github.com/vllm-project/vllm/pull/4687
* [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by dsikka in https://github.com/vllm-project/vllm/pull/5542
* [ci] Setup Release pipeline and build release wheels with cache by khluu in https://github.com/vllm-project/vllm/pull/5610
* [Model] LoRA support added for command-r by sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
* [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by tdoublep in https://github.com/vllm-project/vllm/pull/5639
* [Doc] Added cerebrium as Integration option by milo157 in https://github.com/vllm-project/vllm/pull/5553
* [Bugfix] Fix CUDA version check for mma warning suppression by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
* [Bugfix] Fix w8a8 benchmarks for int8 case by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
* [Bugfix] Fix Phi-3 Long RoPE scaling implementation by ShukantPal in https://github.com/vllm-project/vllm/pull/5628
* [Bugfix] Added test for sampling repetition penalty bug. by tdoublep in https://github.com/vllm-project/vllm/pull/5659
* [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by hongxiayang in https://github.com/vllm-project/vllm/pull/5641
* [misc][distributed] use localhost for single-node by youkaichao in https://github.com/vllm-project/vllm/pull/5619
* [Model] Add FP8 kv cache for Qwen2 by mgoin in https://github.com/vllm-project/vllm/pull/5656
* [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by Isotr0py in https://github.com/vllm-project/vllm/pull/5684
* [Misc]Add param max-model-len in benchmark_latency.py by DearPlanet in https://github.com/vllm-project/vllm/pull/5629
* [CI/Build] Add tqdm to dependencies by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
* [ci] Add A100 queue into AWS CI template by khluu in https://github.com/vllm-project/vllm/pull/5648
* [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by mgoin in https://github.com/vllm-project/vllm/pull/5688
* [ci][distributed] add tests for custom allreduce by youkaichao in https://github.com/vllm-project/vllm/pull/5689
* [Bugfix] AsyncLLMEngine hangs with asyncio.run by zifeitong in https://github.com/vllm-project/vllm/pull/5654
* [Doc] Update docker references by rafvasq in https://github.com/vllm-project/vllm/pull/5614
* [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by dsikka in https://github.com/vllm-project/vllm/pull/5650
* [ci] Limit num gpus if specified for A100 by khluu in https://github.com/vllm-project/vllm/pull/5694
* [Misc] Improve conftest by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
* [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by ywang96 in https://github.com/vllm-project/vllm/pull/5703
* [Kernel] Update Cutlass int8 kernel configs for SM90 by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
* [Model] Port over CLIPVisionModel for VLMs by ywang96 in https://github.com/vllm-project/vllm/pull/5591
* [Kernel] Update Cutlass int8 kernel configs for SM80 by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
* [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
* [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by mgoin in https://github.com/vllm-project/vllm/pull/5718
* [distributed][misc] use fork by default for mp by youkaichao in https://github.com/vllm-project/vllm/pull/5669
* [Model] MLPSpeculator speculative decoding support by JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
* [Kernel] Add punica dimension for Qwen2 LoRA by jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
* [BugFix] Fix test_phi3v.py by CatherineSue in https://github.com/vllm-project/vllm/pull/5725
* [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by jeejeelee in https://github.com/vllm-project/vllm/pull/5665
* [Core][Distributed] add shm broadcast by youkaichao in https://github.com/vllm-project/vllm/pull/5399
* [Kernel][CPU] Add Quick `gelu` to CPU by ywang96 in https://github.com/vllm-project/vllm/pull/5717
* [Doc] Documentation on supported hardware for quantization methods by mgoin in https://github.com/vllm-project/vllm/pull/5745
* [BugFix] exclude version 1.15.0 for modelscope by zhyncs in https://github.com/vllm-project/vllm/pull/5668
* [ci][test] fix ca test in main by youkaichao in https://github.com/vllm-project/vllm/pull/5746
* [LoRA] Add support for pinning lora adapters in the LRU cache by rohithkrn in https://github.com/vllm-project/vllm/pull/5603
* [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by jikunshang in https://github.com/vllm-project/vllm/pull/5616
* [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by DamonFool in https://github.com/vllm-project/vllm/pull/5710
* [Misc] Remove 4789 workaround left in vllm/entrypoints/openai/run_batch.py by zifeitong in https://github.com/vllm-project/vllm/pull/5756
* [Bugfix] Fix pin_lora error in TPU executor by WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
* [Docs][TPU] Add installation tip for TPU by WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
* [core][distributed] improve shared memory broadcast by youkaichao in https://github.com/vllm-project/vllm/pull/5754
* [BugFix] [Kernel] Add Cutlass2x fallback kernels by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
* [Distributed] Add send and recv helpers by andoorve in https://github.com/vllm-project/vllm/pull/5719
* [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by Isotr0py in https://github.com/vllm-project/vllm/pull/5772
* [doc][faq] add warning to download models for every nodes by youkaichao in https://github.com/vllm-project/vllm/pull/5783
* [Doc] Add "Suggest edit" button to doc pages by mgoin in https://github.com/vllm-project/vllm/pull/5789
* [Doc] Add Phi-3-medium to list of supported models by mgoin in https://github.com/vllm-project/vllm/pull/5788
* [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by CatherineSue in https://github.com/vllm-project/vllm/pull/5795
* [ci] Remove aws template by khluu in https://github.com/vllm-project/vllm/pull/5757
* [Doc] Add notice about breaking changes to VLMs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
* [Speculative Decoding] Support draft model on different tensor-parallel size than target model by wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
* [Misc] Remove useless code in cpu_worker by DamonFool in https://github.com/vllm-project/vllm/pull/5824
* [Core] Add fault tolerance for `RayTokenizerGroupPool` by Yard1 in https://github.com/vllm-project/vllm/pull/5748
* [doc][distributed] add both gloo and nccl tests by youkaichao in https://github.com/vllm-project/vllm/pull/5834
* [CI/Build] Add unit testing for FlexibleArgumentParser by mgoin in https://github.com/vllm-project/vllm/pull/5798
* [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` by dsikka in https://github.com/vllm-project/vllm/pull/5794
* [Hardware][TPU] Refactor TPU backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
* [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by mawong-amd in https://github.com/vllm-project/vllm/pull/5422
* [Hardware][TPU] Raise errors for unsupported sampling params by WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
* [CI/Build] Add E2E tests for MLPSpeculator by tdoublep in https://github.com/vllm-project/vllm/pull/5791
* [Bugfix] Fix assertion in NeuronExecutor by aws-patlange in https://github.com/vllm-project/vllm/pull/5841
* [Core] Refactor Worker and ModelRunner to consolidate control plane communication by stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
* [Misc][Doc] Add Example of using OpenAI Server with VLM by ywang96 in https://github.com/vllm-project/vllm/pull/5832
* [bugfix][distributed] fix shm broadcast when the queue size is full by youkaichao in https://github.com/vllm-project/vllm/pull/5801
* [Bugfix] Fix embedding to support 2D inputs by WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
* [Bugfix][TPU] Fix KV cache size calculation by WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
* [CI/Build] Refactor image test assets by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
* [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` by ProExpertProg in https://github.com/vllm-project/vllm/pull/5560
* [Frontend] Add tokenize/detokenize endpoints by sasha0552 in https://github.com/vllm-project/vllm/pull/5054
* [Hardware][TPU] Support parallel sampling & Swapping by WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
* [Bugfix][TPU] Fix CPU cache allocation by WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
* Support CPU inference with VSX PowerPC ISA by ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
* [doc] update usage of env var to avoid conflict by youkaichao in https://github.com/vllm-project/vllm/pull/5873
* [Misc] Add example for LLaVA-NeXT by ywang96 in https://github.com/vllm-project/vllm/pull/5879
* [BugFix] Fix cuda graph for MLPSpeculator by njhill in https://github.com/vllm-project/vllm/pull/5875
* [Doc] Add note about context length in Phi-3-Vision example by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
* [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880
* [Model] Add base class for LoRA-supported models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
* [Bugfix] Fix img_sizes Parsing in Phi3-Vision by ywang96 in https://github.com/vllm-project/vllm/pull/5888
* [CI/Build] [1/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
* [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
* [doc][misc] add note for Kubernetes users by youkaichao in https://github.com/vllm-project/vllm/pull/5916
* [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` by njhill in https://github.com/vllm-project/vllm/pull/5876
* [BugFix] Fix `min_tokens` behaviour for multiple eos tokens by njhill in https://github.com/vllm-project/vllm/pull/5849
* [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test by ywang96 in https://github.com/vllm-project/vllm/pull/5922
* [Model] Add Gemma 2 by WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
* [core][misc] remove logical block by youkaichao in https://github.com/vllm-project/vllm/pull/5882
* [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by divakar-amd in https://github.com/vllm-project/vllm/pull/5932
* [Hardware][TPU] Optimize KV cache swapping by WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
* [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905
* [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/5956
* [Core] Registry for processing model inputs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
* Unmark fused_moe config json file as executable by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
* [Hardware][Intel] OpenVINO vLLM backend by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
* [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high by tdoublep in https://github.com/vllm-project/vllm/pull/5894
* [CI/Build] [2/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
* [Distributed] Make it clear that % should not be in tensor dict keys. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
* [Spec Decode] Introduce DraftModelRunner by comaniac in https://github.com/vllm-project/vllm/pull/5799
* [Bugfix] Fix compute datatype for cutlass 3.x epilogues by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
* [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928
* [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
* Support Deepseek-V2 by zwd003 in https://github.com/vllm-project/vllm/pull/4650
* [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled by mgoin in https://github.com/vllm-project/vllm/pull/5936
* Unmark more files as executable by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5962
* [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5963
* [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4628
* [Bugfix][TPU] Fix TPU sampler output by WoosukKwon in https://github.com/vllm-project/vllm/pull/5978
* [Bugfix][TPU] Fix pad slot id by WoosukKwon in https://github.com/vllm-project/vllm/pull/5977
* [Bugfix] fix missing last itl in openai completions benchmark by mcalman in https://github.com/vllm-project/vllm/pull/5926
* [Misc] Extend vLLM Metrics logging API by SolitaryThinker in https://github.com/vllm-project/vllm/pull/5925
* [Kernel] Add punica dimensions for Granite 3b and 8b by joerunde in https://github.com/vllm-project/vllm/pull/5930
* [Bugfix] Fix precisions in Gemma 1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/5913
* [Misc] Update Phi-3-Vision Example by ywang96 in https://github.com/vllm-project/vllm/pull/5981
* [Bugfix] Support `eos_token_id` from `config.json` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5954
* [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum by Yard1 in https://github.com/vllm-project/vllm/pull/5974
* [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k by comaniac in https://github.com/vllm-project/vllm/pull/5939
* [ CI/Build ] Added E2E Test For Compressed Tensors by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5839
* [CI/Build] Add TP test for vision models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5892
* [ CI/Build ] LM Eval Harness Based CI Testing by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5838
* [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests by mawong-amd in https://github.com/vllm-project/vllm/pull/5949
* [CI/Build] Temporarily Remove Phi3-Vision from TP Test by ywang96 in https://github.com/vllm-project/vllm/pull/5989
* [CI/Build] Reuse code for checking output consistency by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5988
* [CI/Build] [3/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5966
* [ci][distributed] fix some cuda init that makes it necessary to use spawn by youkaichao in https://github.com/vllm-project/vllm/pull/5991
* [Frontend]: Support base64 embedding by llmpros in https://github.com/vllm-project/vllm/pull/5935
* [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. by rkooo567 in https://github.com/vllm-project/vllm/pull/5909
* [ CI ] Temporarily Disable Large LM-Eval Tests by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6005
* [Misc] Fix `get_min_capability` by dsikka in https://github.com/vllm-project/vllm/pull/5971
* [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5940
* [misc][cuda] use nvml query to avoid accidentally cuda initialization by youkaichao in https://github.com/vllm-project/vllm/pull/6007
* [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker by sroy745 in https://github.com/vllm-project/vllm/pull/5348
* [ CI ] Re-enable Large Model LM Eval by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6031
* [doc][misc] remove deprecated api server in doc by youkaichao in https://github.com/vllm-project/vllm/pull/6037
* [Misc] update benchmark backend for scalellm by zhyncs in https://github.com/vllm-project/vllm/pull/6018
* [doc][misc] further lower visibility of simple api server by youkaichao in https://github.com/vllm-project/vllm/pull/6041
* [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool by Yard1 in https://github.com/vllm-project/vllm/pull/6039
* [Bugfix] adding chunking mechanism to fused_moe to handle large inputs by avshalomman in https://github.com/vllm-project/vllm/pull/6029
* add FAQ doc under 'serving' by llmpros in https://github.com/vllm-project/vllm/pull/5946
* [Bugfix][Doc] Fix Doc Formatting by ywang96 in https://github.com/vllm-project/vllm/pull/6048
* [Bugfix] Add explicit `end_forward` calls to flashinfer by Yard1 in https://github.com/vllm-project/vllm/pull/6044
* [BugFix] Ensure worker model loop is always stopped at the right time by njhill in https://github.com/vllm-project/vllm/pull/5987
* [Frontend] Relax api url assertion for openai benchmarking by jamestwhedbee in https://github.com/vllm-project/vllm/pull/6046
* [Model] Changes to MLPSpeculator to support tie_weights and input_scale by tdoublep in https://github.com/vllm-project/vllm/pull/5965
* [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5602
* [Frontend] Add template related params to request by danieljannai21 in https://github.com/vllm-project/vllm/pull/5709
* [VLM] Remove `image_input_type` from VLM config by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5852
* [Doc] Reinstate doc dependencies by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6061
* [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) by sirejdua in https://github.com/vllm-project/vllm/pull/6050
* [Core] Pipeline Parallel Support by andoorve in https://github.com/vllm-project/vllm/pull/4412
* Update conftest.py by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6076
* [ Misc ] Refactor MoE to isolate Fp8 From Mixtral by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5970
* [CORE] Quantized lm-head Framework by Qubitium in https://github.com/vllm-project/vllm/pull/4442
* [Model] Jamba support by mzusman in https://github.com/vllm-project/vllm/pull/4115
* [hardware][misc] introduce platform abstraction by youkaichao in https://github.com/vllm-project/vllm/pull/6080
* [Core] Dynamic image size support for VLMs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5276
* [CI] Fix base url doesn't strip "/" by rkooo567 in https://github.com/vllm-project/vllm/pull/6087
* [BugFix] Avoid unnecessary Ray import warnings by njhill in https://github.com/vllm-project/vllm/pull/6079
* [misc][distributed] error on invalid state by youkaichao in https://github.com/vllm-project/vllm/pull/6092
* [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API by ywang96 in https://github.com/vllm-project/vllm/pull/6091
* [Doc] Fix Mock Import by ywang96 in https://github.com/vllm-project/vllm/pull/6094
* [Bugfix] Fix `compute_logits` in Jamba by ywang96 in https://github.com/vllm-project/vllm/pull/6093
* [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin by mgoin in https://github.com/vllm-project/vllm/pull/5975
* [core][distributed] allow custom allreduce when pipeline parallel size > 1 by youkaichao in https://github.com/vllm-project/vllm/pull/6117
* [vlm] Remove vision language config. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/6089
* [ Misc ] Clean Up `CompressedTensorsW8A8` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6113
* [doc][misc] bump up py version in installation doc by youkaichao in https://github.com/vllm-project/vllm/pull/6119
* [core][distributed] support layer size undividable by pp size in pipeline parallel inference by youkaichao in https://github.com/vllm-project/vllm/pull/6115
* [Bugfix] set OMP_NUM_THREADS to 1 by default when using the multiproc_gpu_executor by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6109
* [Distributed][Core] Support Py39 and Py38 for PP by andoorve in https://github.com/vllm-project/vllm/pull/6120
* [CI/Build] Cleanup VLM tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6107
* [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention by gshtras in https://github.com/vllm-project/vllm/pull/6043
* [misc][doc] try to add warning for latest html by youkaichao in https://github.com/vllm-project/vllm/pull/5979
* [Hardware][Intel CPU] Adding intel openmp tunings in Docker file by zhouyuan in https://github.com/vllm-project/vllm/pull/6008
* [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6051
* [VLM] Calculate maximum number of multi-modal tokens by model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6121
* [VLM] Improve consistency between feature size calculation and dummy data for profiling by ywang96 in https://github.com/vllm-project/vllm/pull/6146
* [VLM] Cleanup validation and update docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6149
* [Bugfix] Use templated datasource in grafana.json to allow automatic imports by frittentheke in https://github.com/vllm-project/vllm/pull/6136
* [Frontend] Continuous usage stats in OpenAI completion API by jvlunteren in https://github.com/vllm-project/vllm/pull/5742
* [Bugfix] Add verbose error if scipy is missing for blocksparse attention by JGSweets in https://github.com/vllm-project/vllm/pull/5695
* bump version to v0.5.1 by simon-mo in https://github.com/vllm-project/vllm/pull/6157
* [Docs] Fix readthedocs for tag build by simon-mo in https://github.com/vllm-project/vllm/pull/6158
New Contributors
* kimdwkimdw made their first contribution in https://github.com/vllm-project/vllm/pull/5444
* sywangyi made their first contribution in https://github.com/vllm-project/vllm/pull/5303
* garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
* CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
* zhyncs made their first contribution in https://github.com/vllm-project/vllm/pull/5588
* bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
* sroy745 made their first contribution in https://github.com/vllm-project/vllm/pull/5131
* joerunde made their first contribution in https://github.com/vllm-project/vllm/pull/5559
* sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
* milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
* ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
* rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
* JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
* rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
* wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
* aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
* stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
* ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
* ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
* ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379
* mcalman made their first contribution in https://github.com/vllm-project/vllm/pull/5926
* SolitaryThinker made their first contribution in https://github.com/vllm-project/vllm/pull/5925
* llmpros made their first contribution in https://github.com/vllm-project/vllm/pull/5935
* avshalomman made their first contribution in https://github.com/vllm-project/vllm/pull/6029
* danieljannai21 made their first contribution in https://github.com/vllm-project/vllm/pull/5709
* sirejdua made their first contribution in https://github.com/vllm-project/vllm/pull/6050
* gshtras made their first contribution in https://github.com/vllm-project/vllm/pull/6043
* frittentheke made their first contribution in https://github.com/vllm-project/vllm/pull/6136
* jvlunteren made their first contribution in https://github.com/vllm-project/vllm/pull/5742
* JGSweets made their first contribution in https://github.com/vllm-project/vllm/pull/5695
**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.0...v0.5.1