Vllm

Latest version: v0.6.4.post1

Safety actively analyzes 687990 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 8

0.5.3

Not secure
Highlights

Model Support
* vLLM now supports Meta Llama 3.1! Please checkout our blog [here](https://blog.vllm.ai/2024/07/23/llama31.html) for initial details on running the model.
* Please checkout [this thread](https://github.com/vllm-project/vllm/issues/6689) for any known issues related to the model.
* The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (6606, 6547, 6487, 6593, 6511, 6515, 6552)
* The BF16 version of the model should run on multiple nodes using pipeline parallelism ([docs](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, 6598, 6529, 6569)
* In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (6666, 6553, 6673)
* Support Mistral-Nemo (6548)
* Support Chameleon (6633, 5770)
* Pipeline parallel support for Mixtral (6516)

Hardware Support
* Many enhancements to TPU support. (6277, 6457, 6506, 6504)

Performance Enhancements
* Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (6612)
* Progress towards refactoring for SPMD worker execution. (6032)
* Progress in improving prepare inputs procedure. (6164, 6338, 6596)
* Memory optimization for pipeline parallelism. (6455)


Production Engine
* Correctness testing for pipeline parallel and CPU offloading (6410, 6549)
* Support dynamically loading Lora adapter from HuggingFace (6234)
* Pipeline Parallel using stdlib multiprocessing module (6130)


Others
* A CPU offloading implementation, you can now use `--cpu-offload-gb` to control how much memory to "extend" the RAM with. (6496)
* The new `vllm` CLI is now ready for testing. It comes with three commands: `serve`, `complete`, and `chat`. Feedback and improvements are greatly welcomed! (6431)
* The wheels now build on Ubuntu 20.04 instead of 22.04. (6517)



What's Changed
* [Docs] Add Google Cloud to sponsor list by WoosukKwon in https://github.com/vllm-project/vllm/pull/6450
* [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by WoosukKwon in https://github.com/vllm-project/vllm/pull/6289
* [CI/Build][TPU] Add TPU CI test by WoosukKwon in https://github.com/vllm-project/vllm/pull/6277
* Pin sphinx-argparse version by khluu in https://github.com/vllm-project/vllm/pull/6453
* [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by mzusman in https://github.com/vllm-project/vllm/pull/6425
* [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by g-eoj in https://github.com/vllm-project/vllm/pull/6419
* [Docs] Announce 5th meetup by WoosukKwon in https://github.com/vllm-project/vllm/pull/6458
* [CI/Build] vLLM cache directory for images by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6444
* [Frontend] Support for chat completions input in the tokenize endpoint by sasha0552 in https://github.com/vllm-project/vllm/pull/5923
* [Misc] Fix typos in spec. decode metrics logging. by tdoublep in https://github.com/vllm-project/vllm/pull/6470
* [Core] Use numpy to speed up padded token processing by peng1999 in https://github.com/vllm-project/vllm/pull/6442
* [CI/Build] Remove "boardwalk" image asset by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6460
* [doc][misc] remind users to cancel debugging environment variables after debugging by youkaichao in https://github.com/vllm-project/vllm/pull/6481
* [Hardware][TPU] Support MoE with Pallas GMM kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/6457
* [Doc] Fix the lora adapter path in server startup script by Jeffwan in https://github.com/vllm-project/vllm/pull/6230
* [Misc] Log spec decode metrics by comaniac in https://github.com/vllm-project/vllm/pull/6454
* [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` by mgoin in https://github.com/vllm-project/vllm/pull/6081
* [ci][distributed] add pipeline parallel correctness test by youkaichao in https://github.com/vllm-project/vllm/pull/6410
* [misc][distributed] improve tests by youkaichao in https://github.com/vllm-project/vllm/pull/6488
* [misc][distributed] add seed to dummy weights by youkaichao in https://github.com/vllm-project/vllm/pull/6491
* [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by wushidonguc in https://github.com/vllm-project/vllm/pull/6455
* [ROCm] Cleanup Dockerfile and remove outdated patch by hongxiayang in https://github.com/vllm-project/vllm/pull/6482
* [Misc][Speculative decoding] Typos and typing fixes by ShangmingCai in https://github.com/vllm-project/vllm/pull/6467
* [Doc][CI/Build] Update docs and tests to use `vllm serve` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6431
* [Bugfix] Fix for multinode crash on 4 PP by andoorve in https://github.com/vllm-project/vllm/pull/6495
* [TPU] Remove multi-modal args in TPU backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/6504
* [Misc] Use `torch.Tensor` for type annotation by WoosukKwon in https://github.com/vllm-project/vllm/pull/6505
* [Core] Refactor _prepare_model_input_tensors - take 2 by comaniac in https://github.com/vllm-project/vllm/pull/6164
* [DOC] - Add docker image to Cerebrium Integration by milo157 in https://github.com/vllm-project/vllm/pull/6510
* [Bugfix] Fix Ray Metrics API usage by Yard1 in https://github.com/vllm-project/vllm/pull/6354
* [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6338
* [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6511
* [Model] Pipeline parallel support for Mixtral by comaniac in https://github.com/vllm-project/vllm/pull/6516
* [ Kernel ] Fp8 Channelwise Weight Support by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6487
* [core][model] yet another cpu offload implementation by youkaichao in https://github.com/vllm-project/vllm/pull/6496
* [BugFix] Avoid secondary error in ShmRingBuffer destructor by njhill in https://github.com/vllm-project/vllm/pull/6530
* [Core] Introduce SPMD worker execution using Ray accelerated DAG by ruisearch42 in https://github.com/vllm-project/vllm/pull/6032
* [Misc] Minor patch for draft model runner by comaniac in https://github.com/vllm-project/vllm/pull/6523
* [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by njhill in https://github.com/vllm-project/vllm/pull/6227
* [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by noamgat in https://github.com/vllm-project/vllm/pull/6501
* [TPU] Refactor TPU worker & model runner by WoosukKwon in https://github.com/vllm-project/vllm/pull/6506
* [ Misc ] Improve Min Capability Checking in `compressed-tensors` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6522
* [ci] Reword Github bot comment by khluu in https://github.com/vllm-project/vllm/pull/6534
* [Model] Support Mistral-Nemo by mgoin in https://github.com/vllm-project/vllm/pull/6548
* Fix PR comment bot by khluu in https://github.com/vllm-project/vllm/pull/6554
* [ci][test] add correctness test for cpu offloading by youkaichao in https://github.com/vllm-project/vllm/pull/6549
* [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6552
* [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6517
* Add support for a rope extension method by simon-mo in https://github.com/vllm-project/vllm/pull/6553
* [Core] Multiprocessing Pipeline Parallel support by njhill in https://github.com/vllm-project/vllm/pull/6130
* [Bugfix] Make spec. decode respect per-request seed. by tdoublep in https://github.com/vllm-project/vllm/pull/6034
* [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6515
* [Bugfix][Frontend] Fix missing `/metrics` endpoint by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6463
* [BUGFIX] Raise an error for no draft token case when draft_tp>1 by wooyeonlee0 in https://github.com/vllm-project/vllm/pull/6369
* [Model] RowParallelLinear: pass bias to quant_method.apply by tdoublep in https://github.com/vllm-project/vllm/pull/6327
* [Bugfix][Frontend] remove duplicate init logger by dtrifiro in https://github.com/vllm-project/vllm/pull/6581
* [Misc] Small perf improvements by Yard1 in https://github.com/vllm-project/vllm/pull/6520
* [Docs] Update docs for wheel location by simon-mo in https://github.com/vllm-project/vllm/pull/6580
* [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by tdoublep in https://github.com/vllm-project/vllm/pull/6578
* [bugfix][distributed] fix multi-node bug for shared memory by youkaichao in https://github.com/vllm-project/vllm/pull/6597
* [ Kernel ] Enable Dynamic Per Token `fp8` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6547
* [Docs] Update PP docs by andoorve in https://github.com/vllm-project/vllm/pull/6598
* [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by youkaichao in https://github.com/vllm-project/vllm/pull/6599
* [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6593
* [Core] Allow specifying custom Executor by Yard1 in https://github.com/vllm-project/vllm/pull/6557
* [Bugfix][Core]: Guard for KeyErrors that can occur if a request is aborted with Pipeline Parallel by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6587
* [Misc] Consolidate and optimize logic for building padded tensors by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6541
* [ Misc ] `fbgemm` checkpoints by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6559
* [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes by mawong-amd in https://github.com/vllm-project/vllm/pull/6543
* [ Kernel ] Enable `fp8-marlin` for `fbgemm-fp8` models by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6606
* [Misc] Fix input_scale typing in w8a8_utils.py by mgoin in https://github.com/vllm-project/vllm/pull/6579
* [ Bugfix ] Fix AutoFP8 fp8 marlin by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6609
* [Frontend] Move chat utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6602
* [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. by sroy745 in https://github.com/vllm-project/vllm/pull/6485
* [Misc] Remove abused noqa by WoosukKwon in https://github.com/vllm-project/vllm/pull/6619
* [Model] Refactor and decouple phi3v image embedding by Isotr0py in https://github.com/vllm-project/vllm/pull/6621
* [Kernel][Core] Add AWQ support to the Marlin kernel by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6612
* [Model] Initial Support for Chameleon by ywang96 in https://github.com/vllm-project/vllm/pull/5770
* [Misc] Add a wrapper for torch.inference_mode by WoosukKwon in https://github.com/vllm-project/vllm/pull/6618
* [Bugfix] Fix `vocab_size` field access in LLaVA models by jaywonchung in https://github.com/vllm-project/vllm/pull/6624
* [Frontend] Refactor prompt processing by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4028
* [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6649
* [ci] Use different sccache bucket for CUDA 11.8 wheel build by khluu in https://github.com/vllm-project/vllm/pull/6656
* [Core] Support dynamically loading Lora adapter from HuggingFace by Jeffwan in https://github.com/vllm-project/vllm/pull/6234
* [ci][build] add back vim in docker by youkaichao in https://github.com/vllm-project/vllm/pull/6661
* [Misc] Remove deprecation warning for beam search by WoosukKwon in https://github.com/vllm-project/vllm/pull/6659
* [Core] Modulize prepare input and attention metadata builder by comaniac in https://github.com/vllm-project/vllm/pull/6596
* [Bugfix] Fix null `modules_to_not_convert` in FBGEMM Fp8 quantization by cli99 in https://github.com/vllm-project/vllm/pull/6665
* [Misc] Enable chunked prefill by default for long context models by WoosukKwon in https://github.com/vllm-project/vllm/pull/6666
* [misc] add start loading models for users information by youkaichao in https://github.com/vllm-project/vllm/pull/6670
* add tqdm when loading checkpoint shards by zhaotyer in https://github.com/vllm-project/vllm/pull/6569
* [Misc] Support FP8 kv cache scales from compressed-tensors by mgoin in https://github.com/vllm-project/vllm/pull/6528
* [doc][distributed] add more doc for setting up multi-node environment by youkaichao in https://github.com/vllm-project/vllm/pull/6529
* [Misc] Manage HTTP connections in one place by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6600
* [misc] only tqdm for first rank by youkaichao in https://github.com/vllm-project/vllm/pull/6672
* [VLM][Model] Support image input for Chameleon by ywang96 in https://github.com/vllm-project/vllm/pull/6633
* support ignore patterns in model loader by simon-mo in https://github.com/vllm-project/vllm/pull/6673
* Bump version to v0.5.3 by simon-mo in https://github.com/vllm-project/vllm/pull/6674

New Contributors
* g-eoj made their first contribution in https://github.com/vllm-project/vllm/pull/6419
* peng1999 made their first contribution in https://github.com/vllm-project/vllm/pull/6442
* Jeffwan made their first contribution in https://github.com/vllm-project/vllm/pull/6230
* wushidonguc made their first contribution in https://github.com/vllm-project/vllm/pull/6455
* ShangmingCai made their first contribution in https://github.com/vllm-project/vllm/pull/6467
* ruisearch42 made their first contribution in https://github.com/vllm-project/vllm/pull/6032

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.2...v0.5.3

0.5.2

Not secure
Major Changes
* ❗Planned breaking change ❗: we plan to remove beam search (see more in 6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
* The release has moved to a Python version agnostic wheel (6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support
* Add PaliGemma (5189), Fuyu-8B (3924)
* Support for soft tuned prompts (4645)
* A [new guide](https://docs.vllm.ai/en/latest/dev/multimodal/adding_multimodal_plugin.html) for adding multi-modal plugins (#6205)

Hardware
* AMD: unify CUDA_VISIBLE_DEVICES usage (6352)

Performance
* ZeroMQ fallback for broadcasting large objects (6183)
* Simplify code to support pipeline parallel (6406)
* Turn off CUTLASS scaled_mm for Ada Lovelace (6384)
* Use CUTLASS kernels for the FP8 layers with Bias (6270)

Features
* Enabling bonus token in speculative decoding for KV cache based models (5765)
* Medusa Implementation with Top-1 proposer (4978)
* An experimental vLLM CLI for serving and querying OpenAI compatible server (5090)

Others
* Add support for multi-node on CI (5955)
* Benchmark: add H100 suite (6047)
* [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (5362)
* Build some nightly wheels (6380)



What's Changed
* Update wheel builds to strip debug by simon-mo in https://github.com/vllm-project/vllm/pull/6161
* Fix release wheel build env var by simon-mo in https://github.com/vllm-project/vllm/pull/6162
* Move release wheel env var to Dockerfile instead by simon-mo in https://github.com/vllm-project/vllm/pull/6163
* [Doc] Reorganize Supported Models by Type by ywang96 in https://github.com/vllm-project/vllm/pull/6167
* [Doc] Move guide for multimodal model and other improvements by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6168
* [Model] Add PaliGemma by ywang96 in https://github.com/vllm-project/vllm/pull/5189
* add benchmark for fix length input and output by haichuan1221 in https://github.com/vllm-project/vllm/pull/5857
* [ Misc ] Support Fp8 via `llm-compressor` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6110
* [misc][frontend] log all available endpoints by youkaichao in https://github.com/vllm-project/vllm/pull/6195
* do not exclude `object` field in CompletionStreamResponse by kczimm in https://github.com/vllm-project/vllm/pull/6196
* [Bugfix] FIx benchmark args for randomly sampled dataset by haichuan1221 in https://github.com/vllm-project/vllm/pull/5947
* [Kernel] reloading fused_moe config on the last chunk by avshalomman in https://github.com/vllm-project/vllm/pull/6210
* [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by afeldman-nm in https://github.com/vllm-project/vllm/pull/4888
* [Bugfix] use diskcache in outlines _get_guide 5436 by ericperfect in https://github.com/vllm-project/vllm/pull/6203
* [Bugfix] Mamba cache Cuda Graph padding by tomeras91 in https://github.com/vllm-project/vllm/pull/6214
* Add FlashInfer to default Dockerfile by simon-mo in https://github.com/vllm-project/vllm/pull/6172
* [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by youkaichao in https://github.com/vllm-project/vllm/pull/6216
* [core][distributed] fix ray worker rank assignment by youkaichao in https://github.com/vllm-project/vllm/pull/6235
* [Bugfix][TPU] Add missing None to model input by WoosukKwon in https://github.com/vllm-project/vllm/pull/6245
* [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by WoosukKwon in https://github.com/vllm-project/vllm/pull/6256
* Add support for multi-node on CI by khluu in https://github.com/vllm-project/vllm/pull/5955
* [CORE] Adding support for insertion of soft-tuned prompts by SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/4645
* [Docs] Docs update for Pipeline Parallel by andoorve in https://github.com/vllm-project/vllm/pull/6222
* [Bugfix]fix and needs_scalar_to_array logic check by qibaoyuan in https://github.com/vllm-project/vllm/pull/6238
* [Speculative Decoding] Medusa Implementation with Top-1 proposer by abhigoyal1997 in https://github.com/vllm-project/vllm/pull/4978
* [core][distributed] add zmq fallback for broadcasting large objects by youkaichao in https://github.com/vllm-project/vllm/pull/6183
* [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by WoosukKwon in https://github.com/vllm-project/vllm/pull/6279
* [Doc] Guide for adding multi-modal plugins by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6205
* [Bugfix] Support 2D input shape in MoE layer by WoosukKwon in https://github.com/vllm-project/vllm/pull/6287
* [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by tdoublep in https://github.com/vllm-project/vllm/pull/6303
* [CI/Build] Enable mypy typing for remaining folders by bmuskalla in https://github.com/vllm-project/vllm/pull/6268
* [Bugfix] OpenVINOExecutor abstractmethod error by park12sj in https://github.com/vllm-project/vllm/pull/6296
* [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by sroy745 in https://github.com/vllm-project/vllm/pull/5765
* [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by WoosukKwon in https://github.com/vllm-project/vllm/pull/6313
* [Doc] Remove comments incorrectly copied from another project by daquexian in https://github.com/vllm-project/vllm/pull/6286
* [Doc] Update description of vLLM support for CPUs by DamonFool in https://github.com/vllm-project/vllm/pull/6003
* [BugFix]: set outlines pkg version by xiangyang-95 in https://github.com/vllm-project/vllm/pull/6262
* [Bugfix] Fix snapshot download in serving benchmark by ywang96 in https://github.com/vllm-project/vllm/pull/6318
* [Misc] refactor(config): clean up unused code by aniaan in https://github.com/vllm-project/vllm/pull/6320
* [BugFix]: fix engine timeout due to request abort by pushan01 in https://github.com/vllm-project/vllm/pull/6255
* [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by tdoublep in https://github.com/vllm-project/vllm/pull/6326
* [BugFix] get_and_reset only when scheduler outputs are not empty by mzusman in https://github.com/vllm-project/vllm/pull/6266
* [ Misc ] Refactor Marlin Python Utilities by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6082
* Benchmark: add H100 suite by simon-mo in https://github.com/vllm-project/vllm/pull/6047
* [bug fix] Fix llava next feature size calculation. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/6339
* [doc] update pipeline parallel in readme by youkaichao in https://github.com/vllm-project/vllm/pull/6347
* [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by KuntaiDu in https://github.com/vllm-project/vllm/pull/5362
* [ BugFix ] Prompt Logprobs Detokenization by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6223
* [Misc] Remove flashinfer warning, add flashinfer tests to CI by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6351
* [distributed][misc] keep consistent with how pytorch finds libcudart.so by youkaichao in https://github.com/vllm-project/vllm/pull/6346
* [Bugfix] Fix usage stats logging exception warning with OpenVINO by helena-intel in https://github.com/vllm-project/vllm/pull/6349
* [Model][Phi3-Small] Remove scipy from blocksparse_attention by mgoin in https://github.com/vllm-project/vllm/pull/6343
* [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by adityagoel14 in https://github.com/vllm-project/vllm/pull/6350
* [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by hongxiayang in https://github.com/vllm-project/vllm/pull/6352
* [ Misc ] Remove separate bias add by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6353
* [Misc][Bugfix] Update transformers for tokenizer issue by ywang96 in https://github.com/vllm-project/vllm/pull/6364
* [ Misc ] Support Models With Bias in `compressed-tensors` integration by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6356
* [Bugfix] Fix dtype mismatch in PaliGemma by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6367
* [Build/CI] Checking/Waiting for the GPU's clean state by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/6379
* [Misc] add fixture to guided processor tests by kevinbu233 in https://github.com/vllm-project/vllm/pull/6341
* [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by khluu in https://github.com/vllm-project/vllm/pull/6365
* [ci] Add GHA workflows to enable full CI run by khluu in https://github.com/vllm-project/vllm/pull/6381
* [MISC] Upgrade dependency to PyTorch 2.3.1 by comaniac in https://github.com/vllm-project/vllm/pull/5327
* Build some nightly wheels by default by simon-mo in https://github.com/vllm-project/vllm/pull/6380
* Fix release-pipeline.yaml by simon-mo in https://github.com/vllm-project/vllm/pull/6388
* Fix interpolation in release pipeline by simon-mo in https://github.com/vllm-project/vllm/pull/6389
* Fix release pipeline's -e flag by simon-mo in https://github.com/vllm-project/vllm/pull/6390
* [Bugfix] Fix illegal memory access in FP8 MoE kernel by comaniac in https://github.com/vllm-project/vllm/pull/6382
* [Misc] Add generated git commit hash as `vllm.__commit__` by mgoin in https://github.com/vllm-project/vllm/pull/6386
* Fix release pipeline's dir permission by simon-mo in https://github.com/vllm-project/vllm/pull/6391
* [Bugfix][TPU] Fix megacore setting for v5e-litepod by WoosukKwon in https://github.com/vllm-project/vllm/pull/6397
* [ci] Fix wording for GH bot by khluu in https://github.com/vllm-project/vllm/pull/6398
* [Doc] Fix Typo in Doc by esaliya in https://github.com/vllm-project/vllm/pull/6392
* [Bugfix] Fix hard-coded value of x in context_attention_fwd by tdoublep in https://github.com/vllm-project/vllm/pull/6373
* [Docs] Clean up latest news by WoosukKwon in https://github.com/vllm-project/vllm/pull/6401
* [ci] try to add multi-node tests by youkaichao in https://github.com/vllm-project/vllm/pull/6280
* Updating LM Format Enforcer version to v10.3 by noamgat in https://github.com/vllm-project/vllm/pull/6411
* [ Misc ] More Cleanup of Marlin by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6359
* [Misc] Add deprecation warning for beam search by WoosukKwon in https://github.com/vllm-project/vllm/pull/6402
* [ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6417
* [Model] Initialize Fuyu-8B support by Isotr0py in https://github.com/vllm-project/vllm/pull/3924
* Remove unnecessary trailing period in spec_decode.rst by terrytangyuan in https://github.com/vllm-project/vllm/pull/6405
* [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6384
* [ci][build] fix commit id by youkaichao in https://github.com/vllm-project/vllm/pull/6420
* [ Misc ] Enable Quantizing All Layers of DeekSeekv2 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6423
* [Feature] vLLM CLI for serving and querying OpenAI compatible server by EthanqX in https://github.com/vllm-project/vllm/pull/5090
* [Doc] xpu backend requires running setvars.sh by rscohn2 in https://github.com/vllm-project/vllm/pull/6393
* [CI/Build] Cross python wheel by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6394
* [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by lxline in https://github.com/vllm-project/vllm/pull/6428
* Report usage for beam search by simon-mo in https://github.com/vllm-project/vllm/pull/6404
* Add FUNDING.yml by simon-mo in https://github.com/vllm-project/vllm/pull/6435
* [BugFix] BatchResponseData body should be optional by zifeitong in https://github.com/vllm-project/vllm/pull/6345
* [Doc] add env docs for flashinfer backend by DefTruth in https://github.com/vllm-project/vllm/pull/6437
* [core][distributed] simplify code to support pipeline parallel by youkaichao in https://github.com/vllm-project/vllm/pull/6406
* [Bugfix] Convert image to RGB by default by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6430
* [doc][misc] doc update by youkaichao in https://github.com/vllm-project/vllm/pull/6439
* [VLM] Minor space optimization for `ClipVisionModel` by ywang96 in https://github.com/vllm-project/vllm/pull/6436
* [doc][distributed] add suggestion for distributed inference by youkaichao in https://github.com/vllm-project/vllm/pull/6418
* [Kernel] Use CUTLASS kernels for the FP8 layers with Bias by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6270
* [Misc] Use 0.0.9 version for flashinfer by Pernekhan in https://github.com/vllm-project/vllm/pull/6447
* [Bugfix] Add custom Triton cache manager to resolve MoE MP issue by tdoublep in https://github.com/vllm-project/vllm/pull/6140
* [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by tdoublep in https://github.com/vllm-project/vllm/pull/6409
* bump version to v0.5.2 by simon-mo in https://github.com/vllm-project/vllm/pull/6433
* [misc][distributed] fix pp missing layer condition by youkaichao in https://github.com/vllm-project/vllm/pull/6446

New Contributors
* haichuan1221 made their first contribution in https://github.com/vllm-project/vllm/pull/5857
* kczimm made their first contribution in https://github.com/vllm-project/vllm/pull/6196
* ericperfect made their first contribution in https://github.com/vllm-project/vllm/pull/6203
* qibaoyuan made their first contribution in https://github.com/vllm-project/vllm/pull/6238
* abhigoyal1997 made their first contribution in https://github.com/vllm-project/vllm/pull/4978
* bmuskalla made their first contribution in https://github.com/vllm-project/vllm/pull/6268
* park12sj made their first contribution in https://github.com/vllm-project/vllm/pull/6296
* daquexian made their first contribution in https://github.com/vllm-project/vllm/pull/6286
* xiangyang-95 made their first contribution in https://github.com/vllm-project/vllm/pull/6262
* aniaan made their first contribution in https://github.com/vllm-project/vllm/pull/6320
* pushan01 made their first contribution in https://github.com/vllm-project/vllm/pull/6255
* helena-intel made their first contribution in https://github.com/vllm-project/vllm/pull/6349
* adityagoel14 made their first contribution in https://github.com/vllm-project/vllm/pull/6350
* kevinbu233 made their first contribution in https://github.com/vllm-project/vllm/pull/6341
* esaliya made their first contribution in https://github.com/vllm-project/vllm/pull/6392
* EthanqX made their first contribution in https://github.com/vllm-project/vllm/pull/5090
* rscohn2 made their first contribution in https://github.com/vllm-project/vllm/pull/6393
* lxline made their first contribution in https://github.com/vllm-project/vllm/pull/6428

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.1...v0.5.2

0.5.1

Not secure
Highlights
* vLLM now has pipeline parallelism! (4412, 5408, 6115, 6120). You can now run the API server with `--pipeline-parallel-size`. This feature is in early stage, please let us know your feedback.


Model Support
* Support Gemma 2 (5908, 6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded [here](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.8)
* Support Jamba (4115). This is vLLM's first state space model!
* Support Deepseek-V2 (4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
* Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (4986, 5276, 5214)
* Notably, it has a **breaking change** that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in `<image>` into the prompt instead of complicated prompt formatting. See more [here](https://docs.vllm.ai/en/latest/models/vlm.html#offline-batched-inference)
* There is also a new [guide](https://docs.vllm.ai/en/latest/models/enabling_multimodal_inputs.html) on adding VLMs! We would love your contribution for new models!

Hardware Support
* Enhancement to TPU support (5292, 5878, 5850, 5831, 5855)
* OpenVINO backend (5379)


Production Service
* Support for sharded tensorized models (4990)
* Continous streaming of OpenAI response token stats (5742)


Performance
* Enhancement in distributed communication via shared memory (5399)
* Latency enhancement in block manager (5584)
* Enhancements to `compressed-tensors` supporting Marlin, W4A16 (5435, 5385)
* Faster FP8 quantize kernel (5396), FP8 on Ampere (5975)
* Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (4628)
* Speculative Decoding
* MLPSpeculator (4947, 6050)
* Typical Acceptance Sampler (5131, 5348)
* Draft Model Runner (5799)


Development Productivity
* Post merge benchmark is now available at perf.vllm.ai!
* Addition of A100 in CI environment (5658)
* Step towards nightly wheel publication (5610)



What's Changed
* [CI/Build] Add `is_quant_method_supported` to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5253
* Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" by simon-mo in https://github.com/vllm-project/vllm/pull/5463
* [CI] Upgrade codespell version. by rkooo567 in https://github.com/vllm-project/vllm/pull/5381
* [Hardware] Initial TPU integration by WoosukKwon in https://github.com/vllm-project/vllm/pull/5292
* [Bugfix] Add device assertion to TorchSDPA by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5402
* [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by khluu in https://github.com/vllm-project/vllm/pull/5464
* [Kernel] Vectorized FP8 quantize kernel by comaniac in https://github.com/vllm-project/vllm/pull/5396
* [Bugfix] TYPE_CHECKING for MultiModalData by kimdwkimdw in https://github.com/vllm-project/vllm/pull/5444
* [Frontend] [Core] Support for sharded tensorized models by tjohnson31415 in https://github.com/vllm-project/vllm/pull/4990
* [misc] add hint for AttributeError by youkaichao in https://github.com/vllm-project/vllm/pull/5462
* [Doc] Update debug docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5438
* [Bugfix] Fix typo in scheduler.py (requeset -> request) by mgoin in https://github.com/vllm-project/vllm/pull/5470
* [Frontend] Add "input speed" to tqdm postfix alongside output speed by mgoin in https://github.com/vllm-project/vllm/pull/5425
* [Bugfix] Fix wrong multi_modal_input format for CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/5451
* [Core][Distributed] add coordinator to reduce code duplication in tp and pp by youkaichao in https://github.com/vllm-project/vllm/pull/5293
* [ci] Use sccache to build images by khluu in https://github.com/vllm-project/vllm/pull/5419
* [Bugfix]if the content is started with ":"(response of ping), client should i… by sywangyi in https://github.com/vllm-project/vllm/pull/5303
* [Kernel] `w4a16` support for `compressed-tensors` by dsikka in https://github.com/vllm-project/vllm/pull/5385
* [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5466
* [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by wenyujin333 in https://github.com/vllm-project/vllm/pull/5497
* [Hardware][Intel] Optimize CPU backend and add more performance tips by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4971
* [Docs] Add 4th meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/5509
* [Misc] Add vLLM version getter to utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5098
* [CI/Build] Simplify OpenAI server setup in tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5100
* [Doc] Update LLaVA docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5437
* [Kernel] Factor out epilogues from cutlass kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5391
* [MISC] Remove FP8 warning by comaniac in https://github.com/vllm-project/vllm/pull/5472
* Seperate dev requirements into lint and test by Yard1 in https://github.com/vllm-project/vllm/pull/5474
* Revert "[Core] Remove unnecessary copies in flash attn backend" by Yard1 in https://github.com/vllm-project/vllm/pull/5478
* [misc] fix format.sh by youkaichao in https://github.com/vllm-project/vllm/pull/5511
* [CI/Build] Disable test_fp8.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5508
* [Kernel] Disable CUTLASS kernels for fp8 by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5505
* Add `cuda_device_count_stateless` by Yard1 in https://github.com/vllm-project/vllm/pull/5473
* [Hardware][Intel] Support CPU inference with AVX2 ISA by DamonFool in https://github.com/vllm-project/vllm/pull/5452
* [Bugfix]typofix by AllenDou in https://github.com/vllm-project/vllm/pull/5507
* bump version to v0.5.0.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/5522
* [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label by KuntaiDu in https://github.com/vllm-project/vllm/pull/5073
* [CI/Build] Disable LLaVA-NeXT CPU test by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
* [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
* [Misc] Fix arg names by AllenDou in https://github.com/vllm-project/vllm/pull/5524
* [ Misc ] Rs/compressed tensors cleanup by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
* [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
* [mis] fix flaky test of test_cuda_device_count_stateless by youkaichao in https://github.com/vllm-project/vllm/pull/5546
* [Core] Remove duplicate processing in async engine by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
* [misc][distributed] fix benign error in `is_in_the_same_node` by youkaichao in https://github.com/vllm-project/vllm/pull/5512
* [Docs] Add ZhenFund as a Sponsor by simon-mo in https://github.com/vllm-project/vllm/pull/5548
* [Doc] Update documentation on Tensorizer by sangstar in https://github.com/vllm-project/vllm/pull/5471
* [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by tdoublep in https://github.com/vllm-project/vllm/pull/5460
* [Bugfix] Fix typo in Pallas backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
* [Core][Distributed] improve p2p cache generation by youkaichao in https://github.com/vllm-project/vllm/pull/5528
* Add ccache to amd by simon-mo in https://github.com/vllm-project/vllm/pull/5555
* [Core][Bugfix]: fix prefix caching for blockv2 by leiwen83 in https://github.com/vllm-project/vllm/pull/5364
* [mypy] Enable type checking for test directory by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
* [CI/Build] Test both text and token IDs in batched OpenAI Completions API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
* [misc] Do not allow to use lora with chunked prefill. by rkooo567 in https://github.com/vllm-project/vllm/pull/5538
* add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
* [BugFix] Don't start a Ray cluster when not using Ray by njhill in https://github.com/vllm-project/vllm/pull/5570
* [Fix] Correct OpenAI batch response format by zifeitong in https://github.com/vllm-project/vllm/pull/5554
* Add basic correctness 2 GPU tests to 4 GPU pipeline by Yard1 in https://github.com/vllm-project/vllm/pull/5518
* [CI][BugFix] Flip is_quant_method_supported condition by mgoin in https://github.com/vllm-project/vllm/pull/5577
* [build][misc] limit numpy version by youkaichao in https://github.com/vllm-project/vllm/pull/5582
* [Doc] add debugging tips for crash and multi-node debugging by youkaichao in https://github.com/vllm-project/vllm/pull/5581
* Fix w8a8 benchmark and add Llama-3-8B by comaniac in https://github.com/vllm-project/vllm/pull/5562
* [Model] Rename Phi3 rope scaling type by garg-amit in https://github.com/vllm-project/vllm/pull/5595
* Correct alignment in the seq_len diagram. by CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
* [Kernel] `compressed-tensors` marlin 24 support by dsikka in https://github.com/vllm-project/vllm/pull/5435
* [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by zhyncs in https://github.com/vllm-project/vllm/pull/5588
* [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by jikunshang in https://github.com/vllm-project/vllm/pull/3814
* [CI/BUILD] Support non-AVX512 vLLM building and testing by DamonFool in https://github.com/vllm-project/vllm/pull/5574
* [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
* [bugfix][distributed] fix 16 gpus local rank arrangement by youkaichao in https://github.com/vllm-project/vllm/pull/5604
* [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by youkaichao in https://github.com/vllm-project/vllm/pull/5584
* [Bugfix] Fix KV head calculation for MPT models when using GQA by bfontain in https://github.com/vllm-project/vllm/pull/5142
* [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by zifeitong in https://github.com/vllm-project/vllm/pull/5606
* [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by sroy745 in https://github.com/vllm-project/vllm/pull/5131
* [Model] Initialize Phi-3-vision support by Isotr0py in https://github.com/vllm-project/vllm/pull/4986
* [Kernel] Add punica dimensions for Granite 13b by joerunde in https://github.com/vllm-project/vllm/pull/5559
* [misc][typo] fix typo by youkaichao in https://github.com/vllm-project/vllm/pull/5620
* [Misc] Fix typo by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
* [CI] Avoid naming different metrics with the same name in performance benchmark by KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
* [bugfix][distributed] do not error if two processes do not agree on p2p capability by youkaichao in https://github.com/vllm-project/vllm/pull/5612
* [Misc] Remove import from transformers logging by CatherineSue in https://github.com/vllm-project/vllm/pull/5625
* [CI/Build][Misc] Update Pytest Marker for VLMs by ywang96 in https://github.com/vllm-project/vllm/pull/5623
* [ci] Deprecate original CI template by khluu in https://github.com/vllm-project/vllm/pull/5624
* [Misc] Add OpenTelemetry support by ronensc in https://github.com/vllm-project/vllm/pull/4687
* [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by dsikka in https://github.com/vllm-project/vllm/pull/5542
* [ci] Setup Release pipeline and build release wheels with cache by khluu in https://github.com/vllm-project/vllm/pull/5610
* [Model] LoRA support added for command-r by sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
* [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by tdoublep in https://github.com/vllm-project/vllm/pull/5639
* [Doc] Added cerebrium as Integration option by milo157 in https://github.com/vllm-project/vllm/pull/5553
* [Bugfix] Fix CUDA version check for mma warning suppression by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
* [Bugfix] Fix w8a8 benchmarks for int8 case by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
* [Bugfix] Fix Phi-3 Long RoPE scaling implementation by ShukantPal in https://github.com/vllm-project/vllm/pull/5628
* [Bugfix] Added test for sampling repetition penalty bug. by tdoublep in https://github.com/vllm-project/vllm/pull/5659
* [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by hongxiayang in https://github.com/vllm-project/vllm/pull/5641
* [misc][distributed] use localhost for single-node by youkaichao in https://github.com/vllm-project/vllm/pull/5619
* [Model] Add FP8 kv cache for Qwen2 by mgoin in https://github.com/vllm-project/vllm/pull/5656
* [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by Isotr0py in https://github.com/vllm-project/vllm/pull/5684
* [Misc]Add param max-model-len in benchmark_latency.py by DearPlanet in https://github.com/vllm-project/vllm/pull/5629
* [CI/Build] Add tqdm to dependencies by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
* [ci] Add A100 queue into AWS CI template by khluu in https://github.com/vllm-project/vllm/pull/5648
* [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by mgoin in https://github.com/vllm-project/vllm/pull/5688
* [ci][distributed] add tests for custom allreduce by youkaichao in https://github.com/vllm-project/vllm/pull/5689
* [Bugfix] AsyncLLMEngine hangs with asyncio.run by zifeitong in https://github.com/vllm-project/vllm/pull/5654
* [Doc] Update docker references by rafvasq in https://github.com/vllm-project/vllm/pull/5614
* [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by dsikka in https://github.com/vllm-project/vllm/pull/5650
* [ci] Limit num gpus if specified for A100 by khluu in https://github.com/vllm-project/vllm/pull/5694
* [Misc] Improve conftest by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
* [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by ywang96 in https://github.com/vllm-project/vllm/pull/5703
* [Kernel] Update Cutlass int8 kernel configs for SM90 by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
* [Model] Port over CLIPVisionModel for VLMs by ywang96 in https://github.com/vllm-project/vllm/pull/5591
* [Kernel] Update Cutlass int8 kernel configs for SM80 by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
* [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
* [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by mgoin in https://github.com/vllm-project/vllm/pull/5718
* [distributed][misc] use fork by default for mp by youkaichao in https://github.com/vllm-project/vllm/pull/5669
* [Model] MLPSpeculator speculative decoding support by JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
* [Kernel] Add punica dimension for Qwen2 LoRA by jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
* [BugFix] Fix test_phi3v.py by CatherineSue in https://github.com/vllm-project/vllm/pull/5725
* [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by jeejeelee in https://github.com/vllm-project/vllm/pull/5665
* [Core][Distributed] add shm broadcast by youkaichao in https://github.com/vllm-project/vllm/pull/5399
* [Kernel][CPU] Add Quick `gelu` to CPU by ywang96 in https://github.com/vllm-project/vllm/pull/5717
* [Doc] Documentation on supported hardware for quantization methods by mgoin in https://github.com/vllm-project/vllm/pull/5745
* [BugFix] exclude version 1.15.0 for modelscope by zhyncs in https://github.com/vllm-project/vllm/pull/5668
* [ci][test] fix ca test in main by youkaichao in https://github.com/vllm-project/vllm/pull/5746
* [LoRA] Add support for pinning lora adapters in the LRU cache by rohithkrn in https://github.com/vllm-project/vllm/pull/5603
* [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by jikunshang in https://github.com/vllm-project/vllm/pull/5616
* [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by DamonFool in https://github.com/vllm-project/vllm/pull/5710
* [Misc] Remove 4789 workaround left in vllm/entrypoints/openai/run_batch.py by zifeitong in https://github.com/vllm-project/vllm/pull/5756
* [Bugfix] Fix pin_lora error in TPU executor by WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
* [Docs][TPU] Add installation tip for TPU by WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
* [core][distributed] improve shared memory broadcast by youkaichao in https://github.com/vllm-project/vllm/pull/5754
* [BugFix] [Kernel] Add Cutlass2x fallback kernels by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
* [Distributed] Add send and recv helpers by andoorve in https://github.com/vllm-project/vllm/pull/5719
* [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by Isotr0py in https://github.com/vllm-project/vllm/pull/5772
* [doc][faq] add warning to download models for every nodes by youkaichao in https://github.com/vllm-project/vllm/pull/5783
* [Doc] Add "Suggest edit" button to doc pages by mgoin in https://github.com/vllm-project/vllm/pull/5789
* [Doc] Add Phi-3-medium to list of supported models by mgoin in https://github.com/vllm-project/vllm/pull/5788
* [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by CatherineSue in https://github.com/vllm-project/vllm/pull/5795
* [ci] Remove aws template by khluu in https://github.com/vllm-project/vllm/pull/5757
* [Doc] Add notice about breaking changes to VLMs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
* [Speculative Decoding] Support draft model on different tensor-parallel size than target model by wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
* [Misc] Remove useless code in cpu_worker by DamonFool in https://github.com/vllm-project/vllm/pull/5824
* [Core] Add fault tolerance for `RayTokenizerGroupPool` by Yard1 in https://github.com/vllm-project/vllm/pull/5748
* [doc][distributed] add both gloo and nccl tests by youkaichao in https://github.com/vllm-project/vllm/pull/5834
* [CI/Build] Add unit testing for FlexibleArgumentParser by mgoin in https://github.com/vllm-project/vllm/pull/5798
* [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` by dsikka in https://github.com/vllm-project/vllm/pull/5794
* [Hardware][TPU] Refactor TPU backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
* [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by mawong-amd in https://github.com/vllm-project/vllm/pull/5422
* [Hardware][TPU] Raise errors for unsupported sampling params by WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
* [CI/Build] Add E2E tests for MLPSpeculator by tdoublep in https://github.com/vllm-project/vllm/pull/5791
* [Bugfix] Fix assertion in NeuronExecutor by aws-patlange in https://github.com/vllm-project/vllm/pull/5841
* [Core] Refactor Worker and ModelRunner to consolidate control plane communication by stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
* [Misc][Doc] Add Example of using OpenAI Server with VLM by ywang96 in https://github.com/vllm-project/vllm/pull/5832
* [bugfix][distributed] fix shm broadcast when the queue size is full by youkaichao in https://github.com/vllm-project/vllm/pull/5801
* [Bugfix] Fix embedding to support 2D inputs by WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
* [Bugfix][TPU] Fix KV cache size calculation by WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
* [CI/Build] Refactor image test assets by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
* [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` by ProExpertProg in https://github.com/vllm-project/vllm/pull/5560
* [Frontend] Add tokenize/detokenize endpoints by sasha0552 in https://github.com/vllm-project/vllm/pull/5054
* [Hardware][TPU] Support parallel sampling & Swapping by WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
* [Bugfix][TPU] Fix CPU cache allocation by WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
* Support CPU inference with VSX PowerPC ISA by ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
* [doc] update usage of env var to avoid conflict by youkaichao in https://github.com/vllm-project/vllm/pull/5873
* [Misc] Add example for LLaVA-NeXT by ywang96 in https://github.com/vllm-project/vllm/pull/5879
* [BugFix] Fix cuda graph for MLPSpeculator by njhill in https://github.com/vllm-project/vllm/pull/5875
* [Doc] Add note about context length in Phi-3-Vision example by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
* [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880
* [Model] Add base class for LoRA-supported models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
* [Bugfix] Fix img_sizes Parsing in Phi3-Vision by ywang96 in https://github.com/vllm-project/vllm/pull/5888
* [CI/Build] [1/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
* [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
* [doc][misc] add note for Kubernetes users by youkaichao in https://github.com/vllm-project/vllm/pull/5916
* [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` by njhill in https://github.com/vllm-project/vllm/pull/5876
* [BugFix] Fix `min_tokens` behaviour for multiple eos tokens by njhill in https://github.com/vllm-project/vllm/pull/5849
* [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test by ywang96 in https://github.com/vllm-project/vllm/pull/5922
* [Model] Add Gemma 2 by WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
* [core][misc] remove logical block by youkaichao in https://github.com/vllm-project/vllm/pull/5882
* [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by divakar-amd in https://github.com/vllm-project/vllm/pull/5932
* [Hardware][TPU] Optimize KV cache swapping by WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
* [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905
* [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/5956
* [Core] Registry for processing model inputs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
* Unmark fused_moe config json file as executable by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
* [Hardware][Intel] OpenVINO vLLM backend by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
* [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high by tdoublep in https://github.com/vllm-project/vllm/pull/5894
* [CI/Build] [2/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
* [Distributed] Make it clear that % should not be in tensor dict keys. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
* [Spec Decode] Introduce DraftModelRunner by comaniac in https://github.com/vllm-project/vllm/pull/5799
* [Bugfix] Fix compute datatype for cutlass 3.x epilogues by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
* [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928
* [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
* Support Deepseek-V2 by zwd003 in https://github.com/vllm-project/vllm/pull/4650
* [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled by mgoin in https://github.com/vllm-project/vllm/pull/5936
* Unmark more files as executable by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5962
* [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5963
* [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4628
* [Bugfix][TPU] Fix TPU sampler output by WoosukKwon in https://github.com/vllm-project/vllm/pull/5978
* [Bugfix][TPU] Fix pad slot id by WoosukKwon in https://github.com/vllm-project/vllm/pull/5977
* [Bugfix] fix missing last itl in openai completions benchmark by mcalman in https://github.com/vllm-project/vllm/pull/5926
* [Misc] Extend vLLM Metrics logging API by SolitaryThinker in https://github.com/vllm-project/vllm/pull/5925
* [Kernel] Add punica dimensions for Granite 3b and 8b by joerunde in https://github.com/vllm-project/vllm/pull/5930
* [Bugfix] Fix precisions in Gemma 1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/5913
* [Misc] Update Phi-3-Vision Example by ywang96 in https://github.com/vllm-project/vllm/pull/5981
* [Bugfix] Support `eos_token_id` from `config.json` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5954
* [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum by Yard1 in https://github.com/vllm-project/vllm/pull/5974
* [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k by comaniac in https://github.com/vllm-project/vllm/pull/5939
* [ CI/Build ] Added E2E Test For Compressed Tensors by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5839
* [CI/Build] Add TP test for vision models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5892
* [ CI/Build ] LM Eval Harness Based CI Testing by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5838
* [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests by mawong-amd in https://github.com/vllm-project/vllm/pull/5949
* [CI/Build] Temporarily Remove Phi3-Vision from TP Test by ywang96 in https://github.com/vllm-project/vllm/pull/5989
* [CI/Build] Reuse code for checking output consistency by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5988
* [CI/Build] [3/3] Reorganize entrypoints tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5966
* [ci][distributed] fix some cuda init that makes it necessary to use spawn by youkaichao in https://github.com/vllm-project/vllm/pull/5991
* [Frontend]: Support base64 embedding by llmpros in https://github.com/vllm-project/vllm/pull/5935
* [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. by rkooo567 in https://github.com/vllm-project/vllm/pull/5909
* [ CI ] Temporarily Disable Large LM-Eval Tests by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6005
* [Misc] Fix `get_min_capability` by dsikka in https://github.com/vllm-project/vllm/pull/5971
* [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5940
* [misc][cuda] use nvml query to avoid accidentally cuda initialization by youkaichao in https://github.com/vllm-project/vllm/pull/6007
* [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker by sroy745 in https://github.com/vllm-project/vllm/pull/5348
* [ CI ] Re-enable Large Model LM Eval by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6031
* [doc][misc] remove deprecated api server in doc by youkaichao in https://github.com/vllm-project/vllm/pull/6037
* [Misc] update benchmark backend for scalellm by zhyncs in https://github.com/vllm-project/vllm/pull/6018
* [doc][misc] further lower visibility of simple api server by youkaichao in https://github.com/vllm-project/vllm/pull/6041
* [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool by Yard1 in https://github.com/vllm-project/vllm/pull/6039
* [Bugfix] adding chunking mechanism to fused_moe to handle large inputs by avshalomman in https://github.com/vllm-project/vllm/pull/6029
* add FAQ doc under 'serving' by llmpros in https://github.com/vllm-project/vllm/pull/5946
* [Bugfix][Doc] Fix Doc Formatting by ywang96 in https://github.com/vllm-project/vllm/pull/6048
* [Bugfix] Add explicit `end_forward` calls to flashinfer by Yard1 in https://github.com/vllm-project/vllm/pull/6044
* [BugFix] Ensure worker model loop is always stopped at the right time by njhill in https://github.com/vllm-project/vllm/pull/5987
* [Frontend] Relax api url assertion for openai benchmarking by jamestwhedbee in https://github.com/vllm-project/vllm/pull/6046
* [Model] Changes to MLPSpeculator to support tie_weights and input_scale by tdoublep in https://github.com/vllm-project/vllm/pull/5965
* [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5602
* [Frontend] Add template related params to request by danieljannai21 in https://github.com/vllm-project/vllm/pull/5709
* [VLM] Remove `image_input_type` from VLM config by xwjiang2010 in https://github.com/vllm-project/vllm/pull/5852
* [Doc] Reinstate doc dependencies by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6061
* [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) by sirejdua in https://github.com/vllm-project/vllm/pull/6050
* [Core] Pipeline Parallel Support by andoorve in https://github.com/vllm-project/vllm/pull/4412
* Update conftest.py by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6076
* [ Misc ] Refactor MoE to isolate Fp8 From Mixtral by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5970
* [CORE] Quantized lm-head Framework by Qubitium in https://github.com/vllm-project/vllm/pull/4442
* [Model] Jamba support by mzusman in https://github.com/vllm-project/vllm/pull/4115
* [hardware][misc] introduce platform abstraction by youkaichao in https://github.com/vllm-project/vllm/pull/6080
* [Core] Dynamic image size support for VLMs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5276
* [CI] Fix base url doesn't strip "/" by rkooo567 in https://github.com/vllm-project/vllm/pull/6087
* [BugFix] Avoid unnecessary Ray import warnings by njhill in https://github.com/vllm-project/vllm/pull/6079
* [misc][distributed] error on invalid state by youkaichao in https://github.com/vllm-project/vllm/pull/6092
* [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API by ywang96 in https://github.com/vllm-project/vllm/pull/6091
* [Doc] Fix Mock Import by ywang96 in https://github.com/vllm-project/vllm/pull/6094
* [Bugfix] Fix `compute_logits` in Jamba by ywang96 in https://github.com/vllm-project/vllm/pull/6093
* [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin by mgoin in https://github.com/vllm-project/vllm/pull/5975
* [core][distributed] allow custom allreduce when pipeline parallel size > 1 by youkaichao in https://github.com/vllm-project/vllm/pull/6117
* [vlm] Remove vision language config. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/6089
* [ Misc ] Clean Up `CompressedTensorsW8A8` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6113
* [doc][misc] bump up py version in installation doc by youkaichao in https://github.com/vllm-project/vllm/pull/6119
* [core][distributed] support layer size undividable by pp size in pipeline parallel inference by youkaichao in https://github.com/vllm-project/vllm/pull/6115
* [Bugfix] set OMP_NUM_THREADS to 1 by default when using the multiproc_gpu_executor by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6109
* [Distributed][Core] Support Py39 and Py38 for PP by andoorve in https://github.com/vllm-project/vllm/pull/6120
* [CI/Build] Cleanup VLM tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6107
* [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention by gshtras in https://github.com/vllm-project/vllm/pull/6043
* [misc][doc] try to add warning for latest html by youkaichao in https://github.com/vllm-project/vllm/pull/5979
* [Hardware][Intel CPU] Adding intel openmp tunings in Docker file by zhouyuan in https://github.com/vllm-project/vllm/pull/6008
* [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6051
* [VLM] Calculate maximum number of multi-modal tokens by model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6121
* [VLM] Improve consistency between feature size calculation and dummy data for profiling by ywang96 in https://github.com/vllm-project/vllm/pull/6146
* [VLM] Cleanup validation and update docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/6149
* [Bugfix] Use templated datasource in grafana.json to allow automatic imports by frittentheke in https://github.com/vllm-project/vllm/pull/6136
* [Frontend] Continuous usage stats in OpenAI completion API by jvlunteren in https://github.com/vllm-project/vllm/pull/5742
* [Bugfix] Add verbose error if scipy is missing for blocksparse attention by JGSweets in https://github.com/vllm-project/vllm/pull/5695
* bump version to v0.5.1 by simon-mo in https://github.com/vllm-project/vllm/pull/6157
* [Docs] Fix readthedocs for tag build by simon-mo in https://github.com/vllm-project/vllm/pull/6158

New Contributors
* kimdwkimdw made their first contribution in https://github.com/vllm-project/vllm/pull/5444
* sywangyi made their first contribution in https://github.com/vllm-project/vllm/pull/5303
* garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
* CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
* zhyncs made their first contribution in https://github.com/vllm-project/vllm/pull/5588
* bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
* sroy745 made their first contribution in https://github.com/vllm-project/vllm/pull/5131
* joerunde made their first contribution in https://github.com/vllm-project/vllm/pull/5559
* sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
* milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
* ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
* rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
* JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
* rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
* wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
* aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
* stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
* ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
* ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
* ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379
* mcalman made their first contribution in https://github.com/vllm-project/vllm/pull/5926
* SolitaryThinker made their first contribution in https://github.com/vllm-project/vllm/pull/5925
* llmpros made their first contribution in https://github.com/vllm-project/vllm/pull/5935
* avshalomman made their first contribution in https://github.com/vllm-project/vllm/pull/6029
* danieljannai21 made their first contribution in https://github.com/vllm-project/vllm/pull/5709
* sirejdua made their first contribution in https://github.com/vllm-project/vllm/pull/6050
* gshtras made their first contribution in https://github.com/vllm-project/vllm/pull/6043
* frittentheke made their first contribution in https://github.com/vllm-project/vllm/pull/6136
* jvlunteren made their first contribution in https://github.com/vllm-project/vllm/pull/5742
* JGSweets made their first contribution in https://github.com/vllm-project/vllm/pull/5695

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.0...v0.5.1

0.5.0.post1

Not secure
Highlights

* Add initial TPU integration (5292)
* Fix crashes when using FlashAttention backend (5478)
* Fix issues when using num_devices < num_available_devices (5473)

What's Changed
* [CI/Build] Add `is_quant_method_supported` to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5253
* Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" by simon-mo in https://github.com/vllm-project/vllm/pull/5463
* [CI] Upgrade codespell version. by rkooo567 in https://github.com/vllm-project/vllm/pull/5381
* [Hardware] Initial TPU integration by WoosukKwon in https://github.com/vllm-project/vllm/pull/5292
* [Bugfix] Add device assertion to TorchSDPA by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5402
* [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by khluu in https://github.com/vllm-project/vllm/pull/5464
* [Kernel] Vectorized FP8 quantize kernel by comaniac in https://github.com/vllm-project/vllm/pull/5396
* [Bugfix] TYPE_CHECKING for MultiModalData by kimdwkimdw in https://github.com/vllm-project/vllm/pull/5444
* [Frontend] [Core] Support for sharded tensorized models by tjohnson31415 in https://github.com/vllm-project/vllm/pull/4990
* [misc] add hint for AttributeError by youkaichao in https://github.com/vllm-project/vllm/pull/5462
* [Doc] Update debug docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5438
* [Bugfix] Fix typo in scheduler.py (requeset -> request) by mgoin in https://github.com/vllm-project/vllm/pull/5470
* [Frontend] Add "input speed" to tqdm postfix alongside output speed by mgoin in https://github.com/vllm-project/vllm/pull/5425
* [Bugfix] Fix wrong multi_modal_input format for CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/5451
* [Core][Distributed] add coordinator to reduce code duplication in tp and pp by youkaichao in https://github.com/vllm-project/vllm/pull/5293
* [ci] Use sccache to build images by khluu in https://github.com/vllm-project/vllm/pull/5419
* [Bugfix]if the content is started with ":"(response of ping), client should i… by sywangyi in https://github.com/vllm-project/vllm/pull/5303
* [Kernel] `w4a16` support for `compressed-tensors` by dsikka in https://github.com/vllm-project/vllm/pull/5385
* [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by mgoin in https://github.com/vllm-project/vllm/pull/5466
* [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by wenyujin333 in https://github.com/vllm-project/vllm/pull/5497
* [Hardware][Intel] Optimize CPU backend and add more performance tips by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4971
* [Docs] Add 4th meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/5509
* [Misc] Add vLLM version getter to utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5098
* [CI/Build] Simplify OpenAI server setup in tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5100
* [Doc] Update LLaVA docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5437
* [Kernel] Factor out epilogues from cutlass kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5391
* [MISC] Remove FP8 warning by comaniac in https://github.com/vllm-project/vllm/pull/5472
* Seperate dev requirements into lint and test by Yard1 in https://github.com/vllm-project/vllm/pull/5474
* Revert "[Core] Remove unnecessary copies in flash attn backend" by Yard1 in https://github.com/vllm-project/vllm/pull/5478
* [misc] fix format.sh by youkaichao in https://github.com/vllm-project/vllm/pull/5511
* [CI/Build] Disable test_fp8.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5508
* [Kernel] Disable CUTLASS kernels for fp8 by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5505
* Add `cuda_device_count_stateless` by Yard1 in https://github.com/vllm-project/vllm/pull/5473
* [Hardware][Intel] Support CPU inference with AVX2 ISA by DamonFool in https://github.com/vllm-project/vllm/pull/5452
* [Bugfix]typofix by AllenDou in https://github.com/vllm-project/vllm/pull/5507
* bump version to v0.5.0.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/5522

New Contributors
* kimdwkimdw made their first contribution in https://github.com/vllm-project/vllm/pull/5444
* sywangyi made their first contribution in https://github.com/vllm-project/vllm/pull/5303

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.5.0...v0.5.0.post1

0.5.0

Not secure
Highlights

Production Features
* [FP8 support](https://docs.vllm.ai/en/stable/quantization/fp8.html) is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. Please try it out and let us know your thoughts! (#5352, 5388, 5159, 5238, 5294, 5183, 5144, 5231)
* Add OpenAI [Vision API](https://docs.vllm.ai/en/stable/models/vlm.html) support. Currently only LLaVA and LLaVA-NeXT are supported. We are working on adding more models in the next release. (#5237, 5383, 4199, 5374, 4197)
* [Speculative Decoding](https://docs.vllm.ai/en/stable/models/spec_decode.html) and [Automatic Prefix Caching](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html) is also ready for testing, we plan to turn them on by default in upcoming releases. (#5400, 5157, 5137, 5324)
* Default to multiprocessing backend for single-node distributed case (5230)
* Support bitsandbytes quantization and QLoRA (4776)

Hardware Support
* Improvements to the Intel CPU CI (4113, 5241)
* Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (5047)

Others
* [Debugging tips](https://docs.vllm.ai/en/stable/getting_started/debugging.html) documentation (#5409, 5430)
* Dynamic Per-Token Activation Quantization (5037)
* Customizable RoPE theta (5197)
* Enable passing multiple LoRA adapters at once to generate() (5300)
* OpenAI `tools` support named functions (5032)
* Support `stream_options` for OpenAI protocol (5319, 5135)
* Update Outlines Integration from `FSM` to `Guide` (4109)

What's Changed
* [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by dtrifiro in https://github.com/vllm-project/vllm/pull/5034
* [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5137
* [Kernel] Update Cutlass fp8 configs by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5144
* [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by dashanji in https://github.com/vllm-project/vllm/pull/5151
* [Bugfix] Fix call to init_logger in openai server by NadavShmayo in https://github.com/vllm-project/vllm/pull/4765
* [Feature][Kernel] Support bitsandbytes quantization and QLoRA by chenqianfzh in https://github.com/vllm-project/vllm/pull/4776
* [Bugfix] Remove deprecated abstractproperty by zhuohan123 in https://github.com/vllm-project/vllm/pull/5174
* [Bugfix]: Fix issues related to prefix caching example (5177) by Delviet in https://github.com/vllm-project/vllm/pull/5180
* [BugFix] Prevent `LLM.encode` for non-generation Models by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5184
* Update test_ignore_eos by simon-mo in https://github.com/vllm-project/vllm/pull/4898
* [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by Avinash-Raj in https://github.com/vllm-project/vllm/pull/4643
* [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by divakar-amd in https://github.com/vllm-project/vllm/pull/4927
* [Misc] Simplify code and fix type annotations in `conftest.py` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5118
* [Core] Support image processor by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4197
* [Core] Remove unnecessary copies in flash attn backend by Yard1 in https://github.com/vllm-project/vllm/pull/5138
* [Kernel] Pass a device pointer into the quantize kernel for the scales by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5159
* [CI/BUILD] enable intel queue for longer CPU tests by zhouyuan in https://github.com/vllm-project/vllm/pull/4113
* [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by Kaiyang-Chen in https://github.com/vllm-project/vllm/pull/3834
* New CI template on AWS stack by khluu in https://github.com/vllm-project/vllm/pull/5110
* [FRONTEND] OpenAI `tools` support named functions by br3no in https://github.com/vllm-project/vllm/pull/5032
* [Bugfix] Support `prompt_logprobs==0` by toslunar in https://github.com/vllm-project/vllm/pull/5217
* [Bugfix] Add warmup for prefix caching example by zhuohan123 in https://github.com/vllm-project/vllm/pull/5235
* [Kernel] Enhance MoE benchmarking & tuning script by WoosukKwon in https://github.com/vllm-project/vllm/pull/4921
* [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by afeldman-nm in https://github.com/vllm-project/vllm/pull/5210
* [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by zifeitong in https://github.com/vllm-project/vllm/pull/5229
* [CI/Build] Add inputs tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5215
* [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by DamonFool in https://github.com/vllm-project/vllm/pull/5249
* [Kernel] Add back batch size 1536 and 3072 to MoE tuning by WoosukKwon in https://github.com/vllm-project/vllm/pull/5242
* [CI/Build] Simplify model loading for `HfRunner` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5251
* [CI/Build] Reducing CPU CI execution time by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5241
* [CI] mark AMD test as softfail to prevent blockage by simon-mo in https://github.com/vllm-project/vllm/pull/5256
* [Misc] Add transformers version to collect_env.py by mgoin in https://github.com/vllm-project/vllm/pull/5259
* [Misc] update collect env by youkaichao in https://github.com/vllm-project/vllm/pull/5261
* [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by zifeitong in https://github.com/vllm-project/vllm/pull/5226
* [Misc] Add CustomOp interface for device portability by WoosukKwon in https://github.com/vllm-project/vllm/pull/5255
* [Misc] Fix docstring of get_attn_backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/5271
* [Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) by tomeras91 in https://github.com/vllm-project/vllm/pull/5278
* [CI] Add nightly benchmarks by simon-mo in https://github.com/vllm-project/vllm/pull/5260
* [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5263
* [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5157
* [Model] Correct Mixtral FP8 checkpoint loading by comaniac in https://github.com/vllm-project/vllm/pull/5231
* [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by DriverSong in https://github.com/vllm-project/vllm/pull/5207
* [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by pcmoritz in https://github.com/vllm-project/vllm/pull/5238
* [Docs] Add Sequoia as sponsors by simon-mo in https://github.com/vllm-project/vllm/pull/5287
* [Speculative Decoding] Add `ProposerWorkerBase` abstract class by njhill in https://github.com/vllm-project/vllm/pull/5252
* [BugFix] Fix log message about default max model length by njhill in https://github.com/vllm-project/vllm/pull/5284
* [Bugfix] Make EngineArgs use named arguments for config construction by mgoin in https://github.com/vllm-project/vllm/pull/5285
* [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by wuisawesome in https://github.com/vllm-project/vllm/pull/5290
* [Misc] Skip for logits_scale == 1.0 by WoosukKwon in https://github.com/vllm-project/vllm/pull/5291
* [Docs] Add Ray Summit CFP by simon-mo in https://github.com/vllm-project/vllm/pull/5295
* [CI] Disable flash_attn backend for spec decode by simon-mo in https://github.com/vllm-project/vllm/pull/5286
* [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` by br3no in https://github.com/vllm-project/vllm/pull/4109
* [CI/Build] Update vision tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5307
* Bugfix: fix broken of download models from modelscope by liuyhwangyh in https://github.com/vllm-project/vllm/pull/5233
* [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by pcmoritz in https://github.com/vllm-project/vllm/pull/5294
* [Frontend] enable passing multiple LoRA adapters at once to generate() by mgoldey in https://github.com/vllm-project/vllm/pull/5300
* [Core] Avoid copying prompt/output tokens if no penalties are used by Yard1 in https://github.com/vllm-project/vllm/pull/5289
* [Core] Change LoRA embedding sharding to support loading methods by Yard1 in https://github.com/vllm-project/vllm/pull/5038
* [Misc] Missing error message for custom ops import by DamonFool in https://github.com/vllm-project/vllm/pull/5282
* [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` by Etelis in https://github.com/vllm-project/vllm/pull/5135
* [Misc][Utils] allow get_open_port to be called for multiple times by youkaichao in https://github.com/vllm-project/vllm/pull/5333
* [Kernel] Switch fp8 layers to use the CUTLASS kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5183
* Remove Ray health check by Yard1 in https://github.com/vllm-project/vllm/pull/4693
* Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by JamesLim-sy in https://github.com/vllm-project/vllm/pull/5296
* [Kernel] Dynamic Per-Token Activation Quantization by dsikka in https://github.com/vllm-project/vllm/pull/5037
* [Frontend] Add OpenAI Vision API Support by ywang96 in https://github.com/vllm-project/vllm/pull/5237
* [Misc] Remove unused cuda_utils.h in CPU backend by DamonFool in https://github.com/vllm-project/vllm/pull/5345
* fix DbrxFusedNormAttention missing cache_config by Calvinnncy97 in https://github.com/vllm-project/vllm/pull/5340
* [Bug Fix] Fix the support check for FP8 CUTLASS by cli99 in https://github.com/vllm-project/vllm/pull/5352
* [Misc] Add args for selecting distributed executor to benchmarks by BKitor in https://github.com/vllm-project/vllm/pull/5335
* [ROCm][AMD] Use pytorch sdpa math backend to do naive attention by hongxiayang in https://github.com/vllm-project/vllm/pull/4965
* [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) by youkaichao in https://github.com/vllm-project/vllm/pull/5347
* [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) by youkaichao in https://github.com/vllm-project/vllm/pull/5357
* [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale by mgoin in https://github.com/vllm-project/vllm/pull/5353
* [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by youkaichao in https://github.com/vllm-project/vllm/pull/5074
* [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py by youkaichao in https://github.com/vllm-project/vllm/pull/5361
* [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by bnellnm in https://github.com/vllm-project/vllm/pull/5047
* [Bugfix] Fix KeyError: 1 When Using LoRA adapters by BlackBird-Coding in https://github.com/vllm-project/vllm/pull/5164
* [Misc] Update to comply with the new `compressed-tensors` config by dsikka in https://github.com/vllm-project/vllm/pull/5350
* [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server by ywang96 in https://github.com/vllm-project/vllm/pull/5374
* [misc][typo] fix typo by youkaichao in https://github.com/vllm-project/vllm/pull/5372
* [Misc] Improve error message when LoRA parsing fails by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5194
* [Model] Initial support for LLaVA-NeXT by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4199
* [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest by Etelis in https://github.com/vllm-project/vllm/pull/5319
* [Bugfix] Fix LLaVA-NeXT by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5380
* [ci] Use small_cpu_queue for doc build by khluu in https://github.com/vllm-project/vllm/pull/5331
* [ci] Mount buildkite agent on Docker container to upload benchmark results by khluu in https://github.com/vllm-project/vllm/pull/5330
* [Docs] Add Docs on Limitations of VLM Support by ywang96 in https://github.com/vllm-project/vllm/pull/5383
* [Docs] Alphabetically sort sponsors by WoosukKwon in https://github.com/vllm-project/vllm/pull/5386
* Bump version to v0.5.0 by simon-mo in https://github.com/vllm-project/vllm/pull/5384
* [Doc] Add documentation for FP8 W8A8 by mgoin in https://github.com/vllm-project/vllm/pull/5388
* [ci] Fix Buildkite agent path by khluu in https://github.com/vllm-project/vllm/pull/5392
* [Misc] Various simplifications and typing fixes by njhill in https://github.com/vllm-project/vllm/pull/5368
* [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs by maor-ps in https://github.com/vllm-project/vllm/pull/5312
* [Bugfix][Frontend] Cleanup "fix chat logprobs" by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5026
* [Doc] add debugging tips by youkaichao in https://github.com/vllm-project/vllm/pull/5409
* [Doc][Typo] Fixing Missing Comma by ywang96 in https://github.com/vllm-project/vllm/pull/5403
* [Misc] Remove VLLM_BUILD_WITH_NEURON env variable by WoosukKwon in https://github.com/vllm-project/vllm/pull/5389
* [CI] docfix by rkooo567 in https://github.com/vllm-project/vllm/pull/5410
* [Speculative decoding] Initial spec decode docs by cadedaniel in https://github.com/vllm-project/vllm/pull/5400
* [Doc] Add an automatic prefix caching section in vllm documentation by KuntaiDu in https://github.com/vllm-project/vllm/pull/5324
* [Docs] [Spec decode] Fix docs error in code example by cadedaniel in https://github.com/vllm-project/vllm/pull/5427
* [Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 by jsato8094 in https://github.com/vllm-project/vllm/pull/5254
* [Bugfix] fix lora_dtype value type in arg_utils.py by c3-ali in https://github.com/vllm-project/vllm/pull/5398
* [Frontend] Customizable RoPE theta by sasha0552 in https://github.com/vllm-project/vllm/pull/5197
* [Core][Distributed] add same-node detection by youkaichao in https://github.com/vllm-project/vllm/pull/5369
* [Core][Doc] Default to multiprocessing for single-node distributed case by njhill in https://github.com/vllm-project/vllm/pull/5230
* [Doc] add common case for long waiting time by youkaichao in https://github.com/vllm-project/vllm/pull/5430

New Contributors
* dtrifiro made their first contribution in https://github.com/vllm-project/vllm/pull/5034
* varun-sundar-rabindranath made their first contribution in https://github.com/vllm-project/vllm/pull/5144
* dashanji made their first contribution in https://github.com/vllm-project/vllm/pull/5151
* chenqianfzh made their first contribution in https://github.com/vllm-project/vllm/pull/4776
* Delviet made their first contribution in https://github.com/vllm-project/vllm/pull/5180
* Avinash-Raj made their first contribution in https://github.com/vllm-project/vllm/pull/4643
* zhouyuan made their first contribution in https://github.com/vllm-project/vllm/pull/4113
* Kaiyang-Chen made their first contribution in https://github.com/vllm-project/vllm/pull/3834
* khluu made their first contribution in https://github.com/vllm-project/vllm/pull/5110
* toslunar made their first contribution in https://github.com/vllm-project/vllm/pull/5217
* DamonFool made their first contribution in https://github.com/vllm-project/vllm/pull/5249
* tomeras91 made their first contribution in https://github.com/vllm-project/vllm/pull/5278
* DriverSong made their first contribution in https://github.com/vllm-project/vllm/pull/5207
* mgoldey made their first contribution in https://github.com/vllm-project/vllm/pull/5300
* JamesLim-sy made their first contribution in https://github.com/vllm-project/vllm/pull/5296
* Calvinnncy97 made their first contribution in https://github.com/vllm-project/vllm/pull/5340
* cli99 made their first contribution in https://github.com/vllm-project/vllm/pull/5352
* BKitor made their first contribution in https://github.com/vllm-project/vllm/pull/5335
* BlackBird-Coding made their first contribution in https://github.com/vllm-project/vllm/pull/5164
* maor-ps made their first contribution in https://github.com/vllm-project/vllm/pull/5312
* c3-ali made their first contribution in https://github.com/vllm-project/vllm/pull/5398

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.3...v0.5.0

0.4.3

Not secure
Highlights
Model Support
LLM
* Added support for Falcon (5069)
* Added support for IBM Granite Code models (4636)
* Added blocksparse flash attention kernel and Phi-3-Small model (4799)
* Added Snowflake arctic model implementation (4652, 4889, 4690)
* Supported Dynamic RoPE scaling (4638)
* Supported for long context lora (4787)

Embedding Models
* Intial support for Embedding API with e5-mistral-7b-instruct (3734)
* Cross-attention KV caching and memory-management towards encoder-decoder model support (4837)

Vision Language Model
* Add base class for vision-language models (4809)
* Consolidate prompt arguments to LLM engines (4328)
* LLaVA model refactor (4910)

Hardware Support
AMD
* Add fused_moe Triton configs (4951)
* Add support for Punica kernels (3140)
* Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (4797)

Production Engine
Batch API
* Support OpenAI batch file format (4794)

Making Ray Optional
* Add `MultiprocessingGPUExecutor` (4539)
* Eliminate parallel worker per-step task scheduling overhead (4894)

Automatic Prefix Caching
* Accelerating the hashing function by avoiding deep copies (4696)

Speculative Decoding
* CUDA graph support (4295)
* Enable TP>1 speculative decoding (4840)
* Improve n-gram efficiency (4724)

Performance Optimization

Quantization
* Add GPTQ Marlin 2:4 sparse structured support (4790)
* Initial Activation Quantization Support (4525)
* Marlin prefill performance improvement (about better on average) (4983)
* Automatically Detect SparseML models (5119)

Better Attention Kernel
* Use flash-attn for decoding (3648)

FP8
* Improve FP8 linear layer performance (4691)
* Add w8a8 CUTLASS kernels (4749)
* Support for CUTLASS kernels in CUDA graphs (4954)
* Load FP8 kv-cache scaling factors from checkpoints (4893)
* Make static FP8 scaling more robust (4570)
* Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (4535)

Optimize Distributed Communication
* change python dict to pytorch tensor (4607)
* change python dict to pytorch tensor for blocks to swap (4659)
* improve paccess check (4992)
* remove vllm-nccl (5091)
* support both cpu and device tensor in broadcast tensor dict (4660)

Extensible Architecture

Pipeline Parallelism
* refactor custom allreduce to support multiple tp groups (4754)
* refactor pynccl to hold multiple communicators (4591)
* Support PP PyNCCL Groups (4988)




What's Changed
* Disable cuda version check in vllm-openai image by zhaoyang-star in https://github.com/vllm-project/vllm/pull/4530
* [Bugfix] Fix `asyncio.Task` not being subscriptable by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4623
* [CI] use ccache actions properly in release workflow by simon-mo in https://github.com/vllm-project/vllm/pull/4629
* [CI] Add retry for agent lost by cadedaniel in https://github.com/vllm-project/vllm/pull/4633
* Update lm-format-enforcer to 0.10.1 by noamgat in https://github.com/vllm-project/vllm/pull/4631
* [Kernel] Make static FP8 scaling more robust by pcmoritz in https://github.com/vllm-project/vllm/pull/4570
* [Core][Optimization] change python dict to pytorch tensor by youkaichao in https://github.com/vllm-project/vllm/pull/4607
* [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4642
* [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by FurtherAI in https://github.com/vllm-project/vllm/pull/4609
* [Core][Optimization] change copy-on-write from dict[int, list] to list by youkaichao in https://github.com/vllm-project/vllm/pull/4648
* [Bug fix][Core] fixup ngram not setup correctly by leiwen83 in https://github.com/vllm-project/vllm/pull/4551
* [Core][Distributed] support both cpu and device tensor in broadcast tensor dict by youkaichao in https://github.com/vllm-project/vllm/pull/4660
* [Core] Optimize sampler get_logprobs by rkooo567 in https://github.com/vllm-project/vllm/pull/4594
* [CI] Make mistral tests pass by rkooo567 in https://github.com/vllm-project/vllm/pull/4596
* [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by DefTruth in https://github.com/vllm-project/vllm/pull/4573
* [Misc] Add `get_name` method to attention backends by WoosukKwon in https://github.com/vllm-project/vllm/pull/4685
* [Core] Faster startup for LoRA enabled models by Yard1 in https://github.com/vllm-project/vllm/pull/4634
* [Core][Optimization] change python dict to pytorch tensor for blocks to swap by youkaichao in https://github.com/vllm-project/vllm/pull/4659
* [CI/Test] fix swap test for multi gpu by youkaichao in https://github.com/vllm-project/vllm/pull/4689
* [Misc] Use vllm-flash-attn instead of flash-attn by WoosukKwon in https://github.com/vllm-project/vllm/pull/4686
* [Dynamic Spec Decoding] Auto-disable by the running queue size by comaniac in https://github.com/vllm-project/vllm/pull/4592
* [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by cadedaniel in https://github.com/vllm-project/vllm/pull/4672
* [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4626
* [Frontend] add tok/s speed metric to llm class when using tqdm by MahmoudAshraf97 in https://github.com/vllm-project/vllm/pull/4400
* [Frontend] Move async logic outside of constructor by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4674
* [Misc] Remove unnecessary ModelRunner imports by WoosukKwon in https://github.com/vllm-project/vllm/pull/4703
* [Misc] Set block size at initialization & Fix test_model_runner by WoosukKwon in https://github.com/vllm-project/vllm/pull/4705
* [ROCm] Add support for Punica kernels on AMD GPUs by kliuae in https://github.com/vllm-project/vllm/pull/3140
* [Bugfix] Fix CLI arguments in OpenAI server docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4709
* [Bugfix] Update grafana.json by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4711
* [Bugfix] Add logs for all model dtype casting by mgoin in https://github.com/vllm-project/vllm/pull/4717
* [Model] Snowflake arctic model implementation by sfc-gh-hazhang in https://github.com/vllm-project/vllm/pull/4652
* [Kernel] [FP8] Improve FP8 linear layer performance by pcmoritz in https://github.com/vllm-project/vllm/pull/4691
* [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by comaniac in https://github.com/vllm-project/vllm/pull/4535
* [Core][Distributed] refactor pynccl to hold multiple communicators by youkaichao in https://github.com/vllm-project/vllm/pull/4591
* [Misc] Keep only one implementation of the create_dummy_prompt function. by AllenDou in https://github.com/vllm-project/vllm/pull/4716
* chunked-prefill-doc-syntax by simon-mo in https://github.com/vllm-project/vllm/pull/4603
* [Core]fix type annotation for `swap_blocks` by jikunshang in https://github.com/vllm-project/vllm/pull/4726
* [Misc] Apply a couple g++ cleanups by stevegrubb in https://github.com/vllm-project/vllm/pull/4719
* [Core] Fix circular reference which leaked llm instance in local dev env by rkooo567 in https://github.com/vllm-project/vllm/pull/4737
* [Bugfix] Fix CLI arguments in OpenAI server docs by AllenDou in https://github.com/vllm-project/vllm/pull/4729
* [Speculative decoding] CUDA graph support by heeju-kim2 in https://github.com/vllm-project/vllm/pull/4295
* [CI] Nits for bad initialization of SeqGroup in testing by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4748
* [Core][Test] fix function name typo in custom allreduce by youkaichao in https://github.com/vllm-project/vllm/pull/4750
* [Model][Misc] Add e5-mistral-7b-instruct and Embedding API by CatherineSue in https://github.com/vllm-project/vllm/pull/3734
* [Model] Add support for IBM Granite Code models by yikangshen in https://github.com/vllm-project/vllm/pull/4636
* [CI/Build] Tweak Marlin Nondeterminism Issues In CI by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4713
* [CORE] Improvement in ranks code by SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/4718
* [Core][Distributed] refactor custom allreduce to support multiple tp groups by youkaichao in https://github.com/vllm-project/vllm/pull/4754
* [CI/Build] Move `test_utils.py` to `tests/utils.py` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4425
* [Scheduler] Warning upon preemption and Swapping by rkooo567 in https://github.com/vllm-project/vllm/pull/4647
* [Misc] Enhance attention selector by WoosukKwon in https://github.com/vllm-project/vllm/pull/4751
* [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 by sangstar in https://github.com/vllm-project/vllm/pull/4208
* [Speculative decoding] Improve n-gram efficiency by comaniac in https://github.com/vllm-project/vllm/pull/4724
* [Kernel] Use flash-attn for decoding by skrider in https://github.com/vllm-project/vllm/pull/3648
* [Bugfix] Fix dynamic FP8 quantization for Mixtral by pcmoritz in https://github.com/vllm-project/vllm/pull/4793
* [Doc] Shorten README by removing supported model list by zhuohan123 in https://github.com/vllm-project/vllm/pull/4796
* [Doc] Add API reference for offline inference by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4710
* [Doc] Add meetups to the doc by zhuohan123 in https://github.com/vllm-project/vllm/pull/4798
* [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by KuntaiDu in https://github.com/vllm-project/vllm/pull/4696
* [Bugfix][Doc] Fix CI failure in docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4804
* [Core] Add MultiprocessingGPUExecutor by njhill in https://github.com/vllm-project/vllm/pull/4539
* Add 4th meetup announcement to readme by simon-mo in https://github.com/vllm-project/vllm/pull/4817
* Revert "[Kernel] Use flash-attn for decoding (3648)" by rkooo567 in https://github.com/vllm-project/vllm/pull/4820
* [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API by rkooo567 in https://github.com/vllm-project/vllm/pull/4681
* [CI/Build] Further decouple HuggingFace implementation from ours during tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4166
* [Bugfix] Properly set distributed_executor_backend in ParallelConfig by zifeitong in https://github.com/vllm-project/vllm/pull/4816
* [Doc] Highlight the fourth meetup in the README by zhuohan123 in https://github.com/vllm-project/vllm/pull/4842
* [Frontend] Re-enable custom roles in Chat Completions API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4758
* [Frontend] Support OpenAI batch file format by wuisawesome in https://github.com/vllm-project/vllm/pull/4794
* [Core] Implement sharded state loader by aurickq in https://github.com/vllm-project/vllm/pull/4690
* [Speculative decoding][Re-take] Enable TP>1 speculative decoding by comaniac in https://github.com/vllm-project/vllm/pull/4840
* Add marlin unit tests and marlin benchmark script by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4815
* [Kernel] add bfloat16 support for gptq marlin kernel by jinzhen-lin in https://github.com/vllm-project/vllm/pull/4788
* [docs] Fix typo in examples filename openi -> openai by wuisawesome in https://github.com/vllm-project/vllm/pull/4864
* [Frontend] Separate OpenAI Batch Runner usage from API Server by wuisawesome in https://github.com/vllm-project/vllm/pull/4851
* [Bugfix] Bypass authorization API token for preflight requests by dulacp in https://github.com/vllm-project/vllm/pull/4862
* Add GPTQ Marlin 2:4 sparse structured support by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4790
* Add JSON output support for benchmark_latency and benchmark_throughput by simon-mo in https://github.com/vllm-project/vllm/pull/4848
* [ROCm][AMD][Bugfix] adding a missing triton autotune config by hongxiayang in https://github.com/vllm-project/vllm/pull/4845
* [Core][Distributed] remove graph mode function by youkaichao in https://github.com/vllm-project/vllm/pull/4818
* [Misc] remove old comments by youkaichao in https://github.com/vllm-project/vllm/pull/4866
* [Kernel] Add punica dimension for Qwen1.5-32B LoRA by Silencioo in https://github.com/vllm-project/vllm/pull/4850
* [Kernel] Add w8a8 CUTLASS kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/4749
* [Bugfix] Fix FP8 KV cache support by WoosukKwon in https://github.com/vllm-project/vllm/pull/4869
* Support to serve vLLM on Kubernetes with LWS by kerthcet in https://github.com/vllm-project/vllm/pull/4829
* [Frontend] OpenAI API server: Do not add bos token by default when encoding by bofenghuang in https://github.com/vllm-project/vllm/pull/4688
* [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4797
* [Bugfix] fix rope error when load models with different dtypes by jinzhen-lin in https://github.com/vllm-project/vllm/pull/4835
* Sync huggingface modifications of qwen Moe model by eigen2017 in https://github.com/vllm-project/vllm/pull/4774
* [Doc] Update Ray Data distributed offline inference example by Yard1 in https://github.com/vllm-project/vllm/pull/4871
* [Bugfix] Relax tiktoken to >= 0.6.0 by mgoin in https://github.com/vllm-project/vllm/pull/4890
* [ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used by alexeykondrat in https://github.com/vllm-project/vllm/pull/4658
* [Lora] Support long context lora by rkooo567 in https://github.com/vllm-project/vllm/pull/4787
* [Bugfix][Model] Add base class for vision-language models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4809
* [Kernel] Add marlin_24 unit tests by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4901
* [Kernel] Add flash-attn back by WoosukKwon in https://github.com/vllm-project/vllm/pull/4907
* [Model] LLaVA model refactor by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4910
* Remove marlin warning by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4918
* [Bugfix]: Fix communication Timeout error in safety-constrained distributed System by ZwwWayne in https://github.com/vllm-project/vllm/pull/4914
* [Build/CI] Enabling AMD Entrypoints Test by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4834
* [Bugfix] Fix dummy weight for fp8 by mzusman in https://github.com/vllm-project/vllm/pull/4916
* [Core] Sharded State Loader download from HF by aurickq in https://github.com/vllm-project/vllm/pull/4889
* [Doc]Add documentation to benchmarking script when running TGI by KuntaiDu in https://github.com/vllm-project/vllm/pull/4920
* [Core] Fix scheduler considering "no LoRA" as "LoRA" by Yard1 in https://github.com/vllm-project/vllm/pull/4897
* [Model] add rope_scaling support for qwen2 by hzhwcmhf in https://github.com/vllm-project/vllm/pull/4930
* [Model] Add Phi-2 LoRA support by Isotr0py in https://github.com/vllm-project/vllm/pull/4886
* [Docs] Add acknowledgment for sponsors by simon-mo in https://github.com/vllm-project/vllm/pull/4925
* [CI/Build] Codespell ignore `build/` directory by mgoin in https://github.com/vllm-project/vllm/pull/4945
* [Bugfix] Fix flag name for `max_seq_len_to_capture` by kerthcet in https://github.com/vllm-project/vllm/pull/4935
* [Bugfix][Kernel] Add head size check for attention backend selection by Isotr0py in https://github.com/vllm-project/vllm/pull/4944
* [Frontend] Dynamic RoPE scaling by sasha0552 in https://github.com/vllm-project/vllm/pull/4638
* [CI/Build] Enforce style for C++ and CUDA code with `clang-format` by mgoin in https://github.com/vllm-project/vllm/pull/4722
* [misc] remove comments that were supposed to be removed by rkooo567 in https://github.com/vllm-project/vllm/pull/4977
* [Kernel] Fixup for CUTLASS kernels in CUDA graphs by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/4954
* [Misc] Load FP8 kv-cache scaling factors from checkpoints by comaniac in https://github.com/vllm-project/vllm/pull/4893
* [Model] LoRA gptbigcode implementation by raywanb in https://github.com/vllm-project/vllm/pull/3949
* [Core] Eliminate parallel worker per-step task scheduling overhead by njhill in https://github.com/vllm-project/vllm/pull/4894
* [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig by pcmoritz in https://github.com/vllm-project/vllm/pull/4991
* [Misc] Take user preference in attention selector by comaniac in https://github.com/vllm-project/vllm/pull/4960
* Marlin 24 prefill performance improvement (about 25% better on average) by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/4983
* [Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined by LetianLee in https://github.com/vllm-project/vllm/pull/5009
* [Core][1/N] Support PP PyNCCL Groups by andoorve in https://github.com/vllm-project/vllm/pull/4988
* [Kernel] Initial Activation Quantization Support by dsikka in https://github.com/vllm-project/vllm/pull/4525
* [Core]: Option To Use Prompt Token Ids Inside Logits Processor by kezouke in https://github.com/vllm-project/vllm/pull/4985
* [Doc] add ccache guide in doc by youkaichao in https://github.com/vllm-project/vllm/pull/5012
* [Bugfix] Fix Mistral v0.3 Weight Loading by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5005
* [Core][Bugfix]: fix prefix caching for blockv2 by leiwen83 in https://github.com/vllm-project/vllm/pull/4764
* [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model by linxihui in https://github.com/vllm-project/vllm/pull/4799
* [Misc] add logging level env var by youkaichao in https://github.com/vllm-project/vllm/pull/5045
* [Dynamic Spec Decoding] Minor fix for disabling speculative decoding by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/5000
* [Misc] Make Serving Benchmark More User-friendly by ywang96 in https://github.com/vllm-project/vllm/pull/5044
* [Bugfix / Core] Prefix Caching Guards (merged with main) by zhuohan123 in https://github.com/vllm-project/vllm/pull/4846
* [Core] Allow AQLM on Pascal by sasha0552 in https://github.com/vllm-project/vllm/pull/5058
* [Model] Add support for falcon-11B by Isotr0py in https://github.com/vllm-project/vllm/pull/5069
* [Core] Sliding window for block manager v2 by mmoskal in https://github.com/vllm-project/vllm/pull/4545
* [BugFix] Fix Embedding Models with TP>1 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5075
* [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X by divakar-amd in https://github.com/vllm-project/vllm/pull/4951
* [Docs] Add Dropbox as sponsors by simon-mo in https://github.com/vllm-project/vllm/pull/5089
* [Core] Consolidate prompt arguments to LLM engines by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4328
* [Bugfix] Remove the last EOS token unless explicitly specified by jsato8094 in https://github.com/vllm-project/vllm/pull/5077
* [Misc] add gpu_memory_utilization arg by pandyamarut in https://github.com/vllm-project/vllm/pull/5079
* [Core][Optimization] remove vllm-nccl by youkaichao in https://github.com/vllm-project/vllm/pull/5091
* [Bugfix] Fix arguments passed to `Sequence` in stop checker test by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5092
* [Core][Distributed] improve p2p access check by youkaichao in https://github.com/vllm-project/vllm/pull/4992
* [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) by afeldman-nm in https://github.com/vllm-project/vllm/pull/4837
* [Doc]Replace deprecated flag in readme by ronensc in https://github.com/vllm-project/vllm/pull/4526
* [Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5096
* [Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5097
* [Core] Avoid the need to pass `None` values to `Sequence.inputs` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5099
* [Bugfix] logprobs is not compatible with the OpenAI spec 4795 by Etelis in https://github.com/vllm-project/vllm/pull/5031
* [Doc][Build] update after removing vllm-nccl by youkaichao in https://github.com/vllm-project/vllm/pull/5103
* [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5108
* [CI/Build] Docker cleanup functionality for amd servers by okakarpa in https://github.com/vllm-project/vllm/pull/5112
* [BUGFIX] [FRONTEND] Correct chat logprobs by br3no in https://github.com/vllm-project/vllm/pull/5029
* [Bugfix] Automatically Detect SparseML models by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5119
* [CI/Build] increase wheel size limit to 200 MB by youkaichao in https://github.com/vllm-project/vllm/pull/5130
* [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py by ita9naiwa in https://github.com/vllm-project/vllm/pull/5129
* [Doc] Use intersphinx and update entrypoints docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5125
* add doc about serving option on dstack by deep-diver in https://github.com/vllm-project/vllm/pull/3074
* Bump version to v0.4.3 by simon-mo in https://github.com/vllm-project/vllm/pull/5046
* [Build] Disable sm_90a in cu11 by simon-mo in https://github.com/vllm-project/vllm/pull/5141
* [Bugfix] Avoid Warnings in SparseML Activation Quantization by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5120
* [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5136
* [Model] Support MAP-NEO model by xingweiqu in https://github.com/vllm-project/vllm/pull/5081
* Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" by simon-mo in https://github.com/vllm-project/vllm/pull/5149
* [Misc]: optimize eager mode host time by functionxu123 in https://github.com/vllm-project/vllm/pull/4196
* [Model] Enable FP8 QKV in MoE and refine kernel tuning script by comaniac in https://github.com/vllm-project/vllm/pull/5039
* [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support by njhill in https://github.com/vllm-project/vllm/pull/5171
* [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5168

New Contributors
* MahmoudAshraf97 made their first contribution in https://github.com/vllm-project/vllm/pull/4400
* sfc-gh-hazhang made their first contribution in https://github.com/vllm-project/vllm/pull/4652
* stevegrubb made their first contribution in https://github.com/vllm-project/vllm/pull/4719
* heeju-kim2 made their first contribution in https://github.com/vllm-project/vllm/pull/4295
* yikangshen made their first contribution in https://github.com/vllm-project/vllm/pull/4636
* KuntaiDu made their first contribution in https://github.com/vllm-project/vllm/pull/4696
* wuisawesome made their first contribution in https://github.com/vllm-project/vllm/pull/4794
* aurickq made their first contribution in https://github.com/vllm-project/vllm/pull/4690
* jinzhen-lin made their first contribution in https://github.com/vllm-project/vllm/pull/4788
* dulacp made their first contribution in https://github.com/vllm-project/vllm/pull/4862
* Silencioo made their first contribution in https://github.com/vllm-project/vllm/pull/4850
* tlrmchlsmth made their first contribution in https://github.com/vllm-project/vllm/pull/4749
* kerthcet made their first contribution in https://github.com/vllm-project/vllm/pull/4829
* bofenghuang made their first contribution in https://github.com/vllm-project/vllm/pull/4688
* eigen2017 made their first contribution in https://github.com/vllm-project/vllm/pull/4774
* alexeykondrat made their first contribution in https://github.com/vllm-project/vllm/pull/4658
* ZwwWayne made their first contribution in https://github.com/vllm-project/vllm/pull/4914
* mzusman made their first contribution in https://github.com/vllm-project/vllm/pull/4916
* hzhwcmhf made their first contribution in https://github.com/vllm-project/vllm/pull/4930
* raywanb made their first contribution in https://github.com/vllm-project/vllm/pull/3949
* LetianLee made their first contribution in https://github.com/vllm-project/vllm/pull/5009
* dsikka made their first contribution in https://github.com/vllm-project/vllm/pull/4525
* kezouke made their first contribution in https://github.com/vllm-project/vllm/pull/4985
* linxihui made their first contribution in https://github.com/vllm-project/vllm/pull/4799
* divakar-amd made their first contribution in https://github.com/vllm-project/vllm/pull/4951
* pandyamarut made their first contribution in https://github.com/vllm-project/vllm/pull/5079
* afeldman-nm made their first contribution in https://github.com/vllm-project/vllm/pull/4837
* Etelis made their first contribution in https://github.com/vllm-project/vllm/pull/5031
* okakarpa made their first contribution in https://github.com/vllm-project/vllm/pull/5112
* deep-diver made their first contribution in https://github.com/vllm-project/vllm/pull/3074
* xingweiqu made their first contribution in https://github.com/vllm-project/vllm/pull/5081
* functionxu123 made their first contribution in https://github.com/vllm-project/vllm/pull/4196

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.2...v0.4.3

Page 3 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.