Highlights
* Significant progress in V1 engine core refactor (9826, 10135, 10288, 10211, 10225, 10228, 10268, 9954, 10272, 9971, 10224, 10166, 9289, 10058, 9888, 9972, 10059, 9945, 9679, 9871, 10227, 10245, 9629, 10097, 10203, 10148). You can checkout more details regarding the design and plan ahead in our recent [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_2_130)
* Signficant progress in `torch.compile` support. Many models now support torch compile with TorchInductor. You can checkout our [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_0_443) for more details. (9775, 9614, 9639, 9641, 9876, 9946, 9589, 9896, 9637, 9300, 9947, 9138, 9715, 9866, 9632, 9858, 9889)
Model Support
* New LLMs and VLMs: Idefics3 (9767), H2OVL-Mississippi (9747), Qwen2-Audio (9248), FalconMamba (9325), E5-V (9576), math-shepherd-mistral-7b-prm (9697), Pixtral models in the HF Transformers format (9036), Florence-2 (9555)
* New support for encoder decoder embedding model: `BERTModel` (9056) and `Roberta` (9387)
* Expanded task support: LlamaEmbeddingModel (9806), Qwen2ForSequenceClassification (9704), Qwen2 embeddings (10184)
* Add user-configurable `--task` parameter for models that support both generation and embedding (9424)
* Tool calling parser for Granite 3.0 (9027), Jamba (9154), granite-20b-functioncalling (8339)
* LoRA support for granite 3.0 MoE (9673), idefics3 (10281), LlamaEmbeddingModel (10071), Qwen (9622), Qwen2VLForConditionalGeneration (10022)
* BNB quantization support for Idefics3 (10310), Mllama (9720), Qwen2 (9467, 9574), MiniCPMV (9891)
* Unified multi-modal processor for VLM (10040, 10044, 9933, 10237, 9938, 9958, 10007, 9978, 9983, 10205)
Hardware Support
* Gaudi: Add Intel Gaudi (HPU) inference backend (6143)
* CPU: Add embedding models support for CPU backend (10193)
* TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (9438)
* Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (9857)
Performance
* Combine chunked prefill with speculative decoding (9291)
* `fused_moe` Performance Improvement (9384)
Engine Core
* Override HF `config.json` via CLI (5836)
* Add goodput metric support (9338)
* Move parallel sampling out from vllm core, paving way for V1 engine (9302)
* Add stateless process group for easier integration with RLHF and disaggregated prefill (10216, 10072)
Others
* Improvements to the pull request experience with DCO, mergify, stale bot, etc. (9436, 9512, 9513, 9259, 10082, 10285, 9803)
* Dropped support for Python 3.8 (10038, 8464)
* Basic Integration Test For TPU (9968)
* Document the class hierarchy in vLLM (10240), explain the integration with Hugging Face (10173).
* Benchmark throughput now supports image input (9851)
What's Changed
* [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/9350
* [Frontend] merge beam search implementations by LunrEclipse in https://github.com/vllm-project/vllm/pull/9296
* [Model] Make llama3.2 support multiple and interleaved images by xiangxu-google in https://github.com/vllm-project/vllm/pull/9095
* [Bugfix] Clean up some cruft in mamba.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9343
* [Frontend] Clarify model_type error messages by stevegrubb in https://github.com/vllm-project/vllm/pull/9345
* [Doc] Fix code formatting in spec_decode.rst by mgoin in https://github.com/vllm-project/vllm/pull/9348
* [Bugfix] Update InternVL input mapper to support image embeds by hhzhang16 in https://github.com/vllm-project/vllm/pull/9351
* [BugFix] Fix chat API continuous usage stats by njhill in https://github.com/vllm-project/vllm/pull/9357
* pass ignore_eos parameter to all benchmark_serving calls by gracehonv in https://github.com/vllm-project/vllm/pull/9349
* [Misc] Directly use compressed-tensors for checkpoint definitions by mgoin in https://github.com/vllm-project/vllm/pull/8909
* [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by CatherineSue in https://github.com/vllm-project/vllm/pull/9034
* [Bugfix][CI/Build] Fix CUDA 11.8 Build by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9386
* [Bugfix] Molmo text-only input bug fix by mrsalehi in https://github.com/vllm-project/vllm/pull/9397
* [Misc] Standardize RoPE handling for Qwen2-VL by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9250
* [Model] VLM2Vec, the first multimodal embedding model in vLLM by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9303
* [CI/Build] Test VLM embeddings by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9406
* [Core] Rename input data types by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8688
* [Misc] Consolidate example usage of OpenAI client for multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/9412
* [Model] Support SDPA attention for Molmo vision backbone by Isotr0py in https://github.com/vllm-project/vllm/pull/9410
* Support mistral interleaved attn by patrickvonplaten in https://github.com/vllm-project/vllm/pull/9414
* [Kernel][Model] Improve continuous batching for Jamba and Mamba by mzusman in https://github.com/vllm-project/vllm/pull/9189
* [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by streaver91 in https://github.com/vllm-project/vllm/pull/9396
* [Performance][Spec Decode] Optimize ngram lookup performance by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9333
* [CI/Build] mypy: Resolve some errors from checking vllm/engine by russellb in https://github.com/vllm-project/vllm/pull/9267
* [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9425
* [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by rasmith in https://github.com/vllm-project/vllm/pull/9391
* Add notes on the use of Slack by terrytangyuan in https://github.com/vllm-project/vllm/pull/9442
* [Kernel] Add Exllama as a backend for compressed-tensors by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9395
* [Misc] Print stack trace using `logger.exception` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9461
* [misc] CUDA Time Layerwise Profiler by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8337
* [Bugfix] Allow prefill of assistant response when using `mistral_common` by sasha0552 in https://github.com/vllm-project/vllm/pull/9446
* [TPU] Call torch._sync(param) during weight loading by WoosukKwon in https://github.com/vllm-project/vllm/pull/9437
* [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/9344
* [Core] Deprecating block manager v1 and make block manager v2 default by KuntaiDu in https://github.com/vllm-project/vllm/pull/8704
* [CI/Build] remove .github from .dockerignore, add dirty repo check by dtrifiro in https://github.com/vllm-project/vllm/pull/9375
* [Misc] Remove commit id file by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9470
* [torch.compile] Fine-grained CustomOp enabling mechanism by ProExpertProg in https://github.com/vllm-project/vllm/pull/9300
* [Bugfix] Fix support for dimension like integers and ScalarType by bnellnm in https://github.com/vllm-project/vllm/pull/9299
* [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by wukaixingxp in https://github.com/vllm-project/vllm/pull/9013
* [Bugfix] Print warnings related to `mistral_common` tokenizer only once by sasha0552 in https://github.com/vllm-project/vllm/pull/9468
* [Hardwware][Neuron] Simplify model load for transformers-neuronx library by sssrijan-amazon in https://github.com/vllm-project/vllm/pull/9380
* Support `BERTModel` (first `encoder-only` embedding model) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9056
* [BugFix] Stop silent failures on compressed-tensors parsing by dsikka in https://github.com/vllm-project/vllm/pull/9381
* [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by joerunde in https://github.com/vllm-project/vllm/pull/9352
* [Qwen2.5] Support bnb quant for Qwen2.5 by blueyo0 in https://github.com/vllm-project/vllm/pull/9467
* [CI/Build] Use commit hash references for github actions by russellb in https://github.com/vllm-project/vllm/pull/9430
* [BugFix] Typing fixes to RequestOutput.prompt and beam search by njhill in https://github.com/vllm-project/vllm/pull/9473
* [Frontend][Feature] Add jamba tool parser by tomeras91 in https://github.com/vllm-project/vllm/pull/9154
* [BugFix] Fix and simplify completion API usage streaming by njhill in https://github.com/vllm-project/vllm/pull/9475
* [CI/Build] Fix lint errors in mistral tokenizer by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9504
* [Bugfix] Fix offline_inference_with_prefix.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9505
* [Misc] benchmark: Add option to set max concurrency by russellb in https://github.com/vllm-project/vllm/pull/9390
* [Model] Add user-configurable task for models that support both generation and embedding by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9424
* [CI/Build] Add error matching config for mypy by russellb in https://github.com/vllm-project/vllm/pull/9512
* [Model] Support Pixtral models in the HF Transformers format by mgoin in https://github.com/vllm-project/vllm/pull/9036
* [MISC] Add lora requests to metrics by coolkp in https://github.com/vllm-project/vllm/pull/9477
* [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by comaniac in https://github.com/vllm-project/vllm/pull/9510
* [Kernel] Add env variable to force flashinfer backend to enable tensor cores by tdoublep in https://github.com/vllm-project/vllm/pull/9497
* [Bugfix] Fix offline mode when using `mistral_common` by sasha0552 in https://github.com/vllm-project/vllm/pull/9457
* :bug: fix torch memory profiling by joerunde in https://github.com/vllm-project/vllm/pull/9516
* [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by njhill in https://github.com/vllm-project/vllm/pull/9521
* [Doc] update gpu-memory-utilization flag docs by joerunde in https://github.com/vllm-project/vllm/pull/9507
* [CI/Build] Add error matching for ruff output by russellb in https://github.com/vllm-project/vllm/pull/9513
* [CI/Build] Configure matcher for actionlint workflow by russellb in https://github.com/vllm-project/vllm/pull/9511
* [Frontend] Support simpler image input format by yue-anyscale in https://github.com/vllm-project/vllm/pull/9478
* [Bugfix] Fix missing task for speculative decoding by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9524
* [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by mgoin in https://github.com/vllm-project/vllm/pull/9514
* [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by heheda12345 in https://github.com/vllm-project/vllm/pull/9530
* [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by mgoin in https://github.com/vllm-project/vllm/pull/9520
* [Kernel] Support sliding window in flash attention backend by heheda12345 in https://github.com/vllm-project/vllm/pull/9403
* [Frontend][Misc] Goodput metric support by Imss27 in https://github.com/vllm-project/vllm/pull/9338
* [CI/Build] Split up decoder-only LM tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9488
* [Doc] Consistent naming of attention backends by tdoublep in https://github.com/vllm-project/vllm/pull/9498
* [Model] FalconMamba Support by dhiaEddineRhaiem in https://github.com/vllm-project/vllm/pull/9325
* [Bugfix][Misc]: fix graph capture for decoder by yudian0504 in https://github.com/vllm-project/vllm/pull/9549
* [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by varad-ahirwadkar in https://github.com/vllm-project/vllm/pull/9492
* [Model][Bugfix] Fix batching with multi-image in PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9518
* [Frontend] Reduce frequency of client cancellation checking by njhill in https://github.com/vllm-project/vllm/pull/7959
* [doc] fix format by youkaichao in https://github.com/vllm-project/vllm/pull/9562
* [BugFix] Update draft model TP size check to allow matching target TP size by njhill in https://github.com/vllm-project/vllm/pull/9394
* [Frontend] Don't log duplicate error stacktrace for every request in the batch by wallashss in https://github.com/vllm-project/vllm/pull/9023
* [CI] Make format checker error message more user-friendly by using emoji by KuntaiDu in https://github.com/vllm-project/vllm/pull/9564
* :bug: Fixup more test failures from memory profiling by joerunde in https://github.com/vllm-project/vllm/pull/9563
* [core] move parallel sampling out from vllm core by youkaichao in https://github.com/vllm-project/vllm/pull/9302
* [Bugfix]: serialize config instances by value when using --trust-remote-code by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6751
* [CI/Build] Remove unnecessary `fork_new_process` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9484
* [Bugfix][OpenVINO] fix_dockerfile_openvino by ngrozae in https://github.com/vllm-project/vllm/pull/9552
* [Bugfix]: phi.py get rope_theta from config file by Falko1 in https://github.com/vllm-project/vllm/pull/9503
* [CI/Build] Replaced some models on tests for smaller ones by wallashss in https://github.com/vllm-project/vllm/pull/9570
* [Core] Remove evictor_v1 by KuntaiDu in https://github.com/vllm-project/vllm/pull/9572
* [Doc] Use shell code-blocks and fix section headers by rafvasq in https://github.com/vllm-project/vllm/pull/9508
* support TP in qwen2 bnb by chenqianfzh in https://github.com/vllm-project/vllm/pull/9574
* [Hardware][CPU] using current_platform.is_cpu by wangshuai09 in https://github.com/vllm-project/vllm/pull/9536
* [V1] Implement vLLM V1 [1/N] by WoosukKwon in https://github.com/vllm-project/vllm/pull/9289
* [CI/Build][LoRA] Temporarily fix long context failure issue by jeejeelee in https://github.com/vllm-project/vllm/pull/9579
* [Neuron] [Bugfix] Fix neuron startup by xendo in https://github.com/vllm-project/vllm/pull/9374
* [Model][VLM] Initialize support for Mono-InternVL model by Isotr0py in https://github.com/vllm-project/vllm/pull/9528
* [Bugfix] Eagle: change config name for fc bias by gopalsarda in https://github.com/vllm-project/vllm/pull/9580
* [Hardware][Intel CPU][DOC] Update docs for CPU backend by zhouyuan in https://github.com/vllm-project/vllm/pull/6212
* [Frontend] Support custom request_id from request by guoyuhong in https://github.com/vllm-project/vllm/pull/9550
* [BugFix] Prevent exporting duplicate OpenTelemetry spans by ronensc in https://github.com/vllm-project/vllm/pull/9017
* [torch.compile] auto infer dynamic_arg_dims from type annotation by youkaichao in https://github.com/vllm-project/vllm/pull/9589
* [Bugfix] fix detokenizer shallow copy by aurickq in https://github.com/vllm-project/vllm/pull/5919
* [Misc] Make benchmarks use EngineArgs by JArnoldAMD in https://github.com/vllm-project/vllm/pull/9529
* [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9487
* [BugFix] Fix metrics error for --num-scheduler-steps > 1 by yuleil in https://github.com/vllm-project/vllm/pull/8234
* [Doc]: Update tensorizer docs to include vllm[tensorizer] by sethkimmel3 in https://github.com/vllm-project/vllm/pull/7889
* [Bugfix] Generate exactly input_len tokens in benchmark_throughput by heheda12345 in https://github.com/vllm-project/vllm/pull/9592
* [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/9590
* [Model] Support E5-V by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9576
* [Build] Fix `FetchContent` multiple build issue by ProExpertProg in https://github.com/vllm-project/vllm/pull/9596
* [Hardware][XPU] using current_platform.is_xpu by MengqingCao in https://github.com/vllm-project/vllm/pull/9605
* [Model] Initialize Florence-2 language backbone support by Isotr0py in https://github.com/vllm-project/vllm/pull/9555
* [VLM] Post-layernorm override and quant config in vision encoder by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9217
* [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9612
* [Bugfix] Fix `_init_vision_model` in NVLM_D model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9611
* [misc] comment to avoid future confusion about baichuan by youkaichao in https://github.com/vllm-project/vllm/pull/9620
* [Bugfix] Fix divide by zero when serving Mamba models by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9617
* [Misc] Separate total and output tokens in benchmark_throughput.py by mgoin in https://github.com/vllm-project/vllm/pull/8914
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9614
* [Frontend] Enable Online Multi-image Support for MLlama by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9393
* [Model] Add Qwen2-Audio model support by faychu in https://github.com/vllm-project/vllm/pull/9248
* [CI/Build] Add bot to close stale issues and PRs by russellb in https://github.com/vllm-project/vllm/pull/9436
* [Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by mgoin in https://github.com/vllm-project/vllm/pull/9626
* [Bugfix] Use "vision_model" prefix for MllamaVisionModel by mgoin in https://github.com/vllm-project/vllm/pull/9628
* [Bugfix]: Make chat content text allow type content by vrdn-23 in https://github.com/vllm-project/vllm/pull/9358
* [XPU] avoid triton import for xpu by yma11 in https://github.com/vllm-project/vllm/pull/9440
* [Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9422
* [V1][Bugfix] Clean up requests when aborted by WoosukKwon in https://github.com/vllm-project/vllm/pull/9629
* [core] simplify seq group code by youkaichao in https://github.com/vllm-project/vllm/pull/9569
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9639
* [Kernel] add kernel for FATReLU by jeejeelee in https://github.com/vllm-project/vllm/pull/9610
* [torch.compile] expanding support and fix allgather compilation by CRZbulabula in https://github.com/vllm-project/vllm/pull/9637
* [Doc] Move additional tips/notes to the top by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9647
* [Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by litianjian in https://github.com/vllm-project/vllm/pull/9653
* Increase operation per run limit for "Close inactive issues and PRs" workflow by hmellor in https://github.com/vllm-project/vllm/pull/9661
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9641
* [CI/Build] Fix VLM test failures when using transformers v4.46 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9666
* [Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9650
* [Log][Bugfix] Fix default value check for `image_url.detail` by mgoin in https://github.com/vllm-project/vllm/pull/9663
* [Performance][Kernel] Fused_moe Performance Improvement by charlifu in https://github.com/vllm-project/vllm/pull/9384
* [Bugfix] Remove xformers requirement for Pixtral by mgoin in https://github.com/vllm-project/vllm/pull/9597
* [ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test 9675 by khluu in https://github.com/vllm-project/vllm/pull/9676
* [Model] add a lora module for granite 3.0 MoE models by willmj in https://github.com/vllm-project/vllm/pull/9673
* [V1] Support sliding window attention by WoosukKwon in https://github.com/vllm-project/vllm/pull/9679
* [Bugfix] Fix compressed_tensors_moe bad config.strategy by mgoin in https://github.com/vllm-project/vllm/pull/9677
* [Doc] Improve quickstart documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9256
* [Bugfix] Fix crash with llama 3.2 vision models and guided decoding by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9631
* [Bugfix] Steaming continuous_usage_stats default to False by samos123 in https://github.com/vllm-project/vllm/pull/9709
* [Hardware][openvino] is_openvino --> current_platform.is_openvino by MengqingCao in https://github.com/vllm-project/vllm/pull/9716
* Fix: MI100 Support By Bypassing Custom Paged Attention by MErkinSag in https://github.com/vllm-project/vllm/pull/9560
* [Frontend] Bad words sampling parameter by Alvant in https://github.com/vllm-project/vllm/pull/9717
* [Model] Add classification Task with Qwen2ForSequenceClassification by kakao-kevin-us in https://github.com/vllm-project/vllm/pull/9704
* [Misc] SpecDecodeWorker supports profiling by Abatom in https://github.com/vllm-project/vllm/pull/9719
* [core] cudagraph output with tensor weak reference by youkaichao in https://github.com/vllm-project/vllm/pull/9724
* [Misc] Upgrade to pytorch 2.5 by bnellnm in https://github.com/vllm-project/vllm/pull/9588
* Fix cache management in "Close inactive issues and PRs" actions workflow by hmellor in https://github.com/vllm-project/vllm/pull/9734
* [Bugfix] Fix load config when using bools by madt2709 in https://github.com/vllm-project/vllm/pull/9533
* [Hardware][ROCM] using current_platform.is_rocm by wangshuai09 in https://github.com/vllm-project/vllm/pull/9642
* [torch.compile] support moe models by youkaichao in https://github.com/vllm-project/vllm/pull/9632
* Fix beam search eos by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9627
* [Bugfix] Fix ray instance detect issue by yma11 in https://github.com/vllm-project/vllm/pull/9439
* [CI/Build] Adopt Mergify for auto-labeling PRs by russellb in https://github.com/vllm-project/vllm/pull/9259
* [Model][VLM] Add multi-video support for LLaVA-Onevision by litianjian in https://github.com/vllm-project/vllm/pull/8905
* [torch.compile] Adding "torch compile" annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9758
* [misc] avoid circular import by youkaichao in https://github.com/vllm-project/vllm/pull/9765
* [torch.compile] add deepseek v2 compile by youkaichao in https://github.com/vllm-project/vllm/pull/9775
* [Doc] fix third-party model example by russellb in https://github.com/vllm-project/vllm/pull/9771
* [Model][LoRA]LoRA support added for Qwen by jeejeelee in https://github.com/vllm-project/vllm/pull/9622
* [Doc] Specify async engine args in docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9726
* [Bugfix] Use temporary directory in registry by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9721
* [Frontend] re-enable multi-modality input in the new beam search implementation by FerdinandZhong in https://github.com/vllm-project/vllm/pull/9427
* [Model] Add BNB quantization support for Mllama by Isotr0py in https://github.com/vllm-project/vllm/pull/9720
* [Hardware] using current_platform.seed_everything by wangshuai09 in https://github.com/vllm-project/vllm/pull/9785
* [Misc] Add metrics for request queue time, forward time, and execute time by Abatom in https://github.com/vllm-project/vllm/pull/9659
* Fix the log to correct guide user to install modelscope by tastelikefeet in https://github.com/vllm-project/vllm/pull/9793
* [Bugfix] Use host argument to bind to interface by svenseeberg in https://github.com/vllm-project/vllm/pull/9798
* [Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by yannicks1 in https://github.com/vllm-project/vllm/pull/9801
* [Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by jsato8094 in https://github.com/vllm-project/vllm/pull/9806
* [CI][Bugfix] Skip chameleon for transformers 4.46.1 by mgoin in https://github.com/vllm-project/vllm/pull/9808
* [CI/Build] mergify: fix rules for ci/build label by russellb in https://github.com/vllm-project/vllm/pull/9804
* [MISC] Set label value to timestamp over 0, to keep track of recent history by coolkp in https://github.com/vllm-project/vllm/pull/9777
* [Bugfix][Frontend] Guard against bad token ids by joerunde in https://github.com/vllm-project/vllm/pull/9634
* [Model] tool calling support for ibm-granite/granite-20b-functioncalling by wseaton in https://github.com/vllm-project/vllm/pull/8339
* [Docs] Add notes about Snowflake Meetup by simon-mo in https://github.com/vllm-project/vllm/pull/9814
* [Bugfix] Fix prefix strings for quantized VLMs by mgoin in https://github.com/vllm-project/vllm/pull/9772
* [core][distributed] fix custom allreduce in pytorch 2.5 by youkaichao in https://github.com/vllm-project/vllm/pull/9815
* Update README.md by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9819
* [Bugfix][VLM] Make apply_fp8_linear work with >2D input by mgoin in https://github.com/vllm-project/vllm/pull/9812
* [ci/build] Pin CI dependencies version with pip-compile by khluu in https://github.com/vllm-project/vllm/pull/9810
* [Bugfix] Fix multi nodes TP+PP for XPU by yma11 in https://github.com/vllm-project/vllm/pull/8884
* [Doc] Add the DCO to CONTRIBUTING.md by russellb in https://github.com/vllm-project/vllm/pull/9803
* [torch.compile] rework compile control with piecewise cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/9715
* [Misc] Specify minimum pynvml version by jeejeelee in https://github.com/vllm-project/vllm/pull/9827
* [TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by WoosukKwon in https://github.com/vllm-project/vllm/pull/9438
* [CI/Build] VLM Test Consolidation by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9372
* [Model] Support math-shepherd-mistral-7b-prm model by Went-Liang in https://github.com/vllm-project/vllm/pull/9697
* [Misc] Add chunked-prefill support on FlashInfer. by elfiegg in https://github.com/vllm-project/vllm/pull/9781
* [Bugfix][core] replace heartbeat with pid check by joerunde in https://github.com/vllm-project/vllm/pull/9818
* [Doc] link bug for multistep guided decoding by joerunde in https://github.com/vllm-project/vllm/pull/9843
* [Neuron] Update Dockerfile.neuron to fix build failure by hbikki in https://github.com/vllm-project/vllm/pull/9822
* [doc] update pp support by youkaichao in https://github.com/vllm-project/vllm/pull/9853
* [CI/Build] Simplify exception trace in api server tests by CRZbulabula in https://github.com/vllm-project/vllm/pull/9787
* [torch.compile] upgrade tests by youkaichao in https://github.com/vllm-project/vllm/pull/9858
* [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by gcalmettes in https://github.com/vllm-project/vllm/pull/9837
* Revert "[Bugfix] Use host argument to bind to interface (9798)" by khluu in https://github.com/vllm-project/vllm/pull/9852
* [Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by mgoin in https://github.com/vllm-project/vllm/pull/9817
* [Misc] Remove deprecated arg for cuda graph capture by ywang96 in https://github.com/vllm-project/vllm/pull/9864
* [Doc] Update Qwen documentation by jeejeelee in https://github.com/vllm-project/vllm/pull/9869
* [CI/Build] Add Model Tests for Qwen2-VL by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9846
* [CI/Build] Adding a forced docker system prune to clean up space by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/9849
* [Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by sasha0552 in https://github.com/vllm-project/vllm/pull/9532
* [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by mzusman in https://github.com/vllm-project/vllm/pull/9838
* [ci/build] Configure dependabot to update pip dependencies by khluu in https://github.com/vllm-project/vllm/pull/9811
* [Bugfix][Frontend] Reject guided decoding in multistep mode by joerunde in https://github.com/vllm-project/vllm/pull/9892
* [torch.compile] directly register custom op by youkaichao in https://github.com/vllm-project/vllm/pull/9896
* [Bugfix] Fix layer skip logic with bitsandbytes by mgoin in https://github.com/vllm-project/vllm/pull/9887
* [torch.compile] rework test plans by youkaichao in https://github.com/vllm-project/vllm/pull/9866
* [Model] Support bitsandbytes for MiniCPMV by mgoin in https://github.com/vllm-project/vllm/pull/9891
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9876
* [Doc] Update multi-input support by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9906
* [Frontend] Chat-based Embeddings API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9759
* [CI/Build] Add Model Tests for PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9813
* [Frontend] Use a proper chat template for VLM2Vec by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9912
* [Bugfix] Fix edge cases for MistralTokenizer by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9625
* [Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by andrejonasson in https://github.com/vllm-project/vllm/pull/9696
* [torch.compile] use interpreter with stable api from pytorch by youkaichao in https://github.com/vllm-project/vllm/pull/9889
* [Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by pavanimajety in https://github.com/vllm-project/vllm/pull/9861
* [1/N] pass the complete config from engine to executor by youkaichao in https://github.com/vllm-project/vllm/pull/9933
* [Bugfix] PicklingError on RayTaskError by GeneDer in https://github.com/vllm-project/vllm/pull/9934
* Bump the patch-update group with 10 updates by dependabot in https://github.com/vllm-project/vllm/pull/9897
* [Core][VLM] Add precise multi-modal placeholder tracking by petersalas in https://github.com/vllm-project/vllm/pull/8346
* [ci/build] Have dependabot ignore pinned dependencies by khluu in https://github.com/vllm-project/vllm/pull/9935
* [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by sroy745 in https://github.com/vllm-project/vllm/pull/9559
* [torch.compile] fix cpu broken code by youkaichao in https://github.com/vllm-project/vllm/pull/9947
* [Docs] Update Granite 3.0 models in supported models table by njhill in https://github.com/vllm-project/vllm/pull/9930
* [Doc] Updated tpu-installation.rst with more details by mikegre-google in https://github.com/vllm-project/vllm/pull/9926
* [2/N] executor pass the complete config to worker/modelrunner by youkaichao in https://github.com/vllm-project/vllm/pull/9938
* [V1] Fix `EngineArgs` refactor on V1 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9954
* [bugfix] fix chatglm dummy_data_for_glmv by youkaichao in https://github.com/vllm-project/vllm/pull/9955
* [3/N] model runner pass the whole config to model by youkaichao in https://github.com/vllm-project/vllm/pull/9958
* [CI/Build] Quoting around > by nokados in https://github.com/vllm-project/vllm/pull/9956
* [torch.compile] Adding torch compile annotations to vision-language models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9946
* [bugfix] fix tsts by youkaichao in https://github.com/vllm-project/vllm/pull/9959
* [V1] Support per-request seed by njhill in https://github.com/vllm-project/vllm/pull/9945
* [Model] Add support for H2OVL-Mississippi models by cooleel in https://github.com/vllm-project/vllm/pull/9747
* [V1] Fix Configs by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9971
* [Bugfix] Fix MiniCPMV and Mllama BNB bug by jeejeelee in https://github.com/vllm-project/vllm/pull/9917
* [Bugfix]Using the correct type hints by gshtras in https://github.com/vllm-project/vllm/pull/9885
* [Misc] Compute query_start_loc/seq_start_loc on CPU by zhengy001 in https://github.com/vllm-project/vllm/pull/9447
* [Bugfix] Fix E2EL mean and median stats by daitran2k1 in https://github.com/vllm-project/vllm/pull/9984
* [Bugfix][OpenVINO] Fix circular reference 9939 by MengqingCao in https://github.com/vllm-project/vllm/pull/9974
* [Frontend] Multi-Modality Support for Loading Local Image Files by chaunceyjiang in https://github.com/vllm-project/vllm/pull/9915
* [4/N] make quant config first-class citizen by youkaichao in https://github.com/vllm-project/vllm/pull/9978
* [Misc]Reduce BNB static variable by jeejeelee in https://github.com/vllm-project/vllm/pull/9987
* [Model] factoring out MambaMixer out of Jamba by mzusman in https://github.com/vllm-project/vllm/pull/8993
* [CI] Basic Integration Test For TPU by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9968
* [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by hissu-hyvarinen in https://github.com/vllm-project/vllm/pull/9279
* [Doc] Update VLM doc about loading from local files by ywang96 in https://github.com/vllm-project/vllm/pull/9999
* [Bugfix] Fix `MQLLMEngine` hanging by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9973
* [Misc] Refactor benchmark_throughput.py by lk-chen in https://github.com/vllm-project/vllm/pull/9779
* [Frontend] Add max_tokens prometheus metric by tomeras91 in https://github.com/vllm-project/vllm/pull/9881
* [Bugfix] Upgrade to pytorch 2.5.1 by bnellnm in https://github.com/vllm-project/vllm/pull/10001
* [4.5/N] bugfix for quant config in speculative decode by youkaichao in https://github.com/vllm-project/vllm/pull/10007
* [Bugfix] Respect modules_to_not_convert within awq_marlin by mgoin in https://github.com/vllm-project/vllm/pull/9895
* [Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9994
* [Core] Make encoder-decoder inputs a nested structure to be more composable by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9604
* [Bugfix] Fixup Mamba by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/10004
* [BugFix] Lazy import ray by GeneDer in https://github.com/vllm-project/vllm/pull/10021
* [Misc] vllm CLI flags should be ordered for better user readability by chaunceyjiang in https://github.com/vllm-project/vllm/pull/10017
* [Frontend] Fix tcp port reservation for api server by russellb in https://github.com/vllm-project/vllm/pull/10012
* Refactor TPU requirements file and pin build dependencies by richardsliu in https://github.com/vllm-project/vllm/pull/10010
* [Misc] Add logging for CUDA memory by yangalan123 in https://github.com/vllm-project/vllm/pull/10027
* [CI/Build] Limit github CI jobs based on files changed by russellb in https://github.com/vllm-project/vllm/pull/9928
* [Model] Support quantization of PixtralHFTransformer for PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9921
* [Feature] Update benchmark_throughput.py to support image input by lk-chen in https://github.com/vllm-project/vllm/pull/9851
* [Misc] Modify BNB parameter name by jeejeelee in https://github.com/vllm-project/vllm/pull/9997
* [CI] Prune tests/models/decoder_only/language/* tests by mgoin in https://github.com/vllm-project/vllm/pull/9940
* [CI] Prune back the number of tests in tests/kernels/* by mgoin in https://github.com/vllm-project/vllm/pull/9932
* [bugfix] fix weak ref in piecewise cudagraph and tractable test by youkaichao in https://github.com/vllm-project/vllm/pull/10048
* [Bugfix] Properly propagate trust_remote_code settings by zifeitong in https://github.com/vllm-project/vllm/pull/10047
* [Bugfix] Fix pickle of input when async output processing is on by wallashss in https://github.com/vllm-project/vllm/pull/9931
* [Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by llsj14 in https://github.com/vllm-project/vllm/pull/9730
* [v1] reduce graph capture time for piecewise cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/10059
* [Misc] Sort the list of embedding models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10037
* [Model][OpenVINO] Fix regressions from 8346 by petersalas in https://github.com/vllm-project/vllm/pull/10045
* [Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by tjohnson31415 in https://github.com/vllm-project/vllm/pull/10051
* [Bugfix] Gpt-j-6B patch kv_scale to k_scale path by arakowsk-amd in https://github.com/vllm-project/vllm/pull/10063
* [Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by zifeitong in https://github.com/vllm-project/vllm/pull/10054
* [V1] Integrate Piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10058
* [distributed] add function to create ipc buffers directly by youkaichao in https://github.com/vllm-project/vllm/pull/10064
* [CI/Build] drop support for Python 3.8 EOL by aarnphm in https://github.com/vllm-project/vllm/pull/8464
* [CI/Build] Fix large_gpu_mark reason by Isotr0py in https://github.com/vllm-project/vllm/pull/10070
* [Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by kzawora-intel in https://github.com/vllm-project/vllm/pull/6143
* [Hotfix] Fix ruff errors by WoosukKwon in https://github.com/vllm-project/vllm/pull/10073
* [Model][LoRA]LoRA support added for LlamaEmbeddingModel by jeejeelee in https://github.com/vllm-project/vllm/pull/10071
* [Model] Add Idefics3 support by jeejeelee in https://github.com/vllm-project/vllm/pull/9767
* [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by ericperfect in https://github.com/vllm-project/vllm/pull/10022
* Remove ScaledActivation for AWQ by mgoin in https://github.com/vllm-project/vllm/pull/10057
* [CI/Build] Drop Python 3.8 support by russellb in https://github.com/vllm-project/vllm/pull/10038
* [CI/Build] change conflict PR comment from mergify by russellb in https://github.com/vllm-project/vllm/pull/10080
* [V1] Make v1 more testable by joerunde in https://github.com/vllm-project/vllm/pull/9888
* [CI/Build] Always run the ruff workflow by russellb in https://github.com/vllm-project/vllm/pull/10092
* [core][distributed] add stateless_init_process_group by youkaichao in https://github.com/vllm-project/vllm/pull/10072
* [Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by mgoin in https://github.com/vllm-project/vllm/pull/10095
* [Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by yma11 in https://github.com/vllm-project/vllm/pull/9823
* [Frontend] Adjust try/except blocks in API impl by njhill in https://github.com/vllm-project/vllm/pull/10056
* [Hardware][CPU] Update torch 2.5 by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/9911
* [doc] add back Python 3.8 ABI by youkaichao in https://github.com/vllm-project/vllm/pull/10100
* [V1][BugFix] Fix Generator construction in greedy + seed case by njhill in https://github.com/vllm-project/vllm/pull/10097
* [Misc] Consolidate ModelConfig code related to HF config by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10104
* [CI/Build] re-add codespell to CI by russellb in https://github.com/vllm-project/vllm/pull/10083
* [Doc] Improve benchmark documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9927
* [Core][Distributed] Refactor ipc buffer init in CustomAllreduce by hanzhi713 in https://github.com/vllm-project/vllm/pull/10030
* [CI/Build] Improve mypy + python version matrix by russellb in https://github.com/vllm-project/vllm/pull/10041
* Adds method to read the pooling types from model's files by flaviabeo in https://github.com/vllm-project/vllm/pull/9506
* [Frontend] Fix multiple values for keyword argument error (10075) by DIYer22 in https://github.com/vllm-project/vllm/pull/10076
* [Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/10108
* [Bugfix] Make image processor respect `mm_processor_kwargs` for Qwen2-VL by li-plus in https://github.com/vllm-project/vllm/pull/10112
* [Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by spliii in https://github.com/vllm-project/vllm/pull/10105
* [Frontend] Tool calling parser for Granite 3.0 models by maxdebayser in https://github.com/vllm-project/vllm/pull/9027
* [Feature] [Spec decode]: Combine chunked prefill with speculative decoding by NickLucche in https://github.com/vllm-project/vllm/pull/9291
* [CI/Build] Always run mypy by russellb in https://github.com/vllm-project/vllm/pull/10122
* [CI/Build] Add shell script linting using shellcheck by russellb in https://github.com/vllm-project/vllm/pull/7925
* [CI/Build] Automate PR body text cleanup by russellb in https://github.com/vllm-project/vllm/pull/10082
* Bump actions/setup-python from 5.2.0 to 5.3.0 by dependabot in https://github.com/vllm-project/vllm/pull/9745
* Online video support for VLMs by litianjian in https://github.com/vllm-project/vllm/pull/10020
* Bump actions/checkout from 4.2.1 to 4.2.2 by dependabot in https://github.com/vllm-project/vllm/pull/9746
* [Misc] Add environment variables collection in collect_env.py tool by ycool in https://github.com/vllm-project/vllm/pull/9293
* [V1] Add all_token_ids attribute to Request by WoosukKwon in https://github.com/vllm-project/vllm/pull/10135
* [V1] Prefix caching (take 2) by comaniac in https://github.com/vllm-project/vllm/pull/9972
* [CI/Build] Give PR cleanup job PR write access by russellb in https://github.com/vllm-project/vllm/pull/10139
* [Doc] Update FAQ links in spec_decode.rst by whyiug in https://github.com/vllm-project/vllm/pull/9662
* [Bugfix] Add error handling when server cannot respond any valid tokens by DearPlanet in https://github.com/vllm-project/vllm/pull/5895
* [Misc] Fix ImportError causing by triton by MengqingCao in https://github.com/vllm-project/vllm/pull/9493
* [Doc] Move CONTRIBUTING to docs site by russellb in https://github.com/vllm-project/vllm/pull/9924
* Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by sighingnow in https://github.com/vllm-project/vllm/pull/9285
* Add hf_transfer to testing image by mgoin in https://github.com/vllm-project/vllm/pull/10096
* [Misc] Fix typo in 5895 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10145
* [Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by yma11 in https://github.com/vllm-project/vllm/pull/10144
* [Model] Expose size to Idefics3 as mm_processor_kwargs by Isotr0py in https://github.com/vllm-project/vllm/pull/10146
* [V1]Enable APC by default only for text models by ywang96 in https://github.com/vllm-project/vllm/pull/10148
* [CI/Build] Update CPU tests to include all "standard" tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5481
* Fix edge case Mistral tokenizer by patrickvonplaten in https://github.com/vllm-project/vllm/pull/10152
* Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by sroy745 in https://github.com/vllm-project/vllm/pull/10136
* [Misc] Improve Web UI by rafvasq in https://github.com/vllm-project/vllm/pull/10090
* [V1] Fix non-cudagraph op name by WoosukKwon in https://github.com/vllm-project/vllm/pull/10166
* [CI/Build] Ignore .gitignored files for shellcheck by ProExpertProg in https://github.com/vllm-project/vllm/pull/10162
* Rename vllm.logging to vllm.logging_utils by flozi00 in https://github.com/vllm-project/vllm/pull/10134
* [torch.compile] Fuse RMSNorm with quant by ProExpertProg in https://github.com/vllm-project/vllm/pull/9138
* [Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by bnellnm in https://github.com/vllm-project/vllm/pull/10170
* [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by rasmith in https://github.com/vllm-project/vllm/pull/9857
* [CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/6892
* [0/N] Rename `MultiModalInputs` to `MultiModalKwargs` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10040
* [Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by mgoin in https://github.com/vllm-project/vllm/pull/10169
* [CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing by Isotr0py in https://github.com/vllm-project/vllm/pull/10161
* [Doc] Adjust RunLLM location by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10176
* [5/N] pass the whole config to model by youkaichao in https://github.com/vllm-project/vllm/pull/9983
* [CI/Build] Add run-hpu-test.sh script by xuechendi in https://github.com/vllm-project/vllm/pull/10167
* [Bugfix] Enable some fp8 and quantized fullgraph tests by bnellnm in https://github.com/vllm-project/vllm/pull/10171
* [bugfix] fix broken tests of mlp speculator by youkaichao in https://github.com/vllm-project/vllm/pull/10177
* [doc] explaining the integration with huggingface by youkaichao in https://github.com/vllm-project/vllm/pull/10173
* bugfix: fix the bug that stream generate not work by caijizhuo in https://github.com/vllm-project/vllm/pull/2756
* [Frontend] add `add_request_id` middleware by cjackal in https://github.com/vllm-project/vllm/pull/9594
* [Frontend][Core] Override HF `config.json` via CLI by KrishnaM251 in https://github.com/vllm-project/vllm/pull/5836
* [CI/Build] Split up models tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10069
* [ci][build] limit cmake version by youkaichao in https://github.com/vllm-project/vllm/pull/10188
* [Doc] Fix typo error in CONTRIBUTING.md by FuryMartin in https://github.com/vllm-project/vllm/pull/10190
* [doc] Polish the integration with huggingface doc by CRZbulabula in https://github.com/vllm-project/vllm/pull/10195
* [Misc] small fixes to function tracing file path by ShawnD200 in https://github.com/vllm-project/vllm/pull/9543
* [misc] improve cloudpickle registration and tests by youkaichao in https://github.com/vllm-project/vllm/pull/10202
* [Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by yansh97 in https://github.com/vllm-project/vllm/pull/10196
* [doc] improve debugging code by youkaichao in https://github.com/vllm-project/vllm/pull/10206
* [6/N] pass whole config to inner model by youkaichao in https://github.com/vllm-project/vllm/pull/10205
* Bump the patch-update group with 5 updates by dependabot in https://github.com/vllm-project/vllm/pull/10210
* [Hardware][CPU] Add embedding models support for CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/10193
* [LoRA][Kernel] Remove the unused libentry module by jeejeelee in https://github.com/vllm-project/vllm/pull/10214
* [V1] Allow `tokenizer_mode` and `trust_remote_code` for Detokenizer by ywang96 in https://github.com/vllm-project/vllm/pull/10211
* [Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/10218
* [Metrics] add more metrics by HarryWu99 in https://github.com/vllm-project/vllm/pull/4464
* [Doc] fix doc string typo in block_manager `swap_out` function by yyccli in https://github.com/vllm-project/vllm/pull/10212
* [core][distributed] add stateless process group by youkaichao in https://github.com/vllm-project/vllm/pull/10216
* Bump actions/setup-python from 5.2.0 to 5.3.0 by dependabot in https://github.com/vllm-project/vllm/pull/10209
* [V1] Fix detokenizer ports by WoosukKwon in https://github.com/vllm-project/vllm/pull/10224
* [V1] Do not use inductor for piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10225
* [v1][torch.compile] support managing cudagraph buffer by youkaichao in https://github.com/vllm-project/vllm/pull/10203
* [V1] Use custom ops for piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10227
* Add docs on serving with Llama Stack by terrytangyuan in https://github.com/vllm-project/vllm/pull/10183
* [misc][distributed] auto port selection and disable tests by youkaichao in https://github.com/vllm-project/vllm/pull/10226
* [V1] Enable custom ops with piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10228
* Make shutil rename in python_only_dev by shcheglovnd in https://github.com/vllm-project/vllm/pull/10233
* [V1] `AsyncLLM` Implementation by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9826
* [doc] update debugging guide by youkaichao in https://github.com/vllm-project/vllm/pull/10236
* [Doc] Update help text for `--distributed-executor-backend` by russellb in https://github.com/vllm-project/vllm/pull/10231
* [1/N] torch.compile user interface design by youkaichao in https://github.com/vllm-project/vllm/pull/10237
* [Misc][LoRA] Replace hardcoded cuda device with configurable argument by jeejeelee in https://github.com/vllm-project/vllm/pull/10223
* Splitting attention kernel file by maleksan85 in https://github.com/vllm-project/vllm/pull/10091
* [doc] explain the class hierarchy in vLLM by youkaichao in https://github.com/vllm-project/vllm/pull/10240
* [CI][CPU]refactor CPU tests to allow to bind with different cores by zhouyuan in https://github.com/vllm-project/vllm/pull/10222
* [BugFix] Do not raise a `ValueError` when `tool_choice` is set to the supported `none` option and `tools` are not defined. by gcalmettes in https://github.com/vllm-project/vllm/pull/10000
* [Misc]Fix Idefics3Model argument by jeejeelee in https://github.com/vllm-project/vllm/pull/10255
* [Bugfix] Fix QwenModel argument by DamonFool in https://github.com/vllm-project/vllm/pull/10262
* [Frontend] Add per-request number of cached token stats by zifeitong in https://github.com/vllm-project/vllm/pull/10174
* [V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by WoosukKwon in https://github.com/vllm-project/vllm/pull/10245
* [Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by sroy745 in https://github.com/vllm-project/vllm/pull/9982
* [LoRA] Adds support for bias in LoRA by followumesh in https://github.com/vllm-project/vllm/pull/5733
* [V1] Enable Inductor when using piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10268
* [doc] fix location of runllm widget by youkaichao in https://github.com/vllm-project/vllm/pull/10266
* [doc] improve debugging doc by youkaichao in https://github.com/vllm-project/vllm/pull/10270
* Revert "[ci][build] limit cmake version" by youkaichao in https://github.com/vllm-project/vllm/pull/10271
* [V1] Fix CI tests on V1 engine by WoosukKwon in https://github.com/vllm-project/vllm/pull/10272
* [core][distributed] use tcp store directly by youkaichao in https://github.com/vllm-project/vllm/pull/10275
* [V1] Support VLMs with fine-grained scheduling by WoosukKwon in https://github.com/vllm-project/vllm/pull/9871
* Bump to compressed-tensors v0.8.0 by dsikka in https://github.com/vllm-project/vllm/pull/10279
* [Doc] Fix typo in arg_utils.py by xyang16 in https://github.com/vllm-project/vllm/pull/10264
* [Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by imkero in https://github.com/vllm-project/vllm/pull/10221
* [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by FurtherAI in https://github.com/vllm-project/vllm/pull/9944
* [Core] Flashinfer - Remove advance step size restriction by pavanimajety in https://github.com/vllm-project/vllm/pull/10282
* [Model][LoRA]LoRA support added for idefics3 by B-201 in https://github.com/vllm-project/vllm/pull/10281
* [V1] Add missing tokenizer options for `Detokenizer` by ywang96 in https://github.com/vllm-project/vllm/pull/10288
* [1/N] Initial prototype for multi-modal processor by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10044
* [Bugfix] bitsandbytes models fail to run pipeline parallel by HoangCongDuc in https://github.com/vllm-project/vllm/pull/10200
* [Bugfix] Fix tensor parallel for qwen2 classification model by Isotr0py in https://github.com/vllm-project/vllm/pull/10297
* [misc] error early for old-style class by youkaichao in https://github.com/vllm-project/vllm/pull/10304
* [Misc] format.sh: Simplify tool_version_check by russellb in https://github.com/vllm-project/vllm/pull/10305
* [Frontend] Pythonic tool parser by mdepinet in https://github.com/vllm-project/vllm/pull/9859
* [BugFix]: properly deserialize `tool_calls` iterator before processing by mistral-common when MistralTokenizer is used by gcalmettes in https://github.com/vllm-project/vllm/pull/9951
* [Model] Add BNB quantization support for Idefics3 by B-201 in https://github.com/vllm-project/vllm/pull/10310
* [ci][distributed] disable hanging tests by youkaichao in https://github.com/vllm-project/vllm/pull/10317
* [CI/Build] Fix CPU CI online inference timeout by Isotr0py in https://github.com/vllm-project/vllm/pull/10314
* [CI/Build] Make shellcheck happy by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10285
* [Docs] Publish meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/10331
* Support Roberta embedding models by maxdebayser in https://github.com/vllm-project/vllm/pull/9387
* [Perf] Reduce peak memory usage of llama by andoorve in https://github.com/vllm-project/vllm/pull/10339
* [Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by jxpxxzj in https://github.com/vllm-project/vllm/pull/9583
* [Tool parsing] Improve / correct mistral tool parsing by patrickvonplaten in https://github.com/vllm-project/vllm/pull/10333
* [Bugfix] Fix unable to load some models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10312
* [bugfix] Fix static asymmetric quantization case by ProExpertProg in https://github.com/vllm-project/vllm/pull/10334
* [Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/10308
* [Model] Support Qwen2 embeddings and use tags to select model tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10184
* [Bugfix] Qwen-vl output is inconsistent in speculative decoding by skylee-01 in https://github.com/vllm-project/vllm/pull/10350
* [Misc] Consolidate pooler config overrides by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10351
* [Build] skip renaming files for release wheels pipeline by simon-mo in https://github.com/vllm-project/vllm/pull/9671
New Contributors
* gracehonv made their first contribution in https://github.com/vllm-project/vllm/pull/9349
* streaver91 made their first contribution in https://github.com/vllm-project/vllm/pull/9396
* wukaixingxp made their first contribution in https://github.com/vllm-project/vllm/pull/9013
* sssrijan-amazon made their first contribution in https://github.com/vllm-project/vllm/pull/9380
* coolkp made their first contribution in https://github.com/vllm-project/vllm/pull/9477
* yue-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/9478
* dhiaEddineRhaiem made their first contribution in https://github.com/vllm-project/vllm/pull/9325
* yudian0504 made their first contribution in https://github.com/vllm-project/vllm/pull/9549
* ngrozae made their first contribution in https://github.com/vllm-project/vllm/pull/9552
* Falko1 made their first contribution in https://github.com/vllm-project/vllm/pull/9503
* wangshuai09 made their first contribution in https://github.com/vllm-project/vllm/pull/9536
* gopalsarda made their first contribution in https://github.com/vllm-project/vllm/pull/9580
* guoyuhong made their first contribution in https://github.com/vllm-project/vllm/pull/9550
* JArnoldAMD made their first contribution in https://github.com/vllm-project/vllm/pull/9529
* yuleil made their first contribution in https://github.com/vllm-project/vllm/pull/8234
* sethkimmel3 made their first contribution in https://github.com/vllm-project/vllm/pull/7889
* MengqingCao made their first contribution in https://github.com/vllm-project/vllm/pull/9605
* CRZbulabula made their first contribution in https://github.com/vllm-project/vllm/pull/9614
* faychu made their first contribution in https://github.com/vllm-project/vllm/pull/9248
* vrdn-23 made their first contribution in https://github.com/vllm-project/vllm/pull/9358
* willmj made their first contribution in https://github.com/vllm-project/vllm/pull/9673
* samos123 made their first contribution in https://github.com/vllm-project/vllm/pull/9709
* MErkinSag made their first contribution in https://github.com/vllm-project/vllm/pull/9560
* Alvant made their first contribution in https://github.com/vllm-project/vllm/pull/9717
* kakao-kevin-us made their first contribution in https://github.com/vllm-project/vllm/pull/9704
* madt2709 made their first contribution in https://github.com/vllm-project/vllm/pull/9533
* FerdinandZhong made their first contribution in https://github.com/vllm-project/vllm/pull/9427
* svenseeberg made their first contribution in https://github.com/vllm-project/vllm/pull/9798
* yannicks1 made their first contribution in https://github.com/vllm-project/vllm/pull/9801
* wseaton made their first contribution in https://github.com/vllm-project/vllm/pull/8339
* Went-Liang made their first contribution in https://github.com/vllm-project/vllm/pull/9697
* andrejonasson made their first contribution in https://github.com/vllm-project/vllm/pull/9696
* GeneDer made their first contribution in https://github.com/vllm-project/vllm/pull/9934
* mikegre-google made their first contribution in https://github.com/vllm-project/vllm/pull/9926
* nokados made their first contribution in https://github.com/vllm-project/vllm/pull/9956
* cooleel made their first contribution in https://github.com/vllm-project/vllm/pull/9747
* zhengy001 made their first contribution in https://github.com/vllm-project/vllm/pull/9447
* daitran2k1 made their first contribution in https://github.com/vllm-project/vllm/pull/9984
* chaunceyjiang made their first contribution in https://github.com/vllm-project/vllm/pull/9915
* hissu-hyvarinen made their first contribution in https://github.com/vllm-project/vllm/pull/9279
* lk-chen made their first contribution in https://github.com/vllm-project/vllm/pull/9779
* yangalan123 made their first contribution in https://github.com/vllm-project/vllm/pull/10027
* llsj14 made their first contribution in https://github.com/vllm-project/vllm/pull/9730
* arakowsk-amd made their first contribution in https://github.com/vllm-project/vllm/pull/10063
* kzawora-intel made their first contribution in https://github.com/vllm-project/vllm/pull/6143
* DIYer22 made their first contribution in https://github.com/vllm-project/vllm/pull/10076
* li-plus made their first contribution in https://github.com/vllm-project/vllm/pull/10112
* spliii made their first contribution in https://github.com/vllm-project/vllm/pull/10105
* flozi00 made their first contribution in https://github.com/vllm-project/vllm/pull/10134
* xuechendi made their first contribution in https://github.com/vllm-project/vllm/pull/10167
* caijizhuo made their first contribution in https://github.com/vllm-project/vllm/pull/2756
* cjackal made their first contribution in https://github.com/vllm-project/vllm/pull/9594
* KrishnaM251 made their first contribution in https://github.com/vllm-project/vllm/pull/5836
* FuryMartin made their first contribution in https://github.com/vllm-project/vllm/pull/10190
* ShawnD200 made their first contribution in https://github.com/vllm-project/vllm/pull/9543
* yansh97 made their first contribution in https://github.com/vllm-project/vllm/pull/10196
* yyccli made their first contribution in https://github.com/vllm-project/vllm/pull/10212
* shcheglovnd made their first contribution in https://github.com/vllm-project/vllm/pull/10233
* maleksan85 made their first contribution in https://github.com/vllm-project/vllm/pull/10091
* followumesh made their first contribution in https://github.com/vllm-project/vllm/pull/5733
* imkero made their first contribution in https://github.com/vllm-project/vllm/pull/10221
* B-201 made their first contribution in https://github.com/vllm-project/vllm/pull/10281
* HoangCongDuc made their first contribution in https://github.com/vllm-project/vllm/pull/10200
* mdepinet made their first contribution in https://github.com/vllm-project/vllm/pull/9859
* jxpxxzj made their first contribution in https://github.com/vllm-project/vllm/pull/9583
* skylee-01 made their first contribution in https://github.com/vllm-project/vllm/pull/10350
**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.3...v0.6.4