Vllm

Latest version: v0.6.4.post1

Safety actively analyzes 687990 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 8

0.6.4.post1

Not secure
This patch release covers bug fixes (10347, 10349, 10348, 10352, 10363), keep compatibility for `vLLMConfig` usage in out of tree models (10356)

What's Changed
* Add default value to avoid Falcon crash (5363) by wchen61 in https://github.com/vllm-project/vllm/pull/10347
* [Misc] Fix import error in tensorizer tests and cleanup some code by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10349
* [Doc] Remove float32 choice from --lora-dtype by xyang16 in https://github.com/vllm-project/vllm/pull/10348
* [Bugfix] Fix fully sharded LoRA bug by jeejeelee in https://github.com/vllm-project/vllm/pull/10352
* [Misc] Fix some help info of arg_utils to improve readability by ShangmingCai in https://github.com/vllm-project/vllm/pull/10362
* [core][misc] keep compatibility for old-style classes by youkaichao in https://github.com/vllm-project/vllm/pull/10356
* [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by gcalmettes in https://github.com/vllm-project/vllm/pull/10363
* [Misc] Bump up test_fused_moe tolerance by ElizaWszola in https://github.com/vllm-project/vllm/pull/10364
* [Misc] bump mistral common version by simon-mo in https://github.com/vllm-project/vllm/pull/10367

New Contributors
* wchen61 made their first contribution in https://github.com/vllm-project/vllm/pull/10347

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.4...v0.6.4.post1

0.6.4

Not secure
Highlights
* Significant progress in V1 engine core refactor (9826, 10135, 10288, 10211, 10225, 10228, 10268, 9954, 10272, 9971, 10224, 10166, 9289, 10058, 9888, 9972, 10059, 9945, 9679, 9871, 10227, 10245, 9629, 10097, 10203, 10148). You can checkout more details regarding the design and plan ahead in our recent [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_2_130)
* Signficant progress in `torch.compile` support. Many models now support torch compile with TorchInductor. You can checkout our [meetup slides](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit#slide=id.g31455c8bc1e_0_443) for more details. (9775, 9614, 9639, 9641, 9876, 9946, 9589, 9896, 9637, 9300, 9947, 9138, 9715, 9866, 9632, 9858, 9889)

Model Support
* New LLMs and VLMs: Idefics3 (9767), H2OVL-Mississippi (9747), Qwen2-Audio (9248), FalconMamba (9325), E5-V (9576), math-shepherd-mistral-7b-prm (9697), Pixtral models in the HF Transformers format (9036), Florence-2 (9555)
* New support for encoder decoder embedding model: `BERTModel` (9056) and `Roberta` (9387)
* Expanded task support: LlamaEmbeddingModel (9806), Qwen2ForSequenceClassification (9704), Qwen2 embeddings (10184)
* Add user-configurable `--task` parameter for models that support both generation and embedding (9424)
* Tool calling parser for Granite 3.0 (9027), Jamba (9154), granite-20b-functioncalling (8339)
* LoRA support for granite 3.0 MoE (9673), idefics3 (10281), LlamaEmbeddingModel (10071), Qwen (9622), Qwen2VLForConditionalGeneration (10022)
* BNB quantization support for Idefics3 (10310), Mllama (9720), Qwen2 (9467, 9574), MiniCPMV (9891)
* Unified multi-modal processor for VLM (10040, 10044, 9933, 10237, 9938, 9958, 10007, 9978, 9983, 10205)

Hardware Support
* Gaudi: Add Intel Gaudi (HPU) inference backend (6143)
* CPU: Add embedding models support for CPU backend (10193)
* TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (9438)
* Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (9857)

Performance
* Combine chunked prefill with speculative decoding (9291)
* `fused_moe` Performance Improvement (9384)

Engine Core
* Override HF `config.json` via CLI (5836)
* Add goodput metric support (9338)
* Move parallel sampling out from vllm core, paving way for V1 engine (9302)
* Add stateless process group for easier integration with RLHF and disaggregated prefill (10216, 10072)

Others
* Improvements to the pull request experience with DCO, mergify, stale bot, etc. (9436, 9512, 9513, 9259, 10082, 10285, 9803)
* Dropped support for Python 3.8 (10038, 8464)
* Basic Integration Test For TPU (9968)
* Document the class hierarchy in vLLM (10240), explain the integration with Hugging Face (10173).
* Benchmark throughput now supports image input (9851)



What's Changed
* [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/9350
* [Frontend] merge beam search implementations by LunrEclipse in https://github.com/vllm-project/vllm/pull/9296
* [Model] Make llama3.2 support multiple and interleaved images by xiangxu-google in https://github.com/vllm-project/vllm/pull/9095
* [Bugfix] Clean up some cruft in mamba.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9343
* [Frontend] Clarify model_type error messages by stevegrubb in https://github.com/vllm-project/vllm/pull/9345
* [Doc] Fix code formatting in spec_decode.rst by mgoin in https://github.com/vllm-project/vllm/pull/9348
* [Bugfix] Update InternVL input mapper to support image embeds by hhzhang16 in https://github.com/vllm-project/vllm/pull/9351
* [BugFix] Fix chat API continuous usage stats by njhill in https://github.com/vllm-project/vllm/pull/9357
* pass ignore_eos parameter to all benchmark_serving calls by gracehonv in https://github.com/vllm-project/vllm/pull/9349
* [Misc] Directly use compressed-tensors for checkpoint definitions by mgoin in https://github.com/vllm-project/vllm/pull/8909
* [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by CatherineSue in https://github.com/vllm-project/vllm/pull/9034
* [Bugfix][CI/Build] Fix CUDA 11.8 Build by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9386
* [Bugfix] Molmo text-only input bug fix by mrsalehi in https://github.com/vllm-project/vllm/pull/9397
* [Misc] Standardize RoPE handling for Qwen2-VL by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9250
* [Model] VLM2Vec, the first multimodal embedding model in vLLM by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9303
* [CI/Build] Test VLM embeddings by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9406
* [Core] Rename input data types by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8688
* [Misc] Consolidate example usage of OpenAI client for multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/9412
* [Model] Support SDPA attention for Molmo vision backbone by Isotr0py in https://github.com/vllm-project/vllm/pull/9410
* Support mistral interleaved attn by patrickvonplaten in https://github.com/vllm-project/vllm/pull/9414
* [Kernel][Model] Improve continuous batching for Jamba and Mamba by mzusman in https://github.com/vllm-project/vllm/pull/9189
* [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by streaver91 in https://github.com/vllm-project/vllm/pull/9396
* [Performance][Spec Decode] Optimize ngram lookup performance by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9333
* [CI/Build] mypy: Resolve some errors from checking vllm/engine by russellb in https://github.com/vllm-project/vllm/pull/9267
* [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9425
* [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by rasmith in https://github.com/vllm-project/vllm/pull/9391
* Add notes on the use of Slack by terrytangyuan in https://github.com/vllm-project/vllm/pull/9442
* [Kernel] Add Exllama as a backend for compressed-tensors by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9395
* [Misc] Print stack trace using `logger.exception` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9461
* [misc] CUDA Time Layerwise Profiler by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8337
* [Bugfix] Allow prefill of assistant response when using `mistral_common` by sasha0552 in https://github.com/vllm-project/vllm/pull/9446
* [TPU] Call torch._sync(param) during weight loading by WoosukKwon in https://github.com/vllm-project/vllm/pull/9437
* [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/9344
* [Core] Deprecating block manager v1 and make block manager v2 default by KuntaiDu in https://github.com/vllm-project/vllm/pull/8704
* [CI/Build] remove .github from .dockerignore, add dirty repo check by dtrifiro in https://github.com/vllm-project/vllm/pull/9375
* [Misc] Remove commit id file by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9470
* [torch.compile] Fine-grained CustomOp enabling mechanism by ProExpertProg in https://github.com/vllm-project/vllm/pull/9300
* [Bugfix] Fix support for dimension like integers and ScalarType by bnellnm in https://github.com/vllm-project/vllm/pull/9299
* [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by wukaixingxp in https://github.com/vllm-project/vllm/pull/9013
* [Bugfix] Print warnings related to `mistral_common` tokenizer only once by sasha0552 in https://github.com/vllm-project/vllm/pull/9468
* [Hardwware][Neuron] Simplify model load for transformers-neuronx library by sssrijan-amazon in https://github.com/vllm-project/vllm/pull/9380
* Support `BERTModel` (first `encoder-only` embedding model) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9056
* [BugFix] Stop silent failures on compressed-tensors parsing by dsikka in https://github.com/vllm-project/vllm/pull/9381
* [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by joerunde in https://github.com/vllm-project/vllm/pull/9352
* [Qwen2.5] Support bnb quant for Qwen2.5 by blueyo0 in https://github.com/vllm-project/vllm/pull/9467
* [CI/Build] Use commit hash references for github actions by russellb in https://github.com/vllm-project/vllm/pull/9430
* [BugFix] Typing fixes to RequestOutput.prompt and beam search by njhill in https://github.com/vllm-project/vllm/pull/9473
* [Frontend][Feature] Add jamba tool parser by tomeras91 in https://github.com/vllm-project/vllm/pull/9154
* [BugFix] Fix and simplify completion API usage streaming by njhill in https://github.com/vllm-project/vllm/pull/9475
* [CI/Build] Fix lint errors in mistral tokenizer by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9504
* [Bugfix] Fix offline_inference_with_prefix.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9505
* [Misc] benchmark: Add option to set max concurrency by russellb in https://github.com/vllm-project/vllm/pull/9390
* [Model] Add user-configurable task for models that support both generation and embedding by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9424
* [CI/Build] Add error matching config for mypy by russellb in https://github.com/vllm-project/vllm/pull/9512
* [Model] Support Pixtral models in the HF Transformers format by mgoin in https://github.com/vllm-project/vllm/pull/9036
* [MISC] Add lora requests to metrics by coolkp in https://github.com/vllm-project/vllm/pull/9477
* [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by comaniac in https://github.com/vllm-project/vllm/pull/9510
* [Kernel] Add env variable to force flashinfer backend to enable tensor cores by tdoublep in https://github.com/vllm-project/vllm/pull/9497
* [Bugfix] Fix offline mode when using `mistral_common` by sasha0552 in https://github.com/vllm-project/vllm/pull/9457
* :bug: fix torch memory profiling by joerunde in https://github.com/vllm-project/vllm/pull/9516
* [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by njhill in https://github.com/vllm-project/vllm/pull/9521
* [Doc] update gpu-memory-utilization flag docs by joerunde in https://github.com/vllm-project/vllm/pull/9507
* [CI/Build] Add error matching for ruff output by russellb in https://github.com/vllm-project/vllm/pull/9513
* [CI/Build] Configure matcher for actionlint workflow by russellb in https://github.com/vllm-project/vllm/pull/9511
* [Frontend] Support simpler image input format by yue-anyscale in https://github.com/vllm-project/vllm/pull/9478
* [Bugfix] Fix missing task for speculative decoding by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9524
* [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by mgoin in https://github.com/vllm-project/vllm/pull/9514
* [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by heheda12345 in https://github.com/vllm-project/vllm/pull/9530
* [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by mgoin in https://github.com/vllm-project/vllm/pull/9520
* [Kernel] Support sliding window in flash attention backend by heheda12345 in https://github.com/vllm-project/vllm/pull/9403
* [Frontend][Misc] Goodput metric support by Imss27 in https://github.com/vllm-project/vllm/pull/9338
* [CI/Build] Split up decoder-only LM tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9488
* [Doc] Consistent naming of attention backends by tdoublep in https://github.com/vllm-project/vllm/pull/9498
* [Model] FalconMamba Support by dhiaEddineRhaiem in https://github.com/vllm-project/vllm/pull/9325
* [Bugfix][Misc]: fix graph capture for decoder by yudian0504 in https://github.com/vllm-project/vllm/pull/9549
* [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by varad-ahirwadkar in https://github.com/vllm-project/vllm/pull/9492
* [Model][Bugfix] Fix batching with multi-image in PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9518
* [Frontend] Reduce frequency of client cancellation checking by njhill in https://github.com/vllm-project/vllm/pull/7959
* [doc] fix format by youkaichao in https://github.com/vllm-project/vllm/pull/9562
* [BugFix] Update draft model TP size check to allow matching target TP size by njhill in https://github.com/vllm-project/vllm/pull/9394
* [Frontend] Don't log duplicate error stacktrace for every request in the batch by wallashss in https://github.com/vllm-project/vllm/pull/9023
* [CI] Make format checker error message more user-friendly by using emoji by KuntaiDu in https://github.com/vllm-project/vllm/pull/9564
* :bug: Fixup more test failures from memory profiling by joerunde in https://github.com/vllm-project/vllm/pull/9563
* [core] move parallel sampling out from vllm core by youkaichao in https://github.com/vllm-project/vllm/pull/9302
* [Bugfix]: serialize config instances by value when using --trust-remote-code by tjohnson31415 in https://github.com/vllm-project/vllm/pull/6751
* [CI/Build] Remove unnecessary `fork_new_process` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9484
* [Bugfix][OpenVINO] fix_dockerfile_openvino by ngrozae in https://github.com/vllm-project/vllm/pull/9552
* [Bugfix]: phi.py get rope_theta from config file by Falko1 in https://github.com/vllm-project/vllm/pull/9503
* [CI/Build] Replaced some models on tests for smaller ones by wallashss in https://github.com/vllm-project/vllm/pull/9570
* [Core] Remove evictor_v1 by KuntaiDu in https://github.com/vllm-project/vllm/pull/9572
* [Doc] Use shell code-blocks and fix section headers by rafvasq in https://github.com/vllm-project/vllm/pull/9508
* support TP in qwen2 bnb by chenqianfzh in https://github.com/vllm-project/vllm/pull/9574
* [Hardware][CPU] using current_platform.is_cpu by wangshuai09 in https://github.com/vllm-project/vllm/pull/9536
* [V1] Implement vLLM V1 [1/N] by WoosukKwon in https://github.com/vllm-project/vllm/pull/9289
* [CI/Build][LoRA] Temporarily fix long context failure issue by jeejeelee in https://github.com/vllm-project/vllm/pull/9579
* [Neuron] [Bugfix] Fix neuron startup by xendo in https://github.com/vllm-project/vllm/pull/9374
* [Model][VLM] Initialize support for Mono-InternVL model by Isotr0py in https://github.com/vllm-project/vllm/pull/9528
* [Bugfix] Eagle: change config name for fc bias by gopalsarda in https://github.com/vllm-project/vllm/pull/9580
* [Hardware][Intel CPU][DOC] Update docs for CPU backend by zhouyuan in https://github.com/vllm-project/vllm/pull/6212
* [Frontend] Support custom request_id from request by guoyuhong in https://github.com/vllm-project/vllm/pull/9550
* [BugFix] Prevent exporting duplicate OpenTelemetry spans by ronensc in https://github.com/vllm-project/vllm/pull/9017
* [torch.compile] auto infer dynamic_arg_dims from type annotation by youkaichao in https://github.com/vllm-project/vllm/pull/9589
* [Bugfix] fix detokenizer shallow copy by aurickq in https://github.com/vllm-project/vllm/pull/5919
* [Misc] Make benchmarks use EngineArgs by JArnoldAMD in https://github.com/vllm-project/vllm/pull/9529
* [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9487
* [BugFix] Fix metrics error for --num-scheduler-steps > 1 by yuleil in https://github.com/vllm-project/vllm/pull/8234
* [Doc]: Update tensorizer docs to include vllm[tensorizer] by sethkimmel3 in https://github.com/vllm-project/vllm/pull/7889
* [Bugfix] Generate exactly input_len tokens in benchmark_throughput by heheda12345 in https://github.com/vllm-project/vllm/pull/9592
* [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/9590
* [Model] Support E5-V by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9576
* [Build] Fix `FetchContent` multiple build issue by ProExpertProg in https://github.com/vllm-project/vllm/pull/9596
* [Hardware][XPU] using current_platform.is_xpu by MengqingCao in https://github.com/vllm-project/vllm/pull/9605
* [Model] Initialize Florence-2 language backbone support by Isotr0py in https://github.com/vllm-project/vllm/pull/9555
* [VLM] Post-layernorm override and quant config in vision encoder by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9217
* [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9612
* [Bugfix] Fix `_init_vision_model` in NVLM_D model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9611
* [misc] comment to avoid future confusion about baichuan by youkaichao in https://github.com/vllm-project/vllm/pull/9620
* [Bugfix] Fix divide by zero when serving Mamba models by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9617
* [Misc] Separate total and output tokens in benchmark_throughput.py by mgoin in https://github.com/vllm-project/vllm/pull/8914
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9614
* [Frontend] Enable Online Multi-image Support for MLlama by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9393
* [Model] Add Qwen2-Audio model support by faychu in https://github.com/vllm-project/vllm/pull/9248
* [CI/Build] Add bot to close stale issues and PRs by russellb in https://github.com/vllm-project/vllm/pull/9436
* [Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by mgoin in https://github.com/vllm-project/vllm/pull/9626
* [Bugfix] Use "vision_model" prefix for MllamaVisionModel by mgoin in https://github.com/vllm-project/vllm/pull/9628
* [Bugfix]: Make chat content text allow type content by vrdn-23 in https://github.com/vllm-project/vllm/pull/9358
* [XPU] avoid triton import for xpu by yma11 in https://github.com/vllm-project/vllm/pull/9440
* [Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9422
* [V1][Bugfix] Clean up requests when aborted by WoosukKwon in https://github.com/vllm-project/vllm/pull/9629
* [core] simplify seq group code by youkaichao in https://github.com/vllm-project/vllm/pull/9569
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9639
* [Kernel] add kernel for FATReLU by jeejeelee in https://github.com/vllm-project/vllm/pull/9610
* [torch.compile] expanding support and fix allgather compilation by CRZbulabula in https://github.com/vllm-project/vllm/pull/9637
* [Doc] Move additional tips/notes to the top by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9647
* [Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by litianjian in https://github.com/vllm-project/vllm/pull/9653
* Increase operation per run limit for "Close inactive issues and PRs" workflow by hmellor in https://github.com/vllm-project/vllm/pull/9661
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9641
* [CI/Build] Fix VLM test failures when using transformers v4.46 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9666
* [Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9650
* [Log][Bugfix] Fix default value check for `image_url.detail` by mgoin in https://github.com/vllm-project/vllm/pull/9663
* [Performance][Kernel] Fused_moe Performance Improvement by charlifu in https://github.com/vllm-project/vllm/pull/9384
* [Bugfix] Remove xformers requirement for Pixtral by mgoin in https://github.com/vllm-project/vllm/pull/9597
* [ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test 9675 by khluu in https://github.com/vllm-project/vllm/pull/9676
* [Model] add a lora module for granite 3.0 MoE models by willmj in https://github.com/vllm-project/vllm/pull/9673
* [V1] Support sliding window attention by WoosukKwon in https://github.com/vllm-project/vllm/pull/9679
* [Bugfix] Fix compressed_tensors_moe bad config.strategy by mgoin in https://github.com/vllm-project/vllm/pull/9677
* [Doc] Improve quickstart documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9256
* [Bugfix] Fix crash with llama 3.2 vision models and guided decoding by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9631
* [Bugfix] Steaming continuous_usage_stats default to False by samos123 in https://github.com/vllm-project/vllm/pull/9709
* [Hardware][openvino] is_openvino --> current_platform.is_openvino by MengqingCao in https://github.com/vllm-project/vllm/pull/9716
* Fix: MI100 Support By Bypassing Custom Paged Attention by MErkinSag in https://github.com/vllm-project/vllm/pull/9560
* [Frontend] Bad words sampling parameter by Alvant in https://github.com/vllm-project/vllm/pull/9717
* [Model] Add classification Task with Qwen2ForSequenceClassification by kakao-kevin-us in https://github.com/vllm-project/vllm/pull/9704
* [Misc] SpecDecodeWorker supports profiling by Abatom in https://github.com/vllm-project/vllm/pull/9719
* [core] cudagraph output with tensor weak reference by youkaichao in https://github.com/vllm-project/vllm/pull/9724
* [Misc] Upgrade to pytorch 2.5 by bnellnm in https://github.com/vllm-project/vllm/pull/9588
* Fix cache management in "Close inactive issues and PRs" actions workflow by hmellor in https://github.com/vllm-project/vllm/pull/9734
* [Bugfix] Fix load config when using bools by madt2709 in https://github.com/vllm-project/vllm/pull/9533
* [Hardware][ROCM] using current_platform.is_rocm by wangshuai09 in https://github.com/vllm-project/vllm/pull/9642
* [torch.compile] support moe models by youkaichao in https://github.com/vllm-project/vllm/pull/9632
* Fix beam search eos by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9627
* [Bugfix] Fix ray instance detect issue by yma11 in https://github.com/vllm-project/vllm/pull/9439
* [CI/Build] Adopt Mergify for auto-labeling PRs by russellb in https://github.com/vllm-project/vllm/pull/9259
* [Model][VLM] Add multi-video support for LLaVA-Onevision by litianjian in https://github.com/vllm-project/vllm/pull/8905
* [torch.compile] Adding "torch compile" annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9758
* [misc] avoid circular import by youkaichao in https://github.com/vllm-project/vllm/pull/9765
* [torch.compile] add deepseek v2 compile by youkaichao in https://github.com/vllm-project/vllm/pull/9775
* [Doc] fix third-party model example by russellb in https://github.com/vllm-project/vllm/pull/9771
* [Model][LoRA]LoRA support added for Qwen by jeejeelee in https://github.com/vllm-project/vllm/pull/9622
* [Doc] Specify async engine args in docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9726
* [Bugfix] Use temporary directory in registry by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9721
* [Frontend] re-enable multi-modality input in the new beam search implementation by FerdinandZhong in https://github.com/vllm-project/vllm/pull/9427
* [Model] Add BNB quantization support for Mllama by Isotr0py in https://github.com/vllm-project/vllm/pull/9720
* [Hardware] using current_platform.seed_everything by wangshuai09 in https://github.com/vllm-project/vllm/pull/9785
* [Misc] Add metrics for request queue time, forward time, and execute time by Abatom in https://github.com/vllm-project/vllm/pull/9659
* Fix the log to correct guide user to install modelscope by tastelikefeet in https://github.com/vllm-project/vllm/pull/9793
* [Bugfix] Use host argument to bind to interface by svenseeberg in https://github.com/vllm-project/vllm/pull/9798
* [Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by yannicks1 in https://github.com/vllm-project/vllm/pull/9801
* [Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by jsato8094 in https://github.com/vllm-project/vllm/pull/9806
* [CI][Bugfix] Skip chameleon for transformers 4.46.1 by mgoin in https://github.com/vllm-project/vllm/pull/9808
* [CI/Build] mergify: fix rules for ci/build label by russellb in https://github.com/vllm-project/vllm/pull/9804
* [MISC] Set label value to timestamp over 0, to keep track of recent history by coolkp in https://github.com/vllm-project/vllm/pull/9777
* [Bugfix][Frontend] Guard against bad token ids by joerunde in https://github.com/vllm-project/vllm/pull/9634
* [Model] tool calling support for ibm-granite/granite-20b-functioncalling by wseaton in https://github.com/vllm-project/vllm/pull/8339
* [Docs] Add notes about Snowflake Meetup by simon-mo in https://github.com/vllm-project/vllm/pull/9814
* [Bugfix] Fix prefix strings for quantized VLMs by mgoin in https://github.com/vllm-project/vllm/pull/9772
* [core][distributed] fix custom allreduce in pytorch 2.5 by youkaichao in https://github.com/vllm-project/vllm/pull/9815
* Update README.md by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9819
* [Bugfix][VLM] Make apply_fp8_linear work with >2D input by mgoin in https://github.com/vllm-project/vllm/pull/9812
* [ci/build] Pin CI dependencies version with pip-compile by khluu in https://github.com/vllm-project/vllm/pull/9810
* [Bugfix] Fix multi nodes TP+PP for XPU by yma11 in https://github.com/vllm-project/vllm/pull/8884
* [Doc] Add the DCO to CONTRIBUTING.md by russellb in https://github.com/vllm-project/vllm/pull/9803
* [torch.compile] rework compile control with piecewise cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/9715
* [Misc] Specify minimum pynvml version by jeejeelee in https://github.com/vllm-project/vllm/pull/9827
* [TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by WoosukKwon in https://github.com/vllm-project/vllm/pull/9438
* [CI/Build] VLM Test Consolidation by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9372
* [Model] Support math-shepherd-mistral-7b-prm model by Went-Liang in https://github.com/vllm-project/vllm/pull/9697
* [Misc] Add chunked-prefill support on FlashInfer. by elfiegg in https://github.com/vllm-project/vllm/pull/9781
* [Bugfix][core] replace heartbeat with pid check by joerunde in https://github.com/vllm-project/vllm/pull/9818
* [Doc] link bug for multistep guided decoding by joerunde in https://github.com/vllm-project/vllm/pull/9843
* [Neuron] Update Dockerfile.neuron to fix build failure by hbikki in https://github.com/vllm-project/vllm/pull/9822
* [doc] update pp support by youkaichao in https://github.com/vllm-project/vllm/pull/9853
* [CI/Build] Simplify exception trace in api server tests by CRZbulabula in https://github.com/vllm-project/vllm/pull/9787
* [torch.compile] upgrade tests by youkaichao in https://github.com/vllm-project/vllm/pull/9858
* [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by gcalmettes in https://github.com/vllm-project/vllm/pull/9837
* Revert "[Bugfix] Use host argument to bind to interface (9798)" by khluu in https://github.com/vllm-project/vllm/pull/9852
* [Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by mgoin in https://github.com/vllm-project/vllm/pull/9817
* [Misc] Remove deprecated arg for cuda graph capture by ywang96 in https://github.com/vllm-project/vllm/pull/9864
* [Doc] Update Qwen documentation by jeejeelee in https://github.com/vllm-project/vllm/pull/9869
* [CI/Build] Add Model Tests for Qwen2-VL by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9846
* [CI/Build] Adding a forced docker system prune to clean up space by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/9849
* [Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by sasha0552 in https://github.com/vllm-project/vllm/pull/9532
* [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by mzusman in https://github.com/vllm-project/vllm/pull/9838
* [ci/build] Configure dependabot to update pip dependencies by khluu in https://github.com/vllm-project/vllm/pull/9811
* [Bugfix][Frontend] Reject guided decoding in multistep mode by joerunde in https://github.com/vllm-project/vllm/pull/9892
* [torch.compile] directly register custom op by youkaichao in https://github.com/vllm-project/vllm/pull/9896
* [Bugfix] Fix layer skip logic with bitsandbytes by mgoin in https://github.com/vllm-project/vllm/pull/9887
* [torch.compile] rework test plans by youkaichao in https://github.com/vllm-project/vllm/pull/9866
* [Model] Support bitsandbytes for MiniCPMV by mgoin in https://github.com/vllm-project/vllm/pull/9891
* [torch.compile] Adding torch compile annotations to some models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9876
* [Doc] Update multi-input support by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9906
* [Frontend] Chat-based Embeddings API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9759
* [CI/Build] Add Model Tests for PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9813
* [Frontend] Use a proper chat template for VLM2Vec by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9912
* [Bugfix] Fix edge cases for MistralTokenizer by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9625
* [Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by andrejonasson in https://github.com/vllm-project/vllm/pull/9696
* [torch.compile] use interpreter with stable api from pytorch by youkaichao in https://github.com/vllm-project/vllm/pull/9889
* [Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by pavanimajety in https://github.com/vllm-project/vllm/pull/9861
* [1/N] pass the complete config from engine to executor by youkaichao in https://github.com/vllm-project/vllm/pull/9933
* [Bugfix] PicklingError on RayTaskError by GeneDer in https://github.com/vllm-project/vllm/pull/9934
* Bump the patch-update group with 10 updates by dependabot in https://github.com/vllm-project/vllm/pull/9897
* [Core][VLM] Add precise multi-modal placeholder tracking by petersalas in https://github.com/vllm-project/vllm/pull/8346
* [ci/build] Have dependabot ignore pinned dependencies by khluu in https://github.com/vllm-project/vllm/pull/9935
* [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by sroy745 in https://github.com/vllm-project/vllm/pull/9559
* [torch.compile] fix cpu broken code by youkaichao in https://github.com/vllm-project/vllm/pull/9947
* [Docs] Update Granite 3.0 models in supported models table by njhill in https://github.com/vllm-project/vllm/pull/9930
* [Doc] Updated tpu-installation.rst with more details by mikegre-google in https://github.com/vllm-project/vllm/pull/9926
* [2/N] executor pass the complete config to worker/modelrunner by youkaichao in https://github.com/vllm-project/vllm/pull/9938
* [V1] Fix `EngineArgs` refactor on V1 by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9954
* [bugfix] fix chatglm dummy_data_for_glmv by youkaichao in https://github.com/vllm-project/vllm/pull/9955
* [3/N] model runner pass the whole config to model by youkaichao in https://github.com/vllm-project/vllm/pull/9958
* [CI/Build] Quoting around > by nokados in https://github.com/vllm-project/vllm/pull/9956
* [torch.compile] Adding torch compile annotations to vision-language models by CRZbulabula in https://github.com/vllm-project/vllm/pull/9946
* [bugfix] fix tsts by youkaichao in https://github.com/vllm-project/vllm/pull/9959
* [V1] Support per-request seed by njhill in https://github.com/vllm-project/vllm/pull/9945
* [Model] Add support for H2OVL-Mississippi models by cooleel in https://github.com/vllm-project/vllm/pull/9747
* [V1] Fix Configs by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9971
* [Bugfix] Fix MiniCPMV and Mllama BNB bug by jeejeelee in https://github.com/vllm-project/vllm/pull/9917
* [Bugfix]Using the correct type hints by gshtras in https://github.com/vllm-project/vllm/pull/9885
* [Misc] Compute query_start_loc/seq_start_loc on CPU by zhengy001 in https://github.com/vllm-project/vllm/pull/9447
* [Bugfix] Fix E2EL mean and median stats by daitran2k1 in https://github.com/vllm-project/vllm/pull/9984
* [Bugfix][OpenVINO] Fix circular reference 9939 by MengqingCao in https://github.com/vllm-project/vllm/pull/9974
* [Frontend] Multi-Modality Support for Loading Local Image Files by chaunceyjiang in https://github.com/vllm-project/vllm/pull/9915
* [4/N] make quant config first-class citizen by youkaichao in https://github.com/vllm-project/vllm/pull/9978
* [Misc]Reduce BNB static variable by jeejeelee in https://github.com/vllm-project/vllm/pull/9987
* [Model] factoring out MambaMixer out of Jamba by mzusman in https://github.com/vllm-project/vllm/pull/8993
* [CI] Basic Integration Test For TPU by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9968
* [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by hissu-hyvarinen in https://github.com/vllm-project/vllm/pull/9279
* [Doc] Update VLM doc about loading from local files by ywang96 in https://github.com/vllm-project/vllm/pull/9999
* [Bugfix] Fix `MQLLMEngine` hanging by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9973
* [Misc] Refactor benchmark_throughput.py by lk-chen in https://github.com/vllm-project/vllm/pull/9779
* [Frontend] Add max_tokens prometheus metric by tomeras91 in https://github.com/vllm-project/vllm/pull/9881
* [Bugfix] Upgrade to pytorch 2.5.1 by bnellnm in https://github.com/vllm-project/vllm/pull/10001
* [4.5/N] bugfix for quant config in speculative decode by youkaichao in https://github.com/vllm-project/vllm/pull/10007
* [Bugfix] Respect modules_to_not_convert within awq_marlin by mgoin in https://github.com/vllm-project/vllm/pull/9895
* [Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9994
* [Core] Make encoder-decoder inputs a nested structure to be more composable by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9604
* [Bugfix] Fixup Mamba by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/10004
* [BugFix] Lazy import ray by GeneDer in https://github.com/vllm-project/vllm/pull/10021
* [Misc] vllm CLI flags should be ordered for better user readability by chaunceyjiang in https://github.com/vllm-project/vllm/pull/10017
* [Frontend] Fix tcp port reservation for api server by russellb in https://github.com/vllm-project/vllm/pull/10012
* Refactor TPU requirements file and pin build dependencies by richardsliu in https://github.com/vllm-project/vllm/pull/10010
* [Misc] Add logging for CUDA memory by yangalan123 in https://github.com/vllm-project/vllm/pull/10027
* [CI/Build] Limit github CI jobs based on files changed by russellb in https://github.com/vllm-project/vllm/pull/9928
* [Model] Support quantization of PixtralHFTransformer for PixtralHF by mgoin in https://github.com/vllm-project/vllm/pull/9921
* [Feature] Update benchmark_throughput.py to support image input by lk-chen in https://github.com/vllm-project/vllm/pull/9851
* [Misc] Modify BNB parameter name by jeejeelee in https://github.com/vllm-project/vllm/pull/9997
* [CI] Prune tests/models/decoder_only/language/* tests by mgoin in https://github.com/vllm-project/vllm/pull/9940
* [CI] Prune back the number of tests in tests/kernels/* by mgoin in https://github.com/vllm-project/vllm/pull/9932
* [bugfix] fix weak ref in piecewise cudagraph and tractable test by youkaichao in https://github.com/vllm-project/vllm/pull/10048
* [Bugfix] Properly propagate trust_remote_code settings by zifeitong in https://github.com/vllm-project/vllm/pull/10047
* [Bugfix] Fix pickle of input when async output processing is on by wallashss in https://github.com/vllm-project/vllm/pull/9931
* [Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by llsj14 in https://github.com/vllm-project/vllm/pull/9730
* [v1] reduce graph capture time for piecewise cudagraph by youkaichao in https://github.com/vllm-project/vllm/pull/10059
* [Misc] Sort the list of embedding models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10037
* [Model][OpenVINO] Fix regressions from 8346 by petersalas in https://github.com/vllm-project/vllm/pull/10045
* [Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by tjohnson31415 in https://github.com/vllm-project/vllm/pull/10051
* [Bugfix] Gpt-j-6B patch kv_scale to k_scale path by arakowsk-amd in https://github.com/vllm-project/vllm/pull/10063
* [Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by zifeitong in https://github.com/vllm-project/vllm/pull/10054
* [V1] Integrate Piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10058
* [distributed] add function to create ipc buffers directly by youkaichao in https://github.com/vllm-project/vllm/pull/10064
* [CI/Build] drop support for Python 3.8 EOL by aarnphm in https://github.com/vllm-project/vllm/pull/8464
* [CI/Build] Fix large_gpu_mark reason by Isotr0py in https://github.com/vllm-project/vllm/pull/10070
* [Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by kzawora-intel in https://github.com/vllm-project/vllm/pull/6143
* [Hotfix] Fix ruff errors by WoosukKwon in https://github.com/vllm-project/vllm/pull/10073
* [Model][LoRA]LoRA support added for LlamaEmbeddingModel by jeejeelee in https://github.com/vllm-project/vllm/pull/10071
* [Model] Add Idefics3 support by jeejeelee in https://github.com/vllm-project/vllm/pull/9767
* [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by ericperfect in https://github.com/vllm-project/vllm/pull/10022
* Remove ScaledActivation for AWQ by mgoin in https://github.com/vllm-project/vllm/pull/10057
* [CI/Build] Drop Python 3.8 support by russellb in https://github.com/vllm-project/vllm/pull/10038
* [CI/Build] change conflict PR comment from mergify by russellb in https://github.com/vllm-project/vllm/pull/10080
* [V1] Make v1 more testable by joerunde in https://github.com/vllm-project/vllm/pull/9888
* [CI/Build] Always run the ruff workflow by russellb in https://github.com/vllm-project/vllm/pull/10092
* [core][distributed] add stateless_init_process_group by youkaichao in https://github.com/vllm-project/vllm/pull/10072
* [Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by mgoin in https://github.com/vllm-project/vllm/pull/10095
* [Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by yma11 in https://github.com/vllm-project/vllm/pull/9823
* [Frontend] Adjust try/except blocks in API impl by njhill in https://github.com/vllm-project/vllm/pull/10056
* [Hardware][CPU] Update torch 2.5 by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/9911
* [doc] add back Python 3.8 ABI by youkaichao in https://github.com/vllm-project/vllm/pull/10100
* [V1][BugFix] Fix Generator construction in greedy + seed case by njhill in https://github.com/vllm-project/vllm/pull/10097
* [Misc] Consolidate ModelConfig code related to HF config by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10104
* [CI/Build] re-add codespell to CI by russellb in https://github.com/vllm-project/vllm/pull/10083
* [Doc] Improve benchmark documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9927
* [Core][Distributed] Refactor ipc buffer init in CustomAllreduce by hanzhi713 in https://github.com/vllm-project/vllm/pull/10030
* [CI/Build] Improve mypy + python version matrix by russellb in https://github.com/vllm-project/vllm/pull/10041
* Adds method to read the pooling types from model's files by flaviabeo in https://github.com/vllm-project/vllm/pull/9506
* [Frontend] Fix multiple values for keyword argument error (10075) by DIYer22 in https://github.com/vllm-project/vllm/pull/10076
* [Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/10108
* [Bugfix] Make image processor respect `mm_processor_kwargs` for Qwen2-VL by li-plus in https://github.com/vllm-project/vllm/pull/10112
* [Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by spliii in https://github.com/vllm-project/vllm/pull/10105
* [Frontend] Tool calling parser for Granite 3.0 models by maxdebayser in https://github.com/vllm-project/vllm/pull/9027
* [Feature] [Spec decode]: Combine chunked prefill with speculative decoding by NickLucche in https://github.com/vllm-project/vllm/pull/9291
* [CI/Build] Always run mypy by russellb in https://github.com/vllm-project/vllm/pull/10122
* [CI/Build] Add shell script linting using shellcheck by russellb in https://github.com/vllm-project/vllm/pull/7925
* [CI/Build] Automate PR body text cleanup by russellb in https://github.com/vllm-project/vllm/pull/10082
* Bump actions/setup-python from 5.2.0 to 5.3.0 by dependabot in https://github.com/vllm-project/vllm/pull/9745
* Online video support for VLMs by litianjian in https://github.com/vllm-project/vllm/pull/10020
* Bump actions/checkout from 4.2.1 to 4.2.2 by dependabot in https://github.com/vllm-project/vllm/pull/9746
* [Misc] Add environment variables collection in collect_env.py tool by ycool in https://github.com/vllm-project/vllm/pull/9293
* [V1] Add all_token_ids attribute to Request by WoosukKwon in https://github.com/vllm-project/vllm/pull/10135
* [V1] Prefix caching (take 2) by comaniac in https://github.com/vllm-project/vllm/pull/9972
* [CI/Build] Give PR cleanup job PR write access by russellb in https://github.com/vllm-project/vllm/pull/10139
* [Doc] Update FAQ links in spec_decode.rst by whyiug in https://github.com/vllm-project/vllm/pull/9662
* [Bugfix] Add error handling when server cannot respond any valid tokens by DearPlanet in https://github.com/vllm-project/vllm/pull/5895
* [Misc] Fix ImportError causing by triton by MengqingCao in https://github.com/vllm-project/vllm/pull/9493
* [Doc] Move CONTRIBUTING to docs site by russellb in https://github.com/vllm-project/vllm/pull/9924
* Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by sighingnow in https://github.com/vllm-project/vllm/pull/9285
* Add hf_transfer to testing image by mgoin in https://github.com/vllm-project/vllm/pull/10096
* [Misc] Fix typo in 5895 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10145
* [Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by yma11 in https://github.com/vllm-project/vllm/pull/10144
* [Model] Expose size to Idefics3 as mm_processor_kwargs by Isotr0py in https://github.com/vllm-project/vllm/pull/10146
* [V1]Enable APC by default only for text models by ywang96 in https://github.com/vllm-project/vllm/pull/10148
* [CI/Build] Update CPU tests to include all "standard" tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/5481
* Fix edge case Mistral tokenizer by patrickvonplaten in https://github.com/vllm-project/vllm/pull/10152
* Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by sroy745 in https://github.com/vllm-project/vllm/pull/10136
* [Misc] Improve Web UI by rafvasq in https://github.com/vllm-project/vllm/pull/10090
* [V1] Fix non-cudagraph op name by WoosukKwon in https://github.com/vllm-project/vllm/pull/10166
* [CI/Build] Ignore .gitignored files for shellcheck by ProExpertProg in https://github.com/vllm-project/vllm/pull/10162
* Rename vllm.logging to vllm.logging_utils by flozi00 in https://github.com/vllm-project/vllm/pull/10134
* [torch.compile] Fuse RMSNorm with quant by ProExpertProg in https://github.com/vllm-project/vllm/pull/9138
* [Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by bnellnm in https://github.com/vllm-project/vllm/pull/10170
* [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by rasmith in https://github.com/vllm-project/vllm/pull/9857
* [CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/6892
* [0/N] Rename `MultiModalInputs` to `MultiModalKwargs` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10040
* [Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by mgoin in https://github.com/vllm-project/vllm/pull/10169
* [CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing by Isotr0py in https://github.com/vllm-project/vllm/pull/10161
* [Doc] Adjust RunLLM location by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10176
* [5/N] pass the whole config to model by youkaichao in https://github.com/vllm-project/vllm/pull/9983
* [CI/Build] Add run-hpu-test.sh script by xuechendi in https://github.com/vllm-project/vllm/pull/10167
* [Bugfix] Enable some fp8 and quantized fullgraph tests by bnellnm in https://github.com/vllm-project/vllm/pull/10171
* [bugfix] fix broken tests of mlp speculator by youkaichao in https://github.com/vllm-project/vllm/pull/10177
* [doc] explaining the integration with huggingface by youkaichao in https://github.com/vllm-project/vllm/pull/10173
* bugfix: fix the bug that stream generate not work by caijizhuo in https://github.com/vllm-project/vllm/pull/2756
* [Frontend] add `add_request_id` middleware by cjackal in https://github.com/vllm-project/vllm/pull/9594
* [Frontend][Core] Override HF `config.json` via CLI by KrishnaM251 in https://github.com/vllm-project/vllm/pull/5836
* [CI/Build] Split up models tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10069
* [ci][build] limit cmake version by youkaichao in https://github.com/vllm-project/vllm/pull/10188
* [Doc] Fix typo error in CONTRIBUTING.md by FuryMartin in https://github.com/vllm-project/vllm/pull/10190
* [doc] Polish the integration with huggingface doc by CRZbulabula in https://github.com/vllm-project/vllm/pull/10195
* [Misc] small fixes to function tracing file path by ShawnD200 in https://github.com/vllm-project/vllm/pull/9543
* [misc] improve cloudpickle registration and tests by youkaichao in https://github.com/vllm-project/vllm/pull/10202
* [Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by yansh97 in https://github.com/vllm-project/vllm/pull/10196
* [doc] improve debugging code by youkaichao in https://github.com/vllm-project/vllm/pull/10206
* [6/N] pass whole config to inner model by youkaichao in https://github.com/vllm-project/vllm/pull/10205
* Bump the patch-update group with 5 updates by dependabot in https://github.com/vllm-project/vllm/pull/10210
* [Hardware][CPU] Add embedding models support for CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/10193
* [LoRA][Kernel] Remove the unused libentry module by jeejeelee in https://github.com/vllm-project/vllm/pull/10214
* [V1] Allow `tokenizer_mode` and `trust_remote_code` for Detokenizer by ywang96 in https://github.com/vllm-project/vllm/pull/10211
* [Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by Isotr0py in https://github.com/vllm-project/vllm/pull/10218
* [Metrics] add more metrics by HarryWu99 in https://github.com/vllm-project/vllm/pull/4464
* [Doc] fix doc string typo in block_manager `swap_out` function by yyccli in https://github.com/vllm-project/vllm/pull/10212
* [core][distributed] add stateless process group by youkaichao in https://github.com/vllm-project/vllm/pull/10216
* Bump actions/setup-python from 5.2.0 to 5.3.0 by dependabot in https://github.com/vllm-project/vllm/pull/10209
* [V1] Fix detokenizer ports by WoosukKwon in https://github.com/vllm-project/vllm/pull/10224
* [V1] Do not use inductor for piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10225
* [v1][torch.compile] support managing cudagraph buffer by youkaichao in https://github.com/vllm-project/vllm/pull/10203
* [V1] Use custom ops for piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10227
* Add docs on serving with Llama Stack by terrytangyuan in https://github.com/vllm-project/vllm/pull/10183
* [misc][distributed] auto port selection and disable tests by youkaichao in https://github.com/vllm-project/vllm/pull/10226
* [V1] Enable custom ops with piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10228
* Make shutil rename in python_only_dev by shcheglovnd in https://github.com/vllm-project/vllm/pull/10233
* [V1] `AsyncLLM` Implementation by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/9826
* [doc] update debugging guide by youkaichao in https://github.com/vllm-project/vllm/pull/10236
* [Doc] Update help text for `--distributed-executor-backend` by russellb in https://github.com/vllm-project/vllm/pull/10231
* [1/N] torch.compile user interface design by youkaichao in https://github.com/vllm-project/vllm/pull/10237
* [Misc][LoRA] Replace hardcoded cuda device with configurable argument by jeejeelee in https://github.com/vllm-project/vllm/pull/10223
* Splitting attention kernel file by maleksan85 in https://github.com/vllm-project/vllm/pull/10091
* [doc] explain the class hierarchy in vLLM by youkaichao in https://github.com/vllm-project/vllm/pull/10240
* [CI][CPU]refactor CPU tests to allow to bind with different cores by zhouyuan in https://github.com/vllm-project/vllm/pull/10222
* [BugFix] Do not raise a `ValueError` when `tool_choice` is set to the supported `none` option and `tools` are not defined. by gcalmettes in https://github.com/vllm-project/vllm/pull/10000
* [Misc]Fix Idefics3Model argument by jeejeelee in https://github.com/vllm-project/vllm/pull/10255
* [Bugfix] Fix QwenModel argument by DamonFool in https://github.com/vllm-project/vllm/pull/10262
* [Frontend] Add per-request number of cached token stats by zifeitong in https://github.com/vllm-project/vllm/pull/10174
* [V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by WoosukKwon in https://github.com/vllm-project/vllm/pull/10245
* [Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by sroy745 in https://github.com/vllm-project/vllm/pull/9982
* [LoRA] Adds support for bias in LoRA by followumesh in https://github.com/vllm-project/vllm/pull/5733
* [V1] Enable Inductor when using piecewise CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/10268
* [doc] fix location of runllm widget by youkaichao in https://github.com/vllm-project/vllm/pull/10266
* [doc] improve debugging doc by youkaichao in https://github.com/vllm-project/vllm/pull/10270
* Revert "[ci][build] limit cmake version" by youkaichao in https://github.com/vllm-project/vllm/pull/10271
* [V1] Fix CI tests on V1 engine by WoosukKwon in https://github.com/vllm-project/vllm/pull/10272
* [core][distributed] use tcp store directly by youkaichao in https://github.com/vllm-project/vllm/pull/10275
* [V1] Support VLMs with fine-grained scheduling by WoosukKwon in https://github.com/vllm-project/vllm/pull/9871
* Bump to compressed-tensors v0.8.0 by dsikka in https://github.com/vllm-project/vllm/pull/10279
* [Doc] Fix typo in arg_utils.py by xyang16 in https://github.com/vllm-project/vllm/pull/10264
* [Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by imkero in https://github.com/vllm-project/vllm/pull/10221
* [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by FurtherAI in https://github.com/vllm-project/vllm/pull/9944
* [Core] Flashinfer - Remove advance step size restriction by pavanimajety in https://github.com/vllm-project/vllm/pull/10282
* [Model][LoRA]LoRA support added for idefics3 by B-201 in https://github.com/vllm-project/vllm/pull/10281
* [V1] Add missing tokenizer options for `Detokenizer` by ywang96 in https://github.com/vllm-project/vllm/pull/10288
* [1/N] Initial prototype for multi-modal processor by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10044
* [Bugfix] bitsandbytes models fail to run pipeline parallel by HoangCongDuc in https://github.com/vllm-project/vllm/pull/10200
* [Bugfix] Fix tensor parallel for qwen2 classification model by Isotr0py in https://github.com/vllm-project/vllm/pull/10297
* [misc] error early for old-style class by youkaichao in https://github.com/vllm-project/vllm/pull/10304
* [Misc] format.sh: Simplify tool_version_check by russellb in https://github.com/vllm-project/vllm/pull/10305
* [Frontend] Pythonic tool parser by mdepinet in https://github.com/vllm-project/vllm/pull/9859
* [BugFix]: properly deserialize `tool_calls` iterator before processing by mistral-common when MistralTokenizer is used by gcalmettes in https://github.com/vllm-project/vllm/pull/9951
* [Model] Add BNB quantization support for Idefics3 by B-201 in https://github.com/vllm-project/vllm/pull/10310
* [ci][distributed] disable hanging tests by youkaichao in https://github.com/vllm-project/vllm/pull/10317
* [CI/Build] Fix CPU CI online inference timeout by Isotr0py in https://github.com/vllm-project/vllm/pull/10314
* [CI/Build] Make shellcheck happy by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10285
* [Docs] Publish meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/10331
* Support Roberta embedding models by maxdebayser in https://github.com/vllm-project/vllm/pull/9387
* [Perf] Reduce peak memory usage of llama by andoorve in https://github.com/vllm-project/vllm/pull/10339
* [Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by jxpxxzj in https://github.com/vllm-project/vllm/pull/9583
* [Tool parsing] Improve / correct mistral tool parsing by patrickvonplaten in https://github.com/vllm-project/vllm/pull/10333
* [Bugfix] Fix unable to load some models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10312
* [bugfix] Fix static asymmetric quantization case by ProExpertProg in https://github.com/vllm-project/vllm/pull/10334
* [Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/10308
* [Model] Support Qwen2 embeddings and use tags to select model tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10184
* [Bugfix] Qwen-vl output is inconsistent in speculative decoding by skylee-01 in https://github.com/vllm-project/vllm/pull/10350
* [Misc] Consolidate pooler config overrides by DarkLight1337 in https://github.com/vllm-project/vllm/pull/10351
* [Build] skip renaming files for release wheels pipeline by simon-mo in https://github.com/vllm-project/vllm/pull/9671

New Contributors
* gracehonv made their first contribution in https://github.com/vllm-project/vllm/pull/9349
* streaver91 made their first contribution in https://github.com/vllm-project/vllm/pull/9396
* wukaixingxp made their first contribution in https://github.com/vllm-project/vllm/pull/9013
* sssrijan-amazon made their first contribution in https://github.com/vllm-project/vllm/pull/9380
* coolkp made their first contribution in https://github.com/vllm-project/vllm/pull/9477
* yue-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/9478
* dhiaEddineRhaiem made their first contribution in https://github.com/vllm-project/vllm/pull/9325
* yudian0504 made their first contribution in https://github.com/vllm-project/vllm/pull/9549
* ngrozae made their first contribution in https://github.com/vllm-project/vllm/pull/9552
* Falko1 made their first contribution in https://github.com/vllm-project/vllm/pull/9503
* wangshuai09 made their first contribution in https://github.com/vllm-project/vllm/pull/9536
* gopalsarda made their first contribution in https://github.com/vllm-project/vllm/pull/9580
* guoyuhong made their first contribution in https://github.com/vllm-project/vllm/pull/9550
* JArnoldAMD made their first contribution in https://github.com/vllm-project/vllm/pull/9529
* yuleil made their first contribution in https://github.com/vllm-project/vllm/pull/8234
* sethkimmel3 made their first contribution in https://github.com/vllm-project/vllm/pull/7889
* MengqingCao made their first contribution in https://github.com/vllm-project/vllm/pull/9605
* CRZbulabula made their first contribution in https://github.com/vllm-project/vllm/pull/9614
* faychu made their first contribution in https://github.com/vllm-project/vllm/pull/9248
* vrdn-23 made their first contribution in https://github.com/vllm-project/vllm/pull/9358
* willmj made their first contribution in https://github.com/vllm-project/vllm/pull/9673
* samos123 made their first contribution in https://github.com/vllm-project/vllm/pull/9709
* MErkinSag made their first contribution in https://github.com/vllm-project/vllm/pull/9560
* Alvant made their first contribution in https://github.com/vllm-project/vllm/pull/9717
* kakao-kevin-us made their first contribution in https://github.com/vllm-project/vllm/pull/9704
* madt2709 made their first contribution in https://github.com/vllm-project/vllm/pull/9533
* FerdinandZhong made their first contribution in https://github.com/vllm-project/vllm/pull/9427
* svenseeberg made their first contribution in https://github.com/vllm-project/vllm/pull/9798
* yannicks1 made their first contribution in https://github.com/vllm-project/vllm/pull/9801
* wseaton made their first contribution in https://github.com/vllm-project/vllm/pull/8339
* Went-Liang made their first contribution in https://github.com/vllm-project/vllm/pull/9697
* andrejonasson made their first contribution in https://github.com/vllm-project/vllm/pull/9696
* GeneDer made their first contribution in https://github.com/vllm-project/vllm/pull/9934
* mikegre-google made their first contribution in https://github.com/vllm-project/vllm/pull/9926
* nokados made their first contribution in https://github.com/vllm-project/vllm/pull/9956
* cooleel made their first contribution in https://github.com/vllm-project/vllm/pull/9747
* zhengy001 made their first contribution in https://github.com/vllm-project/vllm/pull/9447
* daitran2k1 made their first contribution in https://github.com/vllm-project/vllm/pull/9984
* chaunceyjiang made their first contribution in https://github.com/vllm-project/vllm/pull/9915
* hissu-hyvarinen made their first contribution in https://github.com/vllm-project/vllm/pull/9279
* lk-chen made their first contribution in https://github.com/vllm-project/vllm/pull/9779
* yangalan123 made their first contribution in https://github.com/vllm-project/vllm/pull/10027
* llsj14 made their first contribution in https://github.com/vllm-project/vllm/pull/9730
* arakowsk-amd made their first contribution in https://github.com/vllm-project/vllm/pull/10063
* kzawora-intel made their first contribution in https://github.com/vllm-project/vllm/pull/6143
* DIYer22 made their first contribution in https://github.com/vllm-project/vllm/pull/10076
* li-plus made their first contribution in https://github.com/vllm-project/vllm/pull/10112
* spliii made their first contribution in https://github.com/vllm-project/vllm/pull/10105
* flozi00 made their first contribution in https://github.com/vllm-project/vllm/pull/10134
* xuechendi made their first contribution in https://github.com/vllm-project/vllm/pull/10167
* caijizhuo made their first contribution in https://github.com/vllm-project/vllm/pull/2756
* cjackal made their first contribution in https://github.com/vllm-project/vllm/pull/9594
* KrishnaM251 made their first contribution in https://github.com/vllm-project/vllm/pull/5836
* FuryMartin made their first contribution in https://github.com/vllm-project/vllm/pull/10190
* ShawnD200 made their first contribution in https://github.com/vllm-project/vllm/pull/9543
* yansh97 made their first contribution in https://github.com/vllm-project/vllm/pull/10196
* yyccli made their first contribution in https://github.com/vllm-project/vllm/pull/10212
* shcheglovnd made their first contribution in https://github.com/vllm-project/vllm/pull/10233
* maleksan85 made their first contribution in https://github.com/vllm-project/vllm/pull/10091
* followumesh made their first contribution in https://github.com/vllm-project/vllm/pull/5733
* imkero made their first contribution in https://github.com/vllm-project/vllm/pull/10221
* B-201 made their first contribution in https://github.com/vllm-project/vllm/pull/10281
* HoangCongDuc made their first contribution in https://github.com/vllm-project/vllm/pull/10200
* mdepinet made their first contribution in https://github.com/vllm-project/vllm/pull/9859
* jxpxxzj made their first contribution in https://github.com/vllm-project/vllm/pull/9583
* skylee-01 made their first contribution in https://github.com/vllm-project/vllm/pull/10350

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.3...v0.6.4

0.6.3.post1

Not secure
Highlights

New Models
* Support Ministral 3B and Ministral 8B via interleaved attention (9414)
* Support multiple and interleaved images for Llama3.2 (9095)
* Support VLM2Vec, the first multimodal embedding model in vLLM (9303)

Important bug fix
* Fix chat API continuous usage stats (9357)
* Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (9034)
* Fix Molmo text-only input bug (9397)
* Fix CUDA 11.8 Build (9386)
* Fix `_version.py` not found issue (9375)

Other Enhancements
* Remove block manager v1 and make block manager v2 default (8704)
* Spec Decode Optimize ngram lookup performance (9333)


What's Changed
* [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/9350
* [Frontend] merge beam search implementations by LunrEclipse in https://github.com/vllm-project/vllm/pull/9296
* [Model] Make llama3.2 support multiple and interleaved images by xiangxu-google in https://github.com/vllm-project/vllm/pull/9095
* [Bugfix] Clean up some cruft in mamba.py by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9343
* [Frontend] Clarify model_type error messages by stevegrubb in https://github.com/vllm-project/vllm/pull/9345
* [Doc] Fix code formatting in spec_decode.rst by mgoin in https://github.com/vllm-project/vllm/pull/9348
* [Bugfix] Update InternVL input mapper to support image embeds by hhzhang16 in https://github.com/vllm-project/vllm/pull/9351
* [BugFix] Fix chat API continuous usage stats by njhill in https://github.com/vllm-project/vllm/pull/9357
* pass ignore_eos parameter to all benchmark_serving calls by gracehonv in https://github.com/vllm-project/vllm/pull/9349
* [Misc] Directly use compressed-tensors for checkpoint definitions by mgoin in https://github.com/vllm-project/vllm/pull/8909
* [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by CatherineSue in https://github.com/vllm-project/vllm/pull/9034
* [Bugfix][CI/Build] Fix CUDA 11.8 Build by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9386
* [Bugfix] Molmo text-only input bug fix by mrsalehi in https://github.com/vllm-project/vllm/pull/9397
* [Misc] Standardize RoPE handling for Qwen2-VL by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9250
* [Model] VLM2Vec, the first multimodal embedding model in vLLM by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9303
* [CI/Build] Test VLM embeddings by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9406
* [Core] Rename input data types by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8688
* [Misc] Consolidate example usage of OpenAI client for multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/9412
* [Model] Support SDPA attention for Molmo vision backbone by Isotr0py in https://github.com/vllm-project/vllm/pull/9410
* Support mistral interleaved attn by patrickvonplaten in https://github.com/vllm-project/vllm/pull/9414
* [Kernel][Model] Improve continuous batching for Jamba and Mamba by mzusman in https://github.com/vllm-project/vllm/pull/9189
* [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by streaver91 in https://github.com/vllm-project/vllm/pull/9396
* [Performance][Spec Decode] Optimize ngram lookup performance by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9333
* [CI/Build] mypy: Resolve some errors from checking vllm/engine by russellb in https://github.com/vllm-project/vllm/pull/9267
* [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9425
* [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by rasmith in https://github.com/vllm-project/vllm/pull/9391
* Add notes on the use of Slack by terrytangyuan in https://github.com/vllm-project/vllm/pull/9442
* [Kernel] Add Exllama as a backend for compressed-tensors by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9395
* [Misc] Print stack trace using `logger.exception` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9461
* [misc] CUDA Time Layerwise Profiler by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8337
* [Bugfix] Allow prefill of assistant response when using `mistral_common` by sasha0552 in https://github.com/vllm-project/vllm/pull/9446
* [TPU] Call torch._sync(param) during weight loading by WoosukKwon in https://github.com/vllm-project/vllm/pull/9437
* [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/9344
* [Core] Deprecating block manager v1 and make block manager v2 default by KuntaiDu in https://github.com/vllm-project/vllm/pull/8704
* [CI/Build] remove .github from .dockerignore, add dirty repo check by dtrifiro in https://github.com/vllm-project/vllm/pull/9375

New Contributors
* gracehonv made their first contribution in https://github.com/vllm-project/vllm/pull/9349
* streaver91 made their first contribution in https://github.com/vllm-project/vllm/pull/9396

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.3...v0.6.3.post1

0.6.3

Not secure
Highlights

Model Support
* New Models:
* Text: Granite MoE (8206), Mamba (6484, 8533)
* Vision: GLM-4V (9242), Molmo (9016), NVLM-D (9045)
* Reward model support: Qwen2.5-Math-RM-72B (8896)
* Expansion in functionality:
* Add Gemma2 embedding model (9004)
* Support input embeddings for qwen2vl (8856), minicpmv (9237)
* LoRA:
* LoRA support for MiniCPMV2.5 (7199), MiniCPMV2.6 (8943)
* Expand lora modules for mixtral (9008)
* Pipeline parallelism support to remaining text and embedding models (7168, 9090)
* Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (9148)
* Tool use:
* Add support for Llama 3.1 and 3.2 tool use (8343)
* Support tool calling for InternLM2.5 (8405)
* Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (9108)

Documentation
* New compatibility matrix for mutual exclusive features (8512)
* Reorganized installation doc, note that we publish a per-commit docker image (8931)

Hardware Support:
* Cross-attention and Encoder-Decoder models support on x86 CPU backend (9089)
* Support AWQ for CPU backend (7515)
* Add async output processor for xpu (8897)
* Add on-device sampling support for Neuron (8746)

Architectural Enhancements
* Progress in vLLM's refactoring to a core core:
* Spec decode removing batch expansion (8839, 9298).
* We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (8678).
* Moving beam search from the core to the API level (9105, 9087, 9117, 8928)
* Move guided decoding params into sampling params (8252)
* Torch Compile:
* You can now set an env var `VLLM_TORCH_COMPILE_LEVEL` to control `torch.compile` various levels of compilation control and integration (9058). Along with various improvements (8982, 9258, 906, 8875), using `VLLM_TORCH_COMPILE_LEVEL=3` can turn on Inductor's full graph compilation without vLLM's custom ops.

Others
* Performance enhancements to turn on multi-step scheeduling by default (8804, 8645, 8378)
* Enhancements towards priority scheduling (8965, 8956, 8850)






What's Changed
* [Misc] Update config loading for Qwen2-VL and remove Granite by ywang96 in https://github.com/vllm-project/vllm/pull/8837
* [Build/CI] Upgrade to gcc 10 in the base build Docker image by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8814
* [Docs] Add README to the build docker image by mgoin in https://github.com/vllm-project/vllm/pull/8825
* [CI/Build] Fix missing ci dependencies by fyuan1316 in https://github.com/vllm-project/vllm/pull/8834
* [misc][installation] build from source without compilation by youkaichao in https://github.com/vllm-project/vllm/pull/8818
* [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by khluu in https://github.com/vllm-project/vllm/pull/8872
* [Bugfix] Include encoder prompts len to non-stream api usage response by Pernekhan in https://github.com/vllm-project/vllm/pull/8861
* [Misc] Change dummy profiling and BOS fallback warns to log once by mgoin in https://github.com/vllm-project/vllm/pull/8820
* [Bugfix] Fix print_warning_once's line info by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8867
* fix validation: Only set tool_choice `auto` if at least one tool is provided by chiragjn in https://github.com/vllm-project/vllm/pull/8568
* [Bugfix] Fixup advance_step.cu warning by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8815
* [BugFix] Fix test breakages from transformers 4.45 upgrade by njhill in https://github.com/vllm-project/vllm/pull/8829
* [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8764
* [Feature] Add support for Llama 3.1 and 3.2 tool use by maxdebayser in https://github.com/vllm-project/vllm/pull/8343
* [Core] Rename `PromptInputs` and `inputs` with backward compatibility by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8876
* [misc] fix collect env by youkaichao in https://github.com/vllm-project/vllm/pull/8894
* [MISC] Fix invalid escape sequence '\' by panpan0000 in https://github.com/vllm-project/vllm/pull/8830
* [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` by Isotr0py in https://github.com/vllm-project/vllm/pull/8892
* [TPU] Update pallas.py to support trillium by bvrockwell in https://github.com/vllm-project/vllm/pull/8871
* [torch.compile] use empty tensor instead of None for profiling by youkaichao in https://github.com/vllm-project/vllm/pull/8875
* [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by ProExpertProg in https://github.com/vllm-project/vllm/pull/7271
* [Bugfix] fix for deepseek w4a16 by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8906
* [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/8378
* [misc][distributed] add VLLM_SKIP_P2P_CHECK flag by youkaichao in https://github.com/vllm-project/vllm/pull/8911
* [Core] Priority-based scheduling in async engine by schoennenbeck in https://github.com/vllm-project/vllm/pull/8850
* [misc] fix wheel name by youkaichao in https://github.com/vllm-project/vllm/pull/8919
* [Bugfix][Intel] Fix XPU Dockerfile Build by tylertitsworth in https://github.com/vllm-project/vllm/pull/7824
* [Misc] Remove vLLM patch of `BaichuanTokenizer` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8921
* [Bugfix] Fix code for downloading models from modelscope by tastelikefeet in https://github.com/vllm-project/vllm/pull/8443
* [Bugfix] Fix PP for Multi-Step by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/8887
* [CI/Build] Update models tests & examples by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8874
* [Frontend] Make beam search emulator temperature modifiable by nFunctor in https://github.com/vllm-project/vllm/pull/8928
* [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by heheda12345 in https://github.com/vllm-project/vllm/pull/8891
* [doc] organize installation doc and expose per-commit docker by youkaichao in https://github.com/vllm-project/vllm/pull/8931
* [Core] Improve choice of Python multiprocessing method by russellb in https://github.com/vllm-project/vllm/pull/8823
* [Bugfix] Block manager v2 with preemption and lookahead slots by sroy745 in https://github.com/vllm-project/vllm/pull/8824
* [Bugfix] Fix Marlin MoE act order when is_k_full == False by ElizaWszola in https://github.com/vllm-project/vllm/pull/8741
* [CI/Build] Add test decorator for minimum GPU memory by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8925
* [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8930
* [Model] Support Qwen2.5-Math-RM-72B by zhuzilin in https://github.com/vllm-project/vllm/pull/8896
* [Model][LoRA]LoRA support added for MiniCPMV2.5 by jeejeelee in https://github.com/vllm-project/vllm/pull/7199
* [BugFix] Fix seeded random sampling with encoder-decoder models by njhill in https://github.com/vllm-project/vllm/pull/8870
* [Misc] Fix typo in BlockSpaceManagerV1 by juncheoll in https://github.com/vllm-project/vllm/pull/8944
* [Frontend] Added support for HF's new `continue_final_message` parameter by danieljannai21 in https://github.com/vllm-project/vllm/pull/8942
* [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by mzusman in https://github.com/vllm-project/vllm/pull/8533
* [Model] support input embeddings for qwen2vl by whyiug in https://github.com/vllm-project/vllm/pull/8856
* [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` by ywang96 in https://github.com/vllm-project/vllm/pull/8951
* [Model][LoRA]LoRA support added for MiniCPMV2.6 by jeejeelee in https://github.com/vllm-project/vllm/pull/8943
* [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by Isotr0py in https://github.com/vllm-project/vllm/pull/8946
* [Core] Make scheduling policy settable via EngineArgs by schoennenbeck in https://github.com/vllm-project/vllm/pull/8956
* [Misc] Adjust max_position_embeddings for LoRA compatibility by jeejeelee in https://github.com/vllm-project/vllm/pull/8957
* [ci] Add CODEOWNERS for test directories by khluu in https://github.com/vllm-project/vllm/pull/8795
* [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8975
* [Frontend][Core] Move guided decoding params into sampling params by joerunde in https://github.com/vllm-project/vllm/pull/8252
* [CI/Build] Fix machete generated kernel files ordering by khluu in https://github.com/vllm-project/vllm/pull/8976
* [torch.compile] fix tensor alias by youkaichao in https://github.com/vllm-project/vllm/pull/8982
* [Misc] add process_weights_after_loading for DummyLoader by divakar-amd in https://github.com/vllm-project/vllm/pull/8969
* [Bugfix] Fix Fuyu tensor parallel inference by Isotr0py in https://github.com/vllm-project/vllm/pull/8986
* [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8991
* [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by schoennenbeck in https://github.com/vllm-project/vllm/pull/8965
* [Doc] Update list of supported models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8987
* Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by vlsav in https://github.com/vllm-project/vllm/pull/8997
* [Spec Decode] (1/2) Remove batch expansion by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8839
* [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching by afeldman-nm in https://github.com/vllm-project/vllm/pull/8804
* [Misc] Update Default Image Mapper Error Log by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8977
* [Core] CUDA Graphs for Multi-Step + Chunked-Prefill by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/8645
* [OpenVINO] Enable GPU support for OpenVINO vLLM backend by sshlyapn in https://github.com/vllm-project/vllm/pull/8192
* [Model] Adding Granite MoE. by shawntan in https://github.com/vllm-project/vllm/pull/8206
* [Doc] Update Granite model docs by njhill in https://github.com/vllm-project/vllm/pull/9025
* [Bugfix] example template should not add parallel_tool_prompt if tools is none by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9007
* [Misc] log when using default MoE config by divakar-amd in https://github.com/vllm-project/vllm/pull/8971
* [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser by gcalmettes in https://github.com/vllm-project/vllm/pull/9020
* [Core] Make BlockSpaceManagerV2 the default BlockManager to use. by sroy745 in https://github.com/vllm-project/vllm/pull/8678
* [Frontend] [Neuron] Parse literals out of override-neuron-config by xendo in https://github.com/vllm-project/vllm/pull/8959
* [misc] add forward context for attention by youkaichao in https://github.com/vllm-project/vllm/pull/9029
* Fix failing spec decode test by sroy745 in https://github.com/vllm-project/vllm/pull/9054
* [Bugfix] Weight loading fix for OPT model by domenVres in https://github.com/vllm-project/vllm/pull/9042
* [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model by sydnash in https://github.com/vllm-project/vllm/pull/8405
* [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8845
* [Misc] Enable multi-step output streaming by default by mgoin in https://github.com/vllm-project/vllm/pull/9047
* [Models] Add remaining model PP support by andoorve in https://github.com/vllm-project/vllm/pull/7168
* [Misc] Move registry to its own file by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9064
* [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL by whyiug in https://github.com/vllm-project/vllm/pull/9071
* [Bugfix] Flash attention arches not getting set properly by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9062
* [Model] add a bunch of supported lora modules for mixtral by prashantgupta24 in https://github.com/vllm-project/vllm/pull/9008
* Remove AMD Ray Summit Banner by simon-mo in https://github.com/vllm-project/vllm/pull/9075
* [Hardware][PowerPC] Make oneDNN dependency optional for Power by varad-ahirwadkar in https://github.com/vllm-project/vllm/pull/9039
* [Core][VLM] Test registration for OOT multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/8717
* Adds truncate_prompt_tokens param for embeddings creation by flaviabeo in https://github.com/vllm-project/vllm/pull/8999
* [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE by ElizaWszola in https://github.com/vllm-project/vllm/pull/8973
* [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang by KuntaiDu in https://github.com/vllm-project/vllm/pull/7412
* [Misc] Improved prefix cache example by Imss27 in https://github.com/vllm-project/vllm/pull/9077
* [Misc] Add random seed for prefix cache benchmark by Imss27 in https://github.com/vllm-project/vllm/pull/9081
* [Misc] Fix CI lint by comaniac in https://github.com/vllm-project/vllm/pull/9085
* [Hardware][Neuron] Add on-device sampling support for Neuron by chongmni-aws in https://github.com/vllm-project/vllm/pull/8746
* [torch.compile] improve allreduce registration by youkaichao in https://github.com/vllm-project/vllm/pull/9061
* [Doc] Update README.md with Ray summit slides by zhuohan123 in https://github.com/vllm-project/vllm/pull/9088
* [Bugfix] use blockmanagerv1 for encoder-decoder by heheda12345 in https://github.com/vllm-project/vllm/pull/9084
* [Bugfix] Fixes for Phi3v and Ultravox Multimodal EmbeddingInputs Support by hhzhang16 in https://github.com/vllm-project/vllm/pull/8979
* [Model] Support Gemma2 embedding model by xyang16 in https://github.com/vllm-project/vllm/pull/9004
* [Bugfix] Deprecate registration of custom configs to huggingface by heheda12345 in https://github.com/vllm-project/vllm/pull/9083
* [Bugfix] Fix order of arguments matters in config.yaml by Imss27 in https://github.com/vllm-project/vllm/pull/8960
* [core] use forward context for flash infer by youkaichao in https://github.com/vllm-project/vllm/pull/9097
* [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model by tjtanaa in https://github.com/vllm-project/vllm/pull/9101
* [Frontend] API support for beam search by LunrEclipse in https://github.com/vllm-project/vllm/pull/9087
* [Misc] Remove user-facing error for removed VLM args by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9104
* [Model] PP support for embedding models and update docs by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9090
* [Bugfix] fix tool_parser error handling when serve a model not support it by liuyanyi in https://github.com/vllm-project/vllm/pull/8709
* [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling by varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/9038
* [Bugfix][Hardware][CPU] Fix CPU model input for decode by Isotr0py in https://github.com/vllm-project/vllm/pull/9044
* [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None by sroy745 in https://github.com/vllm-project/vllm/pull/9103
* [core] remove beam search from the core by youkaichao in https://github.com/vllm-project/vllm/pull/9105
* [Model] Explicit interface for vLLM models and support OOT embedding models by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9108
* [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/9089
* [Core] Refactor GGUF parameters packing and forwarding by Isotr0py in https://github.com/vllm-project/vllm/pull/8859
* [Model] Support NVLM-D and fix QK Norm in InternViT by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9045
* [Doc]: Add deploying_with_k8s guide by haitwang-cloud in https://github.com/vllm-project/vllm/pull/8451
* [CI/Build] Add linting for github actions workflows by russellb in https://github.com/vllm-project/vllm/pull/7876
* [Doc] Include performance benchmark in README by KuntaiDu in https://github.com/vllm-project/vllm/pull/9135
* [misc] fix comment and variable name by youkaichao in https://github.com/vllm-project/vllm/pull/9139
* Add Slack to README by simon-mo in https://github.com/vllm-project/vllm/pull/9137
* [misc] update utils to support comparing multiple settings by youkaichao in https://github.com/vllm-project/vllm/pull/9140
* [Intel GPU] Fix xpu decode input by jikunshang in https://github.com/vllm-project/vllm/pull/9145
* [misc] improve ux on readme by youkaichao in https://github.com/vllm-project/vllm/pull/9147
* [Frontend] API support for beam search for MQLLMEngine by LunrEclipse in https://github.com/vllm-project/vllm/pull/9117
* [Core][Frontend] Add Support for Inference Time mm_processor_kwargs by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9131
* [Frontend] Add Early Validation For Chat Template / Tool Call Parser by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/9151
* [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models by panpan0000 in https://github.com/vllm-project/vllm/pull/8758
* [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing by dtrifiro in https://github.com/vllm-project/vllm/pull/8537
* [Doc] Update vlm.rst to include an example on videos by sayakpaul in https://github.com/vllm-project/vllm/pull/9155
* [Doc] Improve contributing and installation documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9132
* [Bugfix] Try to handle older versions of pytorch by bnellnm in https://github.com/vllm-project/vllm/pull/9086
* mypy: check additional directories by russellb in https://github.com/vllm-project/vllm/pull/9162
* Add `lm-eval` directly to requirements-test.txt by mgoin in https://github.com/vllm-project/vllm/pull/9161
* support bitsandbytes quantization with more models by chenqianfzh in https://github.com/vllm-project/vllm/pull/9148
* Add classifiers in setup.py by terrytangyuan in https://github.com/vllm-project/vllm/pull/9171
* Update link to KServe deployment guide by terrytangyuan in https://github.com/vllm-project/vllm/pull/9173
* [Misc] Improve validation errors around best_of and n by tjohnson31415 in https://github.com/vllm-project/vllm/pull/9167
* [Bugfix][Doc] Report neuron error in output by joerowell in https://github.com/vllm-project/vllm/pull/9159
* [Model] Remap FP8 kv_scale in CommandR and DBRX by hliuca in https://github.com/vllm-project/vllm/pull/9174
* [Frontend] Log the maximum supported concurrency by AlpinDale in https://github.com/vllm-project/vllm/pull/8831
* [Bugfix] Optimize composite weight loading and fix EAGLE weight loading by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9160
* [ci][test] use load dummy for testing by youkaichao in https://github.com/vllm-project/vllm/pull/9165
* [Doc] Fix VLM prompt placeholder sample bug by ycool in https://github.com/vllm-project/vllm/pull/9170
* [Bugfix] Fix lora loading for Compressed Tensors in 9120 by fahadh4ilyas in https://github.com/vllm-project/vllm/pull/9179
* [Bugfix] Access `get_vocab` instead of `vocab` in tool parsers by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9188
* Add Dependabot configuration for GitHub Actions updates by EwoutH in https://github.com/vllm-project/vllm/pull/1217
* [Hardware][CPU] Support AWQ for CPU backend by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/7515
* [CI/Build] mypy: check vllm/entrypoints by russellb in https://github.com/vllm-project/vllm/pull/9194
* [CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 by mgoin in https://github.com/vllm-project/vllm/pull/9130
* [Core] Fix invalid args to _process_request by russellb in https://github.com/vllm-project/vllm/pull/9201
* [misc] improve model support check in another process by youkaichao in https://github.com/vllm-project/vllm/pull/9208
* [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models by mgoin in https://github.com/vllm-project/vllm/pull/9213
* [Bugfix] Machete garbage results for some models (large K dim) by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9212
* [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 by sroy745 in https://github.com/vllm-project/vllm/pull/9149
* [Bugfix] Fix lm_head weights tying with lora for llama by Isotr0py in https://github.com/vllm-project/vllm/pull/9227
* [Model] support input image embedding for minicpmv by whyiug in https://github.com/vllm-project/vllm/pull/9237
* [OpenVINO] Use torch 2.4.0 and newer optimim version by ilya-lavrenov in https://github.com/vllm-project/vllm/pull/9121
* [Bugfix] Fix Machete unittests failing with `NotImplementedError` by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9218
* [Doc] Improve debugging documentation by rafvasq in https://github.com/vllm-project/vllm/pull/9204
* [CI/Build] Make the `Dockerfile.cpu` file's `PIP_EXTRA_INDEX_URL` Configurable as a Build Argument by jyono in https://github.com/vllm-project/vllm/pull/9252
* Suggest codeowners for the core componenets by simon-mo in https://github.com/vllm-project/vllm/pull/9210
* [torch.compile] integration with compilation control by youkaichao in https://github.com/vllm-project/vllm/pull/9058
* Bump actions/github-script from 6 to 7 by dependabot in https://github.com/vllm-project/vllm/pull/9197
* Bump actions/checkout from 3 to 4 by dependabot in https://github.com/vllm-project/vllm/pull/9196
* Bump actions/setup-python from 3 to 5 by dependabot in https://github.com/vllm-project/vllm/pull/9195
* [ci/build] Add placeholder command for custom models test and add comments by khluu in https://github.com/vllm-project/vllm/pull/9262
* [torch.compile] generic decorators by youkaichao in https://github.com/vllm-project/vllm/pull/9258
* [Doc][Neuron] add note to neuron documentation about resolving triton issue by omrishiv in https://github.com/vllm-project/vllm/pull/9257
* [Misc] Fix sampling from sonnet for long context case by Imss27 in https://github.com/vllm-project/vllm/pull/9235
* [misc] hide best_of from engine by youkaichao in https://github.com/vllm-project/vllm/pull/9261
* [Misc] Collect model support info in a single process per model by DarkLight1337 in https://github.com/vllm-project/vllm/pull/9233
* [Misc][LoRA] Support loading LoRA weights for target_modules in reg format by jeejeelee in https://github.com/vllm-project/vllm/pull/9275
* [Bugfix] Fix priority in multiprocessing engine by schoennenbeck in https://github.com/vllm-project/vllm/pull/9277
* [Model] Support Mamba by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6484
* [Kernel] adding fused moe kernel config for L40S TP4 by bringlein in https://github.com/vllm-project/vllm/pull/9245
* [Model] Add GLM-4v support and meet vllm==0.6.2 by sixsixcoder in https://github.com/vllm-project/vllm/pull/9242
* [Doc] Remove outdated comment to avoid misunderstanding by homeffjy in https://github.com/vllm-project/vllm/pull/9287
* [Doc] Compatibility matrix for mutual exclusive features by wallashss in https://github.com/vllm-project/vllm/pull/8512
* [Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected by LucasWilkinson in https://github.com/vllm-project/vllm/pull/9254
* [Bugfix] Sets `is_first_step_output` for TPUModelRunner by allenwang28 in https://github.com/vllm-project/vllm/pull/9202
* [bugfix] fix f-string for error by prashantgupta24 in https://github.com/vllm-project/vllm/pull/9295
* [BugFix] Fix tool call finish reason in streaming case by maxdebayser in https://github.com/vllm-project/vllm/pull/9209
* [SpecDec] Remove Batch Expansion (2/3) by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9298
* [Bugfix] Fix bug of xformer prefill for encoder-decoder by xiangxu-google in https://github.com/vllm-project/vllm/pull/9026
* [Misc][Installation] Improve source installation script and related documentation by cermeng in https://github.com/vllm-project/vllm/pull/9309
* [Bugfix]Fix MiniCPM's LoRA bug by jeejeelee in https://github.com/vllm-project/vllm/pull/9286
* [CI] Fix merge conflict by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/9317
* [Bugfix] Bandaid fix for speculative decoding tests by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/9327
* [Model] Molmo vLLM Integration by mrsalehi in https://github.com/vllm-project/vllm/pull/9016
* [Hardware][intel GPU] add async output process for xpu by jikunshang in https://github.com/vllm-project/vllm/pull/8897
* [CI/Build] setuptools-scm fixes by dtrifiro in https://github.com/vllm-project/vllm/pull/8900
* [Docs] Remove PDF build from Readtehdocs by simon-mo in https://github.com/vllm-project/vllm/pull/9347

New Contributors
* fyuan1316 made their first contribution in https://github.com/vllm-project/vllm/pull/8834
* panpan0000 made their first contribution in https://github.com/vllm-project/vllm/pull/8830
* bvrockwell made their first contribution in https://github.com/vllm-project/vllm/pull/8871
* tylertitsworth made their first contribution in https://github.com/vllm-project/vllm/pull/7824
* tastelikefeet made their first contribution in https://github.com/vllm-project/vllm/pull/8443
* nFunctor made their first contribution in https://github.com/vllm-project/vllm/pull/8928
* zhuzilin made their first contribution in https://github.com/vllm-project/vllm/pull/8896
* juncheoll made their first contribution in https://github.com/vllm-project/vllm/pull/8944
* vlsav made their first contribution in https://github.com/vllm-project/vllm/pull/8997
* sshlyapn made their first contribution in https://github.com/vllm-project/vllm/pull/8192
* gcalmettes made their first contribution in https://github.com/vllm-project/vllm/pull/9020
* xendo made their first contribution in https://github.com/vllm-project/vllm/pull/8959
* domenVres made their first contribution in https://github.com/vllm-project/vllm/pull/9042
* sydnash made their first contribution in https://github.com/vllm-project/vllm/pull/8405
* varad-ahirwadkar made their first contribution in https://github.com/vllm-project/vllm/pull/9039
* flaviabeo made their first contribution in https://github.com/vllm-project/vllm/pull/8999
* chongmni-aws made their first contribution in https://github.com/vllm-project/vllm/pull/8746
* hhzhang16 made their first contribution in https://github.com/vllm-project/vllm/pull/8979
* xyang16 made their first contribution in https://github.com/vllm-project/vllm/pull/9004
* LunrEclipse made their first contribution in https://github.com/vllm-project/vllm/pull/9087
* sayakpaul made their first contribution in https://github.com/vllm-project/vllm/pull/9155
* joerowell made their first contribution in https://github.com/vllm-project/vllm/pull/9159
* AlpinDale made their first contribution in https://github.com/vllm-project/vllm/pull/8831
* ycool made their first contribution in https://github.com/vllm-project/vllm/pull/9170
* fahadh4ilyas made their first contribution in https://github.com/vllm-project/vllm/pull/9179
* EwoutH made their first contribution in https://github.com/vllm-project/vllm/pull/1217
* jyono made their first contribution in https://github.com/vllm-project/vllm/pull/9252
* dependabot made their first contribution in https://github.com/vllm-project/vllm/pull/9197
* bringlein made their first contribution in https://github.com/vllm-project/vllm/pull/9245
* sixsixcoder made their first contribution in https://github.com/vllm-project/vllm/pull/9242
* homeffjy made their first contribution in https://github.com/vllm-project/vllm/pull/9287
* allenwang28 made their first contribution in https://github.com/vllm-project/vllm/pull/9202
* cermeng made their first contribution in https://github.com/vllm-project/vllm/pull/9309
* mrsalehi made their first contribution in https://github.com/vllm-project/vllm/pull/9016

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.2...v0.6.3

0.6.2

Not secure
Highlights

Model Support
* Support Llama 3.2 models (8811, 8822)

vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16

* Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (8684, 8763, 8713)
* ⚠️ You will see the following error now, this is breaking change!

> Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the `vllm.LLM.use_beam_search` method for dedicated beam search instead, or set the environment variable `VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1` to suppress this error. For more details, see https://github.com/vllm-project/vllm/issues/8306

* Support for Solar Model (8386), minicpm3 (8297), LLaVA-Onevision model support (8486)
* Enhancements: pp for qwen2-vl (8696), multiple images for qwen-vl (8247), mistral function calling (8515), bitsandbytes support for Gemma2 (8338), tensor parallelism with bitsandbytes quantization (8434)

Hardware Support
* TPU: implement multi-step scheduling (8489), use Ray for default distributed backend (8389)
* CPU: Enable mrope and support Qwen2-VL on CPU backend (8770)
* AMD: custom paged attention kernel for rocm (8310), and fp8 kv cache support (8577)

Production Engine
* Initial support for priority sheduling (5958)
* Support Lora lineage and base model metadata management (6315)
* Batch inference for llm.chat() API (8648)

Performance
* Introduce `MQLLMEngine` for API Server, boost throughput 30% in single step and 7% in multistep (8157, 8761, 8584)
* Multi-step scheduling enhancements
* Prompt logprobs support in Multi-step (8199)
* Add output streaming support to multi-step + async (8335)
* Add flashinfer backend (7928)
* Add cuda graph support during decoding for encoder-decoder models (7631)

Others
* Support sample from HF datasets and image input for benchmark_serving (8495)
* Progress in torch.compile integration (8488, 8480, 8384, 8526, 8445)


What's Changed
* [MISC] Dump model runner inputs when crashing by comaniac in https://github.com/vllm-project/vllm/pull/8305
* [misc] remove engine_use_ray by youkaichao in https://github.com/vllm-project/vllm/pull/8126
* [TPU] Use Ray for default distributed backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/8389
* Fix the AMD weight loading tests by mgoin in https://github.com/vllm-project/vllm/pull/8390
* [Bugfix]: Fix the logic for deciding if tool parsing is used by tomeras91 in https://github.com/vllm-project/vllm/pull/8366
* [Gemma2] add bitsandbytes support for Gemma2 by blueyo0 in https://github.com/vllm-project/vllm/pull/8338
* [Misc] Raise error when using encoder/decoder model with cpu backend by kevin314 in https://github.com/vllm-project/vllm/pull/8355
* [Misc] Use RoPE cache for MRoPE by WoosukKwon in https://github.com/vllm-project/vllm/pull/8396
* [torch.compile] hide slicing under custom op for inductor by youkaichao in https://github.com/vllm-project/vllm/pull/8384
* [Hotfix][VLM] Fixing max position embeddings for Pixtral by ywang96 in https://github.com/vllm-project/vllm/pull/8399
* [Bugfix] Fix InternVL2 inference with various num_patches by Isotr0py in https://github.com/vllm-project/vllm/pull/8375
* [Model] Support multiple images for qwen-vl by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8247
* [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by lnykww in https://github.com/vllm-project/vllm/pull/8403
* [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by vegaluisjose in https://github.com/vllm-project/vllm/pull/8423
* [Bugfix] Offline mode fix by joerunde in https://github.com/vllm-project/vllm/pull/8376
* [multi-step] add flashinfer backend by SolitaryThinker in https://github.com/vllm-project/vllm/pull/7928
* [Core] Add engine option to return only deltas or final output by njhill in https://github.com/vllm-project/vllm/pull/7381
* [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8427
* [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by ywang96 in https://github.com/vllm-project/vllm/pull/8425
* [CI/Build] Disable multi-node test for InternVL2 by ywang96 in https://github.com/vllm-project/vllm/pull/8428
* [Hotfix][Pixtral] Fix multiple images bugs by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8415
* [Bugfix] Fix weight loading issue by rename variable. by wenxcs in https://github.com/vllm-project/vllm/pull/8293
* [Misc] Update Pixtral example by ywang96 in https://github.com/vllm-project/vllm/pull/8431
* [BugFix] fix group_topk by dsikka in https://github.com/vllm-project/vllm/pull/8430
* [Core] Factor out input preprocessing to a separate class by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7329
* [Bugfix] Mapping physical device indices for e2e test utils by ShangmingCai in https://github.com/vllm-project/vllm/pull/8290
* [Bugfix] Bump fastapi and pydantic version by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8435
* [CI/Build] Update pixtral tests to use JSON by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8436
* [Bugfix] Fix async log stats by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8417
* [bugfix] torch profiler bug for single gpu with GPUExecutor by SolitaryThinker in https://github.com/vllm-project/vllm/pull/8354
* bump version to v0.6.1.post1 by simon-mo in https://github.com/vllm-project/vllm/pull/8440
* [CI/Build] Enable InternVL2 PP test only on single node by Isotr0py in https://github.com/vllm-project/vllm/pull/8437
* [doc] recommend pip instead of conda by youkaichao in https://github.com/vllm-project/vllm/pull/8446
* [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by jeejeelee in https://github.com/vllm-project/vllm/pull/8442
* [misc][ci] fix quant test by youkaichao in https://github.com/vllm-project/vllm/pull/8449
* [Installation] Gate FastAPI version for Python 3.8 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8456
* [plugin][torch.compile] allow to add custom compile backend by youkaichao in https://github.com/vllm-project/vllm/pull/8445
* [CI/Build] Reorganize models tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7820
* [Doc] Add oneDNN installation to CPU backend documentation by Isotr0py in https://github.com/vllm-project/vllm/pull/8467
* [HotFix] Fix final output truncation with stop string + streaming by njhill in https://github.com/vllm-project/vllm/pull/8468
* bump version to v0.6.1.post2 by simon-mo in https://github.com/vllm-project/vllm/pull/8473
* [Hardware][intel GPU] bump up ipex version to 2.3 by jikunshang in https://github.com/vllm-project/vllm/pull/8365
* [Kernel][Hardware][Amd]Custom paged attention kernel for rocm by charlifu in https://github.com/vllm-project/vllm/pull/8310
* [Model] support minicpm3 by SUDA-HLT-ywfang in https://github.com/vllm-project/vllm/pull/8297
* [torch.compile] fix functionalization by youkaichao in https://github.com/vllm-project/vllm/pull/8480
* [torch.compile] add a flag to disable custom op by youkaichao in https://github.com/vllm-project/vllm/pull/8488
* [TPU] Implement multi-step scheduling by WoosukKwon in https://github.com/vllm-project/vllm/pull/8489
* [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by chrisociepa in https://github.com/vllm-project/vllm/pull/8490
* [Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel by Isotr0py in https://github.com/vllm-project/vllm/pull/8357
* [Kernel] Enable 8-bit weights in Fused Marlin MoE by ElizaWszola in https://github.com/vllm-project/vllm/pull/8032
* [Frontend] Expose revision arg in OpenAI server by lewtun in https://github.com/vllm-project/vllm/pull/8501
* [BugFix] Fix clean shutdown issues by njhill in https://github.com/vllm-project/vllm/pull/8492
* [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by sasha0552 in https://github.com/vllm-project/vllm/pull/8506
* [Kernel] AQ AZP 3/4: Asymmetric quantization kernels by ProExpertProg in https://github.com/vllm-project/vllm/pull/7270
* [doc] update doc on testing and debugging by youkaichao in https://github.com/vllm-project/vllm/pull/8514
* [Bugfix] Bind api server port before starting engine by kevin314 in https://github.com/vllm-project/vllm/pull/8491
* [perf bench] set timeout to debug hanging by simon-mo in https://github.com/vllm-project/vllm/pull/8516
* [misc] small qol fixes for release process by simon-mo in https://github.com/vllm-project/vllm/pull/8517
* [Bugfix] Fix 3.12 builds on main by joerunde in https://github.com/vllm-project/vllm/pull/8510
* [refactor] remove triton based sampler by simon-mo in https://github.com/vllm-project/vllm/pull/8524
* [Frontend] Improve Nullable kv Arg Parsing by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8525
* [Misc][Bugfix] Disable guided decoding for mistral tokenizer by ywang96 in https://github.com/vllm-project/vllm/pull/8521
* [torch.compile] register allreduce operations as custom ops by youkaichao in https://github.com/vllm-project/vllm/pull/8526
* [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by ruisearch42 in https://github.com/vllm-project/vllm/pull/8509
* [Benchmark] Support sample from HF datasets and image input for benchmark_serving by Isotr0py in https://github.com/vllm-project/vllm/pull/8495
* [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by sroy745 in https://github.com/vllm-project/vllm/pull/7631
* [Feature][kernel] tensor parallelism with bitsandbytes quantization by chenqianfzh in https://github.com/vllm-project/vllm/pull/8434
* [Model] Add mistral function calling format to all models loaded with "mistral" format by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8515
* [Misc] Don't dump contents of kvcache tensors on errors by njhill in https://github.com/vllm-project/vllm/pull/8527
* [Bugfix] Fix TP > 1 for new granite by joerunde in https://github.com/vllm-project/vllm/pull/8544
* [doc] improve installation doc by youkaichao in https://github.com/vllm-project/vllm/pull/8550
* [CI/Build] Excluding kernels/test_gguf.py from ROCm by alexeykondrat in https://github.com/vllm-project/vllm/pull/8520
* [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8012
* [CI/Build] fix Dockerfile.cpu on podman by dtrifiro in https://github.com/vllm-project/vllm/pull/8540
* [Misc] Add argument to disable FastAPI docs by Jeffwan in https://github.com/vllm-project/vllm/pull/8554
* [CI/Build] Avoid CUDA initialization by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8534
* [CI/Build] Update Ruff version by aarnphm in https://github.com/vllm-project/vllm/pull/8469
* [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8157
* [Core] *Prompt* logprobs support in Multi-step by afeldman-nm in https://github.com/vllm-project/vllm/pull/8199
* [Core] zmq: bind only to 127.0.0.1 for local-only usage by russellb in https://github.com/vllm-project/vllm/pull/8543
* [Model] Support Solar Model by shing100 in https://github.com/vllm-project/vllm/pull/8386
* [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call by gshtras in https://github.com/vllm-project/vllm/pull/8380
* [Kernel] Change interface to Mamba selective_state_update for continuous batching by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8039
* [BugFix] Nonzero exit code if MQLLMEngine startup fails by njhill in https://github.com/vllm-project/vllm/pull/8572
* [Bugfix] add `dead_error` property to engine client by joerunde in https://github.com/vllm-project/vllm/pull/8574
* [Kernel] Remove marlin moe templating on thread_m_blocks by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8573
* [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. by sroy745 in https://github.com/vllm-project/vllm/pull/8545
* Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" by ywang96 in https://github.com/vllm-project/vllm/pull/8593
* [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py by KuntaiDu in https://github.com/vllm-project/vllm/pull/8616
* [MISC] remove engine_use_ray in benchmark_throughput.py by jikunshang in https://github.com/vllm-project/vllm/pull/8615
* [Frontend] Use MQLLMEngine for embeddings models too by njhill in https://github.com/vllm-project/vllm/pull/8584
* [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention by charlifu in https://github.com/vllm-project/vllm/pull/8577
* [Core] simplify logits resort in _apply_top_k_top_p by hidva in https://github.com/vllm-project/vllm/pull/8619
* [Doc] Add documentation for GGUF quantization by Isotr0py in https://github.com/vllm-project/vllm/pull/8618
* Create SECURITY.md by simon-mo in https://github.com/vllm-project/vllm/pull/8642
* [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail by alexeykondrat in https://github.com/vllm-project/vllm/pull/8551
* [Misc] guard against change in cuda library name by bnellnm in https://github.com/vllm-project/vllm/pull/8609
* [Bugfix] Fix Phi3.5 mini and MoE LoRA inference by garg-amit in https://github.com/vllm-project/vllm/pull/8571
* [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata by SolitaryThinker in https://github.com/vllm-project/vllm/pull/8474
* [Core] Support Lora lineage and base model metadata management by Jeffwan in https://github.com/vllm-project/vllm/pull/6315
* [Model] Add OLMoE by Muennighoff in https://github.com/vllm-project/vllm/pull/7922
* [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build by alexeykondrat in https://github.com/vllm-project/vllm/pull/8670
* [Bugfix] Validate SamplingParam n is an int by saumya-saran in https://github.com/vllm-project/vllm/pull/8548
* [Misc] Show AMD GPU topology in `collect_env.py` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8649
* [Bugfix] Config.__init__() got an unexpected keyword argument 'engine' api_server args by Juelianqvq in https://github.com/vllm-project/vllm/pull/8556
* [Bugfix][Core] Fix tekken edge case for mistral tokenizer by patrickvonplaten in https://github.com/vllm-project/vllm/pull/8640
* [Doc] neuron documentation update by omrishiv in https://github.com/vllm-project/vllm/pull/8671
* [Hardware][AWS] update neuron to 2.20 by omrishiv in https://github.com/vllm-project/vllm/pull/8676
* [Bugfix] Fix incorrect llava next feature size calculation by zyddnys in https://github.com/vllm-project/vllm/pull/8496
* [Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8673
* [MISC] add support custom_op check by jikunshang in https://github.com/vllm-project/vllm/pull/8557
* [Core] Factor out common code in `SequenceData` and `Sequence` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8675
* [beam search] add output for manually checking the correctness by youkaichao in https://github.com/vllm-project/vllm/pull/8684
* [Kernel] Build flash-attn from source by ProExpertProg in https://github.com/vllm-project/vllm/pull/8245
* [VLM] Use `SequenceData.from_token_counts` to create dummy data by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8687
* [Doc] Fix typo in AMD installation guide by Imss27 in https://github.com/vllm-project/vllm/pull/8689
* [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 by rasmith in https://github.com/vllm-project/vllm/pull/8646
* [dbrx] refactor dbrx experts to extend FusedMoe class by divakar-amd in https://github.com/vllm-project/vllm/pull/8518
* [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8643
* [Bugfix] Refactor composite weight loading logic by Isotr0py in https://github.com/vllm-project/vllm/pull/8656
* [ci][build] fix vllm-flash-attn by youkaichao in https://github.com/vllm-project/vllm/pull/8699
* [Model] Refactor BLIP/BLIP-2 to support composite model loading by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8407
* [Misc] Use NamedTuple in Multi-image example by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8705
* [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler by statelesshz in https://github.com/vllm-project/vllm/pull/8703
* [Model][VLM] Add LLaVA-Onevision model support by litianjian in https://github.com/vllm-project/vllm/pull/8486
* [SpecDec][Misc] Cleanup, remove bonus token logic. by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8701
* [build] enable existing pytorch (for GH200, aarch64, nightly) by youkaichao in https://github.com/vllm-project/vllm/pull/8713
* [misc] upgrade mistral-common by youkaichao in https://github.com/vllm-project/vllm/pull/8715
* [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building by tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8702
* [Bugfix] Fix CPU CMake build by ProExpertProg in https://github.com/vllm-project/vllm/pull/8723
* [Bugfix] fix docker build for xpu by yma11 in https://github.com/vllm-project/vllm/pull/8652
* [Core][Frontend] Support Passing Multimodal Processor Kwargs by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8657
* [Hardware][CPU] Refactor CPU model runner by Isotr0py in https://github.com/vllm-project/vllm/pull/8729
* [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/8733
* [Model] Support pp for qwen2-vl by liuyanyi in https://github.com/vllm-project/vllm/pull/8696
* [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size by janimo in https://github.com/vllm-project/vllm/pull/8707
* [CI/Build] use setuptools-scm to set __version__ by dtrifiro in https://github.com/vllm-project/vllm/pull/4738
* [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin by LucasWilkinson in https://github.com/vllm-project/vllm/pull/7701
* [Kernel][LoRA] Add assertion for punica sgmv kernels by jeejeelee in https://github.com/vllm-project/vllm/pull/7585
* [Core] Allow IPv6 in VLLM_HOST_IP with zmq by russellb in https://github.com/vllm-project/vllm/pull/8575
* Fix typical acceptance sampler with correct recovered token ids by jiqing-feng in https://github.com/vllm-project/vllm/pull/8562
* Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse by alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8335
* [Hardware][AMD] ROCm6.2 upgrade by hongxiayang in https://github.com/vllm-project/vllm/pull/8674
* Fix tests in test_scheduler.py that fail with BlockManager V2 by sroy745 in https://github.com/vllm-project/vllm/pull/8728
* re-implement beam search on top of vllm core by youkaichao in https://github.com/vllm-project/vllm/pull/8726
* Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" by simon-mo in https://github.com/vllm-project/vllm/pull/8750
* [MISC] Skip dumping inputs when unpicklable by comaniac in https://github.com/vllm-project/vllm/pull/8744
* [Core][Model] Support loading weights by ID within models by petersalas in https://github.com/vllm-project/vllm/pull/7931
* [Model] Expose Phi3v num_crops as a mm_processor_kwarg by alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8658
* [Bugfix] Fix potentially unsafe custom allreduce synchronization by hanzhi713 in https://github.com/vllm-project/vllm/pull/8558
* [Kernel] Split Marlin MoE kernels into multiple files by ElizaWszola in https://github.com/vllm-project/vllm/pull/8661
* [Frontend] Batch inference for llm.chat() API by aandyw in https://github.com/vllm-project/vllm/pull/8648
* [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` by LucasWilkinson in https://github.com/vllm-project/vllm/pull/8748
* [CI/Build] fix setuptools-scm usage by dtrifiro in https://github.com/vllm-project/vllm/pull/8771
* [misc] soft drop beam search by youkaichao in https://github.com/vllm-project/vllm/pull/8763
* [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 by jeejeelee in https://github.com/vllm-project/vllm/pull/8768
* [Core][Bugfix] Support prompt_logprobs returned with speculative decoding by tjohnson31415 in https://github.com/vllm-project/vllm/pull/8047
* [Core] Adding Priority Scheduling by apatke in https://github.com/vllm-project/vllm/pull/5958
* [Bugfix] Use heartbeats instead of health checks by joerunde in https://github.com/vllm-project/vllm/pull/8583
* Fix test_schedule_swapped_simple in test_scheduler.py by sroy745 in https://github.com/vllm-project/vllm/pull/8780
* [Bugfix][Kernel] Implement acquire/release polyfill for Pascal by sasha0552 in https://github.com/vllm-project/vllm/pull/8776
* Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 by sroy745 in https://github.com/vllm-project/vllm/pull/8752
* [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv by zifeitong in https://github.com/vllm-project/vllm/pull/8250
* [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/8770
* [Bugfix] load fc bias from config for eagle by sohamparikh in https://github.com/vllm-project/vllm/pull/8790
* [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer by agt in https://github.com/vllm-project/vllm/pull/8672
* [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node by darthhexx in https://github.com/vllm-project/vllm/pull/8767
* [Misc] Fix minor typo in scheduler by wooyeonlee0 in https://github.com/vllm-project/vllm/pull/8765
* [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade by hongxiayang in https://github.com/vllm-project/vllm/pull/8777
* [Kernel] Fullgraph and opcheck tests by bnellnm in https://github.com/vllm-project/vllm/pull/8479
* [[Misc]] Add extra deps for openai server image by jeejeelee in https://github.com/vllm-project/vllm/pull/8792
* [VLM][Bugfix] enable internvl running with num_scheduler_steps > 1 by DefTruth in https://github.com/vllm-project/vllm/pull/8614
* [Core] Rename `PromptInputs` and `inputs`, with backwards compatibility by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8760
* [Frontend] MQLLMEngine supports profiling. by abatom in https://github.com/vllm-project/vllm/pull/8761
* [Misc] Support FP8 MoE for compressed-tensors by mgoin in https://github.com/vllm-project/vllm/pull/8588
* Revert "rename PromptInputs and inputs with backward compatibility (8760) by simon-mo in https://github.com/vllm-project/vllm/pull/8810
* [Model] Add support for the multi-modal Llama 3.2 model by heheda12345 in https://github.com/vllm-project/vllm/pull/8811
* [Doc] Update doc for Transformers 4.45 by ywang96 in https://github.com/vllm-project/vllm/pull/8817
* [Misc] Support quantization of MllamaForCausalLM by mgoin in https://github.com/vllm-project/vllm/pull/8822

New Contributors
* blueyo0 made their first contribution in https://github.com/vllm-project/vllm/pull/8338
* lnykww made their first contribution in https://github.com/vllm-project/vllm/pull/8403
* vegaluisjose made their first contribution in https://github.com/vllm-project/vllm/pull/8423
* chrisociepa made their first contribution in https://github.com/vllm-project/vllm/pull/8490
* lewtun made their first contribution in https://github.com/vllm-project/vllm/pull/8501
* russellb made their first contribution in https://github.com/vllm-project/vllm/pull/8543
* shing100 made their first contribution in https://github.com/vllm-project/vllm/pull/8386
* hidva made their first contribution in https://github.com/vllm-project/vllm/pull/8619
* Muennighoff made their first contribution in https://github.com/vllm-project/vllm/pull/7922
* saumya-saran made their first contribution in https://github.com/vllm-project/vllm/pull/8548
* zyddnys made their first contribution in https://github.com/vllm-project/vllm/pull/8496
* Imss27 made their first contribution in https://github.com/vllm-project/vllm/pull/8689
* statelesshz made their first contribution in https://github.com/vllm-project/vllm/pull/8703
* litianjian made their first contribution in https://github.com/vllm-project/vllm/pull/8486
* yma11 made their first contribution in https://github.com/vllm-project/vllm/pull/8652
* liuyanyi made their first contribution in https://github.com/vllm-project/vllm/pull/8696
* janimo made their first contribution in https://github.com/vllm-project/vllm/pull/8707
* jiqing-feng made their first contribution in https://github.com/vllm-project/vllm/pull/8562
* aandyw made their first contribution in https://github.com/vllm-project/vllm/pull/8648
* apatke made their first contribution in https://github.com/vllm-project/vllm/pull/5958
* sohamparikh made their first contribution in https://github.com/vllm-project/vllm/pull/8790
* darthhexx made their first contribution in https://github.com/vllm-project/vllm/pull/8767
* abatom made their first contribution in https://github.com/vllm-project/vllm/pull/8761
* heheda12345 made their first contribution in https://github.com/vllm-project/vllm/pull/8811

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.1...v0.6.2

0.6.1.post2

Not secure
Highlights
* This release contains an important bugfix related to token streaming combined with stop string (8468)

What's Changed
* [CI/Build] Enable InternVL2 PP test only on single node by Isotr0py in https://github.com/vllm-project/vllm/pull/8437
* [doc] recommend pip instead of conda by youkaichao in https://github.com/vllm-project/vllm/pull/8446
* [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by jeejeelee in https://github.com/vllm-project/vllm/pull/8442
* [misc][ci] fix quant test by youkaichao in https://github.com/vllm-project/vllm/pull/8449
* [Installation] Gate FastAPI version for Python 3.8 by DarkLight1337 in https://github.com/vllm-project/vllm/pull/8456
* [plugin][torch.compile] allow to add custom compile backend by youkaichao in https://github.com/vllm-project/vllm/pull/8445
* [CI/Build] Reorganize models tests by DarkLight1337 in https://github.com/vllm-project/vllm/pull/7820
* [Doc] Add oneDNN installation to CPU backend documentation by Isotr0py in https://github.com/vllm-project/vllm/pull/8467
* [HotFix] Fix final output truncation with stop string + streaming by njhill in https://github.com/vllm-project/vllm/pull/8468
* bump version to v0.6.1.post2 by simon-mo in https://github.com/vllm-project/vllm/pull/8473

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.6.1.post1...v0.6.1.post2

Page 1 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.