Vllm

Latest version: v0.4.2

Safety actively analyzes 625891 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 5

0.4.2

Highlights
Features
* [Chunked prefill is ready for testing](https://docs.vllm.ai/en/latest/models/performance.html#chunked-prefill)! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (4580)
* Speculative decoding functionalities: logprobs (4378), ngram (4237)
* Support FlashInfer as attention backend (4353)

Models and Enhancements
* Add support for Phi-3-mini (4298, 4372, 4380)
* Add more histogram metrics (2764, 4523)
* Full tensor parallelism for LoRA layers (3524)
* Expanding Marlin kernel to support all GPTQ models (3922, 4466, 4533)

Dependency Upgrade
* Upgrade to `torch==2.3.0` (4454)
* Upgrade to `tensorizer==2.9.0` (4467)
* Expansion of AMD test suite (4267)

Progress and Dev Experience
* Centralize and document all environment variables (4548, 4574)
* Progress towards fully typed codebase (4337, 4427, 4555, 4450)
* Progress towards pipeline parallelism (4512, 4444, 4566)
* Progress towards multiprocessing based executors (4348, 4402, 4419)
* Progress towards FP8 support (4343, 4332, 4527)

What's Changed
* [Core][Distributed] use existing torch.cuda.device context manager by youkaichao in https://github.com/vllm-project/vllm/pull/4318
* [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by ywang96 in https://github.com/vllm-project/vllm/pull/4279
* [Bugfix] Fix marlin kernel crash on H100 by alexm-nm in https://github.com/vllm-project/vllm/pull/4218
* [Doc] Add note for docker user by youkaichao in https://github.com/vllm-project/vllm/pull/4340
* [Misc] Use public API in benchmark_throughput by zifeitong in https://github.com/vllm-project/vllm/pull/4300
* [Model] Adds Phi-3 support by caiom in https://github.com/vllm-project/vllm/pull/4298
* [Core] Move ray_utils.py from `engine` to `executor` package by njhill in https://github.com/vllm-project/vllm/pull/4347
* [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by Isotr0py in https://github.com/vllm-project/vllm/pull/4324
* [CI/Build] Adding functionality to reset the node's GPUs before processing. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4213
* [Doc] README Phi-3 name fix. by caiom in https://github.com/vllm-project/vllm/pull/4372
* [Core]refactor aqlm quant ops by jikunshang in https://github.com/vllm-project/vllm/pull/4351
* [Mypy] Typing lora folder by rkooo567 in https://github.com/vllm-project/vllm/pull/4337
* [Misc] Optimize flash attention backend log by esmeetu in https://github.com/vllm-project/vllm/pull/4368
* [Core] Add `shutdown()` method to `ExecutorBase` by njhill in https://github.com/vllm-project/vllm/pull/4349
* [Core] Move function tracing setup to util function by njhill in https://github.com/vllm-project/vllm/pull/4352
* [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by hongxiayang in https://github.com/vllm-project/vllm/pull/4376
* [Bugfix] Fix parameter name in `get_tokenizer` by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4107
* [Frontend] Add --log-level option to api server by normster in https://github.com/vllm-project/vllm/pull/4377
* [CI] Disable non-lazy string operation on logging by rkooo567 in https://github.com/vllm-project/vllm/pull/4326
* [Core] Refactoring sampler and support prompt logprob for chunked prefill by rkooo567 in https://github.com/vllm-project/vllm/pull/4309
* [Misc][Refactor] Generalize linear_method to be quant_method by comaniac in https://github.com/vllm-project/vllm/pull/4373
* [Misc] add RFC issue template by youkaichao in https://github.com/vllm-project/vllm/pull/4401
* [Core] Introduce `DistributedGPUExecutor` abstract class by njhill in https://github.com/vllm-project/vllm/pull/4348
* [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by pcmoritz in https://github.com/vllm-project/vllm/pull/4343
* [Frontend][Bugfix] Disallow extra fields in OpenAI API by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4355
* [Misc] Fix logger format typo by esmeetu in https://github.com/vllm-project/vllm/pull/4396
* [ROCm][Hardware][AMD] Enable group query attention for triton FA by hongxiayang in https://github.com/vllm-project/vllm/pull/4406
* [Kernel] Full Tensor Parallelism for LoRA Layers by FurtherAI in https://github.com/vllm-project/vllm/pull/3524
* [Model] Phi-3 4k sliding window temp. fix by caiom in https://github.com/vllm-project/vllm/pull/4380
* [Bugfix][Core] Fix get decoding config from ray by esmeetu in https://github.com/vllm-project/vllm/pull/4335
* [Bugfix] Abort requests when the connection to /v1/completions is interrupted by chestnut-Q in https://github.com/vllm-project/vllm/pull/4363
* [BugFix] Fix `min_tokens` when `eos_token_id` is None by njhill in https://github.com/vllm-project/vllm/pull/4389
* ✨ support local cache for models by prashantgupta24 in https://github.com/vllm-project/vllm/pull/4374
* [BugFix] Fix return type of executor execute_model methods by njhill in https://github.com/vllm-project/vllm/pull/4402
* [BugFix] Resolved Issues For LinearMethod --> QuantConfig by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4418
* [Misc] fix typo in llm_engine init logging by DefTruth in https://github.com/vllm-project/vllm/pull/4428
* Add more Prometheus metrics by ronensc in https://github.com/vllm-project/vllm/pull/2764
* [CI] clean docker cache for neuron by simon-mo in https://github.com/vllm-project/vllm/pull/4441
* [mypy][5/N] Support all typing on model executor by rkooo567 in https://github.com/vllm-project/vllm/pull/4427
* [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3922
* [CI] hotfix: soft fail neuron test by simon-mo in https://github.com/vllm-project/vllm/pull/4458
* [Core][Distributed] use cpu group to broadcast metadata in cpu by youkaichao in https://github.com/vllm-project/vllm/pull/4444
* [Misc] Upgrade to `torch==2.3.0` by mgoin in https://github.com/vllm-project/vllm/pull/4454
* [Bugfix][Kernel] Fix compute_type for MoE kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/4463
* [Core]Refactor gptq_marlin ops by jikunshang in https://github.com/vllm-project/vllm/pull/4466
* [BugFix] fix num_lookahead_slots missing in async executor by leiwen83 in https://github.com/vllm-project/vllm/pull/4165
* [Doc] add visualization for multi-stage dockerfile by prashantgupta24 in https://github.com/vllm-project/vllm/pull/4456
* [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4332
* [Frontend] Support complex message content for chat completions endpoint by fgreinacher in https://github.com/vllm-project/vllm/pull/3467
* [Frontend] [Core] Tensorizer: support dynamic `num_readers`, update version by alpayariyak in https://github.com/vllm-project/vllm/pull/4467
* [Bugfix][Minor] Make ignore_eos effective by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4468
* fix_tokenizer_snapshot_download_bug by kingljl in https://github.com/vllm-project/vllm/pull/4493
* Unable to find Punica extension issue during source code installation by kingljl in https://github.com/vllm-project/vllm/pull/4494
* [Core] Centralize GPU Worker construction by njhill in https://github.com/vllm-project/vllm/pull/4419
* [Misc][Typo] type annotation fix by HarryWu99 in https://github.com/vllm-project/vllm/pull/4495
* [Misc] fix typo in block manager by Juelianqvq in https://github.com/vllm-project/vllm/pull/4453
* Allow user to define whitespace pattern for outlines by robcaulk in https://github.com/vllm-project/vllm/pull/4305
* [Misc]Add customized information for models by jeejeelee in https://github.com/vllm-project/vllm/pull/4132
* [Test] Add ignore_eos test by rkooo567 in https://github.com/vllm-project/vllm/pull/4519
* [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by AnyISalIn in https://github.com/vllm-project/vllm/pull/4173
* [Bugfix] Fix 307 Redirect for `/metrics` by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4523
* [Doc] update(example model): for OpenAI compatible serving by fpaupier in https://github.com/vllm-project/vllm/pull/4503
* [Bugfix] Use random seed if seed is -1 by sasha0552 in https://github.com/vllm-project/vllm/pull/4531
* [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by tjohnson31415 in https://github.com/vllm-project/vllm/pull/4534
* [Speculative decoding] Add ngram prompt lookup decoding by leiwen83 in https://github.com/vllm-project/vllm/pull/4237
* [Core] Enable prefix caching with block manager v2 enabled by leiwen83 in https://github.com/vllm-project/vllm/pull/4142
* [Core] Add `multiproc_worker_utils` for multiprocessing-based workers by njhill in https://github.com/vllm-project/vllm/pull/4357
* [Kernel] Update fused_moe tuning script for FP8 by pcmoritz in https://github.com/vllm-project/vllm/pull/4457
* [Bugfix] Add validation for seed by sasha0552 in https://github.com/vllm-project/vllm/pull/4529
* [Bugfix][Core] Fix and refactor logging stats by esmeetu in https://github.com/vllm-project/vllm/pull/4336
* [Core][Distributed] fix pynccl del error by youkaichao in https://github.com/vllm-project/vllm/pull/4508
* [Misc] Remove Mixtral device="cuda" declarations by pcmoritz in https://github.com/vllm-project/vllm/pull/4543
* [Misc] Fix expert_ids shape in MoE by WoosukKwon in https://github.com/vllm-project/vllm/pull/4517
* [MISC] Rework logger to enable pythonic custom logging configuration to be provided by tdg5 in https://github.com/vllm-project/vllm/pull/4273
* [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by rkooo567 in https://github.com/vllm-project/vllm/pull/4451
* [CI]Add regression tests to ensure the async engine generates metrics by ronensc in https://github.com/vllm-project/vllm/pull/4524
* [mypy][6/N] Fix all the core subdirectory typing by rkooo567 in https://github.com/vllm-project/vllm/pull/4450
* [Core][Distributed] enable multiple tp group by youkaichao in https://github.com/vllm-project/vllm/pull/4512
* [Kernel] Support running GPTQ 8-bit models in Marlin by alexm-nm in https://github.com/vllm-project/vllm/pull/4533
* [mypy][7/N] Cover all directories by rkooo567 in https://github.com/vllm-project/vllm/pull/4555
* [Misc] Exclude the `tests` directory from being packaged by itechbear in https://github.com/vllm-project/vllm/pull/4552
* [BugFix] Include target-device specific requirements.txt in sdist by markmc in https://github.com/vllm-project/vllm/pull/4559
* [Misc] centralize all usage of environment variables by youkaichao in https://github.com/vllm-project/vllm/pull/4548
* [kernel] fix sliding window in prefix prefill Triton kernel by mmoskal in https://github.com/vllm-project/vllm/pull/4405
* [CI/Build] AMD CI pipeline with extended set of tests. by Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4267
* [Core] Ignore infeasible swap requests. by rkooo567 in https://github.com/vllm-project/vllm/pull/4557
* [Core][Distributed] enable allreduce for multiple tp groups by youkaichao in https://github.com/vllm-project/vllm/pull/4566
* [BugFix] Prevent the task of `_force_log` from being garbage collected by Atry in https://github.com/vllm-project/vllm/pull/4567
* [Misc] remove chunk detected debug logs by DefTruth in https://github.com/vllm-project/vllm/pull/4571
* [Doc] add env vars to the doc by youkaichao in https://github.com/vllm-project/vllm/pull/4572
* [Core][Model runner refactoring 1/N] Refactor attn metadata term by rkooo567 in https://github.com/vllm-project/vllm/pull/4518
* [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by mgoin in https://github.com/vllm-project/vllm/pull/4586
* Fix/async chat serving by schoennenbeck in https://github.com/vllm-project/vllm/pull/2727
* [Kernel] Use flashinfer for decoding by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4353
* [Speculative decoding] Support target-model logprobs by cadedaniel in https://github.com/vllm-project/vllm/pull/4378
* [Misc] add installation time env vars by youkaichao in https://github.com/vllm-project/vllm/pull/4574
* [Misc][Refactor] Introduce ExecuteModelData by comaniac in https://github.com/vllm-project/vllm/pull/4540
* [Doc] Chunked Prefill Documentation by rkooo567 in https://github.com/vllm-project/vllm/pull/4580
* [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by mgoin in https://github.com/vllm-project/vllm/pull/4527
* [CI] check size of the wheels by simon-mo in https://github.com/vllm-project/vllm/pull/4319
* [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by DearPlanet in https://github.com/vllm-project/vllm/pull/3937
* bump version to v0.4.2 by simon-mo in https://github.com/vllm-project/vllm/pull/4600
* [CI] Reduce wheel size by not shipping debug symbols by simon-mo in https://github.com/vllm-project/vllm/pull/4602

New Contributors
* zifeitong made their first contribution in https://github.com/vllm-project/vllm/pull/4300
* caiom made their first contribution in https://github.com/vllm-project/vllm/pull/4298
* Alexei-V-Ivanov-AMD made their first contribution in https://github.com/vllm-project/vllm/pull/4213
* normster made their first contribution in https://github.com/vllm-project/vllm/pull/4377
* FurtherAI made their first contribution in https://github.com/vllm-project/vllm/pull/3524
* chestnut-Q made their first contribution in https://github.com/vllm-project/vllm/pull/4363
* prashantgupta24 made their first contribution in https://github.com/vllm-project/vllm/pull/4374
* fgreinacher made their first contribution in https://github.com/vllm-project/vllm/pull/3467
* alpayariyak made their first contribution in https://github.com/vllm-project/vllm/pull/4467
* HarryWu99 made their first contribution in https://github.com/vllm-project/vllm/pull/4495
* Juelianqvq made their first contribution in https://github.com/vllm-project/vllm/pull/4453
* robcaulk made their first contribution in https://github.com/vllm-project/vllm/pull/4305
* AnyISalIn made their first contribution in https://github.com/vllm-project/vllm/pull/4173
* sasha0552 made their first contribution in https://github.com/vllm-project/vllm/pull/4531
* tdg5 made their first contribution in https://github.com/vllm-project/vllm/pull/4273
* itechbear made their first contribution in https://github.com/vllm-project/vllm/pull/4552
* markmc made their first contribution in https://github.com/vllm-project/vllm/pull/4559
* Atry made their first contribution in https://github.com/vllm-project/vllm/pull/4567
* schoennenbeck made their first contribution in https://github.com/vllm-project/vllm/pull/2727
* DearPlanet made their first contribution in https://github.com/vllm-project/vllm/pull/3937

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2

0.4.1

Highlights

Features
* Support and enhance CommandR+ (3829), minicpm (3893), Meta Llama 3 (4175, 4182), Mixtral 8x22b (4073, 4002)
* Support private model registration, and updating our support policy (3871, 3948)
* Support PyTorch 2.2.1 and Triton 2.2.0 (4061, 4079, 3805, 3904, 4271)
* Add option for using LM Format Enforcer for guided decoding (3868)
* Add option for optionally initialize tokenizer and detokenizer (3748)
* Add option for load model using `tensorizer` (3476)

Enhancements
* vLLM is now mostly type checked by `mypy` (3816, 4006, 4161, 4043)
* Progress towards chunked prefill scheduler (3550, 3853, 4280, 3884)
* Progress towards speculative decoding (3250, 3706, 3894)
* Initial support with dynamic per-tensor scaling via FP8 (4118)

Hardwares
* Intel CPU inference backend is added (3993, 3634)
* AMD backend is enhanced with Triton kernel and e4m3fn KV cache (3643, 3290)

What's Changed
* [Kernel] Layernorm performance optimization by mawong-amd in https://github.com/vllm-project/vllm/pull/3662
* [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by youkaichao in https://github.com/vllm-project/vllm/pull/3746
* [CI/Build] Make Marlin Tests Green by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3753
* [Misc] Minor fixes in requirements.txt by WoosukKwon in https://github.com/vllm-project/vllm/pull/3769
* [Misc] Some minor simplifications to detokenization logic by njhill in https://github.com/vllm-project/vllm/pull/3670
* [Misc] Fix Benchmark TTFT Calculation for Chat Completions by ywang96 in https://github.com/vllm-project/vllm/pull/3768
* [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/3250
* [Misc] Add support for new autogptq checkpoint_format by Qubitium in https://github.com/vllm-project/vllm/pull/3689
* [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by cadedaniel in https://github.com/vllm-project/vllm/pull/3783
* [Hardware][Intel] Add CPU inference backend by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3634
* [HotFix] [CI/Build] Minor fix for CPU backend CI by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3787
* [Frontend][Bugfix] allow using the default middleware with a root path by A-Mahla in https://github.com/vllm-project/vllm/pull/3788
* [Doc] Fix vLLMEngine Doc Page by ywang96 in https://github.com/vllm-project/vllm/pull/3791
* [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by youkaichao in https://github.com/vllm-project/vllm/pull/3801
* Fix crash when try torch.cuda.set_device in worker by leiwen83 in https://github.com/vllm-project/vllm/pull/3770
* [Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spec_decode/` by mgoin in https://github.com/vllm-project/vllm/pull/3798
* [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by youkaichao in https://github.com/vllm-project/vllm/pull/3803
* [Speculative decoding] Adding configuration object for speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/3706
* [BugFix] Use different mechanism to get vllm version in `is_cpu()` by njhill in https://github.com/vllm-project/vllm/pull/3804
* [Doc] Update README.md by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3806
* [Doc] Update contribution guidelines for better onboarding by michaelfeil in https://github.com/vllm-project/vllm/pull/3819
* [3/N] Refactor scheduler for chunked prefill scheduling by rkooo567 in https://github.com/vllm-project/vllm/pull/3550
* Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by AdrianAbeyta in https://github.com/vllm-project/vllm/pull/3290
* [Misc] Publish 3rd meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/3835
* Fixes the argument for local_tokenizer_group by sighingnow in https://github.com/vllm-project/vllm/pull/3754
* [Core] Enable hf_transfer by default if available by michaelfeil in https://github.com/vllm-project/vllm/pull/3817
* [Bugfix] Add kv_scale input parameter to CPU backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/3840
* [Core] [Frontend] Make detokenization optional by mgerstgrasser in https://github.com/vllm-project/vllm/pull/3749
* [Bugfix] Fix args in benchmark_serving by CatherineSue in https://github.com/vllm-project/vllm/pull/3836
* [Benchmark] Refactor sample_requests in benchmark_throughput by gty111 in https://github.com/vllm-project/vllm/pull/3613
* [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by youkaichao in https://github.com/vllm-project/vllm/pull/3805
* [Hardware][CPU] Update cpu torch to match default of 2.2.1 by mgoin in https://github.com/vllm-project/vllm/pull/3854
* [Model] Cohere CommandR+ by saurabhdash2512 in https://github.com/vllm-project/vllm/pull/3829
* [Core] improve robustness of pynccl by youkaichao in https://github.com/vllm-project/vllm/pull/3860
* [Doc]Add asynchronous engine arguments to documentation. by SeanGallen in https://github.com/vllm-project/vllm/pull/3810
* [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by youkaichao in https://github.com/vllm-project/vllm/pull/3859
* [Misc] Add pytest marker to opt-out of global test cleanup by cadedaniel in https://github.com/vllm-project/vllm/pull/3863
* [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by cadedaniel in https://github.com/vllm-project/vllm/pull/3864
* [Bugfix] Fixing requirements.txt by noamgat in https://github.com/vllm-project/vllm/pull/3865
* [Misc] Define common requirements by WoosukKwon in https://github.com/vllm-project/vllm/pull/3841
* Add option to completion API to truncate prompt tokens by tdoublep in https://github.com/vllm-project/vllm/pull/3144
* [Chunked Prefill][4/n] Chunked prefill scheduler. by rkooo567 in https://github.com/vllm-project/vllm/pull/3853
* [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by Isotr0py in https://github.com/vllm-project/vllm/pull/3869
* [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by youkaichao in https://github.com/vllm-project/vllm/pull/3889
* [Core] enable out-of-tree model register by youkaichao in https://github.com/vllm-project/vllm/pull/3871
* [WIP][Core] latency optimization by youkaichao in https://github.com/vllm-project/vllm/pull/3890
* [Bugfix] Fix Llava inference with Tensor Parallelism. by Isotr0py in https://github.com/vllm-project/vllm/pull/3883
* [Model] add minicpm by SUDA-HLT-ywfang in https://github.com/vllm-project/vllm/pull/3893
* [Bugfix] Added Command-R GPTQ support by egortolmachev in https://github.com/vllm-project/vllm/pull/3849
* [Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration by Ki6an in https://github.com/vllm-project/vllm/pull/3767
* [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by mawong-amd in https://github.com/vllm-project/vllm/pull/3782
* [BugFix][Model] Fix commandr RoPE max_position_embeddings by esmeetu in https://github.com/vllm-project/vllm/pull/3919
* [Core] separate distributed_init from worker by youkaichao in https://github.com/vllm-project/vllm/pull/3904
* [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by cadedaniel in https://github.com/vllm-project/vllm/pull/3837
* [Bugfix] Fix KeyError on loading GPT-NeoX by jsato8094 in https://github.com/vllm-project/vllm/pull/3925
* [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by jpvillam-amd in https://github.com/vllm-project/vllm/pull/3643
* [Misc] Avoid loading incorrect LoRA config by jeejeelee in https://github.com/vllm-project/vllm/pull/3777
* [Benchmark] Add cpu options to bench scripts by PZD-CHINA in https://github.com/vllm-project/vllm/pull/3915
* [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by zhaotyer in https://github.com/vllm-project/vllm/pull/3955
* [Bugfix] Fix logits processor when prompt_logprobs is not None by huyiwen in https://github.com/vllm-project/vllm/pull/3899
* [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by tjohnson31415 in https://github.com/vllm-project/vllm/pull/3876
* [Bugfix][ROCm] Add numba to Dockerfile.rocm by WoosukKwon in https://github.com/vllm-project/vllm/pull/3962
* [Model][AMD] ROCm support for 256 head dims for Gemma by jamestwhedbee in https://github.com/vllm-project/vllm/pull/3972
* [Doc] Add doc to state our model support policy by youkaichao in https://github.com/vllm-project/vllm/pull/3948
* [Bugfix] Remove key sorting for `guided_json` parameter in OpenAi compatible Server by dmarasco in https://github.com/vllm-project/vllm/pull/3945
* [Doc] Fix getting stared to use publicly available model by fpaupier in https://github.com/vllm-project/vllm/pull/3963
* [Bugfix] handle hf_config with architectures == None by tjohnson31415 in https://github.com/vllm-project/vllm/pull/3982
* [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by youkaichao in https://github.com/vllm-project/vllm/pull/3950
* [Core][5/N] Fully working chunked prefill e2e by rkooo567 in https://github.com/vllm-project/vllm/pull/3884
* [Core][Model] Use torch.compile to accelerate layernorm in commandr by youkaichao in https://github.com/vllm-project/vllm/pull/3985
* [Test] Add xformer and flash attn tests by rkooo567 in https://github.com/vllm-project/vllm/pull/3961
* [Misc] refactor ops and cache_ops layer by jikunshang in https://github.com/vllm-project/vllm/pull/3913
* [Doc][Installation] delete python setup.py develop by youkaichao in https://github.com/vllm-project/vllm/pull/3989
* [Kernel] Fused MoE Config for Mixtral 8x22 by ywang96 in https://github.com/vllm-project/vllm/pull/4002
* fix-bgmv-kernel-640 by kingljl in https://github.com/vllm-project/vllm/pull/4007
* [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3824
* [Core] Set `linear_weights` directly on the layer by Yard1 in https://github.com/vllm-project/vllm/pull/3977
* [Core][Distributed] make init_distributed_environment compatible with init_process_group by youkaichao in https://github.com/vllm-project/vllm/pull/4014
* Fix echo/logprob OpenAI completion bug by dylanwhawk in https://github.com/vllm-project/vllm/pull/3441
* [Kernel] Add extra punica sizes to support bigger vocabs by Yard1 in https://github.com/vllm-project/vllm/pull/4015
* [BugFix] Fix handling of stop strings and stop token ids by njhill in https://github.com/vllm-project/vllm/pull/3672
* [Doc] Add typing hints / mypy types cleanup by michaelfeil in https://github.com/vllm-project/vllm/pull/3816
* [Core] Support LoRA on quantized models by jeejeelee in https://github.com/vllm-project/vllm/pull/4012
* [Frontend][Core] Move `merge_async_iterators` to utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4026
* [Test] Test multiple attn backend for chunked prefill. by rkooo567 in https://github.com/vllm-project/vllm/pull/4023
* [Bugfix] fix type hint for py 3.8 by youkaichao in https://github.com/vllm-project/vllm/pull/4036
* [Misc] Fix typo in scheduler.py by zhuohan123 in https://github.com/vllm-project/vllm/pull/4022
* [mypy] Add mypy type annotation part 1 by rkooo567 in https://github.com/vllm-project/vllm/pull/4006
* [Core] fix custom allreduce default value by youkaichao in https://github.com/vllm-project/vllm/pull/4040
* Fix triton compilation issue by Bellk17 in https://github.com/vllm-project/vllm/pull/3984
* [Bugfix] Fix LoRA bug by jeejeelee in https://github.com/vllm-project/vllm/pull/4032
* [CI/Test] expand ruff and yapf for all supported python version by youkaichao in https://github.com/vllm-project/vllm/pull/4037
* [Bugfix] More type hint fixes for py 3.8 by dylanwhawk in https://github.com/vllm-project/vllm/pull/4039
* [Core][Distributed] improve logging for init dist by youkaichao in https://github.com/vllm-project/vllm/pull/4042
* [Bugfix] fix_log_time_in_metrics by zspo in https://github.com/vllm-project/vllm/pull/4050
* [Bugfix] fix_small_bug_in_neuron_executor by zspo in https://github.com/vllm-project/vllm/pull/4051
* [Kernel] Add punica dimension for Baichuan-13B by jeejeelee in https://github.com/vllm-project/vllm/pull/4053
* [Frontend] [Core] feat: Add model loading using `tensorizer` by sangstar in https://github.com/vllm-project/vllm/pull/3476
* [Core] avoid too many cuda context by caching p2p test by youkaichao in https://github.com/vllm-project/vllm/pull/4021
* [BugFix] Fix tensorizer extra in setup.py by njhill in https://github.com/vllm-project/vllm/pull/4072
* [Docs] document that mixtral 8x22b is supported by simon-mo in https://github.com/vllm-project/vllm/pull/4073
* [Misc] Upgrade triton to 2.2.0 by esmeetu in https://github.com/vllm-project/vllm/pull/4061
* [Bugfix] Fix filelock version requirement by zhuohan123 in https://github.com/vllm-project/vllm/pull/4075
* [Misc][Minor] Fix CPU block num log in CPUExecutor. by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4088
* [Core] Simplifications to executor classes by njhill in https://github.com/vllm-project/vllm/pull/4071
* [Doc] Add better clarity for tensorizer usage by sangstar in https://github.com/vllm-project/vllm/pull/4090
* [Bugfix] Fix ray workers profiling with nsight by rickyyx in https://github.com/vllm-project/vllm/pull/4095
* [Typing] Fix Sequence type GenericAlias only available after Python 3.9. by rkooo567 in https://github.com/vllm-project/vllm/pull/4092
* [Core] Fix engine-use-ray broken by rkooo567 in https://github.com/vllm-project/vllm/pull/4105
* LM Format Enforcer Guided Decoding Support by noamgat in https://github.com/vllm-project/vllm/pull/3868
* [Core] Refactor model loading code by Yard1 in https://github.com/vllm-project/vllm/pull/4097
* [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by cadedaniel in https://github.com/vllm-project/vllm/pull/3894
* [Misc] [CI] Fix CI failure caught after merge by cadedaniel in https://github.com/vllm-project/vllm/pull/4126
* [CI] Move CPU/AMD tests to after wait by cadedaniel in https://github.com/vllm-project/vllm/pull/4123
* [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by youkaichao in https://github.com/vllm-project/vllm/pull/4024
* [Bugfix] fix output parsing error for trtllm backend by elinx in https://github.com/vllm-project/vllm/pull/4137
* [Kernel] Add punica dimension for Swallow-MS-7B LoRA by ucciicci in https://github.com/vllm-project/vllm/pull/4134
* [Typing] Mypy typing part 2 by rkooo567 in https://github.com/vllm-project/vllm/pull/4043
* [Core] Add integrity check during initialization; add test for it by youkaichao in https://github.com/vllm-project/vllm/pull/4155
* Allow model to be served under multiple names by hmellor in https://github.com/vllm-project/vllm/pull/2894
* [Bugfix] Get available quantization methods from quantization registry by mgoin in https://github.com/vllm-project/vllm/pull/4098
* [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by mmoskal in https://github.com/vllm-project/vllm/pull/4128
* [Docs] document that Meta Llama 3 is supported by simon-mo in https://github.com/vllm-project/vllm/pull/4175
* [Bugfix] Support logprobs when using guided_json and other constrained decoding fields by jamestwhedbee in https://github.com/vllm-project/vllm/pull/4149
* [Misc] Bump transformers to latest version by njhill in https://github.com/vllm-project/vllm/pull/4176
* [CI/CD] add neuron docker and ci test scripts by liangfu in https://github.com/vllm-project/vllm/pull/3571
* [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (3974) by agt in https://github.com/vllm-project/vllm/pull/4159
* [Core] add an option to log every function call to for debugging hang/crash in distributed inference by youkaichao in https://github.com/vllm-project/vllm/pull/4079
* Support eos_token_id from generation_config.json by simon-mo in https://github.com/vllm-project/vllm/pull/4182
* [Bugfix] Fix LoRA loading check by jeejeelee in https://github.com/vllm-project/vllm/pull/4138
* Bump version of 0.4.1 by simon-mo in https://github.com/vllm-project/vllm/pull/4177
* [Misc] fix docstrings by UranusSeven in https://github.com/vllm-project/vllm/pull/4191
* [Bugfix][Core] Restore logging of stats in the async engine by ronensc in https://github.com/vllm-project/vllm/pull/4150
* [Misc] add nccl in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/4211
* Pass `tokenizer_revision` when getting tokenizer in openai serving by chiragjn in https://github.com/vllm-project/vllm/pull/4214
* [Bugfix] Add fix for JSON whitespace by ayusher in https://github.com/vllm-project/vllm/pull/4189
* Fix missing docs and out of sync `EngineArgs` by hmellor in https://github.com/vllm-project/vllm/pull/4219
* [Kernel][FP8] Initial support with dynamic per-tensor scaling by comaniac in https://github.com/vllm-project/vllm/pull/4118
* [Frontend] multiple sampling params support by nunjunj in https://github.com/vllm-project/vllm/pull/3570
* Updating lm-format-enforcer version and adding links to decoding libraries in docs by noamgat in https://github.com/vllm-project/vllm/pull/4222
* Don't show default value for flags in `EngineArgs` by hmellor in https://github.com/vllm-project/vllm/pull/4223
* [Doc]: Update the page of adding new models by YeFD in https://github.com/vllm-project/vllm/pull/4236
* Make initialization of tokenizer and detokenizer optional by GeauxEric in https://github.com/vllm-project/vllm/pull/3748
* [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by hongxiayang in https://github.com/vllm-project/vllm/pull/4129
* [Core][Distributed] fix _is_full_nvlink detection by youkaichao in https://github.com/vllm-project/vllm/pull/4233
* [Misc] Add vision language model support to CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/3968
* [Bugfix] Fix type annotations in CPU model runner by WoosukKwon in https://github.com/vllm-project/vllm/pull/4256
* [Frontend] Enable support for CPU backend in AsyncLLMEngine. by sighingnow in https://github.com/vllm-project/vllm/pull/3993
* [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by alexm-nm in https://github.com/vllm-project/vllm/pull/4217
* Add example scripts to documentation by hmellor in https://github.com/vllm-project/vllm/pull/4225
* [Core] Scheduler perf fix by rkooo567 in https://github.com/vllm-project/vllm/pull/4270
* [Doc] Update the SkyPilot doc with serving and Llama-3 by Michaelvll in https://github.com/vllm-project/vllm/pull/4276
* [Core][Distributed] use absolute path for library file by youkaichao in https://github.com/vllm-project/vllm/pull/4271
* Fix `autodoc` directives by hmellor in https://github.com/vllm-project/vllm/pull/4272
* [Mypy] Part 3 fix typing for nested directories for most of directory by rkooo567 in https://github.com/vllm-project/vllm/pull/4161
* [Core] Some simplification of WorkerWrapper changes by njhill in https://github.com/vllm-project/vllm/pull/4183
* [Core] Scheduling optimization 2 by rkooo567 in https://github.com/vllm-project/vllm/pull/4280
* [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by cadedaniel in https://github.com/vllm-project/vllm/pull/3951
* [Bugfix] Fixing max token error message for openai compatible server by jgordley in https://github.com/vllm-project/vllm/pull/4016
* [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by DefTruth in https://github.com/vllm-project/vllm/pull/4286
* [Core][Logging] Add last frame information for better debugging by youkaichao in https://github.com/vllm-project/vllm/pull/4278
* [CI] Add ccache for wheel builds job by simon-mo in https://github.com/vllm-project/vllm/pull/4281
* AQLM CUDA support by jaemzfleming in https://github.com/vllm-project/vllm/pull/3287
* [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4292
* [Kernel] FP8 support for MoE kernel / Mixtral by pcmoritz in https://github.com/vllm-project/vllm/pull/4244
* [Bugfix] fixed fp8 conflict with aqlm by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4307
* [Core][Distributed] use cpu/gloo to initialize pynccl by youkaichao in https://github.com/vllm-project/vllm/pull/4248
* [CI][Build] change pynvml to nvidia-ml-py by youkaichao in https://github.com/vllm-project/vllm/pull/4302
* [Misc] Reduce supported Punica dtypes by WoosukKwon in https://github.com/vllm-project/vllm/pull/4304

New Contributors
* mawong-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3662
* Qubitium made their first contribution in https://github.com/vllm-project/vllm/pull/3689
* bigPYJ1151 made their first contribution in https://github.com/vllm-project/vllm/pull/3634
* A-Mahla made their first contribution in https://github.com/vllm-project/vllm/pull/3788
* AdrianAbeyta made their first contribution in https://github.com/vllm-project/vllm/pull/3290
* mgerstgrasser made their first contribution in https://github.com/vllm-project/vllm/pull/3749
* CatherineSue made their first contribution in https://github.com/vllm-project/vllm/pull/3836
* saurabhdash2512 made their first contribution in https://github.com/vllm-project/vllm/pull/3829
* SeanGallen made their first contribution in https://github.com/vllm-project/vllm/pull/3810
* SUDA-HLT-ywfang made their first contribution in https://github.com/vllm-project/vllm/pull/3893
* egortolmachev made their first contribution in https://github.com/vllm-project/vllm/pull/3849
* Ki6an made their first contribution in https://github.com/vllm-project/vllm/pull/3767
* jsato8094 made their first contribution in https://github.com/vllm-project/vllm/pull/3925
* jpvillam-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3643
* PZD-CHINA made their first contribution in https://github.com/vllm-project/vllm/pull/3915
* zhaotyer made their first contribution in https://github.com/vllm-project/vllm/pull/3955
* huyiwen made their first contribution in https://github.com/vllm-project/vllm/pull/3899
* dmarasco made their first contribution in https://github.com/vllm-project/vllm/pull/3945
* fpaupier made their first contribution in https://github.com/vllm-project/vllm/pull/3963
* kingljl made their first contribution in https://github.com/vllm-project/vllm/pull/4007
* DarkLight1337 made their first contribution in https://github.com/vllm-project/vllm/pull/4026
* Bellk17 made their first contribution in https://github.com/vllm-project/vllm/pull/3984
* sangstar made their first contribution in https://github.com/vllm-project/vllm/pull/3476
* rickyyx made their first contribution in https://github.com/vllm-project/vllm/pull/4095
* elinx made their first contribution in https://github.com/vllm-project/vllm/pull/4137
* ucciicci made their first contribution in https://github.com/vllm-project/vllm/pull/4134
* mmoskal made their first contribution in https://github.com/vllm-project/vllm/pull/4128
* agt made their first contribution in https://github.com/vllm-project/vllm/pull/4159
* ayusher made their first contribution in https://github.com/vllm-project/vllm/pull/4189
* nunjunj made their first contribution in https://github.com/vllm-project/vllm/pull/3570
* YeFD made their first contribution in https://github.com/vllm-project/vllm/pull/4236
* GeauxEric made their first contribution in https://github.com/vllm-project/vllm/pull/3748
* alexm-nm made their first contribution in https://github.com/vllm-project/vllm/pull/4217
* jgordley made their first contribution in https://github.com/vllm-project/vllm/pull/4016
* DefTruth made their first contribution in https://github.com/vllm-project/vllm/pull/4286
* jaemzfleming made their first contribution in https://github.com/vllm-project/vllm/pull/3287

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.1

0.4.0.post1

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed
* [Kernel] Layernorm performance optimization by mawong-amd in https://github.com/vllm-project/vllm/pull/3662
* [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by youkaichao in https://github.com/vllm-project/vllm/pull/3746
* [CI/Build] Make Marlin Tests Green by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3753
* [Misc] Minor fixes in requirements.txt by WoosukKwon in https://github.com/vllm-project/vllm/pull/3769
* [Misc] Some minor simplifications to detokenization logic by njhill in https://github.com/vllm-project/vllm/pull/3670
* [Misc] Fix Benchmark TTFT Calculation for Chat Completions by ywang96 in https://github.com/vllm-project/vllm/pull/3768
* [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/3250
* [Misc] Add support for new autogptq checkpoint_format by Qubitium in https://github.com/vllm-project/vllm/pull/3689
* [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by cadedaniel in https://github.com/vllm-project/vllm/pull/3783
* [Hardware][Intel] Add CPU inference backend by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3634
* [HotFix] [CI/Build] Minor fix for CPU backend CI by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3787
* [Frontend][Bugfix] allow using the default middleware with a root path by A-Mahla in https://github.com/vllm-project/vllm/pull/3788
* [Doc] Fix vLLMEngine Doc Page by ywang96 in https://github.com/vllm-project/vllm/pull/3791
* [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by youkaichao in https://github.com/vllm-project/vllm/pull/3801
* Fix crash when try torch.cuda.set_device in worker by leiwen83 in https://github.com/vllm-project/vllm/pull/3770
* [Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spec_decode/` by mgoin in https://github.com/vllm-project/vllm/pull/3798
* [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by youkaichao in https://github.com/vllm-project/vllm/pull/3803

New Contributors
* mawong-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3662
* Qubitium made their first contribution in https://github.com/vllm-project/vllm/pull/3689
* bigPYJ1151 made their first contribution in https://github.com/vllm-project/vllm/pull/3634
* A-Mahla made their first contribution in https://github.com/vllm-project/vllm/pull/3788

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.0.post1

0.4.0

Major changes
Models
* New models: Command+R(3433), Qwen2 MoE(3346), DBRX(3660), XVerse (3610), Jais (3183).
* New vision language model: LLaVA (3042)

Production features
* Automatic prefix caching (2762, 3703) supporting long system prompt to be automatically cached across requests. Use the flag `--enable-prefix-caching` to turn it on.
* Support `json_object` in OpenAI server for arbitrary JSON, `--use-delay` flag to improve time to first token across many requests, and `min_tokens` to EOS suppression.
* Progress in chunked prefill scheduler (3236, 3538), and speculative decoding (3103).
* Custom all reduce kernel has been re-enabled after more robustness fixes.
* Replaced cupy dependency due to its bugs.

Hardware
* Improved Neuron support for AWS Inferentia.
* CMake based build system for extensibility.

Ecosystem
* Extensive serving benchmark refactoring (3277)
* Usage statistics collection (2852)

What's Changed
* allow user chose log level by --log-level instead of fixed 'info'. by AllenDou in https://github.com/vllm-project/vllm/pull/3109
* Reorder kv dtype check to avoid nvcc not found error on AMD platform by cloudhan in https://github.com/vllm-project/vllm/pull/3104
* Add Automatic Prefix Caching by SageMoore in https://github.com/vllm-project/vllm/pull/2762
* Add vLLM version info to logs and openai API server by jasonacox in https://github.com/vllm-project/vllm/pull/3161
* [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by zhuohan123 in https://github.com/vllm-project/vllm/pull/3158
* Make it easy to profile workers with nsight by pcmoritz in https://github.com/vllm-project/vllm/pull/3162
* [DOC] add setup document to support neuron backend by liangfu in https://github.com/vllm-project/vllm/pull/2777
* [Minor Fix] Remove unused code in benchmark_prefix_caching.py by gty111 in https://github.com/vllm-project/vllm/pull/3171
* Add document for vllm paged attention kernel. by pian13131 in https://github.com/vllm-project/vllm/pull/2978
* enable --gpu-memory-utilization in benchmark_throughput.py by AllenDou in https://github.com/vllm-project/vllm/pull/3175
* [Minor fix] The domain dns.google may cause a socket.gaierror exception by ttbachyinsda in https://github.com/vllm-project/vllm/pull/3176
* Push logprob generation to LLMEngine by Yard1 in https://github.com/vllm-project/vllm/pull/3065
* Add health check, make async Engine more robust by Yard1 in https://github.com/vllm-project/vllm/pull/3015
* Fix the openai benchmarking requests to work with latest OpenAI apis by wangchen615 in https://github.com/vllm-project/vllm/pull/2992
* [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by hongxiayang in https://github.com/vllm-project/vllm/pull/3123
* Store `eos_token_id` in `Sequence` for easy access by njhill in https://github.com/vllm-project/vllm/pull/3166
* [Fix] Avoid pickling entire LLMEngine for Ray workers by njhill in https://github.com/vllm-project/vllm/pull/3207
* [Tests] Add block manager and scheduler tests by rkooo567 in https://github.com/vllm-project/vllm/pull/3108
* [Testing] Fix core tests by cadedaniel in https://github.com/vllm-project/vllm/pull/3224
* A simple addition of `dynamic_ncols=True` by chujiezheng in https://github.com/vllm-project/vllm/pull/3242
* Add GPTQ support for Gemma by TechxGenus in https://github.com/vllm-project/vllm/pull/3200
* Update requirements-dev.txt to include package for benchmarking scripts. by wangchen615 in https://github.com/vllm-project/vllm/pull/3181
* Separate attention backends by WoosukKwon in https://github.com/vllm-project/vllm/pull/3005
* Measure model memory usage by mgoin in https://github.com/vllm-project/vllm/pull/3120
* Possible fix for conflict between Automated Prefix Caching (2762) and multi-LoRA support (1804) by jacobthebanana in https://github.com/vllm-project/vllm/pull/3263
* Fix auto prefix bug by ElizaWszola in https://github.com/vllm-project/vllm/pull/3239
* Connect engine healthcheck to openai server by njhill in https://github.com/vllm-project/vllm/pull/3260
* Feature add lora support for Qwen2 by whyiug in https://github.com/vllm-project/vllm/pull/3177
* [Minor Fix] Fix comments in benchmark_serving by gty111 in https://github.com/vllm-project/vllm/pull/3252
* [Docs] Fix Unmocked Imports by ywang96 in https://github.com/vllm-project/vllm/pull/3275
* [FIX] Make `flash_attn` optional by WoosukKwon in https://github.com/vllm-project/vllm/pull/3269
* Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir by mgoin in https://github.com/vllm-project/vllm/pull/3241
* [FIX] Fix prefix test error on main by zhuohan123 in https://github.com/vllm-project/vllm/pull/3286
* [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by cadedaniel in https://github.com/vllm-project/vllm/pull/3103
* Enhance lora tests with more layer and rank variations by tterrysun in https://github.com/vllm-project/vllm/pull/3243
* [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by dllehr-amd in https://github.com/vllm-project/vllm/pull/3262
* [BugFix] Fix get tokenizer when using ray by esmeetu in https://github.com/vllm-project/vllm/pull/3301
* [Fix] Fix best_of behavior when n=1 by njhill in https://github.com/vllm-project/vllm/pull/3298
* Re-enable the 80 char line width limit by zhuohan123 in https://github.com/vllm-project/vllm/pull/3305
* [docs] Add LoRA support information for models by pcmoritz in https://github.com/vllm-project/vllm/pull/3299
* Add distributed model executor abstraction by zhuohan123 in https://github.com/vllm-project/vllm/pull/3191
* [ROCm] Fix warp and lane calculation in blockReduceSum by kliuae in https://github.com/vllm-project/vllm/pull/3321
* Support Mistral Model Inference with transformers-neuronx by DAIZHENWEI in https://github.com/vllm-project/vllm/pull/3153
* docs: Add BentoML deployment doc by Sherlock113 in https://github.com/vllm-project/vllm/pull/3336
* Fixes 1556 double free by br3no in https://github.com/vllm-project/vllm/pull/3347
* Add kernel for GeGLU with approximate GELU by WoosukKwon in https://github.com/vllm-project/vllm/pull/3337
* [Fix] fix quantization arg when using marlin by DreamTeamWangbowen in https://github.com/vllm-project/vllm/pull/3319
* add hf_transfer to requirements.txt by RonanKMcGovern in https://github.com/vllm-project/vllm/pull/3031
* fix bias in if, ambiguous by hliuca in https://github.com/vllm-project/vllm/pull/3259
* [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by chenxu2048 in https://github.com/vllm-project/vllm/pull/3256
* Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by orsharir in https://github.com/vllm-project/vllm/pull/3350
* Add batched RoPE kernel by tterrysun in https://github.com/vllm-project/vllm/pull/3095
* Fix lint by Yard1 in https://github.com/vllm-project/vllm/pull/3388
* [FIX] Simpler fix for async engine running on ray by zhuohan123 in https://github.com/vllm-project/vllm/pull/3371
* [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by simon-mo in https://github.com/vllm-project/vllm/pull/3383
* allow user to chose which vllm's merics to display in grafana by AllenDou in https://github.com/vllm-project/vllm/pull/3393
* [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by youkaichao in https://github.com/vllm-project/vllm/pull/3389
* Install `flash_attn` in Docker image by tdoublep in https://github.com/vllm-project/vllm/pull/3396
* Add args for mTLS support by declark1 in https://github.com/vllm-project/vllm/pull/3410
* [issue templates] add some issue templates by youkaichao in https://github.com/vllm-project/vllm/pull/3412
* Fix assertion failure in Qwen 1.5 with prefix caching enabled by chenxu2048 in https://github.com/vllm-project/vllm/pull/3373
* fix marlin config repr by qeternity in https://github.com/vllm-project/vllm/pull/3414
* Feature: dynamic shared mem moe_align_block_size_kernel by akhoroshev in https://github.com/vllm-project/vllm/pull/3376
* [Misc] add HOST_IP env var by youkaichao in https://github.com/vllm-project/vllm/pull/3419
* Add chat templates for Falcon by Dinghow in https://github.com/vllm-project/vllm/pull/3420
* Add chat templates for ChatGLM by Dinghow in https://github.com/vllm-project/vllm/pull/3418
* Fix `dist.broadcast` stall without group argument by GindaChen in https://github.com/vllm-project/vllm/pull/3408
* Fix tie_word_embeddings for Qwen2. by fyabc in https://github.com/vllm-project/vllm/pull/3344
* [Fix] Add args for mTLS support by declark1 in https://github.com/vllm-project/vllm/pull/3430
* Fixes the misuse/mixuse of time.time()/time.monotonic() by sighingnow in https://github.com/vllm-project/vllm/pull/3220
* [Misc] add error message in non linux platform by youkaichao in https://github.com/vllm-project/vllm/pull/3438
* Fix issue templates by hmellor in https://github.com/vllm-project/vllm/pull/3436
* fix document error for value and v_vec illustration by laneeeee in https://github.com/vllm-project/vllm/pull/3421
* Asynchronous tokenization by Yard1 in https://github.com/vllm-project/vllm/pull/2879
* Removed Extraneous Print Message From OAI Server by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3440
* [Misc] PR templates by youkaichao in https://github.com/vllm-project/vllm/pull/3413
* Fixes the incorrect argument in the prefix-prefill test cases by sighingnow in https://github.com/vllm-project/vllm/pull/3246
* Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning by ronensc in https://github.com/vllm-project/vllm/pull/2958
* Fix Baichuan chat template by Dinghow in https://github.com/vllm-project/vllm/pull/3340
* [Misc] fix line length for entire codebase by simon-mo in https://github.com/vllm-project/vllm/pull/3444
* Support arbitrary json_object in OpenAI and Context Free Grammar by simon-mo in https://github.com/vllm-project/vllm/pull/3211
* Fix setup.py neuron-ls issue by simon-mo in https://github.com/vllm-project/vllm/pull/2671
* [Misc] Define from_dict and to_dict in InputMetadata by WoosukKwon in https://github.com/vllm-project/vllm/pull/3452
* [CI] Shard tests for LoRA and Kernels to speed up by simon-mo in https://github.com/vllm-project/vllm/pull/3445
* [Bugfix] Make moe_align_block_size AMD-compatible by WoosukKwon in https://github.com/vllm-project/vllm/pull/3470
* CI: Add ROCm Docker Build by simon-mo in https://github.com/vllm-project/vllm/pull/2886
* [Testing] Add test_config.py to CI by cadedaniel in https://github.com/vllm-project/vllm/pull/3437
* [CI/Build] Fix Bad Import In Test by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3473
* [Misc] Fix PR Template by zhuohan123 in https://github.com/vllm-project/vllm/pull/3478
* Cmake based build system by bnellnm in https://github.com/vllm-project/vllm/pull/2830
* [Core] Zero-copy asdict for InputMetadata by Yard1 in https://github.com/vllm-project/vllm/pull/3475
* [Misc] Update README for the Third vLLM Meetup by zhuohan123 in https://github.com/vllm-project/vllm/pull/3479
* [Core] Cache some utils by Yard1 in https://github.com/vllm-project/vllm/pull/3474
* [Core] print error before deadlock by youkaichao in https://github.com/vllm-project/vllm/pull/3459
* [Doc] Add docs about OpenAI compatible server by simon-mo in https://github.com/vllm-project/vllm/pull/3288
* [BugFix] Avoid initializing CUDA too early by njhill in https://github.com/vllm-project/vllm/pull/3487
* Update dockerfile with ModelScope support by ifsheldon in https://github.com/vllm-project/vllm/pull/3429
* [Doc] minor fix to neuron-installation.rst by jimburtoft in https://github.com/vllm-project/vllm/pull/3505
* Revert "[Core] Cache some utils" by simon-mo in https://github.com/vllm-project/vllm/pull/3507
* [Doc] minor fix of spelling in amd-installation.rst by jimburtoft in https://github.com/vllm-project/vllm/pull/3506
* Use lru_cache for some environment detection utils by simon-mo in https://github.com/vllm-project/vllm/pull/3508
* [PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled by ElizaWszola in https://github.com/vllm-project/vllm/pull/3357
* [Core] Add generic typing to `LRUCache` by njhill in https://github.com/vllm-project/vllm/pull/3511
* [Misc] Remove cache stream and cache events by WoosukKwon in https://github.com/vllm-project/vllm/pull/3461
* Abort when nvcc command is not found in the PATH by AllenDou in https://github.com/vllm-project/vllm/pull/3527
* Check for _is_cuda() in compute_num_jobs by bnellnm in https://github.com/vllm-project/vllm/pull/3481
* [Bugfix] Fix ROCm support in CMakeLists.txt by jamestwhedbee in https://github.com/vllm-project/vllm/pull/3534
* [1/n] Triton sampling kernel by Yard1 in https://github.com/vllm-project/vllm/pull/3186
* [1/n][Chunked Prefill] Refactor input query shapes by rkooo567 in https://github.com/vllm-project/vllm/pull/3236
* Migrate `logits` computation and gather to `model_runner` by esmeetu in https://github.com/vllm-project/vllm/pull/3233
* [BugFix] Hot fix in setup.py for neuron build by zhuohan123 in https://github.com/vllm-project/vllm/pull/3537
* [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor by ElizaWszola in https://github.com/vllm-project/vllm/pull/3431
* Fix 1D query issue from `_prune_hidden_states` by rkooo567 in https://github.com/vllm-project/vllm/pull/3539
* [🚀 Ready to be merged] Added support for Jais models by grandiose-pizza in https://github.com/vllm-project/vllm/pull/3183
* [Misc][Log] Add log for tokenizer length not equal to vocabulary size by esmeetu in https://github.com/vllm-project/vllm/pull/3500
* [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config by WoosukKwon in https://github.com/vllm-project/vllm/pull/3551
* [BugFix] gemma loading after quantization or LoRA. by taeminlee in https://github.com/vllm-project/vllm/pull/3553
* [Bugfix][Model] Fix Qwen2 by esmeetu in https://github.com/vllm-project/vllm/pull/3554
* [Hardware][Neuron] Refactor neuron support by zhuohan123 in https://github.com/vllm-project/vllm/pull/3471
* Some fixes for custom allreduce kernels by hanzhi713 in https://github.com/vllm-project/vllm/pull/2760
* Dynamic scheduler delay to improve ITL performance by tdoublep in https://github.com/vllm-project/vllm/pull/3279
* [Core] Improve detokenization performance for prefill by Yard1 in https://github.com/vllm-project/vllm/pull/3469
* [Bugfix] use SoftLockFile instead of LockFile by kota-iizuka in https://github.com/vllm-project/vllm/pull/3578
* [Misc] Fix BLOOM copyright notice by WoosukKwon in https://github.com/vllm-project/vllm/pull/3591
* [Misc] Bump transformers version by ywang96 in https://github.com/vllm-project/vllm/pull/3592
* [BugFix] Fix Falcon tied embeddings by WoosukKwon in https://github.com/vllm-project/vllm/pull/3590
* [BugFix] 1D query fix for MoE models by njhill in https://github.com/vllm-project/vllm/pull/3597
* [CI] typo fix: is_hip --> is_hip() by youkaichao in https://github.com/vllm-project/vllm/pull/3595
* [CI/Build] respect the common environment variable MAX_JOBS by youkaichao in https://github.com/vllm-project/vllm/pull/3600
* [CI/Build] fix flaky test by youkaichao in https://github.com/vllm-project/vllm/pull/3602
* [BugFix] minor fix: method typo in `rotary_embedding.py` file, get_device() -> device by jikunshang in https://github.com/vllm-project/vllm/pull/3604
* [Bugfix] Revert "[Bugfix] use SoftLockFile instead of LockFile (3578)" by WoosukKwon in https://github.com/vllm-project/vllm/pull/3599
* [Model] Add starcoder2 awq support by shaonianyr in https://github.com/vllm-project/vllm/pull/3569
* [Core] Refactor Attention Take 2 by WoosukKwon in https://github.com/vllm-project/vllm/pull/3462
* [Bugfix] fix automatic prefix args and add log info by gty111 in https://github.com/vllm-project/vllm/pull/3608
* [CI] Try introducing isort. by rkooo567 in https://github.com/vllm-project/vllm/pull/3495
* [Core] Adding token ranks along with logprobs by SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/3516
* feat: implement the min_tokens sampling parameter by tjohnson31415 in https://github.com/vllm-project/vllm/pull/3124
* [Bugfix] API stream returning two stops by dylanwhawk in https://github.com/vllm-project/vllm/pull/3450
* hotfix isort on logprobs ranks pr by simon-mo in https://github.com/vllm-project/vllm/pull/3622
* [Feature] Add vision language model support. by xwjiang2010 in https://github.com/vllm-project/vllm/pull/3042
* Optimize `_get_ranks` in Sampler by Yard1 in https://github.com/vllm-project/vllm/pull/3623
* [Misc] Include matched stop string/token in responses by njhill in https://github.com/vllm-project/vllm/pull/2976
* Enable more models to inference based on LoRA by jeejeelee in https://github.com/vllm-project/vllm/pull/3382
* [Bugfix] Fix ipv6 address parsing bug by liiliiliil in https://github.com/vllm-project/vllm/pull/3641
* [BugFix] Fix ipv4 address parsing regression by njhill in https://github.com/vllm-project/vllm/pull/3645
* [Kernel] support non-zero cuda devices in punica kernels by jeejeelee in https://github.com/vllm-project/vllm/pull/3636
* [Doc]add lora support by jeejeelee in https://github.com/vllm-project/vllm/pull/3649
* [Misc] Minor fix in KVCache type by WoosukKwon in https://github.com/vllm-project/vllm/pull/3652
* [Core] remove cupy dependency by youkaichao in https://github.com/vllm-project/vllm/pull/3625
* [Bugfix] More faithful implementation of Gemma by WoosukKwon in https://github.com/vllm-project/vllm/pull/3653
* [Bugfix] [Hotfix] fix nccl library name by youkaichao in https://github.com/vllm-project/vllm/pull/3661
* [Model] Add support for DBRX by megha95 in https://github.com/vllm-project/vllm/pull/3660
* [Misc] add the "download-dir" option to the latency/throughput benchmarks by AmadeusChan in https://github.com/vllm-project/vllm/pull/3621
* feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark by ywang96 in https://github.com/vllm-project/vllm/pull/3277
* Add support for Cohere's Command-R model by zeppombal in https://github.com/vllm-project/vllm/pull/3433
* [Docs] Add Command-R to supported models by WoosukKwon in https://github.com/vllm-project/vllm/pull/3669
* [Model] Fix and clean commandr by esmeetu in https://github.com/vllm-project/vllm/pull/3671
* [Model] Add support for xverse by hxer7963 in https://github.com/vllm-project/vllm/pull/3610
* [CI/Build] update default number of jobs and nvcc threads to avoid overloading the system by youkaichao in https://github.com/vllm-project/vllm/pull/3675
* [Kernel] Add Triton MoE kernel configs for DBRX + A100 by WoosukKwon in https://github.com/vllm-project/vllm/pull/3679
* [Core] [Bugfix] Refactor block manager subsystem for better testability by cadedaniel in https://github.com/vllm-project/vllm/pull/3492
* [Model] Add support for Qwen2MoeModel by wenyujin333 in https://github.com/vllm-project/vllm/pull/3346
* [Kernel] DBRX Triton MoE kernel H100 by ywang96 in https://github.com/vllm-project/vllm/pull/3692
* [2/N] Chunked prefill data update by rkooo567 in https://github.com/vllm-project/vllm/pull/3538
* [Bugfix] Update neuron_executor.py to add optional vision_language_config. by adamrb in https://github.com/vllm-project/vllm/pull/3695
* fix benchmark format reporting in buildkite by simon-mo in https://github.com/vllm-project/vllm/pull/3693
* [CI] Add test case to run examples scripts by simon-mo in https://github.com/vllm-project/vllm/pull/3638
* [Core] Support multi-node inference(eager and cuda graph) by esmeetu in https://github.com/vllm-project/vllm/pull/3686
* [Kernel] Add MoE Triton kernel configs for A100 40GB by WoosukKwon in https://github.com/vllm-project/vllm/pull/3700
* [Bugfix] Set enable_prefix_caching=True in prefix caching example by WoosukKwon in https://github.com/vllm-project/vllm/pull/3703
* fix logging msg for block manager by simon-mo in https://github.com/vllm-project/vllm/pull/3701
* [Core] fix del of communicator by youkaichao in https://github.com/vllm-project/vllm/pull/3702
* [Benchmark] Change mii to use persistent deployment and support tensor parallel by IKACE in https://github.com/vllm-project/vllm/pull/3628
* bump version to v0.4.0 by simon-mo in https://github.com/vllm-project/vllm/pull/3705
* Revert "bump version to v0.4.0" by youkaichao in https://github.com/vllm-project/vllm/pull/3708
* [Test] Make model tests run again and remove --forked from pytest by rkooo567 in https://github.com/vllm-project/vllm/pull/3631
* [Misc] Minor type annotation fix by WoosukKwon in https://github.com/vllm-project/vllm/pull/3716
* [Core][Test] move local_rank to the last arg with default value to keep api compatible by youkaichao in https://github.com/vllm-project/vllm/pull/3711
* add ccache to docker build image by simon-mo in https://github.com/vllm-project/vllm/pull/3704
* Usage Stats Collection by yhu422 in https://github.com/vllm-project/vllm/pull/2852
* [BugFix] Fix tokenizer out of vocab size by esmeetu in https://github.com/vllm-project/vllm/pull/3685
* [BugFix][Frontend] Fix completion logprobs=0 error by esmeetu in https://github.com/vllm-project/vllm/pull/3731
* [Bugfix] Command-R Max Model Length by ywang96 in https://github.com/vllm-project/vllm/pull/3727
* bump version to v0.4.0 by simon-mo in https://github.com/vllm-project/vllm/pull/3712
* [ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic by hongxiayang in https://github.com/vllm-project/vllm/pull/3699
* usage lib get version another way by simon-mo in https://github.com/vllm-project/vllm/pull/3735
* [BugFix] Use consistent logger everywhere by njhill in https://github.com/vllm-project/vllm/pull/3738
* [Core][Bugfix] cache len of tokenizer by youkaichao in https://github.com/vllm-project/vllm/pull/3741
* Fix build when nvtools is missing by bnellnm in https://github.com/vllm-project/vllm/pull/3698
* CMake build elf without PTX by simon-mo in https://github.com/vllm-project/vllm/pull/3739

New Contributors
* cloudhan made their first contribution in https://github.com/vllm-project/vllm/pull/3104
* SageMoore made their first contribution in https://github.com/vllm-project/vllm/pull/2762
* jasonacox made their first contribution in https://github.com/vllm-project/vllm/pull/3161
* gty111 made their first contribution in https://github.com/vllm-project/vllm/pull/3171
* pian13131 made their first contribution in https://github.com/vllm-project/vllm/pull/2978
* ttbachyinsda made their first contribution in https://github.com/vllm-project/vllm/pull/3176
* wangchen615 made their first contribution in https://github.com/vllm-project/vllm/pull/2992
* chujiezheng made their first contribution in https://github.com/vllm-project/vllm/pull/3242
* TechxGenus made their first contribution in https://github.com/vllm-project/vllm/pull/3200
* mgoin made their first contribution in https://github.com/vllm-project/vllm/pull/3120
* jacobthebanana made their first contribution in https://github.com/vllm-project/vllm/pull/3263
* ElizaWszola made their first contribution in https://github.com/vllm-project/vllm/pull/3239
* DAIZHENWEI made their first contribution in https://github.com/vllm-project/vllm/pull/3153
* Sherlock113 made their first contribution in https://github.com/vllm-project/vllm/pull/3336
* br3no made their first contribution in https://github.com/vllm-project/vllm/pull/3347
* DreamTeamWangbowen made their first contribution in https://github.com/vllm-project/vllm/pull/3319
* RonanKMcGovern made their first contribution in https://github.com/vllm-project/vllm/pull/3031
* hliuca made their first contribution in https://github.com/vllm-project/vllm/pull/3259
* orsharir made their first contribution in https://github.com/vllm-project/vllm/pull/3350
* youkaichao made their first contribution in https://github.com/vllm-project/vllm/pull/3389
* tdoublep made their first contribution in https://github.com/vllm-project/vllm/pull/3396
* declark1 made their first contribution in https://github.com/vllm-project/vllm/pull/3410
* qeternity made their first contribution in https://github.com/vllm-project/vllm/pull/3414
* akhoroshev made their first contribution in https://github.com/vllm-project/vllm/pull/3376
* Dinghow made their first contribution in https://github.com/vllm-project/vllm/pull/3420
* fyabc made their first contribution in https://github.com/vllm-project/vllm/pull/3344
* laneeeee made their first contribution in https://github.com/vllm-project/vllm/pull/3421
* bnellnm made their first contribution in https://github.com/vllm-project/vllm/pull/2830
* ifsheldon made their first contribution in https://github.com/vllm-project/vllm/pull/3429
* jimburtoft made their first contribution in https://github.com/vllm-project/vllm/pull/3505
* grandiose-pizza made their first contribution in https://github.com/vllm-project/vllm/pull/3183
* taeminlee made their first contribution in https://github.com/vllm-project/vllm/pull/3553
* kota-iizuka made their first contribution in https://github.com/vllm-project/vllm/pull/3578
* shaonianyr made their first contribution in https://github.com/vllm-project/vllm/pull/3569
* SwapnilDreams100 made their first contribution in https://github.com/vllm-project/vllm/pull/3516
* tjohnson31415 made their first contribution in https://github.com/vllm-project/vllm/pull/3124
* xwjiang2010 made their first contribution in https://github.com/vllm-project/vllm/pull/3042
* liiliiliil made their first contribution in https://github.com/vllm-project/vllm/pull/3641
* AmadeusChan made their first contribution in https://github.com/vllm-project/vllm/pull/3621
* zeppombal made their first contribution in https://github.com/vllm-project/vllm/pull/3433
* hxer7963 made their first contribution in https://github.com/vllm-project/vllm/pull/3610
* wenyujin333 made their first contribution in https://github.com/vllm-project/vllm/pull/3346
* adamrb made their first contribution in https://github.com/vllm-project/vllm/pull/3695
* IKACE made their first contribution in https://github.com/vllm-project/vllm/pull/3628
* yhu422 made their first contribution in https://github.com/vllm-project/vllm/pull/2852

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.3.3...v0.4.0

0.3.3

Major changes

* StarCoder2 support
* Performance optimization and LoRA support for Gemma
* 2/3/8-bit GPTQ support
* Integrate Marlin Kernels for Int4 GPTQ inference
* Performance optimization for MoE kernel
* [Experimental] AWS Inferentia2 support
* [Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed
* Update a comment in `benchmark_serving.py` by ronensc in https://github.com/vllm-project/vllm/pull/2934
* Added early stopping to completion APIs by Maxusmusti in https://github.com/vllm-project/vllm/pull/2939
* Migrate MistralForCausalLM to LlamaForCausalLM by esmeetu in https://github.com/vllm-project/vllm/pull/2868
* Use Llama RMSNorm for Gemma by WoosukKwon in https://github.com/vllm-project/vllm/pull/2974
* chore(vllm): codespell for spell checking by mspronesti in https://github.com/vllm-project/vllm/pull/2820
* Optimize GeGLU layer in Gemma by WoosukKwon in https://github.com/vllm-project/vllm/pull/2975
* [FIX] Fix issue 2904 by 44670 in https://github.com/vllm-project/vllm/pull/2983
* Remove Flash Attention in test env by WoosukKwon in https://github.com/vllm-project/vllm/pull/2982
* Include tokens from prompt phase in `counter_generation_tokens` by ronensc in https://github.com/vllm-project/vllm/pull/2802
* Fix nvcc not found in vllm-openai image by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2781
* [Fix] Fix assertion on Mistral YaRN model len by WoosukKwon in https://github.com/vllm-project/vllm/pull/2984
* Port metrics from `aioprometheus` to `prometheus_client` by hmellor in https://github.com/vllm-project/vllm/pull/2730
* Add LogProbs for Chat Completions in OpenAI by jlcmoore in https://github.com/vllm-project/vllm/pull/2918
* Optimized fused MoE Kernel, take 2 by pcmoritz in https://github.com/vllm-project/vllm/pull/2979
* [Minor] Remove gather_cached_kv kernel by WoosukKwon in https://github.com/vllm-project/vllm/pull/3043
* [Minor] Remove unused config file by esmeetu in https://github.com/vllm-project/vllm/pull/3039
* Fix using CuPy for eager mode by esmeetu in https://github.com/vllm-project/vllm/pull/3037
* Fix stablelm by esmeetu in https://github.com/vllm-project/vllm/pull/3038
* Support Orion model by dachengai in https://github.com/vllm-project/vllm/pull/2539
* fix `get_ip` error in pure ipv6 environment by Jingru in https://github.com/vllm-project/vllm/pull/2931
* [Minor] Fix type annotation in fused moe by WoosukKwon in https://github.com/vllm-project/vllm/pull/3045
* Support logit bias for OpenAI API by dylanwhawk in https://github.com/vllm-project/vllm/pull/3027
* [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by WoosukKwon in https://github.com/vllm-project/vllm/pull/3046
* Enables GQA support in the prefix prefill kernels by sighingnow in https://github.com/vllm-project/vllm/pull/3007
* multi-lora documentation fix by ElefHead in https://github.com/vllm-project/vllm/pull/3064
* Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by AllenDou in https://github.com/vllm-project/vllm/pull/3070
* Support inference with transformers-neuronx by liangfu in https://github.com/vllm-project/vllm/pull/2569
* Add LoRA support for Gemma by WoosukKwon in https://github.com/vllm-project/vllm/pull/3050
* Add Support for 2/3/8-bit GPTQ Quantization Models by chu-tianxiang in https://github.com/vllm-project/vllm/pull/2330
* Fix: `AttributeError` in OpenAI-compatible server by jaywonchung in https://github.com/vllm-project/vllm/pull/3018
* add cache_config's info to prometheus metrics. by AllenDou in https://github.com/vllm-project/vllm/pull/3100
* Support starcoder2 architecture by sh0416 in https://github.com/vllm-project/vllm/pull/3089
* Fix building from source on WSL by aliencaocao in https://github.com/vllm-project/vllm/pull/3112
* [Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by njhill in https://github.com/vllm-project/vllm/pull/3099
* Add guided decoding for OpenAI API server by felixzhu555 in https://github.com/vllm-project/vllm/pull/2819
* Fix: Output text is always truncated in some models by HyperdriveHustle in https://github.com/vllm-project/vllm/pull/3016
* Remove exclude_unset in streaming response by sh0416 in https://github.com/vllm-project/vllm/pull/3143
* docs: Add tutorial on deploying vLLM model with KServe by terrytangyuan in https://github.com/vllm-project/vllm/pull/2586
* fix relative import path of protocol.py by Huarong in https://github.com/vllm-project/vllm/pull/3134
* Integrate Marlin Kernels for Int4 GPTQ inference by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/2497
* Bump up to v0.3.3 by WoosukKwon in https://github.com/vllm-project/vllm/pull/3129

New Contributors
* Maxusmusti made their first contribution in https://github.com/vllm-project/vllm/pull/2939
* 44670 made their first contribution in https://github.com/vllm-project/vllm/pull/2983
* jlcmoore made their first contribution in https://github.com/vllm-project/vllm/pull/2918
* dachengai made their first contribution in https://github.com/vllm-project/vllm/pull/2539
* dylanwhawk made their first contribution in https://github.com/vllm-project/vllm/pull/3027
* ElefHead made their first contribution in https://github.com/vllm-project/vllm/pull/3064
* AllenDou made their first contribution in https://github.com/vllm-project/vllm/pull/3070
* jaywonchung made their first contribution in https://github.com/vllm-project/vllm/pull/3018
* sh0416 made their first contribution in https://github.com/vllm-project/vllm/pull/3089
* aliencaocao made their first contribution in https://github.com/vllm-project/vllm/pull/3112
* felixzhu555 made their first contribution in https://github.com/vllm-project/vllm/pull/2819
* HyperdriveHustle made their first contribution in https://github.com/vllm-project/vllm/pull/3016
* terrytangyuan made their first contribution in https://github.com/vllm-project/vllm/pull/2586
* Huarong made their first contribution in https://github.com/vllm-project/vllm/pull/3134

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.3.2...v0.3.3

0.3.2

Major Changes

This version adds support for the OLMo and Gemma Model, as well as `seed` parameter.

What's Changed
* Defensively copy `sampling_params` by njhill in https://github.com/vllm-project/vllm/pull/2881
* multi-LoRA as extra models in OpenAI server by jvmncs in https://github.com/vllm-project/vllm/pull/2775
* Add code-revision config argument for Hugging Face Hub by mbm-ai in https://github.com/vllm-project/vllm/pull/2892
* [Minor] Small fix to make distributed init logic in worker looks cleaner by zhuohan123 in https://github.com/vllm-project/vllm/pull/2905
* [Test] Add basic correctness test by zhuohan123 in https://github.com/vllm-project/vllm/pull/2908
* Support OLMo models. by Isotr0py in https://github.com/vllm-project/vllm/pull/2832
* Add warning to prevent changes to benchmark api server by simon-mo in https://github.com/vllm-project/vllm/pull/2858
* Fix `vllm:prompt_tokens_total` metric calculation by ronensc in https://github.com/vllm-project/vllm/pull/2869
* [ROCm] include gfx908 as supported by jamestwhedbee in https://github.com/vllm-project/vllm/pull/2792
* [FIX] Fix beam search test by zhuohan123 in https://github.com/vllm-project/vllm/pull/2930
* Make vLLM logging formatting optional by Yard1 in https://github.com/vllm-project/vllm/pull/2877
* Add metrics to RequestOutput by Yard1 in https://github.com/vllm-project/vllm/pull/2876
* Add Gemma model by xiangxu-google in https://github.com/vllm-project/vllm/pull/2964
* Upgrade transformers to v4.38.0 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2965
* [FIX] Add Gemma model to the doc by zhuohan123 in https://github.com/vllm-project/vllm/pull/2966
* [ROCm] Upgrade transformers to v4.38.0 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2967
* Support per-request seed by njhill in https://github.com/vllm-project/vllm/pull/2514
* Bump up version to v0.3.2 by zhuohan123 in https://github.com/vllm-project/vllm/pull/2968

New Contributors
* jvmncs made their first contribution in https://github.com/vllm-project/vllm/pull/2775
* mbm-ai made their first contribution in https://github.com/vllm-project/vllm/pull/2892
* Isotr0py made their first contribution in https://github.com/vllm-project/vllm/pull/2832
* jamestwhedbee made their first contribution in https://github.com/vllm-project/vllm/pull/2792

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.3.1...v0.3.2

Page 1 of 5

Releases

Has known vulnerabilities

Vllm

Page 1 of 5

0.4.2

0.4.1

0.4.0.post1

0.4.0

0.3.3

0.3.2

Page 1 of 5

Links

Releases