Highlights
Features
* Support and enhance CommandR+ (3829), minicpm (3893), Meta Llama 3 (4175, 4182), Mixtral 8x22b (4073, 4002)
* Support private model registration, and updating our support policy (3871, 3948)
* Support PyTorch 2.2.1 and Triton 2.2.0 (4061, 4079, 3805, 3904, 4271)
* Add option for using LM Format Enforcer for guided decoding (3868)
* Add option for optionally initialize tokenizer and detokenizer (3748)
* Add option for load model using `tensorizer` (3476)
Enhancements
* vLLM is now mostly type checked by `mypy` (3816, 4006, 4161, 4043)
* Progress towards chunked prefill scheduler (3550, 3853, 4280, 3884)
* Progress towards speculative decoding (3250, 3706, 3894)
* Initial support with dynamic per-tensor scaling via FP8 (4118)
Hardwares
* Intel CPU inference backend is added (3993, 3634)
* AMD backend is enhanced with Triton kernel and e4m3fn KV cache (3643, 3290)
What's Changed
* [Kernel] Layernorm performance optimization by mawong-amd in https://github.com/vllm-project/vllm/pull/3662
* [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by youkaichao in https://github.com/vllm-project/vllm/pull/3746
* [CI/Build] Make Marlin Tests Green by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3753
* [Misc] Minor fixes in requirements.txt by WoosukKwon in https://github.com/vllm-project/vllm/pull/3769
* [Misc] Some minor simplifications to detokenization logic by njhill in https://github.com/vllm-project/vllm/pull/3670
* [Misc] Fix Benchmark TTFT Calculation for Chat Completions by ywang96 in https://github.com/vllm-project/vllm/pull/3768
* [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/3250
* [Misc] Add support for new autogptq checkpoint_format by Qubitium in https://github.com/vllm-project/vllm/pull/3689
* [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by cadedaniel in https://github.com/vllm-project/vllm/pull/3783
* [Hardware][Intel] Add CPU inference backend by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3634
* [HotFix] [CI/Build] Minor fix for CPU backend CI by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3787
* [Frontend][Bugfix] allow using the default middleware with a root path by A-Mahla in https://github.com/vllm-project/vllm/pull/3788
* [Doc] Fix vLLMEngine Doc Page by ywang96 in https://github.com/vllm-project/vllm/pull/3791
* [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by youkaichao in https://github.com/vllm-project/vllm/pull/3801
* Fix crash when try torch.cuda.set_device in worker by leiwen83 in https://github.com/vllm-project/vllm/pull/3770
* [Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spec_decode/` by mgoin in https://github.com/vllm-project/vllm/pull/3798
* [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by youkaichao in https://github.com/vllm-project/vllm/pull/3803
* [Speculative decoding] Adding configuration object for speculative decoding by cadedaniel in https://github.com/vllm-project/vllm/pull/3706
* [BugFix] Use different mechanism to get vllm version in `is_cpu()` by njhill in https://github.com/vllm-project/vllm/pull/3804
* [Doc] Update README.md by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3806
* [Doc] Update contribution guidelines for better onboarding by michaelfeil in https://github.com/vllm-project/vllm/pull/3819
* [3/N] Refactor scheduler for chunked prefill scheduling by rkooo567 in https://github.com/vllm-project/vllm/pull/3550
* Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by AdrianAbeyta in https://github.com/vllm-project/vllm/pull/3290
* [Misc] Publish 3rd meetup slides by WoosukKwon in https://github.com/vllm-project/vllm/pull/3835
* Fixes the argument for local_tokenizer_group by sighingnow in https://github.com/vllm-project/vllm/pull/3754
* [Core] Enable hf_transfer by default if available by michaelfeil in https://github.com/vllm-project/vllm/pull/3817
* [Bugfix] Add kv_scale input parameter to CPU backend by WoosukKwon in https://github.com/vllm-project/vllm/pull/3840
* [Core] [Frontend] Make detokenization optional by mgerstgrasser in https://github.com/vllm-project/vllm/pull/3749
* [Bugfix] Fix args in benchmark_serving by CatherineSue in https://github.com/vllm-project/vllm/pull/3836
* [Benchmark] Refactor sample_requests in benchmark_throughput by gty111 in https://github.com/vllm-project/vllm/pull/3613
* [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by youkaichao in https://github.com/vllm-project/vllm/pull/3805
* [Hardware][CPU] Update cpu torch to match default of 2.2.1 by mgoin in https://github.com/vllm-project/vllm/pull/3854
* [Model] Cohere CommandR+ by saurabhdash2512 in https://github.com/vllm-project/vllm/pull/3829
* [Core] improve robustness of pynccl by youkaichao in https://github.com/vllm-project/vllm/pull/3860
* [Doc]Add asynchronous engine arguments to documentation. by SeanGallen in https://github.com/vllm-project/vllm/pull/3810
* [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by youkaichao in https://github.com/vllm-project/vllm/pull/3859
* [Misc] Add pytest marker to opt-out of global test cleanup by cadedaniel in https://github.com/vllm-project/vllm/pull/3863
* [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by cadedaniel in https://github.com/vllm-project/vllm/pull/3864
* [Bugfix] Fixing requirements.txt by noamgat in https://github.com/vllm-project/vllm/pull/3865
* [Misc] Define common requirements by WoosukKwon in https://github.com/vllm-project/vllm/pull/3841
* Add option to completion API to truncate prompt tokens by tdoublep in https://github.com/vllm-project/vllm/pull/3144
* [Chunked Prefill][4/n] Chunked prefill scheduler. by rkooo567 in https://github.com/vllm-project/vllm/pull/3853
* [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by Isotr0py in https://github.com/vllm-project/vllm/pull/3869
* [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by youkaichao in https://github.com/vllm-project/vllm/pull/3889
* [Core] enable out-of-tree model register by youkaichao in https://github.com/vllm-project/vllm/pull/3871
* [WIP][Core] latency optimization by youkaichao in https://github.com/vllm-project/vllm/pull/3890
* [Bugfix] Fix Llava inference with Tensor Parallelism. by Isotr0py in https://github.com/vllm-project/vllm/pull/3883
* [Model] add minicpm by SUDA-HLT-ywfang in https://github.com/vllm-project/vllm/pull/3893
* [Bugfix] Added Command-R GPTQ support by egortolmachev in https://github.com/vllm-project/vllm/pull/3849
* [Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration by Ki6an in https://github.com/vllm-project/vllm/pull/3767
* [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by mawong-amd in https://github.com/vllm-project/vllm/pull/3782
* [BugFix][Model] Fix commandr RoPE max_position_embeddings by esmeetu in https://github.com/vllm-project/vllm/pull/3919
* [Core] separate distributed_init from worker by youkaichao in https://github.com/vllm-project/vllm/pull/3904
* [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by cadedaniel in https://github.com/vllm-project/vllm/pull/3837
* [Bugfix] Fix KeyError on loading GPT-NeoX by jsato8094 in https://github.com/vllm-project/vllm/pull/3925
* [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by jpvillam-amd in https://github.com/vllm-project/vllm/pull/3643
* [Misc] Avoid loading incorrect LoRA config by jeejeelee in https://github.com/vllm-project/vllm/pull/3777
* [Benchmark] Add cpu options to bench scripts by PZD-CHINA in https://github.com/vllm-project/vllm/pull/3915
* [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by zhaotyer in https://github.com/vllm-project/vllm/pull/3955
* [Bugfix] Fix logits processor when prompt_logprobs is not None by huyiwen in https://github.com/vllm-project/vllm/pull/3899
* [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by tjohnson31415 in https://github.com/vllm-project/vllm/pull/3876
* [Bugfix][ROCm] Add numba to Dockerfile.rocm by WoosukKwon in https://github.com/vllm-project/vllm/pull/3962
* [Model][AMD] ROCm support for 256 head dims for Gemma by jamestwhedbee in https://github.com/vllm-project/vllm/pull/3972
* [Doc] Add doc to state our model support policy by youkaichao in https://github.com/vllm-project/vllm/pull/3948
* [Bugfix] Remove key sorting for `guided_json` parameter in OpenAi compatible Server by dmarasco in https://github.com/vllm-project/vllm/pull/3945
* [Doc] Fix getting stared to use publicly available model by fpaupier in https://github.com/vllm-project/vllm/pull/3963
* [Bugfix] handle hf_config with architectures == None by tjohnson31415 in https://github.com/vllm-project/vllm/pull/3982
* [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by youkaichao in https://github.com/vllm-project/vllm/pull/3950
* [Core][5/N] Fully working chunked prefill e2e by rkooo567 in https://github.com/vllm-project/vllm/pull/3884
* [Core][Model] Use torch.compile to accelerate layernorm in commandr by youkaichao in https://github.com/vllm-project/vllm/pull/3985
* [Test] Add xformer and flash attn tests by rkooo567 in https://github.com/vllm-project/vllm/pull/3961
* [Misc] refactor ops and cache_ops layer by jikunshang in https://github.com/vllm-project/vllm/pull/3913
* [Doc][Installation] delete python setup.py develop by youkaichao in https://github.com/vllm-project/vllm/pull/3989
* [Kernel] Fused MoE Config for Mixtral 8x22 by ywang96 in https://github.com/vllm-project/vllm/pull/4002
* fix-bgmv-kernel-640 by kingljl in https://github.com/vllm-project/vllm/pull/4007
* [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/3824
* [Core] Set `linear_weights` directly on the layer by Yard1 in https://github.com/vllm-project/vllm/pull/3977
* [Core][Distributed] make init_distributed_environment compatible with init_process_group by youkaichao in https://github.com/vllm-project/vllm/pull/4014
* Fix echo/logprob OpenAI completion bug by dylanwhawk in https://github.com/vllm-project/vllm/pull/3441
* [Kernel] Add extra punica sizes to support bigger vocabs by Yard1 in https://github.com/vllm-project/vllm/pull/4015
* [BugFix] Fix handling of stop strings and stop token ids by njhill in https://github.com/vllm-project/vllm/pull/3672
* [Doc] Add typing hints / mypy types cleanup by michaelfeil in https://github.com/vllm-project/vllm/pull/3816
* [Core] Support LoRA on quantized models by jeejeelee in https://github.com/vllm-project/vllm/pull/4012
* [Frontend][Core] Move `merge_async_iterators` to utils by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4026
* [Test] Test multiple attn backend for chunked prefill. by rkooo567 in https://github.com/vllm-project/vllm/pull/4023
* [Bugfix] fix type hint for py 3.8 by youkaichao in https://github.com/vllm-project/vllm/pull/4036
* [Misc] Fix typo in scheduler.py by zhuohan123 in https://github.com/vllm-project/vllm/pull/4022
* [mypy] Add mypy type annotation part 1 by rkooo567 in https://github.com/vllm-project/vllm/pull/4006
* [Core] fix custom allreduce default value by youkaichao in https://github.com/vllm-project/vllm/pull/4040
* Fix triton compilation issue by Bellk17 in https://github.com/vllm-project/vllm/pull/3984
* [Bugfix] Fix LoRA bug by jeejeelee in https://github.com/vllm-project/vllm/pull/4032
* [CI/Test] expand ruff and yapf for all supported python version by youkaichao in https://github.com/vllm-project/vllm/pull/4037
* [Bugfix] More type hint fixes for py 3.8 by dylanwhawk in https://github.com/vllm-project/vllm/pull/4039
* [Core][Distributed] improve logging for init dist by youkaichao in https://github.com/vllm-project/vllm/pull/4042
* [Bugfix] fix_log_time_in_metrics by zspo in https://github.com/vllm-project/vllm/pull/4050
* [Bugfix] fix_small_bug_in_neuron_executor by zspo in https://github.com/vllm-project/vllm/pull/4051
* [Kernel] Add punica dimension for Baichuan-13B by jeejeelee in https://github.com/vllm-project/vllm/pull/4053
* [Frontend] [Core] feat: Add model loading using `tensorizer` by sangstar in https://github.com/vllm-project/vllm/pull/3476
* [Core] avoid too many cuda context by caching p2p test by youkaichao in https://github.com/vllm-project/vllm/pull/4021
* [BugFix] Fix tensorizer extra in setup.py by njhill in https://github.com/vllm-project/vllm/pull/4072
* [Docs] document that mixtral 8x22b is supported by simon-mo in https://github.com/vllm-project/vllm/pull/4073
* [Misc] Upgrade triton to 2.2.0 by esmeetu in https://github.com/vllm-project/vllm/pull/4061
* [Bugfix] Fix filelock version requirement by zhuohan123 in https://github.com/vllm-project/vllm/pull/4075
* [Misc][Minor] Fix CPU block num log in CPUExecutor. by bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4088
* [Core] Simplifications to executor classes by njhill in https://github.com/vllm-project/vllm/pull/4071
* [Doc] Add better clarity for tensorizer usage by sangstar in https://github.com/vllm-project/vllm/pull/4090
* [Bugfix] Fix ray workers profiling with nsight by rickyyx in https://github.com/vllm-project/vllm/pull/4095
* [Typing] Fix Sequence type GenericAlias only available after Python 3.9. by rkooo567 in https://github.com/vllm-project/vllm/pull/4092
* [Core] Fix engine-use-ray broken by rkooo567 in https://github.com/vllm-project/vllm/pull/4105
* LM Format Enforcer Guided Decoding Support by noamgat in https://github.com/vllm-project/vllm/pull/3868
* [Core] Refactor model loading code by Yard1 in https://github.com/vllm-project/vllm/pull/4097
* [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by cadedaniel in https://github.com/vllm-project/vllm/pull/3894
* [Misc] [CI] Fix CI failure caught after merge by cadedaniel in https://github.com/vllm-project/vllm/pull/4126
* [CI] Move CPU/AMD tests to after wait by cadedaniel in https://github.com/vllm-project/vllm/pull/4123
* [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by youkaichao in https://github.com/vllm-project/vllm/pull/4024
* [Bugfix] fix output parsing error for trtllm backend by elinx in https://github.com/vllm-project/vllm/pull/4137
* [Kernel] Add punica dimension for Swallow-MS-7B LoRA by ucciicci in https://github.com/vllm-project/vllm/pull/4134
* [Typing] Mypy typing part 2 by rkooo567 in https://github.com/vllm-project/vllm/pull/4043
* [Core] Add integrity check during initialization; add test for it by youkaichao in https://github.com/vllm-project/vllm/pull/4155
* Allow model to be served under multiple names by hmellor in https://github.com/vllm-project/vllm/pull/2894
* [Bugfix] Get available quantization methods from quantization registry by mgoin in https://github.com/vllm-project/vllm/pull/4098
* [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by mmoskal in https://github.com/vllm-project/vllm/pull/4128
* [Docs] document that Meta Llama 3 is supported by simon-mo in https://github.com/vllm-project/vllm/pull/4175
* [Bugfix] Support logprobs when using guided_json and other constrained decoding fields by jamestwhedbee in https://github.com/vllm-project/vllm/pull/4149
* [Misc] Bump transformers to latest version by njhill in https://github.com/vllm-project/vllm/pull/4176
* [CI/CD] add neuron docker and ci test scripts by liangfu in https://github.com/vllm-project/vllm/pull/3571
* [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (3974) by agt in https://github.com/vllm-project/vllm/pull/4159
* [Core] add an option to log every function call to for debugging hang/crash in distributed inference by youkaichao in https://github.com/vllm-project/vllm/pull/4079
* Support eos_token_id from generation_config.json by simon-mo in https://github.com/vllm-project/vllm/pull/4182
* [Bugfix] Fix LoRA loading check by jeejeelee in https://github.com/vllm-project/vllm/pull/4138
* Bump version of 0.4.1 by simon-mo in https://github.com/vllm-project/vllm/pull/4177
* [Misc] fix docstrings by UranusSeven in https://github.com/vllm-project/vllm/pull/4191
* [Bugfix][Core] Restore logging of stats in the async engine by ronensc in https://github.com/vllm-project/vllm/pull/4150
* [Misc] add nccl in collect env by youkaichao in https://github.com/vllm-project/vllm/pull/4211
* Pass `tokenizer_revision` when getting tokenizer in openai serving by chiragjn in https://github.com/vllm-project/vllm/pull/4214
* [Bugfix] Add fix for JSON whitespace by ayusher in https://github.com/vllm-project/vllm/pull/4189
* Fix missing docs and out of sync `EngineArgs` by hmellor in https://github.com/vllm-project/vllm/pull/4219
* [Kernel][FP8] Initial support with dynamic per-tensor scaling by comaniac in https://github.com/vllm-project/vllm/pull/4118
* [Frontend] multiple sampling params support by nunjunj in https://github.com/vllm-project/vllm/pull/3570
* Updating lm-format-enforcer version and adding links to decoding libraries in docs by noamgat in https://github.com/vllm-project/vllm/pull/4222
* Don't show default value for flags in `EngineArgs` by hmellor in https://github.com/vllm-project/vllm/pull/4223
* [Doc]: Update the page of adding new models by YeFD in https://github.com/vllm-project/vllm/pull/4236
* Make initialization of tokenizer and detokenizer optional by GeauxEric in https://github.com/vllm-project/vllm/pull/3748
* [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by hongxiayang in https://github.com/vllm-project/vllm/pull/4129
* [Core][Distributed] fix _is_full_nvlink detection by youkaichao in https://github.com/vllm-project/vllm/pull/4233
* [Misc] Add vision language model support to CPU backend by Isotr0py in https://github.com/vllm-project/vllm/pull/3968
* [Bugfix] Fix type annotations in CPU model runner by WoosukKwon in https://github.com/vllm-project/vllm/pull/4256
* [Frontend] Enable support for CPU backend in AsyncLLMEngine. by sighingnow in https://github.com/vllm-project/vllm/pull/3993
* [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by alexm-nm in https://github.com/vllm-project/vllm/pull/4217
* Add example scripts to documentation by hmellor in https://github.com/vllm-project/vllm/pull/4225
* [Core] Scheduler perf fix by rkooo567 in https://github.com/vllm-project/vllm/pull/4270
* [Doc] Update the SkyPilot doc with serving and Llama-3 by Michaelvll in https://github.com/vllm-project/vllm/pull/4276
* [Core][Distributed] use absolute path for library file by youkaichao in https://github.com/vllm-project/vllm/pull/4271
* Fix `autodoc` directives by hmellor in https://github.com/vllm-project/vllm/pull/4272
* [Mypy] Part 3 fix typing for nested directories for most of directory by rkooo567 in https://github.com/vllm-project/vllm/pull/4161
* [Core] Some simplification of WorkerWrapper changes by njhill in https://github.com/vllm-project/vllm/pull/4183
* [Core] Scheduling optimization 2 by rkooo567 in https://github.com/vllm-project/vllm/pull/4280
* [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by cadedaniel in https://github.com/vllm-project/vllm/pull/3951
* [Bugfix] Fixing max token error message for openai compatible server by jgordley in https://github.com/vllm-project/vllm/pull/4016
* [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by DefTruth in https://github.com/vllm-project/vllm/pull/4286
* [Core][Logging] Add last frame information for better debugging by youkaichao in https://github.com/vllm-project/vllm/pull/4278
* [CI] Add ccache for wheel builds job by simon-mo in https://github.com/vllm-project/vllm/pull/4281
* AQLM CUDA support by jaemzfleming in https://github.com/vllm-project/vllm/pull/3287
* [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by DarkLight1337 in https://github.com/vllm-project/vllm/pull/4292
* [Kernel] FP8 support for MoE kernel / Mixtral by pcmoritz in https://github.com/vllm-project/vllm/pull/4244
* [Bugfix] fixed fp8 conflict with aqlm by robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4307
* [Core][Distributed] use cpu/gloo to initialize pynccl by youkaichao in https://github.com/vllm-project/vllm/pull/4248
* [CI][Build] change pynvml to nvidia-ml-py by youkaichao in https://github.com/vllm-project/vllm/pull/4302
* [Misc] Reduce supported Punica dtypes by WoosukKwon in https://github.com/vllm-project/vllm/pull/4304
New Contributors
* mawong-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3662
* Qubitium made their first contribution in https://github.com/vllm-project/vllm/pull/3689
* bigPYJ1151 made their first contribution in https://github.com/vllm-project/vllm/pull/3634
* A-Mahla made their first contribution in https://github.com/vllm-project/vllm/pull/3788
* AdrianAbeyta made their first contribution in https://github.com/vllm-project/vllm/pull/3290
* mgerstgrasser made their first contribution in https://github.com/vllm-project/vllm/pull/3749
* CatherineSue made their first contribution in https://github.com/vllm-project/vllm/pull/3836
* saurabhdash2512 made their first contribution in https://github.com/vllm-project/vllm/pull/3829
* SeanGallen made their first contribution in https://github.com/vllm-project/vllm/pull/3810
* SUDA-HLT-ywfang made their first contribution in https://github.com/vllm-project/vllm/pull/3893
* egortolmachev made their first contribution in https://github.com/vllm-project/vllm/pull/3849
* Ki6an made their first contribution in https://github.com/vllm-project/vllm/pull/3767
* jsato8094 made their first contribution in https://github.com/vllm-project/vllm/pull/3925
* jpvillam-amd made their first contribution in https://github.com/vllm-project/vllm/pull/3643
* PZD-CHINA made their first contribution in https://github.com/vllm-project/vllm/pull/3915
* zhaotyer made their first contribution in https://github.com/vllm-project/vllm/pull/3955
* huyiwen made their first contribution in https://github.com/vllm-project/vllm/pull/3899
* dmarasco made their first contribution in https://github.com/vllm-project/vllm/pull/3945
* fpaupier made their first contribution in https://github.com/vllm-project/vllm/pull/3963
* kingljl made their first contribution in https://github.com/vllm-project/vllm/pull/4007
* DarkLight1337 made their first contribution in https://github.com/vllm-project/vllm/pull/4026
* Bellk17 made their first contribution in https://github.com/vllm-project/vllm/pull/3984
* sangstar made their first contribution in https://github.com/vllm-project/vllm/pull/3476
* rickyyx made their first contribution in https://github.com/vllm-project/vllm/pull/4095
* elinx made their first contribution in https://github.com/vllm-project/vllm/pull/4137
* ucciicci made their first contribution in https://github.com/vllm-project/vllm/pull/4134
* mmoskal made their first contribution in https://github.com/vllm-project/vllm/pull/4128
* agt made their first contribution in https://github.com/vllm-project/vllm/pull/4159
* ayusher made their first contribution in https://github.com/vllm-project/vllm/pull/4189
* nunjunj made their first contribution in https://github.com/vllm-project/vllm/pull/3570
* YeFD made their first contribution in https://github.com/vllm-project/vllm/pull/4236
* GeauxEric made their first contribution in https://github.com/vllm-project/vllm/pull/3748
* alexm-nm made their first contribution in https://github.com/vllm-project/vllm/pull/4217
* jgordley made their first contribution in https://github.com/vllm-project/vllm/pull/4016
* DefTruth made their first contribution in https://github.com/vllm-project/vllm/pull/4286
* jaemzfleming made their first contribution in https://github.com/vllm-project/vllm/pull/3287
**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.4.0...v0.4.1