Major Changes
- Experimental multi-lora support
- Experimental prefix caching support
- FP8 KV Cache support
- Optimized MoE performance and Deepseek MoE support
- CI tested PRs
- Support batch completion in server
What's Changed
* Miner fix of type hint by beginlner in https://github.com/vllm-project/vllm/pull/2340
* Build docker image with shared objects from "build" step by payoto in https://github.com/vllm-project/vllm/pull/2237
* Ensure metrics are logged regardless of requests by ichernev in https://github.com/vllm-project/vllm/pull/2347
* Changed scheduler to use deques instead of lists by NadavShmayo in https://github.com/vllm-project/vllm/pull/2290
* Fix eager mode performance by WoosukKwon in https://github.com/vllm-project/vllm/pull/2377
* [Minor] Remove unused code in attention by WoosukKwon in https://github.com/vllm-project/vllm/pull/2384
* Add baichuan chat template jinjia file by EvilPsyCHo in https://github.com/vllm-project/vllm/pull/2390
* [Speculative decoding 1/9] Optimized rejection sampler by cadedaniel in https://github.com/vllm-project/vllm/pull/2336
* Fix ipv4 ipv6 dualstack by yunfeng-scale in https://github.com/vllm-project/vllm/pull/2408
* [Minor] Rename phi_1_5 to phi by WoosukKwon in https://github.com/vllm-project/vllm/pull/2385
* [DOC] Add additional comments for LLMEngine and AsyncLLMEngine by litone01 in https://github.com/vllm-project/vllm/pull/1011
* [Minor] Fix the format in quick start guide related to Model Scope by zhuohan123 in https://github.com/vllm-project/vllm/pull/2425
* Add gradio chatbot for openai webserver by arkohut in https://github.com/vllm-project/vllm/pull/2307
* [BUG] RuntimeError: deque mutated during iteration in abort_seq_group by chenxu2048 in https://github.com/vllm-project/vllm/pull/2371
* Allow setting fastapi root_path argument by chiragjn in https://github.com/vllm-project/vllm/pull/2341
* Address Phi modeling update 2 by huiwy in https://github.com/vllm-project/vllm/pull/2428
* Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU 1901 by chuanzhubin in https://github.com/vllm-project/vllm/pull/2374
* Update quickstart.rst with small clarifying change (fix typo) by nautsimon in https://github.com/vllm-project/vllm/pull/2369
* Aligning `top_p` and `top_k` Sampling by chenxu2048 in https://github.com/vllm-project/vllm/pull/1885
* [Minor] Fix err msg by WoosukKwon in https://github.com/vllm-project/vllm/pull/2431
* [Minor] Optimize cuda graph memory usage by esmeetu in https://github.com/vllm-project/vllm/pull/2437
* [CI] Add Buildkite by simon-mo in https://github.com/vllm-project/vllm/pull/2355
* Announce the second vLLM meetup by WoosukKwon in https://github.com/vllm-project/vllm/pull/2444
* Allow buildkite to retry build on agent lost by simon-mo in https://github.com/vllm-project/vllm/pull/2446
* Fix weigit loading for GQA with TP by zhangch9 in https://github.com/vllm-project/vllm/pull/2379
* CI: make sure benchmark script exit on error by simon-mo in https://github.com/vllm-project/vllm/pull/2449
* ci: retry on build failure as well by simon-mo in https://github.com/vllm-project/vllm/pull/2457
* Add StableLM3B model by ita9naiwa in https://github.com/vllm-project/vllm/pull/2372
* OpenAI refactoring by FlorianJoncour in https://github.com/vllm-project/vllm/pull/2360
* [Experimental] Prefix Caching Support by caoshiyi in https://github.com/vllm-project/vllm/pull/1669
* fix stablelm.py tensor-parallel-size bug by YingchaoX in https://github.com/vllm-project/vllm/pull/2482
* Minor fix in prefill cache example by JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2494
* fix: fix some args desc by zspo in https://github.com/vllm-project/vllm/pull/2487
* [Neuron] Add an option to build with neuron by liangfu in https://github.com/vllm-project/vllm/pull/2065
* Don't download both safetensor and bin files. by NikolaBorisov in https://github.com/vllm-project/vllm/pull/2480
* [BugFix] Fix abort_seq_group by beginlner in https://github.com/vllm-project/vllm/pull/2463
* refactor completion api for readability by simon-mo in https://github.com/vllm-project/vllm/pull/2499
* Support OpenAI API server in `benchmark_serving.py` by hmellor in https://github.com/vllm-project/vllm/pull/2172
* Simplify broadcast logic for control messages by zhuohan123 in https://github.com/vllm-project/vllm/pull/2501
* [Bugfix] fix load local safetensors model by esmeetu in https://github.com/vllm-project/vllm/pull/2512
* Add benchmark serving to CI by simon-mo in https://github.com/vllm-project/vllm/pull/2505
* Add `group` as an argument in broadcast ops by GindaChen in https://github.com/vllm-project/vllm/pull/2522
* [Fix] Keep `scheduler.running` as deque by njhill in https://github.com/vllm-project/vllm/pull/2523
* migrate pydantic from v1 to v2 by joennlae in https://github.com/vllm-project/vllm/pull/2531
* [Speculative decoding 2/9] Multi-step worker for draft model by cadedaniel in https://github.com/vllm-project/vllm/pull/2424
* Fix "Port could not be cast to integer value as <function get_open_port>" by pcmoritz in https://github.com/vllm-project/vllm/pull/2545
* Add qwen2 by JustinLin610 in https://github.com/vllm-project/vllm/pull/2495
* Fix progress bar and allow HTTPS in `benchmark_serving.py` by hmellor in https://github.com/vllm-project/vllm/pull/2552
* Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2553
* [Feature] Simple API token authentication by taisazero in https://github.com/vllm-project/vllm/pull/1106
* Add multi-LoRA support by Yard1 in https://github.com/vllm-project/vllm/pull/1804
* lint: format all python file instead of just source code by simon-mo in https://github.com/vllm-project/vllm/pull/2567
* [Bugfix] fix crash if max_tokens=None by NikolaBorisov in https://github.com/vllm-project/vllm/pull/2570
* Added `include_stop_str_in_output` and `length_penalty` parameters to OpenAI API by galatolofederico in https://github.com/vllm-project/vllm/pull/2562
* [Doc] Fix the syntax error in the doc of supported_models. by keli-wen in https://github.com/vllm-project/vllm/pull/2584
* Support Batch Completion in Server by simon-mo in https://github.com/vllm-project/vllm/pull/2529
* fix names and license by JustinLin610 in https://github.com/vllm-project/vllm/pull/2589
* [Fix] Use a correct device when creating OptionalCUDAGuard by sh1ng in https://github.com/vllm-project/vllm/pull/2583
* [ROCm] add support to ROCm 6.0 and MI300 by hongxiayang in https://github.com/vllm-project/vllm/pull/2274
* Support for Stable LM 2 by dakotamahan-stability in https://github.com/vllm-project/vllm/pull/2598
* Don't build punica kernels by default by pcmoritz in https://github.com/vllm-project/vllm/pull/2605
* AWQ: Up to 2.66x higher throughput by casper-hansen in https://github.com/vllm-project/vllm/pull/2566
* Use head_dim in config if exists by xiangxu-google in https://github.com/vllm-project/vllm/pull/2622
* Custom all reduce kernels by hanzhi713 in https://github.com/vllm-project/vllm/pull/2192
* [Minor] Fix warning on Ray dependencies by WoosukKwon in https://github.com/vllm-project/vllm/pull/2630
* Speed up Punica compilation by WoosukKwon in https://github.com/vllm-project/vllm/pull/2632
* Small async_llm_engine refactor by andoorve in https://github.com/vllm-project/vllm/pull/2618
* Update Ray version requirements by simon-mo in https://github.com/vllm-project/vllm/pull/2636
* Support FP8-E5M2 KV Cache by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2279
* Fix error when tp > 1 by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2644
* No repeated IPC open by hanzhi713 in https://github.com/vllm-project/vllm/pull/2642
* ROCm: Allow setting compilation target by rlrs in https://github.com/vllm-project/vllm/pull/2581
* DeepseekMoE support with Fused MoE kernel by zwd003 in https://github.com/vllm-project/vllm/pull/2453
* Fused MOE for Mixtral by pcmoritz in https://github.com/vllm-project/vllm/pull/2542
* Fix 'Actor methods cannot be called directly' when using `--engine-use-ray` by HermitSun in https://github.com/vllm-project/vllm/pull/2664
* Add swap_blocks unit tests by sh1ng in https://github.com/vllm-project/vllm/pull/2616
* Fix a small typo (tenosr -> tensor) by pcmoritz in https://github.com/vllm-project/vllm/pull/2672
* [Minor] Fix false warning when TP=1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2674
* Add quantized mixtral support by WoosukKwon in https://github.com/vllm-project/vllm/pull/2673
* Bump up version to v0.3.0 by zhuohan123 in https://github.com/vllm-project/vllm/pull/2656
New Contributors
* payoto made their first contribution in https://github.com/vllm-project/vllm/pull/2237
* NadavShmayo made their first contribution in https://github.com/vllm-project/vllm/pull/2290
* EvilPsyCHo made their first contribution in https://github.com/vllm-project/vllm/pull/2390
* litone01 made their first contribution in https://github.com/vllm-project/vllm/pull/1011
* arkohut made their first contribution in https://github.com/vllm-project/vllm/pull/2307
* chiragjn made their first contribution in https://github.com/vllm-project/vllm/pull/2341
* huiwy made their first contribution in https://github.com/vllm-project/vllm/pull/2428
* chuanzhubin made their first contribution in https://github.com/vllm-project/vllm/pull/2374
* nautsimon made their first contribution in https://github.com/vllm-project/vllm/pull/2369
* zhangch9 made their first contribution in https://github.com/vllm-project/vllm/pull/2379
* ita9naiwa made their first contribution in https://github.com/vllm-project/vllm/pull/2372
* caoshiyi made their first contribution in https://github.com/vllm-project/vllm/pull/1669
* YingchaoX made their first contribution in https://github.com/vllm-project/vllm/pull/2482
* JasonZhu1313 made their first contribution in https://github.com/vllm-project/vllm/pull/2494
* zspo made their first contribution in https://github.com/vllm-project/vllm/pull/2487
* liangfu made their first contribution in https://github.com/vllm-project/vllm/pull/2065
* NikolaBorisov made their first contribution in https://github.com/vllm-project/vllm/pull/2480
* GindaChen made their first contribution in https://github.com/vllm-project/vllm/pull/2522
* njhill made their first contribution in https://github.com/vllm-project/vllm/pull/2523
* joennlae made their first contribution in https://github.com/vllm-project/vllm/pull/2531
* pcmoritz made their first contribution in https://github.com/vllm-project/vllm/pull/2545
* JustinLin610 made their first contribution in https://github.com/vllm-project/vllm/pull/2495
* taisazero made their first contribution in https://github.com/vllm-project/vllm/pull/1106
* galatolofederico made their first contribution in https://github.com/vllm-project/vllm/pull/2562
* keli-wen made their first contribution in https://github.com/vllm-project/vllm/pull/2584
* sh1ng made their first contribution in https://github.com/vllm-project/vllm/pull/2583
* hongxiayang made their first contribution in https://github.com/vllm-project/vllm/pull/2274
* dakotamahan-stability made their first contribution in https://github.com/vllm-project/vllm/pull/2598
* xiangxu-google made their first contribution in https://github.com/vllm-project/vllm/pull/2622
* andoorve made their first contribution in https://github.com/vllm-project/vllm/pull/2618
* rlrs made their first contribution in https://github.com/vllm-project/vllm/pull/2581
* zwd003 made their first contribution in https://github.com/vllm-project/vllm/pull/2453
**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.7...v0.3.0