Vllm

Latest version: v0.6.4.post1

Safety actively analyzes 687990 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 8

0.3.1

Not secure

Major Changes

This version fixes the following major bugs:
* Memory leak with distributed execution. (Solved by using CuPY for collective communication).
* Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed
* Fixes assertion failure in prefix caching: the lora index mapping should respect `prefix_len`. by sighingnow in https://github.com/vllm-project/vllm/pull/2688
* fix some bugs about parameter description by zspo in https://github.com/vllm-project/vllm/pull/2689
* [Minor] Fix test_cache.py CI test failure by pcmoritz in https://github.com/vllm-project/vllm/pull/2684
* Add unit test for Mixtral MoE layer by pcmoritz in https://github.com/vllm-project/vllm/pull/2677
* Refactor Prometheus and Add Request Level Metrics by rib-2 in https://github.com/vllm-project/vllm/pull/2316
* Add Internlm2 by Leymore in https://github.com/vllm-project/vllm/pull/2666
* Fix compile error when using rocm by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2648
* fix python 3.8 syntax by simon-mo in https://github.com/vllm-project/vllm/pull/2716
* Update README for meetup slides by simon-mo in https://github.com/vllm-project/vllm/pull/2718
* Use revision when downloading the quantization config file by Pernekhan in https://github.com/vllm-project/vllm/pull/2697
* remove hardcoded `device="cuda" ` to support more device by jikunshang in https://github.com/vllm-project/vllm/pull/2503
* fix length_penalty default value to 1.0 by zspo in https://github.com/vllm-project/vllm/pull/2667
* Add one example to run batch inference distributed on Ray by c21 in https://github.com/vllm-project/vllm/pull/2696
* docs: update langchain serving instructions by mspronesti in https://github.com/vllm-project/vllm/pull/2736
* Set&Get llm internal tokenizer instead of the TokenizerGroup by dancingpipi in https://github.com/vllm-project/vllm/pull/2741
* Remove eos tokens from output by default by zcnrex in https://github.com/vllm-project/vllm/pull/2611
* add requirement: triton >= 2.1.0 by whyiug in https://github.com/vllm-project/vllm/pull/2746
* [Minor] Fix benchmark_latency by WoosukKwon in https://github.com/vllm-project/vllm/pull/2765
* [ROCm] Fix some kernels failed unit tests by hongxiayang in https://github.com/vllm-project/vllm/pull/2498
* Set local logging level via env variable by gardberg in https://github.com/vllm-project/vllm/pull/2774
* [ROCm] Fixup arch checks for ROCM by dllehr-amd in https://github.com/vllm-project/vllm/pull/2627
* Add fused top-K softmax kernel for MoE by WoosukKwon in https://github.com/vllm-project/vllm/pull/2769
* fix issue when model parameter is not a model id but path of the model. by liuyhwangyh in https://github.com/vllm-project/vllm/pull/2489
* [Minor] More fix of test_cache.py CI test failure by LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/2750
* [ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by hongxiayang in https://github.com/vllm-project/vllm/pull/2790
* Add documentation on how to do incremental builds by pcmoritz in https://github.com/vllm-project/vllm/pull/2796
* [Ray] Integration compiled DAG off by default by rkooo567 in https://github.com/vllm-project/vllm/pull/2471
* Disable custom all reduce by default by WoosukKwon in https://github.com/vllm-project/vllm/pull/2808
* [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by hongxiayang in https://github.com/vllm-project/vllm/pull/2768
* Add documentation section about LoRA by pcmoritz in https://github.com/vllm-project/vllm/pull/2834
* Refactor 2 awq gemm kernels into m16nXk32 by zcnrex in https://github.com/vllm-project/vllm/pull/2723
* Serving Benchmark Refactoring by ywang96 in https://github.com/vllm-project/vllm/pull/2433
* [CI] Ensure documentation build is checked in CI by simon-mo in https://github.com/vllm-project/vllm/pull/2842
* Refactor llama family models by esmeetu in https://github.com/vllm-project/vllm/pull/2637
* Revert "Refactor llama family models" by pcmoritz in https://github.com/vllm-project/vllm/pull/2851
* Use CuPy for CUDA graphs by WoosukKwon in https://github.com/vllm-project/vllm/pull/2811
* Remove Yi model definition, please use `LlamaForCausalLM` instead by pcmoritz in https://github.com/vllm-project/vllm/pull/2854
* Add LoRA support for Mixtral by tterrysun in https://github.com/vllm-project/vllm/pull/2831
* Migrate InternLMForCausalLM to LlamaForCausalLM by pcmoritz in https://github.com/vllm-project/vllm/pull/2860
* Fix internlm after https://github.com/vllm-project/vllm/pull/2860 by pcmoritz in https://github.com/vllm-project/vllm/pull/2861
* [Fix] Fix memory profiling when GPU is used by multiple processes by WoosukKwon in https://github.com/vllm-project/vllm/pull/2863
* Fix docker python version by NikolaBorisov in https://github.com/vllm-project/vllm/pull/2845
* Migrate AquilaForCausalLM to LlamaForCausalLM by esmeetu in https://github.com/vllm-project/vllm/pull/2867
* Don't use cupy NCCL for AMD backends by WoosukKwon in https://github.com/vllm-project/vllm/pull/2855
* Align LoRA code between Mistral and Mixtral (fixes 2875) by pcmoritz in https://github.com/vllm-project/vllm/pull/2880
* [BugFix] Fix GC bug for `LLM` class by WoosukKwon in https://github.com/vllm-project/vllm/pull/2882
* Fix decilm.py by pcmoritz in https://github.com/vllm-project/vllm/pull/2883
* [ROCm] Dockerfile fix for flash-attention build by hongxiayang in https://github.com/vllm-project/vllm/pull/2885
* Prefix Caching- fix t4 triton error by caoshiyi in https://github.com/vllm-project/vllm/pull/2517
* Bump up to v0.3.1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2887

New Contributors
* sighingnow made their first contribution in https://github.com/vllm-project/vllm/pull/2688
* rib-2 made their first contribution in https://github.com/vllm-project/vllm/pull/2316
* Leymore made their first contribution in https://github.com/vllm-project/vllm/pull/2666
* Pernekhan made their first contribution in https://github.com/vllm-project/vllm/pull/2697
* jikunshang made their first contribution in https://github.com/vllm-project/vllm/pull/2503
* c21 made their first contribution in https://github.com/vllm-project/vllm/pull/2696
* zcnrex made their first contribution in https://github.com/vllm-project/vllm/pull/2611
* whyiug made their first contribution in https://github.com/vllm-project/vllm/pull/2746
* gardberg made their first contribution in https://github.com/vllm-project/vllm/pull/2774
* dllehr-amd made their first contribution in https://github.com/vllm-project/vllm/pull/2627
* rkooo567 made their first contribution in https://github.com/vllm-project/vllm/pull/2471
* ywang96 made their first contribution in https://github.com/vllm-project/vllm/pull/2433
* tterrysun made their first contribution in https://github.com/vllm-project/vllm/pull/2831

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.3.0...v0.3.1

0.3.0

Not secure

Major Changes
- Experimental multi-lora support
- Experimental prefix caching support
- FP8 KV Cache support
- Optimized MoE performance and Deepseek MoE support
- CI tested PRs
- Support batch completion in server

What's Changed
* Miner fix of type hint by beginlner in https://github.com/vllm-project/vllm/pull/2340
* Build docker image with shared objects from "build" step by payoto in https://github.com/vllm-project/vllm/pull/2237
* Ensure metrics are logged regardless of requests by ichernev in https://github.com/vllm-project/vllm/pull/2347
* Changed scheduler to use deques instead of lists by NadavShmayo in https://github.com/vllm-project/vllm/pull/2290
* Fix eager mode performance by WoosukKwon in https://github.com/vllm-project/vllm/pull/2377
* [Minor] Remove unused code in attention by WoosukKwon in https://github.com/vllm-project/vllm/pull/2384
* Add baichuan chat template jinjia file by EvilPsyCHo in https://github.com/vllm-project/vllm/pull/2390
* [Speculative decoding 1/9] Optimized rejection sampler by cadedaniel in https://github.com/vllm-project/vllm/pull/2336
* Fix ipv4 ipv6 dualstack by yunfeng-scale in https://github.com/vllm-project/vllm/pull/2408
* [Minor] Rename phi_1_5 to phi by WoosukKwon in https://github.com/vllm-project/vllm/pull/2385
* [DOC] Add additional comments for LLMEngine and AsyncLLMEngine by litone01 in https://github.com/vllm-project/vllm/pull/1011
* [Minor] Fix the format in quick start guide related to Model Scope by zhuohan123 in https://github.com/vllm-project/vllm/pull/2425
* Add gradio chatbot for openai webserver by arkohut in https://github.com/vllm-project/vllm/pull/2307
* [BUG] RuntimeError: deque mutated during iteration in abort_seq_group by chenxu2048 in https://github.com/vllm-project/vllm/pull/2371
* Allow setting fastapi root_path argument by chiragjn in https://github.com/vllm-project/vllm/pull/2341
* Address Phi modeling update 2 by huiwy in https://github.com/vllm-project/vllm/pull/2428
* Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU 1901 by chuanzhubin in https://github.com/vllm-project/vllm/pull/2374
* Update quickstart.rst with small clarifying change (fix typo) by nautsimon in https://github.com/vllm-project/vllm/pull/2369
* Aligning `top_p` and `top_k` Sampling by chenxu2048 in https://github.com/vllm-project/vllm/pull/1885
* [Minor] Fix err msg by WoosukKwon in https://github.com/vllm-project/vllm/pull/2431
* [Minor] Optimize cuda graph memory usage by esmeetu in https://github.com/vllm-project/vllm/pull/2437
* [CI] Add Buildkite by simon-mo in https://github.com/vllm-project/vllm/pull/2355
* Announce the second vLLM meetup by WoosukKwon in https://github.com/vllm-project/vllm/pull/2444
* Allow buildkite to retry build on agent lost by simon-mo in https://github.com/vllm-project/vllm/pull/2446
* Fix weigit loading for GQA with TP by zhangch9 in https://github.com/vllm-project/vllm/pull/2379
* CI: make sure benchmark script exit on error by simon-mo in https://github.com/vllm-project/vllm/pull/2449
* ci: retry on build failure as well by simon-mo in https://github.com/vllm-project/vllm/pull/2457
* Add StableLM3B model by ita9naiwa in https://github.com/vllm-project/vllm/pull/2372
* OpenAI refactoring by FlorianJoncour in https://github.com/vllm-project/vllm/pull/2360
* [Experimental] Prefix Caching Support by caoshiyi in https://github.com/vllm-project/vllm/pull/1669
* fix stablelm.py tensor-parallel-size bug by YingchaoX in https://github.com/vllm-project/vllm/pull/2482
* Minor fix in prefill cache example by JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2494
* fix: fix some args desc by zspo in https://github.com/vllm-project/vllm/pull/2487
* [Neuron] Add an option to build with neuron by liangfu in https://github.com/vllm-project/vllm/pull/2065
* Don't download both safetensor and bin files. by NikolaBorisov in https://github.com/vllm-project/vllm/pull/2480
* [BugFix] Fix abort_seq_group by beginlner in https://github.com/vllm-project/vllm/pull/2463
* refactor completion api for readability by simon-mo in https://github.com/vllm-project/vllm/pull/2499
* Support OpenAI API server in `benchmark_serving.py` by hmellor in https://github.com/vllm-project/vllm/pull/2172
* Simplify broadcast logic for control messages by zhuohan123 in https://github.com/vllm-project/vllm/pull/2501
* [Bugfix] fix load local safetensors model by esmeetu in https://github.com/vllm-project/vllm/pull/2512
* Add benchmark serving to CI by simon-mo in https://github.com/vllm-project/vllm/pull/2505
* Add `group` as an argument in broadcast ops by GindaChen in https://github.com/vllm-project/vllm/pull/2522
* [Fix] Keep `scheduler.running` as deque by njhill in https://github.com/vllm-project/vllm/pull/2523
* migrate pydantic from v1 to v2 by joennlae in https://github.com/vllm-project/vllm/pull/2531
* [Speculative decoding 2/9] Multi-step worker for draft model by cadedaniel in https://github.com/vllm-project/vllm/pull/2424
* Fix "Port could not be cast to integer value as <function get_open_port>" by pcmoritz in https://github.com/vllm-project/vllm/pull/2545
* Add qwen2 by JustinLin610 in https://github.com/vllm-project/vllm/pull/2495
* Fix progress bar and allow HTTPS in `benchmark_serving.py` by hmellor in https://github.com/vllm-project/vllm/pull/2552
* Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by JasonZhu1313 in https://github.com/vllm-project/vllm/pull/2553
* [Feature] Simple API token authentication by taisazero in https://github.com/vllm-project/vllm/pull/1106
* Add multi-LoRA support by Yard1 in https://github.com/vllm-project/vllm/pull/1804
* lint: format all python file instead of just source code by simon-mo in https://github.com/vllm-project/vllm/pull/2567
* [Bugfix] fix crash if max_tokens=None by NikolaBorisov in https://github.com/vllm-project/vllm/pull/2570
* Added `include_stop_str_in_output` and `length_penalty` parameters to OpenAI API by galatolofederico in https://github.com/vllm-project/vllm/pull/2562
* [Doc] Fix the syntax error in the doc of supported_models. by keli-wen in https://github.com/vllm-project/vllm/pull/2584
* Support Batch Completion in Server by simon-mo in https://github.com/vllm-project/vllm/pull/2529
* fix names and license by JustinLin610 in https://github.com/vllm-project/vllm/pull/2589
* [Fix] Use a correct device when creating OptionalCUDAGuard by sh1ng in https://github.com/vllm-project/vllm/pull/2583
* [ROCm] add support to ROCm 6.0 and MI300 by hongxiayang in https://github.com/vllm-project/vllm/pull/2274
* Support for Stable LM 2 by dakotamahan-stability in https://github.com/vllm-project/vllm/pull/2598
* Don't build punica kernels by default by pcmoritz in https://github.com/vllm-project/vllm/pull/2605
* AWQ: Up to 2.66x higher throughput by casper-hansen in https://github.com/vllm-project/vllm/pull/2566
* Use head_dim in config if exists by xiangxu-google in https://github.com/vllm-project/vllm/pull/2622
* Custom all reduce kernels by hanzhi713 in https://github.com/vllm-project/vllm/pull/2192
* [Minor] Fix warning on Ray dependencies by WoosukKwon in https://github.com/vllm-project/vllm/pull/2630
* Speed up Punica compilation by WoosukKwon in https://github.com/vllm-project/vllm/pull/2632
* Small async_llm_engine refactor by andoorve in https://github.com/vllm-project/vllm/pull/2618
* Update Ray version requirements by simon-mo in https://github.com/vllm-project/vllm/pull/2636
* Support FP8-E5M2 KV Cache by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2279
* Fix error when tp > 1 by zhaoyang-star in https://github.com/vllm-project/vllm/pull/2644
* No repeated IPC open by hanzhi713 in https://github.com/vllm-project/vllm/pull/2642
* ROCm: Allow setting compilation target by rlrs in https://github.com/vllm-project/vllm/pull/2581
* DeepseekMoE support with Fused MoE kernel by zwd003 in https://github.com/vllm-project/vllm/pull/2453
* Fused MOE for Mixtral by pcmoritz in https://github.com/vllm-project/vllm/pull/2542
* Fix 'Actor methods cannot be called directly' when using `--engine-use-ray` by HermitSun in https://github.com/vllm-project/vllm/pull/2664
* Add swap_blocks unit tests by sh1ng in https://github.com/vllm-project/vllm/pull/2616
* Fix a small typo (tenosr -> tensor) by pcmoritz in https://github.com/vllm-project/vllm/pull/2672
* [Minor] Fix false warning when TP=1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2674
* Add quantized mixtral support by WoosukKwon in https://github.com/vllm-project/vllm/pull/2673
* Bump up version to v0.3.0 by zhuohan123 in https://github.com/vllm-project/vllm/pull/2656

New Contributors
* payoto made their first contribution in https://github.com/vllm-project/vllm/pull/2237
* NadavShmayo made their first contribution in https://github.com/vllm-project/vllm/pull/2290
* EvilPsyCHo made their first contribution in https://github.com/vllm-project/vllm/pull/2390
* litone01 made their first contribution in https://github.com/vllm-project/vllm/pull/1011
* arkohut made their first contribution in https://github.com/vllm-project/vllm/pull/2307
* chiragjn made their first contribution in https://github.com/vllm-project/vllm/pull/2341
* huiwy made their first contribution in https://github.com/vllm-project/vllm/pull/2428
* chuanzhubin made their first contribution in https://github.com/vllm-project/vllm/pull/2374
* nautsimon made their first contribution in https://github.com/vllm-project/vllm/pull/2369
* zhangch9 made their first contribution in https://github.com/vllm-project/vllm/pull/2379
* ita9naiwa made their first contribution in https://github.com/vllm-project/vllm/pull/2372
* caoshiyi made their first contribution in https://github.com/vllm-project/vllm/pull/1669
* YingchaoX made their first contribution in https://github.com/vllm-project/vllm/pull/2482
* JasonZhu1313 made their first contribution in https://github.com/vllm-project/vllm/pull/2494
* zspo made their first contribution in https://github.com/vllm-project/vllm/pull/2487
* liangfu made their first contribution in https://github.com/vllm-project/vllm/pull/2065
* NikolaBorisov made their first contribution in https://github.com/vllm-project/vllm/pull/2480
* GindaChen made their first contribution in https://github.com/vllm-project/vllm/pull/2522
* njhill made their first contribution in https://github.com/vllm-project/vllm/pull/2523
* joennlae made their first contribution in https://github.com/vllm-project/vllm/pull/2531
* pcmoritz made their first contribution in https://github.com/vllm-project/vllm/pull/2545
* JustinLin610 made their first contribution in https://github.com/vllm-project/vllm/pull/2495
* taisazero made their first contribution in https://github.com/vllm-project/vllm/pull/1106
* galatolofederico made their first contribution in https://github.com/vllm-project/vllm/pull/2562
* keli-wen made their first contribution in https://github.com/vllm-project/vllm/pull/2584
* sh1ng made their first contribution in https://github.com/vllm-project/vllm/pull/2583
* hongxiayang made their first contribution in https://github.com/vllm-project/vllm/pull/2274
* dakotamahan-stability made their first contribution in https://github.com/vllm-project/vllm/pull/2598
* xiangxu-google made their first contribution in https://github.com/vllm-project/vllm/pull/2622
* andoorve made their first contribution in https://github.com/vllm-project/vllm/pull/2618
* rlrs made their first contribution in https://github.com/vllm-project/vllm/pull/2581
* zwd003 made their first contribution in https://github.com/vllm-project/vllm/pull/2453

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.7...v0.3.0

0.2.7

Not secure

Major Changes

* Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
* Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed
* Minor fix for gpu-memory-utilization description by SuhongMoon in https://github.com/vllm-project/vllm/pull/2162
* [BugFix] Raise error when max_model_len is larger than KV cache size by WoosukKwon in https://github.com/vllm-project/vllm/pull/2163
* [BugFix] Fix RoPE kernel on long sequences by WoosukKwon in https://github.com/vllm-project/vllm/pull/2164
* Add SSL arguments to API servers by HMellor in https://github.com/vllm-project/vllm/pull/2109
* typo fix by oushu1zhangxiangxuan1 in https://github.com/vllm-project/vllm/pull/2166
* [ROCm] Fixes for GPTQ on ROCm by kliuae in https://github.com/vllm-project/vllm/pull/2180
* Update Help Text for --gpu-memory-utilization Argument by SuhongMoon in https://github.com/vllm-project/vllm/pull/2183
* [Minor] Add warning on CUDA graph memory usage by WoosukKwon in https://github.com/vllm-project/vllm/pull/2182
* Added DeciLM-7b and DeciLM-7b-instruct by avideci in https://github.com/vllm-project/vllm/pull/2062
* [BugFix] Fix weight loading for Mixtral with TP by WoosukKwon in https://github.com/vllm-project/vllm/pull/2208
* Make _prepare_sample non blocking and pin memory of CPU input buffers by hanzhi713 in https://github.com/vllm-project/vllm/pull/2207
* Remove Sampler copy stream by Yard1 in https://github.com/vllm-project/vllm/pull/2209
* Fix a broken link by ronensc in https://github.com/vllm-project/vllm/pull/2222
* Disable Ray usage stats collection by WoosukKwon in https://github.com/vllm-project/vllm/pull/2206
* [BugFix] Fix recovery logic for sequence group by WoosukKwon in https://github.com/vllm-project/vllm/pull/2186
* Update installation instructions to include CUDA 11.8 xFormers by skt7 in https://github.com/vllm-project/vllm/pull/2246
* Add "About" Heading to README.md by blueceiling in https://github.com/vllm-project/vllm/pull/2260
* [BUGFIX] Do not return ignored sentences twice in async llm engine by zhuohan123 in https://github.com/vllm-project/vllm/pull/2258
* [BUGFIX] Fix API server test by zhuohan123 in https://github.com/vllm-project/vllm/pull/2270
* [BUGFIX] Fix the path of test prompts by zhuohan123 in https://github.com/vllm-project/vllm/pull/2273
* [BUGFIX] Fix communication test by zhuohan123 in https://github.com/vllm-project/vllm/pull/2285
* Add support GPT-NeoX Models without attention biases by dalgarak in https://github.com/vllm-project/vllm/pull/2301
* [FIX] Fix kernel bug by jeejeelee in https://github.com/vllm-project/vllm/pull/1959
* fix typo and remove unused code by esmeetu in https://github.com/vllm-project/vllm/pull/2305
* Enable CUDA graph for GPTQ & SqueezeLLM by WoosukKwon in https://github.com/vllm-project/vllm/pull/2318
* Fix Gradio example: remove deprecated parameter `concurrency_count` by ronensc in https://github.com/vllm-project/vllm/pull/2315
* Use NCCL instead of ray for control-plane communication to remove serialization overhead by zhuohan123 in https://github.com/vllm-project/vllm/pull/2221
* Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by ronensc in https://github.com/vllm-project/vllm/pull/2321
* [Minor] Revert the changes in test_cache by WoosukKwon in https://github.com/vllm-project/vllm/pull/2335
* Bump up to v0.2.7 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2337

New Contributors
* SuhongMoon made their first contribution in https://github.com/vllm-project/vllm/pull/2162
* HMellor made their first contribution in https://github.com/vllm-project/vllm/pull/2109
* oushu1zhangxiangxuan1 made their first contribution in https://github.com/vllm-project/vllm/pull/2166
* kliuae made their first contribution in https://github.com/vllm-project/vllm/pull/2180
* avideci made their first contribution in https://github.com/vllm-project/vllm/pull/2062
* hanzhi713 made their first contribution in https://github.com/vllm-project/vllm/pull/2207
* ronensc made their first contribution in https://github.com/vllm-project/vllm/pull/2222
* skt7 made their first contribution in https://github.com/vllm-project/vllm/pull/2246
* blueceiling made their first contribution in https://github.com/vllm-project/vllm/pull/2260
* dalgarak made their first contribution in https://github.com/vllm-project/vllm/pull/2301

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.6...v0.2.7

0.2.6

Not secure

Major changes
* Fast model execution with CUDA/HIP graph
* W4A16 GPTQ support (thanks to chu-tianxiang)
* Fix memory profiling with tensor parallelism
* Fix *.bin weight loading for Mixtral models

What's Changed
* Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by mezuzza in https://github.com/vllm-project/vllm/pull/2100
* Fix Dockerfile.rocm by tjtanaa in https://github.com/vllm-project/vllm/pull/2101
* avoid multiple redefinition by MitchellX in https://github.com/vllm-project/vllm/pull/1817
* Add a flag to include stop string in output text by yunfeng-scale in https://github.com/vllm-project/vllm/pull/1976
* Add GPTQ support by chu-tianxiang in https://github.com/vllm-project/vllm/pull/916
* [Docs] Add quantization support to docs by WoosukKwon in https://github.com/vllm-project/vllm/pull/2135
* [ROCm] Temporarily remove GPTQ ROCm support by WoosukKwon in https://github.com/vllm-project/vllm/pull/2138
* simplify loading weights logic by esmeetu in https://github.com/vllm-project/vllm/pull/2133
* Optimize model execution with CUDA graph by WoosukKwon in https://github.com/vllm-project/vllm/pull/1926
* [Minor] Delete Llama tokenizer warnings by WoosukKwon in https://github.com/vllm-project/vllm/pull/2146
* Fix all-reduce memory usage by WoosukKwon in https://github.com/vllm-project/vllm/pull/2151
* Pin PyTorch & xformers versions by WoosukKwon in https://github.com/vllm-project/vllm/pull/2155
* Remove dependency on CuPy by WoosukKwon in https://github.com/vllm-project/vllm/pull/2152
* [Docs] Add CUDA graph support to docs by WoosukKwon in https://github.com/vllm-project/vllm/pull/2148
* Temporarily enforce eager mode for GPTQ models by WoosukKwon in https://github.com/vllm-project/vllm/pull/2154
* [Minor] Add more detailed explanation on `quantization` argument by WoosukKwon in https://github.com/vllm-project/vllm/pull/2145
* [Minor] Fix xformers version by WoosukKwon in https://github.com/vllm-project/vllm/pull/2158
* [Minor] Add Phi 2 to supported models by WoosukKwon in https://github.com/vllm-project/vllm/pull/2159
* Make sampler less blocking by Yard1 in https://github.com/vllm-project/vllm/pull/1889
* [Minor] Fix a typo in .pt weight support by WoosukKwon in https://github.com/vllm-project/vllm/pull/2160
* Disable CUDA graph for SqueezeLLM by WoosukKwon in https://github.com/vllm-project/vllm/pull/2161
* Bump up to v0.2.6 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2157

New Contributors
* mezuzza made their first contribution in https://github.com/vllm-project/vllm/pull/2100
* MitchellX made their first contribution in https://github.com/vllm-project/vllm/pull/1817

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.5...v0.2.6

0.2.5

Not secure

Major changes

* Optimize Mixtral performance with expert parallelism (thanks to Yard1)
* [BugFix] Fix input positions for long context with sliding window

What's Changed
* Update Dockerfile to support Mixtral by simon-mo in https://github.com/vllm-project/vllm/pull/2027
* Remove python 3.10 requirement by WoosukKwon in https://github.com/vllm-project/vllm/pull/2040
* [CI/CD] Upgrade PyTorch version to v2.1.1 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2045
* Upgrade transformers version to 4.36.0 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2046
* Remove einops from dependencies by WoosukKwon in https://github.com/vllm-project/vllm/pull/2049
* gqa added to mpt attn by megha95 in https://github.com/vllm-project/vllm/pull/1938
* Update Dockerfile to build Megablocks by simon-mo in https://github.com/vllm-project/vllm/pull/2042
* Fix peak memory profiling by WoosukKwon in https://github.com/vllm-project/vllm/pull/2031
* Implement lazy model loader by WoosukKwon in https://github.com/vllm-project/vllm/pull/2044
* [ROCm] Upgrade xformers version dependency for ROCm; update documentations by tjtanaa in https://github.com/vllm-project/vllm/pull/2079
* Update installation instruction for CUDA 11.8 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2086
* [Docs] Add notes on ROCm-supported models by WoosukKwon in https://github.com/vllm-project/vllm/pull/2087
* [BugFix] Fix input positions for long context with sliding window by WoosukKwon in https://github.com/vllm-project/vllm/pull/2088
* Mixtral expert parallelism by Yard1 in https://github.com/vllm-project/vllm/pull/2090
* Bump up to v0.2.5 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2095

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.4...v0.2.5

0.2.4

Not secure

Major changes

* Mixtral model support (officially from mistralai)
* AMD GPU support (collaboration with embeddedllm)

What's Changed
* add custom server params by esmeetu in https://github.com/vllm-project/vllm/pull/1868
* support ChatGLMForConditionalGeneration by dancingpipi in https://github.com/vllm-project/vllm/pull/1932
* Save pytorch profiler output for latency benchmark by Yard1 in https://github.com/vllm-project/vllm/pull/1871
* Fix typo in adding_model.rst by petergtz in https://github.com/vllm-project/vllm/pull/1947
* Make InternLM follow `rope_scaling` in `config.json` by theFool32 in https://github.com/vllm-project/vllm/pull/1956
* Fix quickstart.rst example by gottlike in https://github.com/vllm-project/vllm/pull/1964
* Adding number of nvcc_threads during build as envar by AguirreNicolas in https://github.com/vllm-project/vllm/pull/1893
* fix typo in getenv call by dskhudia in https://github.com/vllm-project/vllm/pull/1972
* [Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main by tjtanaa in https://github.com/vllm-project/vllm/pull/1836
* Fix Baichuan2-7B-Chat by firebook in https://github.com/vllm-project/vllm/pull/1987
* [Docker] Add cuda arch list as build option by simon-mo in https://github.com/vllm-project/vllm/pull/1950
* Fix for KeyError on Loading LLaMA by imgaojun in https://github.com/vllm-project/vllm/pull/1978
* [Minor] Fix code style for baichuan by WoosukKwon in https://github.com/vllm-project/vllm/pull/2003
* Fix OpenAI server completion_tokens referenced before assignment by js8544 in https://github.com/vllm-project/vllm/pull/1996
* [Minor] Add comment on skipping rope caches by WoosukKwon in https://github.com/vllm-project/vllm/pull/2004
* Replace head_mapping params with num_kv_heads to attention kernel. by wbn03 in https://github.com/vllm-project/vllm/pull/1997
* Fix completion API echo and logprob combo by simon-mo in https://github.com/vllm-project/vllm/pull/1992
* Mixtral 8x7B support by pierrestock in https://github.com/vllm-project/vllm/pull/2011
* Minor fixes for Mixtral by WoosukKwon in https://github.com/vllm-project/vllm/pull/2015
* Change load format for Mixtral by WoosukKwon in https://github.com/vllm-project/vllm/pull/2028
* Update run_on_sky.rst by eltociear in https://github.com/vllm-project/vllm/pull/2025
* Update requirements.txt for mixtral by 0-hero in https://github.com/vllm-project/vllm/pull/2029
* Revert 2029 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2030
* [Minor] Fix latency benchmark script by WoosukKwon in https://github.com/vllm-project/vllm/pull/2035
* [Minor] Fix type annotation in Mixtral by WoosukKwon in https://github.com/vllm-project/vllm/pull/2036
* Update README.md to add megablocks requirement for mixtral by 0-hero in https://github.com/vllm-project/vllm/pull/2033
* [Minor] Fix import error msg for megablocks by WoosukKwon in https://github.com/vllm-project/vllm/pull/2038
* Bump up to v0.2.4 by WoosukKwon in https://github.com/vllm-project/vllm/pull/2034

New Contributors
* dancingpipi made their first contribution in https://github.com/vllm-project/vllm/pull/1932
* petergtz made their first contribution in https://github.com/vllm-project/vllm/pull/1947
* theFool32 made their first contribution in https://github.com/vllm-project/vllm/pull/1956
* gottlike made their first contribution in https://github.com/vllm-project/vllm/pull/1964
* AguirreNicolas made their first contribution in https://github.com/vllm-project/vllm/pull/1893
* dskhudia made their first contribution in https://github.com/vllm-project/vllm/pull/1972
* tjtanaa made their first contribution in https://github.com/vllm-project/vllm/pull/1836
* firebook made their first contribution in https://github.com/vllm-project/vllm/pull/1987
* imgaojun made their first contribution in https://github.com/vllm-project/vllm/pull/1978
* js8544 made their first contribution in https://github.com/vllm-project/vllm/pull/1996
* wbn03 made their first contribution in https://github.com/vllm-project/vllm/pull/1997
* pierrestock made their first contribution in https://github.com/vllm-project/vllm/pull/2011
* 0-hero made their first contribution in https://github.com/vllm-project/vllm/pull/2029

**Full Changelog**: https://github.com/vllm-project/vllm/compare/v0.2.3...v0.2.4

Page 5 of 8

Releases

Has known vulnerabilities

Previous Next

Vllm

Page 5 of 8

0.3.1

0.3.0

0.2.7

0.2.6

0.2.5

0.2.4

Page 5 of 8

Links

Releases