Lmdeploy

Latest version: v0.7.2.post1

Safety actively analyzes 723625 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 8

0.5.1

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
🚀 Features
* Support phi3-vision by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1845
* Support internvl2 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1911
* support gemma2 in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1924
* Add tools to api_server for InternLM2 model by AllentDan in https://github.com/InternLM/lmdeploy/pull/1763
* support internvl2-1b by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1983
* feat: support llama2 and internlm2 on 910B by yao-fengchen in https://github.com/InternLM/lmdeploy/pull/2011
* Support glm 4v by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1947
* support internlm-xcomposer2d5-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1932
* add chat template for codegeex4 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2013
💥 Improvements
* misc: rm unnecessary files by zhyncs in https://github.com/InternLM/lmdeploy/pull/1875
* drop stop words by grimoire in https://github.com/InternLM/lmdeploy/pull/1823
* Add usage in stream response by fbzhong in https://github.com/InternLM/lmdeploy/pull/1876
* Optimize sampling on pytorch engine. by grimoire in https://github.com/InternLM/lmdeploy/pull/1853
* Remove deprecated chat cli and vl examples by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1899
* vision model use tp number of gpu by irexyc in https://github.com/InternLM/lmdeploy/pull/1854
* misc: add default api_server_url for api_client by zhyncs in https://github.com/InternLM/lmdeploy/pull/1922
* misc: add transformers version check for TurboMind Tokenizer by zhyncs in https://github.com/InternLM/lmdeploy/pull/1917
* fix: append _stats when size > 0 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1809
* refactor: update awq linear and rm legacy by zhyncs in https://github.com/InternLM/lmdeploy/pull/1940
* feat: add gpu topo for check_env by zhyncs in https://github.com/InternLM/lmdeploy/pull/1944
* fix transformers version check for InternVL2 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1952
* Upgrade gradio by AllentDan in https://github.com/InternLM/lmdeploy/pull/1930
* refactor sampling layer setup by irexyc in https://github.com/InternLM/lmdeploy/pull/1912
* Add exception handler to imge encoder by irexyc in https://github.com/InternLM/lmdeploy/pull/2010
* Avoid the same session id for openai endpoint by AllentDan in https://github.com/InternLM/lmdeploy/pull/1995
🐞 Bug fixes
* Fix error link reference by zihaomu in https://github.com/InternLM/lmdeploy/pull/1881
* Fix internlm-xcomposer2-vl awq search scale by AllentDan in https://github.com/InternLM/lmdeploy/pull/1890
* fix SamplingDecodeTest and SamplingDecodeTest2 unittest failure by zhyncs in https://github.com/InternLM/lmdeploy/pull/1874
* Fix smem size for fused split-kv reduction by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1909
* fix llama3 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1956
* fix: set PYTHONIOENCODING to UTF-8 before start tritonserver by zhyncs in https://github.com/InternLM/lmdeploy/pull/1971
* Fix internvl2-40b model export by irexyc in https://github.com/InternLM/lmdeploy/pull/1979
* fix logprobs by irexyc in https://github.com/InternLM/lmdeploy/pull/1968
* fix unexpected argument error when deploying "cogvlm-chat-hf" by AllentDan in https://github.com/InternLM/lmdeploy/pull/1982
* fix mixtral and mistral cache_position by zhyncs in https://github.com/InternLM/lmdeploy/pull/1941
* Fix the session_len assignment logic by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2007
* Fix logprobs openai api by irexyc in https://github.com/InternLM/lmdeploy/pull/1985
* Fix internvl2-40b awq inference by AllentDan in https://github.com/InternLM/lmdeploy/pull/2023
* Fix side effect of 1995 by AllentDan in https://github.com/InternLM/lmdeploy/pull/2033
📚 Documentations
* docs: update faq for turbomind so not found by zhyncs in https://github.com/InternLM/lmdeploy/pull/1877
* [Doc]: Change to sphinx-book-theme in readthedocs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1880
* docs: update compatibility section in README by zhyncs in https://github.com/InternLM/lmdeploy/pull/1946
* docs: update kv quant doc by zhyncs in https://github.com/InternLM/lmdeploy/pull/1977
* docs: sync the core features in README to index.rst by zhyncs in https://github.com/InternLM/lmdeploy/pull/1988
* Fix table rendering for readthedocs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1998
* docs: fix Ada compatibility by zhyncs in https://github.com/InternLM/lmdeploy/pull/2016
* update xcomposer2d5 docs by irexyc in https://github.com/InternLM/lmdeploy/pull/2037
🌐 Other
* [ci] add internlm2.5 models into testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1928
* bump version to v0.5.1 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2022

New Contributors
* zihaomu made their first contribution in https://github.com/InternLM/lmdeploy/pull/1881
* fbzhong made their first contribution in https://github.com/InternLM/lmdeploy/pull/1876

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.5.0...v0.5.1

0.5.0

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
🚀 Features
* support MiniCPM-Llama3-V 2.5 by irexyc in https://github.com/InternLM/lmdeploy/pull/1708
* [Feature]: Support llava for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1641
* Device dispatcher by grimoire in https://github.com/InternLM/lmdeploy/pull/1775
* Add GLM-4-9B-Chat by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1724
* Torch deepseek v2 by grimoire in https://github.com/InternLM/lmdeploy/pull/1621
* Support internvl-chat for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1797
* Add interfaces to the pipeline to obtain logits and ppl by irexyc in https://github.com/InternLM/lmdeploy/pull/1652
* [Feature]: Support cogvlm-chat by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1502
💥 Improvements
* support mistral and llava_mistral in turbomind by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1579
* Add health endpoint by AllentDan in https://github.com/InternLM/lmdeploy/pull/1679
* upgrade the version of the dependency package peft by grimoire in https://github.com/InternLM/lmdeploy/pull/1687
* Follow the conventional model_name by AllentDan in https://github.com/InternLM/lmdeploy/pull/1677
* API Image URL fetch timeout by vody-am in https://github.com/InternLM/lmdeploy/pull/1684
* Support internlm-xcomposer2-4khd-7b awq by AllentDan in https://github.com/InternLM/lmdeploy/pull/1666
* update dockerfile and docs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1715
* lazy import VLAsyncEngine to avoid bringing in VLMs dependencies when deploying LLMs by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1714
* feat: align with OpenAI temperature range by zhyncs in https://github.com/InternLM/lmdeploy/pull/1733
* feat: align with OpenAI temperature range in api server by zhyncs in https://github.com/InternLM/lmdeploy/pull/1734
* Refactor converter about get_input_model_registered_name and get_output_model_registered_name_and_config by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1702
* Refine max_new_tokens logic to improve user experience by AllentDan in https://github.com/InternLM/lmdeploy/pull/1705
* Refactor loading weights by grimoire in https://github.com/InternLM/lmdeploy/pull/1603
* refactor config by grimoire in https://github.com/InternLM/lmdeploy/pull/1751
* Add anomaly handler by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1780
* Encode raw image file to base64 by irexyc in https://github.com/InternLM/lmdeploy/pull/1773
* skip inference for oversized inputs by grimoire in https://github.com/InternLM/lmdeploy/pull/1769
* fix: prevent numpy breakage by zhyncs in https://github.com/InternLM/lmdeploy/pull/1791
* More accurate time logging for ImageEncoder and fix concurrent image processing corruption by irexyc in https://github.com/InternLM/lmdeploy/pull/1765
* Optimize kernel launch for triton2.2.0 and triton2.3.0 by grimoire in https://github.com/InternLM/lmdeploy/pull/1499
* feat: auto set awq model_format from hf by zhyncs in https://github.com/InternLM/lmdeploy/pull/1799
* check driver mismatch by grimoire in https://github.com/InternLM/lmdeploy/pull/1811
* PyTorchEngine adapts to the latest internlm2 modeling. by grimoire in https://github.com/InternLM/lmdeploy/pull/1798
* AsyncEngine create cancel task in exception. by grimoire in https://github.com/InternLM/lmdeploy/pull/1807
* compat internlm2 for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1825
* Add model revision & download_dir to cli by irexyc in https://github.com/InternLM/lmdeploy/pull/1814
* fix image encoder request queue by irexyc in https://github.com/InternLM/lmdeploy/pull/1837
* Harden stream callback by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1838
* Support Qwen2-1.5b awq by AllentDan in https://github.com/InternLM/lmdeploy/pull/1793
* remove chat template config in turbomind engine by irexyc in https://github.com/InternLM/lmdeploy/pull/1161
* misc: align PyTorch Engine temprature with TurboMind by zhyncs in https://github.com/InternLM/lmdeploy/pull/1850
* docs: update cache-max-entry-count help message by zhyncs in https://github.com/InternLM/lmdeploy/pull/1892
🐞 Bug fixes
* fix typos by irexyc in https://github.com/InternLM/lmdeploy/pull/1690
* [Bugfix] fix internvl-1.5-chat vision model preprocess and freeze weights by DefTruth in https://github.com/InternLM/lmdeploy/pull/1741
* lock setuptools version in dockerfile by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1770
* Fix openai package can not use proxy stream mode by AllentDan in https://github.com/InternLM/lmdeploy/pull/1692
* Fix finish_reason by AllentDan in https://github.com/InternLM/lmdeploy/pull/1768
* fix uncached stop words by grimoire in https://github.com/InternLM/lmdeploy/pull/1754
* [side-effect]Fix param `--cache-max-entry-count` is not taking effect (1758) by QwertyJack in https://github.com/InternLM/lmdeploy/pull/1778
* support qwen2 1.5b by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1782
* fix falcon attention by grimoire in https://github.com/InternLM/lmdeploy/pull/1761
* Refine AsyncEngine exception handler by AllentDan in https://github.com/InternLM/lmdeploy/pull/1789
* [side-effect] fix weight_type caused by PR 1702 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1795
* fix best_match_model by irexyc in https://github.com/InternLM/lmdeploy/pull/1812
* Fix Request completed log by irexyc in https://github.com/InternLM/lmdeploy/pull/1821
* fix qwen-vl-chat hung by irexyc in https://github.com/InternLM/lmdeploy/pull/1824
* Detokenize with prompt token ids by AllentDan in https://github.com/InternLM/lmdeploy/pull/1753
* Update engine.py to fix small typos by WANGSSSSSSS in https://github.com/InternLM/lmdeploy/pull/1829
* [side-effect] bring back "--cap" argument in chat cli by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1859
* Fix vl session-len by AllentDan in https://github.com/InternLM/lmdeploy/pull/1860
* fix gradio vl "stop_words" by irexyc in https://github.com/InternLM/lmdeploy/pull/1873
* fix qwen2 cache_position for PyTorch Engine when transformers>4.41.2 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1886
* fix model name matching for internvl by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1867
📚 Documentations
* docs: add BentoLMDeploy in README by zhyncs in https://github.com/InternLM/lmdeploy/pull/1736
* [Doc]: Update docs for internlm2.5 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1887
🌐 Other
* add longtext generation benchmark by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1694
* add qwen2 model into testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1772
* fix pr test for newest internlm2 model by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1806
* react test evaluation config by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1861
* bump version to v0.5.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1852

New Contributors
* DefTruth made their first contribution in https://github.com/InternLM/lmdeploy/pull/1741
* QwertyJack made their first contribution in https://github.com/InternLM/lmdeploy/pull/1778
* WANGSSSSSSS made their first contribution in https://github.com/InternLM/lmdeploy/pull/1829

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.4.2...v0.5.0

0.4.2

<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlight

- Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

**Quantization**

python
lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ


**Inference with quantized model**

python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)


- Balance vision model when deploying VLMs with multiple GPUs

python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)


What's Changed
🚀 Features
* PyTorch Engine hash table based prefix caching by grimoire in https://github.com/InternLM/lmdeploy/pull/1429
* support phi3 by grimoire in https://github.com/InternLM/lmdeploy/pull/1497
* Turbomind prefix caching by ispobock in https://github.com/InternLM/lmdeploy/pull/1450
* Enable search scale for awq by AllentDan in https://github.com/InternLM/lmdeploy/pull/1545
* [Feature] Support vl models quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1553
💥 Improvements
* make Qwen compatible with Slora when TP > 1 by jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1518
* Optimize slora by grimoire in https://github.com/InternLM/lmdeploy/pull/1447
* Use a faster format for images in VLMs by isidentical in https://github.com/InternLM/lmdeploy/pull/1575
* add chat-template args to chat cli by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1566
* Get the max session len from config.json by AllentDan in https://github.com/InternLM/lmdeploy/pull/1550
* Optimize w8a8 kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1353
* support python 3.12 by irexyc in https://github.com/InternLM/lmdeploy/pull/1605
* Optimize moe by grimoire in https://github.com/InternLM/lmdeploy/pull/1520
* Balance vision model weights on multi gpus by irexyc in https://github.com/InternLM/lmdeploy/pull/1591
* Support user-specified IMAGE_TOKEN position for deepseek-vl model by irexyc in https://github.com/InternLM/lmdeploy/pull/1627
* Optimize GQA/MQA by grimoire in https://github.com/InternLM/lmdeploy/pull/1649
🐞 Bug fixes
* fix logger init by AllentDan in https://github.com/InternLM/lmdeploy/pull/1598
* Bugfix: wrongly assign gen_config with True by thelongestusernameofall in https://github.com/InternLM/lmdeploy/pull/1594
* Enable split-kv for attention by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1606
* Fix xcomposer2 vision model process by irexyc in https://github.com/InternLM/lmdeploy/pull/1640
* Fix NTK scaling by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1636
* Fix illegal memory access when seq_len < 64 by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1616
* Fix llava vl template by irexyc in https://github.com/InternLM/lmdeploy/pull/1620
* [side-effect] fix deepseek-vl when tp is 1 by irexyc in https://github.com/InternLM/lmdeploy/pull/1648
* fix logprobs output by irexyc in https://github.com/InternLM/lmdeploy/pull/1561
* fix fused-moe in triton2.2.0 by grimoire in https://github.com/InternLM/lmdeploy/pull/1654
* Align tokenizers in pipeline and api_server benchmark scripts by AllentDan in https://github.com/InternLM/lmdeploy/pull/1650
* [side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1661
* remove paged attention prefill autotune by grimoire in https://github.com/InternLM/lmdeploy/pull/1658
* Fix transformers 4.41.0 prompt may differ after encode decode by AllentDan in https://github.com/InternLM/lmdeploy/pull/1617
📚 Documentations
* Fix typo in w8a8.md by chg0901 in https://github.com/InternLM/lmdeploy/pull/1568
* Update doc for prefix caching by ispobock in https://github.com/InternLM/lmdeploy/pull/1597
* Update VL document by AllentDan in https://github.com/InternLM/lmdeploy/pull/1657
🌐 Other
* remove first empty token check and add input validation testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1549
* add more model into benchmark and evaluate workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1565
* add vl awq testcase and refactor pipeline testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1630
* bump version to v0.4.2 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1644

New Contributors
* isidentical made their first contribution in https://github.com/InternLM/lmdeploy/pull/1575
* chg0901 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1568
* thelongestusernameofall made their first contribution in https://github.com/InternLM/lmdeploy/pull/1594

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.4.1...v0.4.2

0.4.1

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
🚀 Features
* Add colab demo by AllentDan in https://github.com/InternLM/lmdeploy/pull/1428
* support starcoder2 by grimoire in https://github.com/InternLM/lmdeploy/pull/1468
* support OpenGVLab/InternVL-Chat-V1-5 by irexyc in https://github.com/InternLM/lmdeploy/pull/1490
💥 Improvements
* variable `CTA_H` & fix qkv bias by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1491
* refactor vision model loading by irexyc in https://github.com/InternLM/lmdeploy/pull/1482
* fix installation requirements for windows by irexyc in https://github.com/InternLM/lmdeploy/pull/1531
* Remove split batch inside pipline inference function by AllentDan in https://github.com/InternLM/lmdeploy/pull/1507
* Remove first empty chunck for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1527
* add benchmark script to profile pipeline APIs by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1528
* Add input validation by AllentDan in https://github.com/InternLM/lmdeploy/pull/1525
🐞 Bug fixes
* fix local variable 'response' referenced before assignment in async_engine.generate by irexyc in https://github.com/InternLM/lmdeploy/pull/1513
* Fix turbomind import in windows by irexyc in https://github.com/InternLM/lmdeploy/pull/1533
* Fix convert qwen2 to turbomind by AllentDan in https://github.com/InternLM/lmdeploy/pull/1546
* Adding api_key and model_name parameters to the restful benchmark by NiuBlibing in https://github.com/InternLM/lmdeploy/pull/1478
📚 Documentations
* update supported models for Baichuan by zhyncs in https://github.com/InternLM/lmdeploy/pull/1485
* Fix typo in w8a8.md by Infinity4B in https://github.com/InternLM/lmdeploy/pull/1523
* complete build.md by YanxingLiu in https://github.com/InternLM/lmdeploy/pull/1508
* update readme wechat qrcode by vansin in https://github.com/InternLM/lmdeploy/pull/1529
* Update docker docs for VL api by vody-am in https://github.com/InternLM/lmdeploy/pull/1534
* Format supported model table using html syntax by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1493
* doc: add example of deploying api server to Kubernetes by uzuku in https://github.com/InternLM/lmdeploy/pull/1488
🌐 Other
* add modelscope and lora testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1506
* bump version to v0.4.1 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1544

New Contributors
* NiuBlibing made their first contribution in https://github.com/InternLM/lmdeploy/pull/1478
* Infinity4B made their first contribution in https://github.com/InternLM/lmdeploy/pull/1523
* YanxingLiu made their first contribution in https://github.com/InternLM/lmdeploy/pull/1508
* vody-am made their first contribution in https://github.com/InternLM/lmdeploy/pull/1534
* uzuku made their first contribution in https://github.com/InternLM/lmdeploy/pull/1488

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.4.0...v0.4.1

0.4.0

<!-- Release notes generated using configuration in .github/release.yml at main -->

Highlights

**Support for Llama3 and additional Vision-Language Models (VLMs):**
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

**Introduce online int4/int8 KV quantization and inference**
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |

The below table presents LMDeploy's inference performance with quantized KV.

| model | kv type | test settings | RPS | v.s. kv fp16 |
| ----------------- | ------- | ---------------------------------------- | ----- | ------------ |
| llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
| llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
| - | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
| - | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
| internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |


What's Changed
🚀 Features
* Support qwen1.5 in turbomind engine by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
* Online 8/4-bit KV-cache quantization by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
* Support qwen1.5-*-AWQ model inference in turbomind by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
* support Internvl chat v1.1, v1.2 and v1.2-plus by irexyc in https://github.com/InternLM/lmdeploy/pull/1425
* support Internvl chat llava by irexyc in https://github.com/InternLM/lmdeploy/pull/1426
* Add llama3 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
* Support mini gemini llama by AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
* add interactive api in service for VL models by AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
* support output logprobs with turbomind backend. by irexyc in https://github.com/InternLM/lmdeploy/pull/1391
* support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1458
* Add qwen1.5 awq quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1470
💥 Improvements
* Reduce binary size, add `sm_89` and `sm_90` targets by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
* Use new event loop instead of the current loop for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
* Optimize inference of pytorch engine with tensor parallelism by grimoire in https://github.com/InternLM/lmdeploy/pull/1397
* add llava-v1.6-34b template by irexyc in https://github.com/InternLM/lmdeploy/pull/1408
* Initialize vl encoder first to avoid OOM by AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
* Support model_name customization for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
* Expose dynamic split&fuse parameters by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
* warning transformers version by grimoire in https://github.com/InternLM/lmdeploy/pull/1453
* Optimize apply_rotary kernel and remove useless inference_mode by grimoire in https://github.com/InternLM/lmdeploy/pull/1457
* set infinity timeout to nccl by grimoire in https://github.com/InternLM/lmdeploy/pull/1465
* Feat: format internlm2 chat template by liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456
🐞 Bug fixes
* handle SIGTERM by grimoire in https://github.com/InternLM/lmdeploy/pull/1389
* fix chat cli `ArgumentError` error happened in python 3.11 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
* Fix llama_triton_example by AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
* miss --trust-remote-code in converter, which is side effect brought by pr 1406 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
* fix sampling kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1417
* Fix loading single safetensor file error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
* remove space in deepseek template by grimoire in https://github.com/InternLM/lmdeploy/pull/1441
* fix free repetition_penalty_workspace_ buffer by irexyc in https://github.com/InternLM/lmdeploy/pull/1467
* fix adapter failure when tp>1 by grimoire in https://github.com/InternLM/lmdeploy/pull/1476
* get model in advance to fix downloading from modelscope error by irexyc in https://github.com/InternLM/lmdeploy/pull/1473
* Fix the side effect in engine_intance brought by 1391 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480
📚 Documentations
* Add model name corresponding to the test data in the doc by wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
* fix typo in get_started guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
* Add async openai demo for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
* add the recommendation version for Python Backend by zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
* Update kv quantization and inference guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
* update doc for llama3 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1462
🌐 Other
* hack cmakelist.txt in pr_test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
* Add benchmark report generated in summary by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
* add restful completions v1 test case by ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
* Add kvint4/8 ete testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
* impove rotary embedding of qwen in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1451
* change cutlass url in ut by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
* bump version to v0.4.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469

New Contributors
* wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
* ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
* liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0

0.3.0

<!-- Release notes generated using configuration in .github/release.yml at main -->

Highlight
* Refactor attention and optimize GQA(1258 1307 1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
* Support new models, including Qwen1.5-MOE(1372), DBRX(1367), DeepSeek-VL(1335)


What's Changed
🚀 Features
* Add tensor core GQA dispatch for `[4,5,6,8]` by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1258
* upgrade turbomind to v2.1 by by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1307, https://github.com/InternLM/lmdeploy/pull/1116
* Support slora to pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1286
* Support qwen for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1265
* Support Triton inference server python backend by ispobock in https://github.com/InternLM/lmdeploy/pull/1329
* torch engine support dbrx by grimoire in https://github.com/InternLM/lmdeploy/pull/1367
* Support qwen2 moe for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1372
* Add deepseek vl by AllentDan in https://github.com/InternLM/lmdeploy/pull/1335
💥 Improvements
* rm unused var by zhyncs in https://github.com/InternLM/lmdeploy/pull/1256
* Expose cache_block_seq_len to API by ispobock in https://github.com/InternLM/lmdeploy/pull/1218
* add chat template for deepseek coder model by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1310
* Add more log info for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1323
* remove cuda cache after loading vison model by irexyc in https://github.com/InternLM/lmdeploy/pull/1325
* Add new chat cli with auto backend feature by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1276
* Update rewritings for qwen by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1351
* lazy import accelerate.init_empty_weights for vl async engine by irexyc in https://github.com/InternLM/lmdeploy/pull/1359
* update lmdeploy pypi packages deps to cuda12 by irexyc in https://github.com/InternLM/lmdeploy/pull/1368
* update `max_prefill_token_num` for low gpu memory by grimoire in https://github.com/InternLM/lmdeploy/pull/1373
* Optimize pipeline of pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1328
🐞 Bug fixes
* fix different stop/bad words length in batch by irexyc in https://github.com/InternLM/lmdeploy/pull/1246
* Fix performance issue of chatbot by ispobock in https://github.com/InternLM/lmdeploy/pull/1295
* add missed argument by irexyc in https://github.com/InternLM/lmdeploy/pull/1317
* Fix dlpack memory leak by ispobock in https://github.com/InternLM/lmdeploy/pull/1344
* Fix invalid context for Internstudio platform by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1354
* fix benchmark generation by grimoire in https://github.com/InternLM/lmdeploy/pull/1349
* fix window attention by grimoire in https://github.com/InternLM/lmdeploy/pull/1341
* fix batchApplyRepetitionPenalty by irexyc in https://github.com/InternLM/lmdeploy/pull/1358
* Fix memory leak of DLManagedTensor by ispobock in https://github.com/InternLM/lmdeploy/pull/1361
* fix vlm inference hung with tp by irexyc in https://github.com/InternLM/lmdeploy/pull/1336
* [Fix] fix the unit test of model name deduce by AllentDan in https://github.com/InternLM/lmdeploy/pull/1382
📚 Documentations
* add citation in readme by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1308
* Add slora example for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1343
🌐 Other
* Add restful interface regrssion daily test workflow. by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1302
* Add offline mode for testcase workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1318
* workflow bugfix and add llava-v1.5-13b testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1339
* Add benchmark test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1364
* bump version to v0.3.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1387


**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0

Page 4 of 8

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.