<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlight
- Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
**Quantization**
python
lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ
**Inference with quantized model**
python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
- Balance vision model when deploying VLMs with multiple GPUs
python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
What's Changed
🚀 Features
* PyTorch Engine hash table based prefix caching by grimoire in https://github.com/InternLM/lmdeploy/pull/1429
* support phi3 by grimoire in https://github.com/InternLM/lmdeploy/pull/1497
* Turbomind prefix caching by ispobock in https://github.com/InternLM/lmdeploy/pull/1450
* Enable search scale for awq by AllentDan in https://github.com/InternLM/lmdeploy/pull/1545
* [Feature] Support vl models quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1553
💥 Improvements
* make Qwen compatible with Slora when TP > 1 by jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1518
* Optimize slora by grimoire in https://github.com/InternLM/lmdeploy/pull/1447
* Use a faster format for images in VLMs by isidentical in https://github.com/InternLM/lmdeploy/pull/1575
* add chat-template args to chat cli by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1566
* Get the max session len from config.json by AllentDan in https://github.com/InternLM/lmdeploy/pull/1550
* Optimize w8a8 kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1353
* support python 3.12 by irexyc in https://github.com/InternLM/lmdeploy/pull/1605
* Optimize moe by grimoire in https://github.com/InternLM/lmdeploy/pull/1520
* Balance vision model weights on multi gpus by irexyc in https://github.com/InternLM/lmdeploy/pull/1591
* Support user-specified IMAGE_TOKEN position for deepseek-vl model by irexyc in https://github.com/InternLM/lmdeploy/pull/1627
* Optimize GQA/MQA by grimoire in https://github.com/InternLM/lmdeploy/pull/1649
🐞 Bug fixes
* fix logger init by AllentDan in https://github.com/InternLM/lmdeploy/pull/1598
* Bugfix: wrongly assign gen_config with True by thelongestusernameofall in https://github.com/InternLM/lmdeploy/pull/1594
* Enable split-kv for attention by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1606
* Fix xcomposer2 vision model process by irexyc in https://github.com/InternLM/lmdeploy/pull/1640
* Fix NTK scaling by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1636
* Fix illegal memory access when seq_len < 64 by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1616
* Fix llava vl template by irexyc in https://github.com/InternLM/lmdeploy/pull/1620
* [side-effect] fix deepseek-vl when tp is 1 by irexyc in https://github.com/InternLM/lmdeploy/pull/1648
* fix logprobs output by irexyc in https://github.com/InternLM/lmdeploy/pull/1561
* fix fused-moe in triton2.2.0 by grimoire in https://github.com/InternLM/lmdeploy/pull/1654
* Align tokenizers in pipeline and api_server benchmark scripts by AllentDan in https://github.com/InternLM/lmdeploy/pull/1650
* [side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1661
* remove paged attention prefill autotune by grimoire in https://github.com/InternLM/lmdeploy/pull/1658
* Fix transformers 4.41.0 prompt may differ after encode decode by AllentDan in https://github.com/InternLM/lmdeploy/pull/1617
📚 Documentations
* Fix typo in w8a8.md by chg0901 in https://github.com/InternLM/lmdeploy/pull/1568
* Update doc for prefix caching by ispobock in https://github.com/InternLM/lmdeploy/pull/1597
* Update VL document by AllentDan in https://github.com/InternLM/lmdeploy/pull/1657
🌐 Other
* remove first empty token check and add input validation testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1549
* add more model into benchmark and evaluate workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1565
* add vl awq testcase and refactor pipeline testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1630
* bump version to v0.4.2 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1644
New Contributors
* isidentical made their first contribution in https://github.com/InternLM/lmdeploy/pull/1575
* chg0901 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1568
* thelongestusernameofall made their first contribution in https://github.com/InternLM/lmdeploy/pull/1594
**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.4.1...v0.4.2