<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlights
**Support for Llama3 and additional Vision-Language Models (VLMs):**
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.
**Introduce online int4/int8 KV quantization and inference**
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16
The following table shows the evaluation results of three LLM models with different KV numerical precision:
| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |
The below table presents LMDeploy's inference performance with quantized KV.
| model | kv type | test settings | RPS | v.s. kv fp16 |
| ----------------- | ------- | ---------------------------------------- | ----- | ------------ |
| llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
| llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
| - | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
| - | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
| internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |
What's Changed
🚀 Features
* Support qwen1.5 in turbomind engine by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
* Online 8/4-bit KV-cache quantization by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
* Support qwen1.5-*-AWQ model inference in turbomind by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
* support Internvl chat v1.1, v1.2 and v1.2-plus by irexyc in https://github.com/InternLM/lmdeploy/pull/1425
* support Internvl chat llava by irexyc in https://github.com/InternLM/lmdeploy/pull/1426
* Add llama3 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
* Support mini gemini llama by AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
* add interactive api in service for VL models by AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
* support output logprobs with turbomind backend. by irexyc in https://github.com/InternLM/lmdeploy/pull/1391
* support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1458
* Add qwen1.5 awq quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1470
💥 Improvements
* Reduce binary size, add `sm_89` and `sm_90` targets by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
* Use new event loop instead of the current loop for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
* Optimize inference of pytorch engine with tensor parallelism by grimoire in https://github.com/InternLM/lmdeploy/pull/1397
* add llava-v1.6-34b template by irexyc in https://github.com/InternLM/lmdeploy/pull/1408
* Initialize vl encoder first to avoid OOM by AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
* Support model_name customization for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
* Expose dynamic split&fuse parameters by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
* warning transformers version by grimoire in https://github.com/InternLM/lmdeploy/pull/1453
* Optimize apply_rotary kernel and remove useless inference_mode by grimoire in https://github.com/InternLM/lmdeploy/pull/1457
* set infinity timeout to nccl by grimoire in https://github.com/InternLM/lmdeploy/pull/1465
* Feat: format internlm2 chat template by liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456
🐞 Bug fixes
* handle SIGTERM by grimoire in https://github.com/InternLM/lmdeploy/pull/1389
* fix chat cli `ArgumentError` error happened in python 3.11 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
* Fix llama_triton_example by AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
* miss --trust-remote-code in converter, which is side effect brought by pr 1406 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
* fix sampling kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1417
* Fix loading single safetensor file error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
* remove space in deepseek template by grimoire in https://github.com/InternLM/lmdeploy/pull/1441
* fix free repetition_penalty_workspace_ buffer by irexyc in https://github.com/InternLM/lmdeploy/pull/1467
* fix adapter failure when tp>1 by grimoire in https://github.com/InternLM/lmdeploy/pull/1476
* get model in advance to fix downloading from modelscope error by irexyc in https://github.com/InternLM/lmdeploy/pull/1473
* Fix the side effect in engine_intance brought by 1391 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480
📚 Documentations
* Add model name corresponding to the test data in the doc by wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
* fix typo in get_started guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
* Add async openai demo for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
* add the recommendation version for Python Backend by zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
* Update kv quantization and inference guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
* update doc for llama3 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1462
🌐 Other
* hack cmakelist.txt in pr_test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
* Add benchmark report generated in summary by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
* add restful completions v1 test case by ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
* Add kvint4/8 ete testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
* impove rotary embedding of qwen in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1451
* change cutlass url in ut by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
* bump version to v0.4.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469
New Contributors
* wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
* ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
* liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456
**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0