<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlight
English version
* The allocation strategy for k/v cache is changed. The parameter `cache_max_entry_count` defaults to 0.8. It means the proportion of GPU **FREE** memory rather than **TOTAL** memory. The default value is updated to 0.8. It can help prevent OOM issues.
* The pipeline API supports streaming inference. You may give it a try!
python
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
* Add api key and ssl to `api_server`
Chinese version
* TurboMind engine ไฟฎๆนไบGPU memoryๅ้
็ญ็ฅใk/v cache ๅ
ๅญๆฏไพๅๆฐ cache_max_entry_count ็ผบ็ๅผๅๆดไธบ 0.8ใๅฎ่กจ็คบ GPU**็ฉบ้ฒๅ
ๅญ**็ๆฏไพ๏ผไธๅๆฏ GPU **ๆปๅ
ๅญ**็ๆฏไพใ
* Pipeline ๆฏๆๆตๅผ่พๅบๆฅๅฃใๅฏไปฅๅฐ่ฏไธๅฆไธไปฃ็ ๏ผ
python
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
* api_server ๅจๆฅๅฃไธญๅขๅ ไบ api_key
What's Changed
๐ Features
* add alignment tools by grimoire in https://github.com/InternLM/lmdeploy/pull/1004
* support min_length for turbomind backend by irexyc in https://github.com/InternLM/lmdeploy/pull/961
* Add stream mode function to pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/974
* [Feature] Add api key and ssl to http server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1048
๐ฅ Improvements
* hide stop-words in output text by grimoire in https://github.com/InternLM/lmdeploy/pull/991
* optimize sleep by grimoire in https://github.com/InternLM/lmdeploy/pull/1034
* set example values to /v1/chat/completions in swagger UI by AllentDan in https://github.com/InternLM/lmdeploy/pull/984
* Update adapters cli argument by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1039
* Fix turbomind end session bug. Add huggingface demo document by AllentDan in https://github.com/InternLM/lmdeploy/pull/1017
* Support linking the custom built mpi by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1025
* sync mem size for tp by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1053
* Remove model name when loading hf model by irexyc in https://github.com/InternLM/lmdeploy/pull/1022
* support internlm2-1_8b by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1073
* Update chat template for internlm2 base model by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1079
๐ Bug fixes
* fix TorchEngine stuck when benchmarking with `tp>1` by grimoire in https://github.com/InternLM/lmdeploy/pull/942
* fix module mapping error of baichuan model by grimoire in https://github.com/InternLM/lmdeploy/pull/977
* fix import error for triton server by RunningLeon in https://github.com/InternLM/lmdeploy/pull/985
* fix qwen-vl example by irexyc in https://github.com/InternLM/lmdeploy/pull/996
* fix missing init file in modules by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1013
* fix tp mem usage by grimoire in https://github.com/InternLM/lmdeploy/pull/987
* update indexes_containing_token function by AllentDan in https://github.com/InternLM/lmdeploy/pull/1050
* fix flash kernel on sm 70 by grimoire in https://github.com/InternLM/lmdeploy/pull/1027
* Fix baichuan2 lora by grimoire in https://github.com/InternLM/lmdeploy/pull/1042
* Fix modelconfig in pytorch engine, support YI. by grimoire in https://github.com/InternLM/lmdeploy/pull/1052
* Fix repetition penalty for long context by irexyc in https://github.com/InternLM/lmdeploy/pull/1037
* [Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by HIT-cwh in https://github.com/InternLM/lmdeploy/pull/1072
๐ Documentations
* add docs for evaluation with opencompass by RunningLeon in https://github.com/InternLM/lmdeploy/pull/995
* update docs for kvint8 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1026
* [doc] Introduce project OpenAOE by JiaYingLii in https://github.com/InternLM/lmdeploy/pull/1049
* update pipeline guide and FAQ about OOM by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1051
* docs update cache_max_entry_count for turbomind config by zhyncs in https://github.com/InternLM/lmdeploy/pull/1067
๐ Other
* update ut ci to new server node by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1024
* Ete testcase update by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1023
* fix OOM in BlockManager by zhyncs in https://github.com/InternLM/lmdeploy/pull/973
* fix use engine_config.tp when tp is None by zhyncs in https://github.com/InternLM/lmdeploy/pull/1057
* Fix serve api by moving logger inside process for turbomind by AllentDan in https://github.com/InternLM/lmdeploy/pull/1061
* bump version to v0.2.2 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1076
New Contributors
* zhyncs made their first contribution in https://github.com/InternLM/lmdeploy/pull/973
* JiaYingLii made their first contribution in https://github.com/InternLM/lmdeploy/pull/1049
**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.1...v0.2.2