Lmdeploy

Latest version: v0.4.0

Safety actively analyzes 623909 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

0.4.0

<!-- Release notes generated using configuration in .github/release.yml at main -->

Highlights

**Support for Llama3 and additional Vision-Language Models (VLMs):**
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

**Introduce online int4/int8 KV quantization and inference**
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |

The below table presents LMDeploy's inference performance with quantized KV.

| model | kv type | test settings | RPS | v.s. kv fp16 |
| ----------------- | ------- | ---------------------------------------- | ----- | ------------ |
| llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
| llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
| - | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
| - | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
| internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |


What's Changed
🚀 Features
* Support qwen1.5 in turbomind engine by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
* Online 8/4-bit KV-cache quantization by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
* Support qwen1.5-*-AWQ model inference in turbomind by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
* support Internvl chat v1.1, v1.2 and v1.2-plus by irexyc in https://github.com/InternLM/lmdeploy/pull/1425
* support Internvl chat llava by irexyc in https://github.com/InternLM/lmdeploy/pull/1426
* Add llama3 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
* Support mini gemini llama by AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
* add interactive api in service for VL models by AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
* support output logprobs with turbomind backend. by irexyc in https://github.com/InternLM/lmdeploy/pull/1391
* support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1458
* Add qwen1.5 awq quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1470
💥 Improvements
* Reduce binary size, add `sm_89` and `sm_90` targets by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
* Use new event loop instead of the current loop for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
* Optimize inference of pytorch engine with tensor parallelism by grimoire in https://github.com/InternLM/lmdeploy/pull/1397
* add llava-v1.6-34b template by irexyc in https://github.com/InternLM/lmdeploy/pull/1408
* Initialize vl encoder first to avoid OOM by AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
* Support model_name customization for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
* Expose dynamic split&fuse parameters by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
* warning transformers version by grimoire in https://github.com/InternLM/lmdeploy/pull/1453
* Optimize apply_rotary kernel and remove useless inference_mode by grimoire in https://github.com/InternLM/lmdeploy/pull/1457
* set infinity timeout to nccl by grimoire in https://github.com/InternLM/lmdeploy/pull/1465
* Feat: format internlm2 chat template by liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456
🐞 Bug fixes
* handle SIGTERM by grimoire in https://github.com/InternLM/lmdeploy/pull/1389
* fix chat cli `ArgumentError` error happened in python 3.11 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
* Fix llama_triton_example by AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
* miss --trust-remote-code in converter, which is side effect brought by pr 1406 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
* fix sampling kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1417
* Fix loading single safetensor file error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
* remove space in deepseek template by grimoire in https://github.com/InternLM/lmdeploy/pull/1441
* fix free repetition_penalty_workspace_ buffer by irexyc in https://github.com/InternLM/lmdeploy/pull/1467
* fix adapter failure when tp>1 by grimoire in https://github.com/InternLM/lmdeploy/pull/1476
* get model in advance to fix downloading from modelscope error by irexyc in https://github.com/InternLM/lmdeploy/pull/1473
* Fix the side effect in engine_intance brought by 1391 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480
📚 Documentations
* Add model name corresponding to the test data in the doc by wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
* fix typo in get_started guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
* Add async openai demo for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
* add the recommendation version for Python Backend by zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
* Update kv quantization and inference guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
* update doc for llama3 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1462
🌐 Other
* hack cmakelist.txt in pr_test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
* Add benchmark report generated in summary by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
* add restful completions v1 test case by ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
* Add kvint4/8 ete testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
* impove rotary embedding of qwen in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1451
* change cutlass url in ut by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
* bump version to v0.4.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469

New Contributors
* wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
* ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
* liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0

0.3.0

<!-- Release notes generated using configuration in .github/release.yml at main -->

Highlight
* Refactor attention and optimize GQA(1258 1307 1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
* Support new models, including Qwen1.5-MOE(1372), DBRX(1367), DeepSeek-VL(1335)


What's Changed
🚀 Features
* Add tensor core GQA dispatch for `[4,5,6,8]` by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1258
* upgrade turbomind to v2.1 by by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1307, https://github.com/InternLM/lmdeploy/pull/1116
* Support slora to pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1286
* Support qwen for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1265
* Support Triton inference server python backend by ispobock in https://github.com/InternLM/lmdeploy/pull/1329
* torch engine support dbrx by grimoire in https://github.com/InternLM/lmdeploy/pull/1367
* Support qwen2 moe for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1372
* Add deepseek vl by AllentDan in https://github.com/InternLM/lmdeploy/pull/1335
💥 Improvements
* rm unused var by zhyncs in https://github.com/InternLM/lmdeploy/pull/1256
* Expose cache_block_seq_len to API by ispobock in https://github.com/InternLM/lmdeploy/pull/1218
* add chat template for deepseek coder model by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1310
* Add more log info for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1323
* remove cuda cache after loading vison model by irexyc in https://github.com/InternLM/lmdeploy/pull/1325
* Add new chat cli with auto backend feature by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1276
* Update rewritings for qwen by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1351
* lazy import accelerate.init_empty_weights for vl async engine by irexyc in https://github.com/InternLM/lmdeploy/pull/1359
* update lmdeploy pypi packages deps to cuda12 by irexyc in https://github.com/InternLM/lmdeploy/pull/1368
* update `max_prefill_token_num` for low gpu memory by grimoire in https://github.com/InternLM/lmdeploy/pull/1373
* Optimize pipeline of pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1328
🐞 Bug fixes
* fix different stop/bad words length in batch by irexyc in https://github.com/InternLM/lmdeploy/pull/1246
* Fix performance issue of chatbot by ispobock in https://github.com/InternLM/lmdeploy/pull/1295
* add missed argument by irexyc in https://github.com/InternLM/lmdeploy/pull/1317
* Fix dlpack memory leak by ispobock in https://github.com/InternLM/lmdeploy/pull/1344
* Fix invalid context for Internstudio platform by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1354
* fix benchmark generation by grimoire in https://github.com/InternLM/lmdeploy/pull/1349
* fix window attention by grimoire in https://github.com/InternLM/lmdeploy/pull/1341
* fix batchApplyRepetitionPenalty by irexyc in https://github.com/InternLM/lmdeploy/pull/1358
* Fix memory leak of DLManagedTensor by ispobock in https://github.com/InternLM/lmdeploy/pull/1361
* fix vlm inference hung with tp by irexyc in https://github.com/InternLM/lmdeploy/pull/1336
* [Fix] fix the unit test of model name deduce by AllentDan in https://github.com/InternLM/lmdeploy/pull/1382
📚 Documentations
* add citation in readme by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1308
* Add slora example for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1343
🌐 Other
* Add restful interface regrssion daily test workflow. by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1302
* Add offline mode for testcase workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1318
* workflow bugfix and add llava-v1.5-13b testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1339
* Add benchmark test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1364
* bump version to v0.3.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1387


**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0

0.2.6

<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlight

Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), LLaVA series [v1.5](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) and [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)

- VLM Inference Pipeline
python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from [here](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html)

- VLM serving by openai compatible server

shell
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000


- VLM Serving by gradio

shell
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006


What's Changed
🚀 Features
* Add inference pipeline for VL models by irexyc in https://github.com/InternLM/lmdeploy/pull/1214
* Support serving VLMs by AllentDan in https://github.com/InternLM/lmdeploy/pull/1285
* Serve VLM by gradio by irexyc in https://github.com/InternLM/lmdeploy/pull/1293
* Add pipeline.chat api for easy use by irexyc in https://github.com/InternLM/lmdeploy/pull/1292
💥 Improvements
* Hide qos functions from swagger UI if not applied by AllentDan in https://github.com/InternLM/lmdeploy/pull/1238
* Color log formatter by grimoire in https://github.com/InternLM/lmdeploy/pull/1247
* optimize filling kv cache kernel in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1251
* Refactor chat template and support accurate name matching. by AllentDan in https://github.com/InternLM/lmdeploy/pull/1216
* Support passing json file to chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1200
* upgrade peft and check adapters by grimoire in https://github.com/InternLM/lmdeploy/pull/1284
* better cache allocation in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1272
* Fall back to base template if there is no chat_template in tokenizer_config.json by AllentDan in https://github.com/InternLM/lmdeploy/pull/1294
🐞 Bug fixes
* lazy load convert_pv jit function by grimoire in https://github.com/InternLM/lmdeploy/pull/1253
* [BUG] fix the case when num_used_blocks < 0 by jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1277
* Check bf16 model in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1270
* fix bf16 check by grimoire in https://github.com/InternLM/lmdeploy/pull/1281
* [Fix] fix triton server chatbot init error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1278
* Fix concatenate issue in profile serving by ispobock in https://github.com/InternLM/lmdeploy/pull/1282
* fix torch tp lora adapter by grimoire in https://github.com/InternLM/lmdeploy/pull/1300
* Fix crash when api_server loads a turbomind model by irexyc in https://github.com/InternLM/lmdeploy/pull/1304
📚 Documentations
* fix config for readthedocs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1245
* update badges in README by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1243
* Update serving guide including api_server and gradio by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1248
* rename restful_api.md to api_server.md by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1287
* Update readthedocs index by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1288
🌐 Other
* Parallelize testcase and refactor test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1254
* Accelerate sample request in benchmark script by ispobock in https://github.com/InternLM/lmdeploy/pull/1264
* Update eval ci cfg by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1259
* Test case bugfix and add restful interface testcases. by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1271
* bump version to v0.2.6 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1299

New Contributors
* jjjjohnson made their first contribution in https://github.com/InternLM/lmdeploy/pull/1277

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.5...v0.2.6

0.2.5

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
🚀 Features
* Support mistral and sliding window attention by grimoire in https://github.com/InternLM/lmdeploy/pull/1075
* torch engine support chatglm3 by grimoire in https://github.com/InternLM/lmdeploy/pull/1159
* Support qwen1.5 in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1160
* Support mixtral for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1133
* Support torch deepseek moe by grimoire in https://github.com/InternLM/lmdeploy/pull/1163
* Support gemma model in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1184
* Auto backend for pipeline and serve when backend is not set to pytorch explicitly by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1211
💥 Improvements
* Fix argument error by ispobock in https://github.com/InternLM/lmdeploy/pull/1193
* Use LifoQueue for turbomind async_stream_infer by AllentDan in https://github.com/InternLM/lmdeploy/pull/1179
* Update interactive output len strategy and response by AllentDan in https://github.com/InternLM/lmdeploy/pull/1164
* Support `min_new_tokens` generation config in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1096
* Batched sampling by grimoire in https://github.com/InternLM/lmdeploy/pull/1197
* refactor the logic of getting `model_name` by AllentDan in https://github.com/InternLM/lmdeploy/pull/1188
* Add parameter `max_prefill_token_num` by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1203
* optmize baichuan in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1223
* check model required transformers version by grimoire in https://github.com/InternLM/lmdeploy/pull/1220
* torch optmize chatglm3 by grimoire in https://github.com/InternLM/lmdeploy/pull/1215
* Async torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1206
* remove unused kernel in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1237
🐞 Bug fixes
* Fix session length for profile generation by ispobock in https://github.com/InternLM/lmdeploy/pull/1181
* fix torch engine infer by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1185
* fix module map by grimoire in https://github.com/InternLM/lmdeploy/pull/1205
* [Fix] Correct session length warning by AllentDan in https://github.com/InternLM/lmdeploy/pull/1207
* Fix all devices occupation when applying tp to torch engine by updating device map by grimoire in https://github.com/InternLM/lmdeploy/pull/1172
* Fix falcon chatglm2 template by grimoire in https://github.com/InternLM/lmdeploy/pull/1168
* [Fix] Avoid AsyncEngine running the same session id by AllentDan in https://github.com/InternLM/lmdeploy/pull/1219
* Fix `None` session_len by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1230
* fix multinomial sampling by grimoire in https://github.com/InternLM/lmdeploy/pull/1228
* fix returning logits in prefill phase of pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1209
* optimize pytorch engine inference with falcon model by grimoire in https://github.com/InternLM/lmdeploy/pull/1234
* fix bf16 multinomial sampling by grimoire in https://github.com/InternLM/lmdeploy/pull/1239
* reduce torchengine prefill mem usage by grimoire in https://github.com/InternLM/lmdeploy/pull/1240
📚 Documentations
* auto generate pipeline api for readthedocs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1186
* Added tutorial document for deploying lmdeploy on Jetson series boards. by BestAnHongjun in https://github.com/InternLM/lmdeploy/pull/1192
* update doc index by zhyncs in https://github.com/InternLM/lmdeploy/pull/1241
🌐 Other
* Add PR test workflow and check-in more testcases by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1208
* fix pytest version by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1236
* bump version to v0.2.5 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1235

New Contributors
* ispobock made their first contribution in https://github.com/InternLM/lmdeploy/pull/1181
* BestAnHongjun made their first contribution in https://github.com/InternLM/lmdeploy/pull/1192

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.4...v0.2.5

0.2.4

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
💥 Improvements
* use stricter rules to get weight file by irexyc in https://github.com/InternLM/lmdeploy/pull/1070
* check pytorch engine environment by grimoire in https://github.com/InternLM/lmdeploy/pull/1107
* Update Dockerfile order to launch the http service by `docker run` directly by AllentDan in https://github.com/InternLM/lmdeploy/pull/1162
* Support torch cache_max_entry_count by grimoire in https://github.com/InternLM/lmdeploy/pull/1166
* Remove the manual model conversion during benchmark by lvhan028 in https://github.com/InternLM/lmdeploy/pull/953
* update llama triton example by zhyncs in https://github.com/InternLM/lmdeploy/pull/1153
🐞 Bug fixes
* fix embedding copy size by irexyc in https://github.com/InternLM/lmdeploy/pull/1036
* fix pytorch engine with peft==0.8.2 by grimoire in https://github.com/InternLM/lmdeploy/pull/1122
* support triton2.2 by grimoire in https://github.com/InternLM/lmdeploy/pull/1137
* Add `top_k` in ChatCompletionRequest by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1174
* minor fix benchmark generation guide and script by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1175
📚 Documentations
* docs add debug turbomind guide by zhyncs in https://github.com/InternLM/lmdeploy/pull/1121
🌐 Other
* Add eval ci by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1060
* Ete testcase add more models by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1077
* Fix win ci by irexyc in https://github.com/InternLM/lmdeploy/pull/1132
* bump version to v0.2.4 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1171


**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.3...v0.2.4

0.2.3

<!-- Release notes generated using configuration in .github/release.yml at main -->

What's Changed
🚀 Features
* Support loading model from modelscope by irexyc in https://github.com/InternLM/lmdeploy/pull/1069
💥 Improvements
* Remove caching tokenizer.json by grimoire in https://github.com/InternLM/lmdeploy/pull/1074
* Refactor `get_logger` to remove the dependency of MMLogger from mmengine by yinfan98 in https://github.com/InternLM/lmdeploy/pull/1064
* Use TM_LOG_LEVEL environment variable first by zhyncs in https://github.com/InternLM/lmdeploy/pull/1071
* Speed up the initialization of w8a8 model for torch engine by yinfan98 in https://github.com/InternLM/lmdeploy/pull/1088
* Make logging.logger's behavior consistent with MMLogger by irexyc in https://github.com/InternLM/lmdeploy/pull/1092
* Remove owned_session for torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1097
* Unify engine initialization in pipeline by irexyc in https://github.com/InternLM/lmdeploy/pull/1085
* Add skip_special_tokens in GenerationConfig by grimoire in https://github.com/InternLM/lmdeploy/pull/1091
* Use default stop words for turbomind backend in pipeline by irexyc in https://github.com/InternLM/lmdeploy/pull/1119
* Add input_token_len to Response and update Response document by AllentDan in https://github.com/InternLM/lmdeploy/pull/1115
🐞 Bug fixes
* Fix fast tokenizer swallows prefix space when there are too many white spaces by AllentDan in https://github.com/InternLM/lmdeploy/pull/992
* Fix turbomind CUDA runtime error invalid argument by zhyncs in https://github.com/InternLM/lmdeploy/pull/1100
* Add safety check for incremental decode by AllentDan in https://github.com/InternLM/lmdeploy/pull/1094
* Fix device type of get_ppl for turbomind by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1093
* Fix pipeline init turbomind from workspace by irexyc in https://github.com/InternLM/lmdeploy/pull/1126
* Add dependency version check and fix `ignore_eos` logic by grimoire in https://github.com/InternLM/lmdeploy/pull/1099
* Change configuration_internlm.py to configuration_internlm2.py by HIT-cwh in https://github.com/InternLM/lmdeploy/pull/1129

📚 Documentations
* Update contribution guide by zhyncs in https://github.com/InternLM/lmdeploy/pull/1120
🌐 Other
* Bump version to v0.2.3 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1123

New Contributors
* yinfan98 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1064

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.2...v0.2.3

Page 1 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.