Lmdeploy

Latest version: v0.4.1

Safety actively analyzes 629908 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 5

0.4.1

What's Changed
🚀 Features
* Add colab demo by AllentDan in https://github.com/InternLM/lmdeploy/pull/1428
* support starcoder2 by grimoire in https://github.com/InternLM/lmdeploy/pull/1468
* support OpenGVLab/InternVL-Chat-V1-5 by irexyc in https://github.com/InternLM/lmdeploy/pull/1490
💥 Improvements
* variable `CTA_H` & fix qkv bias by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1491
* refactor vision model loading by irexyc in https://github.com/InternLM/lmdeploy/pull/1482
* fix installation requirements for windows by irexyc in https://github.com/InternLM/lmdeploy/pull/1531
* Remove split batch inside pipline inference function by AllentDan in https://github.com/InternLM/lmdeploy/pull/1507
* Remove first empty chunck for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1527
* add benchmark script to profile pipeline APIs by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1528
* Add input validation by AllentDan in https://github.com/InternLM/lmdeploy/pull/1525
🐞 Bug fixes
* fix local variable 'response' referenced before assignment in async_engine.generate by irexyc in https://github.com/InternLM/lmdeploy/pull/1513
* Fix turbomind import in windows by irexyc in https://github.com/InternLM/lmdeploy/pull/1533
* Fix convert qwen2 to turbomind by AllentDan in https://github.com/InternLM/lmdeploy/pull/1546
* Adding api_key and model_name parameters to the restful benchmark by NiuBlibing in https://github.com/InternLM/lmdeploy/pull/1478
📚 Documentations
* update supported models for Baichuan by zhyncs in https://github.com/InternLM/lmdeploy/pull/1485
* Fix typo in w8a8.md by Infinity4B in https://github.com/InternLM/lmdeploy/pull/1523
* complete build.md by YanxingLiu in https://github.com/InternLM/lmdeploy/pull/1508
* update readme wechat qrcode by vansin in https://github.com/InternLM/lmdeploy/pull/1529
* Update docker docs for VL api by vody-am in https://github.com/InternLM/lmdeploy/pull/1534
* Format supported model table using html syntax by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1493
* doc: add example of deploying api server to Kubernetes by uzuku in https://github.com/InternLM/lmdeploy/pull/1488
🌐 Other
* add modelscope and lora testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1506
* bump version to v0.4.1 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1544

New Contributors
* NiuBlibing made their first contribution in https://github.com/InternLM/lmdeploy/pull/1478
* Infinity4B made their first contribution in https://github.com/InternLM/lmdeploy/pull/1523
* YanxingLiu made their first contribution in https://github.com/InternLM/lmdeploy/pull/1508
* vody-am made their first contribution in https://github.com/InternLM/lmdeploy/pull/1534
* uzuku made their first contribution in https://github.com/InternLM/lmdeploy/pull/1488

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.4.0...v0.4.1

0.4.0

Highlights

**Support for Llama3 and additional Vision-Language Models (VLMs):**
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

**Introduce online int4/int8 KV quantization and inference**
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |

The below table presents LMDeploy's inference performance with quantized KV.

| model | kv type | test settings | RPS | v.s. kv fp16 |
| ----------------- | ------- | ---------------------------------------- | ----- | ------------ |
| llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
| llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
| - | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
| - | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
| internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
| - | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
| - | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |

What's Changed
🚀 Features
* Support qwen1.5 in turbomind engine by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
* Online 8/4-bit KV-cache quantization by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
* Support qwen1.5-*-AWQ model inference in turbomind by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
* support Internvl chat v1.1, v1.2 and v1.2-plus by irexyc in https://github.com/InternLM/lmdeploy/pull/1425
* support Internvl chat llava by irexyc in https://github.com/InternLM/lmdeploy/pull/1426
* Add llama3 chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
* Support mini gemini llama by AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
* add interactive api in service for VL models by AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
* support output logprobs with turbomind backend. by irexyc in https://github.com/InternLM/lmdeploy/pull/1391
* support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by irexyc in https://github.com/InternLM/lmdeploy/pull/1458
* Add qwen1.5 awq quantization by AllentDan in https://github.com/InternLM/lmdeploy/pull/1470
💥 Improvements
* Reduce binary size, add `sm_89` and `sm_90` targets by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
* Use new event loop instead of the current loop for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
* Optimize inference of pytorch engine with tensor parallelism by grimoire in https://github.com/InternLM/lmdeploy/pull/1397
* add llava-v1.6-34b template by irexyc in https://github.com/InternLM/lmdeploy/pull/1408
* Initialize vl encoder first to avoid OOM by AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
* Support model_name customization for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
* Expose dynamic split&fuse parameters by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
* warning transformers version by grimoire in https://github.com/InternLM/lmdeploy/pull/1453
* Optimize apply_rotary kernel and remove useless inference_mode by grimoire in https://github.com/InternLM/lmdeploy/pull/1457
* set infinity timeout to nccl by grimoire in https://github.com/InternLM/lmdeploy/pull/1465
* Feat: format internlm2 chat template by liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456
🐞 Bug fixes
* handle SIGTERM by grimoire in https://github.com/InternLM/lmdeploy/pull/1389
* fix chat cli `ArgumentError` error happened in python 3.11 by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
* Fix llama_triton_example by AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
* miss --trust-remote-code in converter, which is side effect brought by pr 1406 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
* fix sampling kernel by grimoire in https://github.com/InternLM/lmdeploy/pull/1417
* Fix loading single safetensor file error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
* remove space in deepseek template by grimoire in https://github.com/InternLM/lmdeploy/pull/1441
* fix free repetition_penalty_workspace_ buffer by irexyc in https://github.com/InternLM/lmdeploy/pull/1467
* fix adapter failure when tp>1 by grimoire in https://github.com/InternLM/lmdeploy/pull/1476
* get model in advance to fix downloading from modelscope error by irexyc in https://github.com/InternLM/lmdeploy/pull/1473
* Fix the side effect in engine_intance brought by 1391 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480
📚 Documentations
* Add model name corresponding to the test data in the doc by wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
* fix typo in get_started guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
* Add async openai demo for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
* add the recommendation version for Python Backend by zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
* Update kv quantization and inference guide by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
* update doc for llama3 by zhyncs in https://github.com/InternLM/lmdeploy/pull/1462
🌐 Other
* hack cmakelist.txt in pr_test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
* Add benchmark report generated in summary by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
* add restful completions v1 test case by ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
* Add kvint4/8 ete testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
* impove rotary embedding of qwen in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1451
* change cutlass url in ut by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
* bump version to v0.4.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469

New Contributors
* wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
* ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
* liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0

0.3.0

Highlight
* Refactor attention and optimize GQA(1258 1307 1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
* Support new models, including Qwen1.5-MOE(1372), DBRX(1367), DeepSeek-VL(1335)

What's Changed
🚀 Features
* Add tensor core GQA dispatch for `[4,5,6,8]` by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1258
* upgrade turbomind to v2.1 by by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1307, https://github.com/InternLM/lmdeploy/pull/1116
* Support slora to pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1286
* Support qwen for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1265
* Support Triton inference server python backend by ispobock in https://github.com/InternLM/lmdeploy/pull/1329
* torch engine support dbrx by grimoire in https://github.com/InternLM/lmdeploy/pull/1367
* Support qwen2 moe for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1372
* Add deepseek vl by AllentDan in https://github.com/InternLM/lmdeploy/pull/1335
💥 Improvements
* rm unused var by zhyncs in https://github.com/InternLM/lmdeploy/pull/1256
* Expose cache_block_seq_len to API by ispobock in https://github.com/InternLM/lmdeploy/pull/1218
* add chat template for deepseek coder model by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1310
* Add more log info for api_server by AllentDan in https://github.com/InternLM/lmdeploy/pull/1323
* remove cuda cache after loading vison model by irexyc in https://github.com/InternLM/lmdeploy/pull/1325
* Add new chat cli with auto backend feature by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1276
* Update rewritings for qwen by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1351
* lazy import accelerate.init_empty_weights for vl async engine by irexyc in https://github.com/InternLM/lmdeploy/pull/1359
* update lmdeploy pypi packages deps to cuda12 by irexyc in https://github.com/InternLM/lmdeploy/pull/1368
* update `max_prefill_token_num` for low gpu memory by grimoire in https://github.com/InternLM/lmdeploy/pull/1373
* Optimize pipeline of pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1328
🐞 Bug fixes
* fix different stop/bad words length in batch by irexyc in https://github.com/InternLM/lmdeploy/pull/1246
* Fix performance issue of chatbot by ispobock in https://github.com/InternLM/lmdeploy/pull/1295
* add missed argument by irexyc in https://github.com/InternLM/lmdeploy/pull/1317
* Fix dlpack memory leak by ispobock in https://github.com/InternLM/lmdeploy/pull/1344
* Fix invalid context for Internstudio platform by lzhangzz in https://github.com/InternLM/lmdeploy/pull/1354
* fix benchmark generation by grimoire in https://github.com/InternLM/lmdeploy/pull/1349
* fix window attention by grimoire in https://github.com/InternLM/lmdeploy/pull/1341
* fix batchApplyRepetitionPenalty by irexyc in https://github.com/InternLM/lmdeploy/pull/1358
* Fix memory leak of DLManagedTensor by ispobock in https://github.com/InternLM/lmdeploy/pull/1361
* fix vlm inference hung with tp by irexyc in https://github.com/InternLM/lmdeploy/pull/1336
* [Fix] fix the unit test of model name deduce by AllentDan in https://github.com/InternLM/lmdeploy/pull/1382
📚 Documentations
* add citation in readme by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1308
* Add slora example for pipeline by AllentDan in https://github.com/InternLM/lmdeploy/pull/1343
🌐 Other
* Add restful interface regrssion daily test workflow. by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1302
* Add offline mode for testcase workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1318
* workflow bugfix and add llava-v1.5-13b testcase by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1339
* Add benchmark test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1364
* bump version to v0.3.0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1387

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.6...v0.3.0

0.2.6

Highlight

Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), LLaVA series [v1.5](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) and [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)

- VLM Inference Pipeline
python
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from [here](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html)

- VLM serving by openai compatible server

shell
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000

- VLM Serving by gradio

shell
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed
🚀 Features
* Add inference pipeline for VL models by irexyc in https://github.com/InternLM/lmdeploy/pull/1214
* Support serving VLMs by AllentDan in https://github.com/InternLM/lmdeploy/pull/1285
* Serve VLM by gradio by irexyc in https://github.com/InternLM/lmdeploy/pull/1293
* Add pipeline.chat api for easy use by irexyc in https://github.com/InternLM/lmdeploy/pull/1292
💥 Improvements
* Hide qos functions from swagger UI if not applied by AllentDan in https://github.com/InternLM/lmdeploy/pull/1238
* Color log formatter by grimoire in https://github.com/InternLM/lmdeploy/pull/1247
* optimize filling kv cache kernel in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1251
* Refactor chat template and support accurate name matching. by AllentDan in https://github.com/InternLM/lmdeploy/pull/1216
* Support passing json file to chat template by AllentDan in https://github.com/InternLM/lmdeploy/pull/1200
* upgrade peft and check adapters by grimoire in https://github.com/InternLM/lmdeploy/pull/1284
* better cache allocation in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1272
* Fall back to base template if there is no chat_template in tokenizer_config.json by AllentDan in https://github.com/InternLM/lmdeploy/pull/1294
🐞 Bug fixes
* lazy load convert_pv jit function by grimoire in https://github.com/InternLM/lmdeploy/pull/1253
* [BUG] fix the case when num_used_blocks < 0 by jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1277
* Check bf16 model in torch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/1270
* fix bf16 check by grimoire in https://github.com/InternLM/lmdeploy/pull/1281
* [Fix] fix triton server chatbot init error by AllentDan in https://github.com/InternLM/lmdeploy/pull/1278
* Fix concatenate issue in profile serving by ispobock in https://github.com/InternLM/lmdeploy/pull/1282
* fix torch tp lora adapter by grimoire in https://github.com/InternLM/lmdeploy/pull/1300
* Fix crash when api_server loads a turbomind model by irexyc in https://github.com/InternLM/lmdeploy/pull/1304
📚 Documentations
* fix config for readthedocs by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1245
* update badges in README by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1243
* Update serving guide including api_server and gradio by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1248
* rename restful_api.md to api_server.md by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1287
* Update readthedocs index by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1288
🌐 Other
* Parallelize testcase and refactor test workflow by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1254
* Accelerate sample request in benchmark script by ispobock in https://github.com/InternLM/lmdeploy/pull/1264
* Update eval ci cfg by RunningLeon in https://github.com/InternLM/lmdeploy/pull/1259
* Test case bugfix and add restful interface testcases. by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1271
* bump version to v0.2.6 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1299

New Contributors
* jjjjohnson made their first contribution in https://github.com/InternLM/lmdeploy/pull/1277

**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.2.5...v0.2.6

0.2.5

中文：
* 支持了LMDeploy v0.2.5在Jetson系列平台的部署。
* 更新了社区招募信息。

English:
* Supports the deployment of LMDeploy v0.2.5 on the Jetson series platform.
* The community recruitment information has been updated.

0.2.4

支持了LMDeploy v0.2.4在Jetson系列平台的部署。
Supports the deployment of LMDeploy v0.2.4 on the Jetson series platform.

Page 1 of 5

Releases

Has known vulnerabilities

Lmdeploy

Page 1 of 5

0.4.1

0.4.0

0.3.0

0.2.6

0.2.5

0.2.4

Page 1 of 5

Links

Releases