<!-- Release notes generated using configuration in .github/release.yml at main -->
Highlight
- Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
- Optimize the prefilling inference stage of PyTorchEngine
- Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:
shell
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name customized_chat_template.json
After
shell
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name "the served model name"
--chat-template customized_chat_template.json
What's Changed
🚀 Features
* support vlm custom image process parameters in openai input format by irexyc in https://github.com/InternLM/lmdeploy/pull/2245
* New GEMM kernels for weight-only quantization by lzhangzz in https://github.com/InternLM/lmdeploy/pull/2090
* Fix hidden size and support mistral nemo by AllentDan in https://github.com/InternLM/lmdeploy/pull/2215
* Support custom logits processors by AllentDan in https://github.com/InternLM/lmdeploy/pull/2329
* support openbmb/MiniCPM-V-2_6 by irexyc in https://github.com/InternLM/lmdeploy/pull/2351
* Support phi3.5 for pytorch engine by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2361
💥 Improvements
* Remove deprecated arguments from API and clarify model_name and chat_template_name by lvhan028 in https://github.com/InternLM/lmdeploy/pull/1931
* Fix duplicated session_id when pipeline is used by multithreads by irexyc in https://github.com/InternLM/lmdeploy/pull/2134
* remove eviction param by grimoire in https://github.com/InternLM/lmdeploy/pull/2285
* Remove QoS serving by AllentDan in https://github.com/InternLM/lmdeploy/pull/2294
* Support send tool_calls back to internlm2 by AllentDan in https://github.com/InternLM/lmdeploy/pull/2147
* Add stream options to control usage by AllentDan in https://github.com/InternLM/lmdeploy/pull/2313
* add device type for pytorch engine in cli by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2321
* Update error status_code to raise error in openai client by AllentDan in https://github.com/InternLM/lmdeploy/pull/2333
* Change to use device instead of device-type in cli by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2337
* Add GEMM test utils by lzhangzz in https://github.com/InternLM/lmdeploy/pull/2342
* Add environment variable to control SILU fusion by lzhangzz in https://github.com/InternLM/lmdeploy/pull/2343
* Use single thread per model instance by lzhangzz in https://github.com/InternLM/lmdeploy/pull/2339
* add cache to speed up docker building by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2344
* add max_prefill_token_num argument in CLI by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2345
* torch engine optimize prefill for long context by grimoire in https://github.com/InternLM/lmdeploy/pull/1962
* Refactor turbomind (1/N) by lzhangzz in https://github.com/InternLM/lmdeploy/pull/2352
* feat(server): enable `seed` parameter for openai compatible server. by DearPlanet in https://github.com/InternLM/lmdeploy/pull/2353
🐞 Bug fixes
* enable run vlm with pytorch engine in gradio by RunningLeon in https://github.com/InternLM/lmdeploy/pull/2256
* fix side-effect: failed to update tm model config with tm engine config by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2275
* Fix internvl2 template and update docs by irexyc in https://github.com/InternLM/lmdeploy/pull/2292
* fix the issue missing dependencies in the Dockerfile and pip by ColorfulDick in https://github.com/InternLM/lmdeploy/pull/2240
* Fix the way to get "quantization_config" from model's coniguration by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2325
* fix(ascend): fix import error of pt engine in cli by CyCle1024 in https://github.com/InternLM/lmdeploy/pull/2328
* Default rope_scaling_factor of TurbomindEngineConfig to None by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2358
* Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2362
📚 Documentations
* Reorganize the user guide and update the get_started section by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2038
* cancel support baichuan2 7b awq in pytorch engine by grimoire in https://github.com/InternLM/lmdeploy/pull/2246
* Add user guide about slora serving by AllentDan in https://github.com/InternLM/lmdeploy/pull/2084
🌐 Other
* test prtest image update by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/2192
* Update python support version by wuhongsheng in https://github.com/InternLM/lmdeploy/pull/2290
* fix Windows compile error by zhyncs in https://github.com/InternLM/lmdeploy/pull/2303
* fix: follow up 2303 by zhyncs in https://github.com/InternLM/lmdeploy/pull/2307
* [ci] benchmark react by zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/2183
* bump version to v0.6.0a0 by lvhan028 in https://github.com/InternLM/lmdeploy/pull/2371
New Contributors
* wuhongsheng made their first contribution in https://github.com/InternLM/lmdeploy/pull/2290
* ColorfulDick made their first contribution in https://github.com/InternLM/lmdeploy/pull/2240
* DearPlanet made their first contribution in https://github.com/InternLM/lmdeploy/pull/2353
**Full Changelog**: https://github.com/InternLM/lmdeploy/compare/v0.5.3...v0.6.0a0