Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
* update hyperparameter guide by merrymercy in https://github.com/sgl-project/sglang/pull/1114
* ci: compatible with fork repo by zhyncs in https://github.com/sgl-project/sglang/pull/1115
* fix: resolve Python.h header missing by zhyncs in https://github.com/sgl-project/sglang/pull/1119
* Fix the deadlock in multi-node tp by merrymercy in https://github.com/sgl-project/sglang/pull/1122
* Mixed style of chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/1013
* Fix port conflicts between local CI and runner CI. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1131
* Fix CI accuracy && time out limit by hnyls2002 in https://github.com/sgl-project/sglang/pull/1133
* fix: use fp16 dtype for sm75 by zhyncs in https://github.com/sgl-project/sglang/pull/1136
* Improve the code style: more comments and remove useless packages by merrymercy in https://github.com/sgl-project/sglang/pull/1139
* Improve benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/1140
* Fix duplicated imports in hf_transformers_utils.py by merrymercy in https://github.com/sgl-project/sglang/pull/1141
* fixed a typo by min-xu-et in https://github.com/sgl-project/sglang/pull/1143
* [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by Michaelvll in https://github.com/sgl-project/sglang/pull/1144
* [Feat]Add support for optional start len of logprobs by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1035
* Optimize MLA/GQA/MQA Triton decoding by ispobock in https://github.com/sgl-project/sglang/pull/1138
* feat: allow streaming for multi-prompt and/or parallel sampling by vhain in https://github.com/sgl-project/sglang/pull/1134
* Improve docs and warnings by merrymercy in https://github.com/sgl-project/sglang/pull/1164
* [Feature] add disable-custom-all-reduce by Xu-Chen in https://github.com/sgl-project/sglang/pull/1148
* misc: add hypervisor vendor by zhyncs in https://github.com/sgl-project/sglang/pull/1165
* support /v1/health using a generation 1 token by LucienShui in https://github.com/sgl-project/sglang/pull/1154
* fix: resolve README render by zhyncs in https://github.com/sgl-project/sglang/pull/1166
* [Feat] Support update weights without restart server by shanyu-sys in https://github.com/sgl-project/sglang/pull/1157
* Improve multi-node stability by merrymercy in https://github.com/sgl-project/sglang/pull/1171
* fix: custom op fallback forward native when lower sm80 by zhyncs in https://github.com/sgl-project/sglang/pull/1177
* [Feature] Add a function to convert sampling_params to kwargs by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1170
* Support min-p sampling by intervitens in https://github.com/sgl-project/sglang/pull/1167
* [Docs] Fix rendering of details in README by Michaelvll in https://github.com/sgl-project/sglang/pull/1179
* Improve code style of sampler by hnyls2002 in https://github.com/sgl-project/sglang/pull/1168
* [Minor] Improve logging and rename the health check endpoint name by merrymercy in https://github.com/sgl-project/sglang/pull/1180
* Fix broken penalty by hnyls2002 in https://github.com/sgl-project/sglang/pull/1184
* Fix benchmark script by Ying1123 in https://github.com/sgl-project/sglang/pull/1185
* [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by kcz358 in https://github.com/sgl-project/sglang/pull/1123
* feat: use gelu_tanh_and_mul by zhyncs in https://github.com/sgl-project/sglang/pull/1193
* Cleanup readme, llava examples, usage examples and nccl init by merrymercy in https://github.com/sgl-project/sglang/pull/1194
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/1198
* [CI] Fix the problem of hf runner too slow by Ying1123 in https://github.com/sgl-project/sglang/pull/1202
* [Fix] the issue of random order when input is a list by Ying1123 in https://github.com/sgl-project/sglang/pull/1199
* Relax the assert in moe throughput test to fix the flaky CI by merrymercy in https://github.com/sgl-project/sglang/pull/1207
* [Fix] Fixing the multi-images error for llava-onevision by kcz358 in https://github.com/sgl-project/sglang/pull/1205
* Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/1186
* [Minor] Improve the function organization in TokenizerManager & improve loggers by merrymercy in https://github.com/sgl-project/sglang/pull/1208
* [Minor] Temporarily skip flaky test by Ying1123 in https://github.com/sgl-project/sglang/pull/1209
* [CI] Fix the issue of unit test hanging by Ying1123 in https://github.com/sgl-project/sglang/pull/1211
* Update CI workflows by merrymercy in https://github.com/sgl-project/sglang/pull/1210
* Update CI runner docs by merrymercy in https://github.com/sgl-project/sglang/pull/1213
* [Feature] Support fp8 e5m2 kv cache with flashinfer by ispobock in https://github.com/sgl-project/sglang/pull/1204
* Update workflow files by merrymercy in https://github.com/sgl-project/sglang/pull/1214
* improve the threshold and ports in tests by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1215
* [CI] Fix CI by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1217
* [Fix] Multi-images loading error by kcz358 in https://github.com/sgl-project/sglang/pull/1218
* [Minor] improve CI and dependencies by hnyls2002 in https://github.com/sgl-project/sglang/pull/1212
* [CI] Parallelize unit tests in CI by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1219
* Move sampler into CUDA graph by hnyls2002 in https://github.com/sgl-project/sglang/pull/1201
* chore: bump v0.2.14 by zhyncs in https://github.com/sgl-project/sglang/pull/1155
* [FEAT] JSON constrained support by havetc in https://github.com/sgl-project/sglang/pull/1125
* Torch compile CI throughput test by hnyls2002 in https://github.com/sgl-project/sglang/pull/1223
* [FEAT] Support batches cancel by caiyueliang in https://github.com/sgl-project/sglang/pull/1222
* [Minor] add delete test and delete tmp file on ci server by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1227
* [FIX] Wrong logger by havetc in https://github.com/sgl-project/sglang/pull/1230
* feat: replace get_act_fn for gpt_bigcode by zhyncs in https://github.com/sgl-project/sglang/pull/1231
* Fix readme by ArtificialZeng in https://github.com/sgl-project/sglang/pull/1236
* Fix bench latency benchmark by hnyls2002 in https://github.com/sgl-project/sglang/pull/1225
* [Minor] Add more type annotations by merrymercy in https://github.com/sgl-project/sglang/pull/1237
* feat: support sm75 with FlashInfer v0.1.6 by zhyncs in https://github.com/sgl-project/sglang/pull/1233
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/1239
* hotfix: revert sampler CUDA Graph by zhyncs in https://github.com/sgl-project/sglang/pull/1242
* Add sglang.bench_latency to CI by merrymercy in https://github.com/sgl-project/sglang/pull/1243
* fix: increase max_new_tokens when testing generation models by zhyncs in https://github.com/sgl-project/sglang/pull/1244
* feat: update GemmaRMSNorm by zhyncs in https://github.com/sgl-project/sglang/pull/1232
* Fix llava on multi images by merrymercy in https://github.com/sgl-project/sglang/pull/1247
* feat: replace GeluAndMul by zhyncs in https://github.com/sgl-project/sglang/pull/1234
* fix: resolve qwen2 moe weight loader by zhyncs in https://github.com/sgl-project/sglang/pull/1252
* chore: bump v0.2.14.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/1250
* make json_schema usable from gen by qeternity in https://github.com/sgl-project/sglang/pull/1254
* fix data racing due to mutable reference using deepcopy by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/1255
* Sampler cudagraph by hnyls2002 in https://github.com/sgl-project/sglang/pull/1253
* fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by lxww302 in https://github.com/sgl-project/sglang/pull/1260
* Transpose mla weight offline by ispobock in https://github.com/sgl-project/sglang/pull/1261
* EXAONE 3.0 Model Support by Deepfocused in https://github.com/sgl-project/sglang/pull/1258
* Update README Support Exaone 3.0 by Deepfocused in https://github.com/sgl-project/sglang/pull/1267
* Report median instead of mean in bench_latency.py by merrymercy in https://github.com/sgl-project/sglang/pull/1269
* Allow more flexible assistant and system response by BabyChouSr in https://github.com/sgl-project/sglang/pull/1256
* fix: resolve the fp8 bug introduced by vLLM 0.5.5 by zhyncs in https://github.com/sgl-project/sglang/pull/1276
* [doc] fix quick start link by ByronHsu in https://github.com/sgl-project/sglang/pull/1282
* Optimize the update flashinfer indices by xiaobochen123 in https://github.com/sgl-project/sglang/pull/1262
* [CI] Add more multi-gpu tests by merrymercy in https://github.com/sgl-project/sglang/pull/1280
* feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by zhyncs in https://github.com/sgl-project/sglang/pull/1285
* [CI] merge all ci tests into one file by merrymercy in https://github.com/sgl-project/sglang/pull/1289
* Support Triton fp8 e5m2 kv cache by ispobock in https://github.com/sgl-project/sglang/pull/1286
* [triton] Remove the zero initialization of qk_acc by directly writing the result by ByronHsu in https://github.com/sgl-project/sglang/pull/1288
* [Chore] Rename model_overide_args to model_override_args by kevin85421 in https://github.com/sgl-project/sglang/pull/1284
* Allow new lines during JSON generation by qeternity in https://github.com/sgl-project/sglang/pull/1277
* fix: resolve fp8 for mixtral by zhyncs in https://github.com/sgl-project/sglang/pull/1290
* ci: add nightly eval by zhyncs in https://github.com/sgl-project/sglang/pull/1291
* Fix the flaky tests in test_moe_eval_accuracy_large.py by merrymercy in https://github.com/sgl-project/sglang/pull/1293
* [doc] Fix more broken links by ByronHsu in https://github.com/sgl-project/sglang/pull/1294
* Fix regex mask by hnyls2002 in https://github.com/sgl-project/sglang/pull/1296
* Fix hang when doing s += None. by max99x in https://github.com/sgl-project/sglang/pull/1297