Sglang

Latest version: v0.3.6

Safety actively analyzes 682416 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 5

0.3.1

* Remove deprecated configs by merrymercy in https://github.com/sgl-project/sglang/pull/1431
* [Feature] Support LoRA path renaming and add LoRA serving benchmarks by Ying1123 in https://github.com/sgl-project/sglang/pull/1433
* Revert "[Minor] Raise exception for wrong import (1409)" by Ying1123 in https://github.com/sgl-project/sglang/pull/1432
* Add constrained_json_whitespace_pattern to ServerArgs by zifeitong in https://github.com/sgl-project/sglang/pull/1438
* Clean up model loader by merrymercy in https://github.com/sgl-project/sglang/pull/1440
* Simplify sampler and its error handling by merrymercy in https://github.com/sgl-project/sglang/pull/1441
* [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by HaiShaw in https://github.com/sgl-project/sglang/pull/1420
* Fix torch compile for deepseek-v2 by ispobock in https://github.com/sgl-project/sglang/pull/1442
* Add OLMoE model by janimo in https://github.com/sgl-project/sglang/pull/1444

0.3.0

Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed
* update hyperparameter guide by merrymercy in https://github.com/sgl-project/sglang/pull/1114
* ci: compatible with fork repo by zhyncs in https://github.com/sgl-project/sglang/pull/1115
* fix: resolve Python.h header missing by zhyncs in https://github.com/sgl-project/sglang/pull/1119
* Fix the deadlock in multi-node tp by merrymercy in https://github.com/sgl-project/sglang/pull/1122
* Mixed style of chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/1013
* Fix port conflicts between local CI and runner CI. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1131
* Fix CI accuracy && time out limit by hnyls2002 in https://github.com/sgl-project/sglang/pull/1133
* fix: use fp16 dtype for sm75 by zhyncs in https://github.com/sgl-project/sglang/pull/1136
* Improve the code style: more comments and remove useless packages by merrymercy in https://github.com/sgl-project/sglang/pull/1139
* Improve benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/1140
* Fix duplicated imports in hf_transformers_utils.py by merrymercy in https://github.com/sgl-project/sglang/pull/1141
* fixed a typo by min-xu-et in https://github.com/sgl-project/sglang/pull/1143
* [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by Michaelvll in https://github.com/sgl-project/sglang/pull/1144
* [Feat]Add support for optional start len of logprobs by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1035
* Optimize MLA/GQA/MQA Triton decoding by ispobock in https://github.com/sgl-project/sglang/pull/1138
* feat: allow streaming for multi-prompt and/or parallel sampling by vhain in https://github.com/sgl-project/sglang/pull/1134
* Improve docs and warnings by merrymercy in https://github.com/sgl-project/sglang/pull/1164
* [Feature] add disable-custom-all-reduce by Xu-Chen in https://github.com/sgl-project/sglang/pull/1148
* misc: add hypervisor vendor by zhyncs in https://github.com/sgl-project/sglang/pull/1165
* support /v1/health using a generation 1 token by LucienShui in https://github.com/sgl-project/sglang/pull/1154
* fix: resolve README render by zhyncs in https://github.com/sgl-project/sglang/pull/1166
* [Feat] Support update weights without restart server by shanyu-sys in https://github.com/sgl-project/sglang/pull/1157
* Improve multi-node stability by merrymercy in https://github.com/sgl-project/sglang/pull/1171
* fix: custom op fallback forward native when lower sm80 by zhyncs in https://github.com/sgl-project/sglang/pull/1177
* [Feature] Add a function to convert sampling_params to kwargs by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1170
* Support min-p sampling by intervitens in https://github.com/sgl-project/sglang/pull/1167
* [Docs] Fix rendering of details in README by Michaelvll in https://github.com/sgl-project/sglang/pull/1179
* Improve code style of sampler by hnyls2002 in https://github.com/sgl-project/sglang/pull/1168
* [Minor] Improve logging and rename the health check endpoint name by merrymercy in https://github.com/sgl-project/sglang/pull/1180
* Fix broken penalty by hnyls2002 in https://github.com/sgl-project/sglang/pull/1184
* Fix benchmark script by Ying1123 in https://github.com/sgl-project/sglang/pull/1185
* [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by kcz358 in https://github.com/sgl-project/sglang/pull/1123
* feat: use gelu_tanh_and_mul by zhyncs in https://github.com/sgl-project/sglang/pull/1193
* Cleanup readme, llava examples, usage examples and nccl init by merrymercy in https://github.com/sgl-project/sglang/pull/1194
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/1198
* [CI] Fix the problem of hf runner too slow by Ying1123 in https://github.com/sgl-project/sglang/pull/1202
* [Fix] the issue of random order when input is a list by Ying1123 in https://github.com/sgl-project/sglang/pull/1199
* Relax the assert in moe throughput test to fix the flaky CI by merrymercy in https://github.com/sgl-project/sglang/pull/1207
* [Fix] Fixing the multi-images error for llava-onevision by kcz358 in https://github.com/sgl-project/sglang/pull/1205
* Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/1186
* [Minor] Improve the function organization in TokenizerManager & improve loggers by merrymercy in https://github.com/sgl-project/sglang/pull/1208
* [Minor] Temporarily skip flaky test by Ying1123 in https://github.com/sgl-project/sglang/pull/1209
* [CI] Fix the issue of unit test hanging by Ying1123 in https://github.com/sgl-project/sglang/pull/1211
* Update CI workflows by merrymercy in https://github.com/sgl-project/sglang/pull/1210
* Update CI runner docs by merrymercy in https://github.com/sgl-project/sglang/pull/1213
* [Feature] Support fp8 e5m2 kv cache with flashinfer by ispobock in https://github.com/sgl-project/sglang/pull/1204
* Update workflow files by merrymercy in https://github.com/sgl-project/sglang/pull/1214
* improve the threshold and ports in tests by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1215
* [CI] Fix CI by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1217
* [Fix] Multi-images loading error by kcz358 in https://github.com/sgl-project/sglang/pull/1218
* [Minor] improve CI and dependencies by hnyls2002 in https://github.com/sgl-project/sglang/pull/1212
* [CI] Parallelize unit tests in CI by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1219
* Move sampler into CUDA graph by hnyls2002 in https://github.com/sgl-project/sglang/pull/1201
* chore: bump v0.2.14 by zhyncs in https://github.com/sgl-project/sglang/pull/1155
* [FEAT] JSON constrained support by havetc in https://github.com/sgl-project/sglang/pull/1125
* Torch compile CI throughput test by hnyls2002 in https://github.com/sgl-project/sglang/pull/1223
* [FEAT] Support batches cancel by caiyueliang in https://github.com/sgl-project/sglang/pull/1222
* [Minor] add delete test and delete tmp file on ci server by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1227
* [FIX] Wrong logger by havetc in https://github.com/sgl-project/sglang/pull/1230
* feat: replace get_act_fn for gpt_bigcode by zhyncs in https://github.com/sgl-project/sglang/pull/1231
* Fix readme by ArtificialZeng in https://github.com/sgl-project/sglang/pull/1236
* Fix bench latency benchmark by hnyls2002 in https://github.com/sgl-project/sglang/pull/1225
* [Minor] Add more type annotations by merrymercy in https://github.com/sgl-project/sglang/pull/1237
* feat: support sm75 with FlashInfer v0.1.6 by zhyncs in https://github.com/sgl-project/sglang/pull/1233
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/1239
* hotfix: revert sampler CUDA Graph by zhyncs in https://github.com/sgl-project/sglang/pull/1242
* Add sglang.bench_latency to CI by merrymercy in https://github.com/sgl-project/sglang/pull/1243
* fix: increase max_new_tokens when testing generation models by zhyncs in https://github.com/sgl-project/sglang/pull/1244
* feat: update GemmaRMSNorm by zhyncs in https://github.com/sgl-project/sglang/pull/1232
* Fix llava on multi images by merrymercy in https://github.com/sgl-project/sglang/pull/1247
* feat: replace GeluAndMul by zhyncs in https://github.com/sgl-project/sglang/pull/1234
* fix: resolve qwen2 moe weight loader by zhyncs in https://github.com/sgl-project/sglang/pull/1252
* chore: bump v0.2.14.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/1250
* make json_schema usable from gen by qeternity in https://github.com/sgl-project/sglang/pull/1254
* fix data racing due to mutable reference using deepcopy by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/1255
* Sampler cudagraph by hnyls2002 in https://github.com/sgl-project/sglang/pull/1253
* fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by lxww302 in https://github.com/sgl-project/sglang/pull/1260
* Transpose mla weight offline by ispobock in https://github.com/sgl-project/sglang/pull/1261
* EXAONE 3.0 Model Support by Deepfocused in https://github.com/sgl-project/sglang/pull/1258
* Update README Support Exaone 3.0 by Deepfocused in https://github.com/sgl-project/sglang/pull/1267
* Report median instead of mean in bench_latency.py by merrymercy in https://github.com/sgl-project/sglang/pull/1269
* Allow more flexible assistant and system response by BabyChouSr in https://github.com/sgl-project/sglang/pull/1256
* fix: resolve the fp8 bug introduced by vLLM 0.5.5 by zhyncs in https://github.com/sgl-project/sglang/pull/1276
* [doc] fix quick start link by ByronHsu in https://github.com/sgl-project/sglang/pull/1282
* Optimize the update flashinfer indices by xiaobochen123 in https://github.com/sgl-project/sglang/pull/1262
* [CI] Add more multi-gpu tests by merrymercy in https://github.com/sgl-project/sglang/pull/1280
* feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by zhyncs in https://github.com/sgl-project/sglang/pull/1285
* [CI] merge all ci tests into one file by merrymercy in https://github.com/sgl-project/sglang/pull/1289
* Support Triton fp8 e5m2 kv cache by ispobock in https://github.com/sgl-project/sglang/pull/1286
* [triton] Remove the zero initialization of qk_acc by directly writing the result by ByronHsu in https://github.com/sgl-project/sglang/pull/1288
* [Chore] Rename model_overide_args to model_override_args by kevin85421 in https://github.com/sgl-project/sglang/pull/1284
* Allow new lines during JSON generation by qeternity in https://github.com/sgl-project/sglang/pull/1277
* fix: resolve fp8 for mixtral by zhyncs in https://github.com/sgl-project/sglang/pull/1290
* ci: add nightly eval by zhyncs in https://github.com/sgl-project/sglang/pull/1291
* Fix the flaky tests in test_moe_eval_accuracy_large.py by merrymercy in https://github.com/sgl-project/sglang/pull/1293
* [doc] Fix more broken links by ByronHsu in https://github.com/sgl-project/sglang/pull/1294
* Fix regex mask by hnyls2002 in https://github.com/sgl-project/sglang/pull/1296
* Fix hang when doing s += None. by max99x in https://github.com/sgl-project/sglang/pull/1297

0.2.15

* feat: update nightly gsm8k eval by zhyncs in https://github.com/sgl-project/sglang/pull/1304
* Fix bugs in sampler with CUDA graph / torch.compile by hnyls2002 in https://github.com/sgl-project/sglang/pull/1306
* [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping by merrymercy in https://github.com/sgl-project/sglang/pull/1308
* Support Phi3 mini and medium by janimo in https://github.com/sgl-project/sglang/pull/1299
* Update README.md for llava-onevision instructions by merrymercy in https://github.com/sgl-project/sglang/pull/1313
* Fix llama2 weight loader by merrymercy in https://github.com/sgl-project/sglang/pull/1317
* Fix select by ensuring each request has at least one token by merrymercy in https://github.com/sgl-project/sglang/pull/1318
* misc: speedup load safetensors by zhyncs in https://github.com/sgl-project/sglang/pull/1319
* chore: bump v0.3.0 by zhyncs in https://github.com/sgl-project/sglang/pull/1320
* Fix the flaky test test_moe_eval_accuracy_large.py by merrymercy in https://github.com/sgl-project/sglang/pull/1326
* docs: update news by zhyncs in https://github.com/sgl-project/sglang/pull/1327

New Contributors
* Michaelvll made their first contribution in https://github.com/sgl-project/sglang/pull/1144
* Xu-Chen made their first contribution in https://github.com/sgl-project/sglang/pull/1148
* shanyu-sys made their first contribution in https://github.com/sgl-project/sglang/pull/1157
* intervitens made their first contribution in https://github.com/sgl-project/sglang/pull/1167
* zhaochenyang20 made their first contribution in https://github.com/sgl-project/sglang/pull/1186
* havetc made their first contribution in https://github.com/sgl-project/sglang/pull/1125
* caiyueliang made their first contribution in https://github.com/sgl-project/sglang/pull/1222
* ArtificialZeng made their first contribution in https://github.com/sgl-project/sglang/pull/1236
* lxww302 made their first contribution in https://github.com/sgl-project/sglang/pull/1260
* Deepfocused made their first contribution in https://github.com/sgl-project/sglang/pull/1258
* ByronHsu made their first contribution in https://github.com/sgl-project/sglang/pull/1282
* xiaobochen123 made their first contribution in https://github.com/sgl-project/sglang/pull/1262
* kevin85421 made their first contribution in https://github.com/sgl-project/sglang/pull/1284

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.13...v0.3.0

0.2.13

Highlights
* **New Feature**: Support window attention for Gemma-2 (1056 1090 1112), enable chunked-prefill by default (1040 984), support all sampling penalties (973)
* **New Models**: Support embedding model e5-mistral (983 987 988 997 1014) and comprehensive OpenAI-compatible API.
* **Performance**: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (905).
* **More CI Tests**: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
* **Refactor and fix**: More modular, better stability, use more kernels from flashinfer (907)

What's Changed
* fix: set env in runner by zhyncs in https://github.com/sgl-project/sglang/pull/891
* docs: update setup runner by zhyncs in https://github.com/sgl-project/sglang/pull/884
* misc: update cuda graph capture exception log by zhyncs in https://github.com/sgl-project/sglang/pull/894
* chore: add multipart dep for fastapi by zhyncs in https://github.com/sgl-project/sglang/pull/895
* [minor] fixed code formatting doc by min-xu-et in https://github.com/sgl-project/sglang/pull/896
* Bump version to 0.2.9.post1 by Ying1123 in https://github.com/sgl-project/sglang/pull/899
* Update the base image of the docker by Ying1123 in https://github.com/sgl-project/sglang/pull/900
* Reorder CI unit tests. by hnyls2002 in https://github.com/sgl-project/sglang/pull/908
* fixed an error handling in bench_latency.py by min-xu-et in https://github.com/sgl-project/sglang/pull/904
* Add model accuracy test - step 1 by Ying1123 in https://github.com/sgl-project/sglang/pull/866
* latency test enhancement - part 1 by min-xu-et in https://github.com/sgl-project/sglang/pull/909
* Improve the structure of CI by Ying1123 in https://github.com/sgl-project/sglang/pull/911
* fix: use e2e and unit test only for original repo or pr by zhyncs in https://github.com/sgl-project/sglang/pull/912
* misc: add triton in check_env PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/914
* Support MLA for DeepSeek-V2 with Triton - step 1 by ispobock in https://github.com/sgl-project/sglang/pull/905
* enhance latency test - part 2 by min-xu-et in https://github.com/sgl-project/sglang/pull/915
* Make API Key OpenAI-compatible by Ying1123 in https://github.com/sgl-project/sglang/pull/917
* Update hyperparameter_tuning.md by Ying1123 in https://github.com/sgl-project/sglang/pull/918
* Fix CI && python3.8 compatible by hnyls2002 in https://github.com/sgl-project/sglang/pull/920
* Support more OpenAI API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/916
* Bump version to 0.2.10 by Ying1123 in https://github.com/sgl-project/sglang/pull/923
* latency test enhancement - final part by min-xu-et in https://github.com/sgl-project/sglang/pull/921
* Test openai vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/925
* Test regex in vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/926
* Update README.md by Ying1123 in https://github.com/sgl-project/sglang/pull/927
* Fix prompt len in parallel sampling by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/928
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/935
* Remove leftover auth_token by AidanCooper in https://github.com/sgl-project/sglang/pull/934
* Feat: add alternative choices selection methods by AidanCooper in https://github.com/sgl-project/sglang/pull/835
* Fix union operator by ispobock in https://github.com/sgl-project/sglang/pull/940
* Support multiple args options by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/941
* Fix stuck in `get_new_prefill_batch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/948
* Organize code (rename, movement) by hnyls2002 in https://github.com/sgl-project/sglang/pull/953
* fix nsys cannot profile cuda kernel by mpjlu in https://github.com/sgl-project/sglang/pull/957
* Add support for Batch API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/936
* Show more error messages for warmup errors by Ying1123 in https://github.com/sgl-project/sglang/pull/932
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/963
* misc: simplify test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/964
* misc: add compute capability in check_env by zhyncs in https://github.com/sgl-project/sglang/pull/965
* Make `req_pool_indices` on CPU by hnyls2002 in https://github.com/sgl-project/sglang/pull/960
* misc: fix the req_to_token member change by hnyls2002 in https://github.com/sgl-project/sglang/pull/967
* chore: update vllm to 0.5.4 by zhyncs in https://github.com/sgl-project/sglang/pull/966
* chore: bump v0.2.11 by zhyncs in https://github.com/sgl-project/sglang/pull/970
* Purge self-runner's pip cache weekly by hnyls2002 in https://github.com/sgl-project/sglang/pull/975
* Run purge-cache only in sgl-project by hnyls2002 in https://github.com/sgl-project/sglang/pull/976
* misc: correct the int data type for token ids and indices by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/969
* PrefillAdder abstraction by hnyls2002 in https://github.com/sgl-project/sglang/pull/968
* RadixCache method adjust by hnyls2002 in https://github.com/sgl-project/sglang/pull/977
* Adjust max prefix len by hnyls2002 in https://github.com/sgl-project/sglang/pull/980
* 590 Increase default , track changes in examples and documentation by foszto in https://github.com/sgl-project/sglang/pull/971
* [minor] Update type annotation in tokenizer_manager.py by Ying1123 in https://github.com/sgl-project/sglang/pull/982
* Fix chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/984
* Add llama embedding modules [unreachable code] - step 1/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/983
* Add io struct for embedding models [unreachable code] - step 2/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/987
* Adjust `InputeMetadata` and `ScheduleBatch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/981
* support more optioin about usage in stream mode by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/985
* Create contributor_guide.md by Ying1123 in https://github.com/sgl-project/sglang/pull/992
* feat: frequency, min_new_tokens, presence, and repetition penalties by vhain in https://github.com/sgl-project/sglang/pull/973
* Move torch.compile configs into cuda_graph_runner.py by Ying1123 in https://github.com/sgl-project/sglang/pull/993
* Add e5-mistral embedding model - step 3/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/988
* test: negative value testing for frequency, presence penalizers by vhain in https://github.com/sgl-project/sglang/pull/995
* support models from www.modelscope.cn by liuyhwangyh in https://github.com/sgl-project/sglang/pull/994
* bugfix: penalizers to be merged before reqs by vhain in https://github.com/sgl-project/sglang/pull/1001
* fix: resolve correctness_test issue by zhyncs in https://github.com/sgl-project/sglang/pull/1002
* Minor bugfix on benchmark serving by ywang96 in https://github.com/sgl-project/sglang/pull/1005
* Add openai embedding API by Ying1123 in https://github.com/sgl-project/sglang/pull/997
* Add skip_tokenizer_init args. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/959
* Fix benchmark latency by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1007
* Some warnings to crash when CI by hnyls2002 in https://github.com/sgl-project/sglang/pull/1009
* Reduce the overhead when cache is disabled by hnyls2002 in https://github.com/sgl-project/sglang/pull/1010
* Support embedding input as a list by Ying1123 in https://github.com/sgl-project/sglang/pull/1014
* misc: update test config by zhyncs in https://github.com/sgl-project/sglang/pull/990
* fix: force max new tokens to be 1 for embedding request by Ying1123 in https://github.com/sgl-project/sglang/pull/1019
* Clean up unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1020
* Fix `input_ids` && rename to `fill_ids` by hnyls2002 in https://github.com/sgl-project/sglang/pull/1021
* feat: use FlashInfer rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/907
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/1024
* Clean up readme and arguments of chunked prefill by merrymercy in https://github.com/sgl-project/sglang/pull/1022
* Fix wrong assert by hnyls2002 in https://github.com/sgl-project/sglang/pull/1028
* Improve type annotation by merrymercy in https://github.com/sgl-project/sglang/pull/1029
* hotfix: add CustomOp abstraction by zhyncs in https://github.com/sgl-project/sglang/pull/1027
* Fix the case where r.prefix_indices is None by merrymercy in https://github.com/sgl-project/sglang/pull/1031
* Fix triton args init by hnyls2002 in https://github.com/sgl-project/sglang/pull/1034
* Fix the case when max_new_tokens is too large by merrymercy in https://github.com/sgl-project/sglang/pull/1025
* Test the case when max_new_tokens is very large by merrymercy in https://github.com/sgl-project/sglang/pull/1038
* Fix the prefix indices by hnyls2002 in https://github.com/sgl-project/sglang/pull/1037
* Improve end-to-end throughput test and its coverage by merrymercy in https://github.com/sgl-project/sglang/pull/1039
* Delete the useless test/srt/test_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/1045
* minor: some potential bugs by hnyls2002 in https://github.com/sgl-project/sglang/pull/1044
* Clean up the comments and names under python/sglang/srt/layers by merrymercy in https://github.com/sgl-project/sglang/pull/1047
* fix: Fix returned prefill logits and add output str test by Ying1123 in https://github.com/sgl-project/sglang/pull/1046
* feat: update Dockerfile by zhyncs in https://github.com/sgl-project/sglang/pull/1033
* docs: update setup github runner by zhyncs in https://github.com/sgl-project/sglang/pull/1050
* Add longer accuracy test on CI by merrymercy in https://github.com/sgl-project/sglang/pull/1049
* Fix accuracy test by merrymercy in https://github.com/sgl-project/sglang/pull/1051
* Re-organize CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/1052
* chore: bump v0.2.12 by zhyncs in https://github.com/sgl-project/sglang/pull/1048
* feat: replace all rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/1057
* fix: not use the default port by zhyncs in https://github.com/sgl-project/sglang/pull/1068
* Fix layernorm input shape by ispobock in https://github.com/sgl-project/sglang/pull/1066
* fix: temporary solution for DeepSeek V2 H100 layout conversion issue by zhyncs in https://github.com/sgl-project/sglang/pull/1060
* ci: add cancel pr workflow by zhyncs in https://github.com/sgl-project/sglang/pull/1070
* ci: add moe test by zhyncs in https://github.com/sgl-project/sglang/pull/1053
* fix: use devel for Triton's compiler requirements by zhyncs in https://github.com/sgl-project/sglang/pull/1074
* ci: add accuracy timeout by zhyncs in https://github.com/sgl-project/sglang/pull/1078
* Fix create_abort_task, GenerateReqInput does not have rids. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1079
* Example file for docker compose and k8s by LucienShui in https://github.com/sgl-project/sglang/pull/1006
* Update the mixtral to use the better FusedMoE layer by merrymercy in https://github.com/sgl-project/sglang/pull/1081
* [Feat] Add window attention for gemma-2 by Ying1123 in https://github.com/sgl-project/sglang/pull/1056
* Fix jump forward final state circular path bug. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1084
* ci: update timeout and retry by zhyncs in https://github.com/sgl-project/sglang/pull/1086
* [Feature] modify Runtime to support skip_tokenizer_init by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1088
* Fix a bug in cuda graph runner by merrymercy in https://github.com/sgl-project/sglang/pull/1094
* ci: remove workflow path trigger by zhyncs in https://github.com/sgl-project/sglang/pull/1096
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/1098
* Update grok 1 model by merrymercy in https://github.com/sgl-project/sglang/pull/1095
* docs: update pr template by zhyncs in https://github.com/sgl-project/sglang/pull/1099
* Use `dtype` to control generate by hnyls2002 in https://github.com/sgl-project/sglang/pull/1082
* [Fix] Compatibility of window attention and cuda graph by Ying1123 in https://github.com/sgl-project/sglang/pull/1090
* docs: update nsys usage by zhyncs in https://github.com/sgl-project/sglang/pull/1103
* Support `stop_token_ids` in sglang API by hnyls2002 in https://github.com/sgl-project/sglang/pull/1092
* Support jinja as chat template file by Ying1123 in https://github.com/sgl-project/sglang/pull/1104
* Use a single workspace for flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/1077
* [Fix] fix the typo bug for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1106
* Enable chunked prefill by default by merrymercy in https://github.com/sgl-project/sglang/pull/1040
* [Fix] fix flashinfer usage for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1107
* misc: rm unused model_loader by zhyncs in https://github.com/sgl-project/sglang/pull/1110
* [Fix] Window attention compatible with RadixAttention and chunked prefill by Ying1123 in https://github.com/sgl-project/sglang/pull/1112
* set CUDA_DEVICE_MAX_CONNECTIONS=1 by merrymercy in https://github.com/sgl-project/sglang/pull/1113
* chore: bump v0.2.13 by zhyncs in https://github.com/sgl-project/sglang/pull/1111

New Contributors
* min-xu-et made their first contribution in https://github.com/sgl-project/sglang/pull/896
* mpjlu made their first contribution in https://github.com/sgl-project/sglang/pull/957
* xiezhq-hermann made their first contribution in https://github.com/sgl-project/sglang/pull/969
* foszto made their first contribution in https://github.com/sgl-project/sglang/pull/971
* vhain made their first contribution in https://github.com/sgl-project/sglang/pull/973
* liuyhwangyh made their first contribution in https://github.com/sgl-project/sglang/pull/994
* ywang96 made their first contribution in https://github.com/sgl-project/sglang/pull/1005
* gryffindor-rr made their first contribution in https://github.com/sgl-project/sglang/pull/959
* LucienShui made their first contribution in https://github.com/sgl-project/sglang/pull/1006

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.9...v0.2.13

0.2.9

Highlights
- **New feature**: Chunked prefill (800, 811)
- **New models**: Deepseek v2
- **Performance improvement**: vectorized logprob computation
- **Accuracy fix**: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- **Feature fix**: fixed many missing logprob-related features in the OpenAI API server
- **CI/CD infra** is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.


What's Changed
* Deepseek v2 support by hnyls2002 in https://github.com/sgl-project/sglang/pull/693
* Fix context length by hnyls2002 in https://github.com/sgl-project/sglang/pull/757
* docs: update model support by zhyncs in https://github.com/sgl-project/sglang/pull/760
* fix: not run workflows on fork repo by zhyncs in https://github.com/sgl-project/sglang/pull/762
* Update supported models by hnyls2002 in https://github.com/sgl-project/sglang/pull/763
* Fix TransformerTokenizer init for chatglm2 & 3 by ispobock in https://github.com/sgl-project/sglang/pull/761
* [Minor] Improve the code style in TokenizerManager by merrymercy in https://github.com/sgl-project/sglang/pull/767
* Update readme by Ying1123 in https://github.com/sgl-project/sglang/pull/769
* feat: add fake tag by zhyncs in https://github.com/sgl-project/sglang/pull/770
* Fix max_tokens for OpenAI chat completion API by merrymercy in https://github.com/sgl-project/sglang/pull/766
* Fix max new tokens by merrymercy in https://github.com/sgl-project/sglang/pull/772
* Move sampling logits to float32 by merrymercy in https://github.com/sgl-project/sglang/pull/773
* minor refactor: move check server args to server_args.py by wisclmy0611 in https://github.com/sgl-project/sglang/pull/774
* Fix return_log_probs with cuda graph by merrymercy in https://github.com/sgl-project/sglang/pull/775
* Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by merrymercy in https://github.com/sgl-project/sglang/pull/776
* Allow disabling flashinfer sampling kernel by merrymercy in https://github.com/sgl-project/sglang/pull/778
* Bump version to 0.2.6 by merrymercy in https://github.com/sgl-project/sglang/pull/779
* fix: replace pillow with PIL in PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/781
* docs: init readthedocs support by zhyncs in https://github.com/sgl-project/sglang/pull/783
* fix: init readthedocs support by zhyncs in https://github.com/sgl-project/sglang/pull/784
* fix: exclude logo png in gitignore by zhyncs in https://github.com/sgl-project/sglang/pull/785
* docs: update index by zhyncs in https://github.com/sgl-project/sglang/pull/786
* Vectorize logprobs computation by Ying1123 in https://github.com/sgl-project/sglang/pull/787
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/788
* docs: make badges center by zhyncs in https://github.com/sgl-project/sglang/pull/789
* chore: add copyright for srt by zhyncs in https://github.com/sgl-project/sglang/pull/790
* Fix echo + lobprob for OpenAI API when the prompt is a list by Ying1123 in https://github.com/sgl-project/sglang/pull/791
* Update README.md by Ying1123 in https://github.com/sgl-project/sglang/pull/792
* Lazy-import third-party backends by bgyoon in https://github.com/sgl-project/sglang/pull/794
* Fix lazy import location by Ying1123 in https://github.com/sgl-project/sglang/pull/795
* Fix logging by Ying1123 in https://github.com/sgl-project/sglang/pull/796
* Add role documentation, add system begin & end tokens by objnf-dev in https://github.com/sgl-project/sglang/pull/793
* Chunked prefill support by hnyls2002 in https://github.com/sgl-project/sglang/pull/797
* Revert "Chunked prefill support" by Ying1123 in https://github.com/sgl-project/sglang/pull/799
* Chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/800
* fix: update flashinfer to 0.1.2 to fix sampling for cu118 by zhyncs in https://github.com/sgl-project/sglang/pull/803
* Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by Ying1123 in https://github.com/sgl-project/sglang/pull/805
* feat: add chat template for internlm2-chat by zhyncs in https://github.com/sgl-project/sglang/pull/802
* Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by Ying1123 in https://github.com/sgl-project/sglang/pull/806
* Add support for OpenAI API : offline batch(file) processing by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/699
* Organize public APIs by hnyls2002 in https://github.com/sgl-project/sglang/pull/809
* Remove inf value for chunked prefill size by hnyls2002 in https://github.com/sgl-project/sglang/pull/812
* Revert "Organize public APIs" by Ying1123 in https://github.com/sgl-project/sglang/pull/815
* fix: use v0.2.5 for benchmark by zhyncs in https://github.com/sgl-project/sglang/pull/814
* Fix LiteLLM kwargs by qeternity in https://github.com/sgl-project/sglang/pull/817
* Code structure refactor by hnyls2002 in https://github.com/sgl-project/sglang/pull/807
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/819
* Fix streaming bug by objnf-dev in https://github.com/sgl-project/sglang/pull/820
* feat: add runner by zhyncs in https://github.com/sgl-project/sglang/pull/821
* feat: add pr e2e test by zhyncs in https://github.com/sgl-project/sglang/pull/822
* Support disable_ignore_eos in bench_serving.py by Ying1123 in https://github.com/sgl-project/sglang/pull/824
* Adjust default mem fraction to avoid OOM by Ying1123 in https://github.com/sgl-project/sglang/pull/823
* Add awq_marlin by Ying1123 in https://github.com/sgl-project/sglang/pull/826
* misc: update e2e test benchmark config by zhyncs in https://github.com/sgl-project/sglang/pull/825
* misc: enable e2e test when push by zhyncs in https://github.com/sgl-project/sglang/pull/828
* docs: add set up runner by zhyncs in https://github.com/sgl-project/sglang/pull/829
* chore: bump v0.2.7 by zhyncs in https://github.com/sgl-project/sglang/pull/830
* Add `--max-total-tokens` by hnyls2002 in https://github.com/sgl-project/sglang/pull/840
* Fix List input bug by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/838
* Add req slots leaking check by hnyls2002 in https://github.com/sgl-project/sglang/pull/842
* docs: update README.md by eltociear in https://github.com/sgl-project/sglang/pull/843
* misc: update e2e test paths config by zhyncs in https://github.com/sgl-project/sglang/pull/848
* chore: update flashinfer to v0.1.3 by zhyncs in https://github.com/sgl-project/sglang/pull/850
* Fix llama for classification by Ying1123 in https://github.com/sgl-project/sglang/pull/855
* Add troubleshooting doc by Ying1123 in https://github.com/sgl-project/sglang/pull/856
* Fix 857 by kaifronsdal in https://github.com/sgl-project/sglang/pull/858
* Add support for logprobs in OpenAI chat API by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/852
* Support chunked prefill when radix cache is disabled by hnyls2002 in https://github.com/sgl-project/sglang/pull/811
* misc: update e2e test paths config by zhyncs in https://github.com/sgl-project/sglang/pull/860
* Rename github workflows by Ying1123 in https://github.com/sgl-project/sglang/pull/861
* misc: disable auto release by zhyncs in https://github.com/sgl-project/sglang/pull/862
* misc: add cancel previous at e2e by zhyncs in https://github.com/sgl-project/sglang/pull/864
* Add OpenAI backend to the CI test by Ying1123 in https://github.com/sgl-project/sglang/pull/869
* Fix openai CI tests by Ying1123 in https://github.com/sgl-project/sglang/pull/870
* misc: use pip cache purge and add unit test ci by zhyncs in https://github.com/sgl-project/sglang/pull/871
* misc: update unit test config by zhyncs in https://github.com/sgl-project/sglang/pull/873
* Fix unit tests for the frontend language part by Ying1123 in https://github.com/sgl-project/sglang/pull/872
* bump to 0.2.8 by Ying1123 in https://github.com/sgl-project/sglang/pull/877
* Make scripts under `/test/srt` as unit tests by Ying1123 in https://github.com/sgl-project/sglang/pull/875
* Update runner docs by hnyls2002 in https://github.com/sgl-project/sglang/pull/876
* Improve the coverage of the openai api server test by Ying1123 in https://github.com/sgl-project/sglang/pull/878
* Implement served_model_name to customize model id when use local mode… by dionren in https://github.com/sgl-project/sglang/pull/749
* Update runner docs by hnyls2002 in https://github.com/sgl-project/sglang/pull/879
* Add more unit tests to CI by Ying1123 in https://github.com/sgl-project/sglang/pull/880
* Add accuracy test to CI: MMLU by Ying1123 in https://github.com/sgl-project/sglang/pull/882
* Update workflow name by Ying1123 in https://github.com/sgl-project/sglang/pull/883
* Fix the double BOS problem in the HF chat template by Ying1123 in https://github.com/sgl-project/sglang/pull/888
* Add benchmark: HumanEval by Ying1123 in https://github.com/sgl-project/sglang/pull/889
* Increase openai client limit by Ying1123 in https://github.com/sgl-project/sglang/pull/886
* Bump version to v0.2.9 by Ying1123 in https://github.com/sgl-project/sglang/pull/890

New Contributors
* bgyoon made their first contribution in https://github.com/sgl-project/sglang/pull/794
* objnf-dev made their first contribution in https://github.com/sgl-project/sglang/pull/793
* kaifronsdal made their first contribution in https://github.com/sgl-project/sglang/pull/858
* dionren made their first contribution in https://github.com/sgl-project/sglang/pull/749

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.5...v0.2.9

0.2.5

Highlights

- We recently released a [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. **SGLang consistently outperforms vLLM**, achieving up to **3.1x** higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.

- We have now automated the release processes for [PyPI](https://pypi.org/project/sglang/), [Docker](https://hub.docker.com/r/lmsysorg/sglang/tags), and [Release](https://github.com/sgl-project/sglang/releases) using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from [v0.2.0](https://github.com/sgl-project/sglang/releases/tag/v0.2.0) directly to [v0.2.5](https://github.com/sgl-project/sglang/releases/tag/v0.2.5).

- Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Page 3 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.