Sglang

Latest version: v0.4.4.post3

Safety actively analyzes 724206 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 7

0.2.13

Highlights
* **New Feature**: Support window attention for Gemma-2 (1056 1090 1112), enable chunked-prefill by default (1040 984), support all sampling penalties (973)
* **New Models**: Support embedding model e5-mistral (983 987 988 997 1014) and comprehensive OpenAI-compatible API.
* **Performance**: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (905).
* **More CI Tests**: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
* **Refactor and fix**: More modular, better stability, use more kernels from flashinfer (907)

What's Changed
* fix: set env in runner by zhyncs in https://github.com/sgl-project/sglang/pull/891
* docs: update setup runner by zhyncs in https://github.com/sgl-project/sglang/pull/884
* misc: update cuda graph capture exception log by zhyncs in https://github.com/sgl-project/sglang/pull/894
* chore: add multipart dep for fastapi by zhyncs in https://github.com/sgl-project/sglang/pull/895
* [minor] fixed code formatting doc by min-xu-et in https://github.com/sgl-project/sglang/pull/896
* Bump version to 0.2.9.post1 by Ying1123 in https://github.com/sgl-project/sglang/pull/899
* Update the base image of the docker by Ying1123 in https://github.com/sgl-project/sglang/pull/900
* Reorder CI unit tests. by hnyls2002 in https://github.com/sgl-project/sglang/pull/908
* fixed an error handling in bench_latency.py by min-xu-et in https://github.com/sgl-project/sglang/pull/904
* Add model accuracy test - step 1 by Ying1123 in https://github.com/sgl-project/sglang/pull/866
* latency test enhancement - part 1 by min-xu-et in https://github.com/sgl-project/sglang/pull/909
* Improve the structure of CI by Ying1123 in https://github.com/sgl-project/sglang/pull/911
* fix: use e2e and unit test only for original repo or pr by zhyncs in https://github.com/sgl-project/sglang/pull/912
* misc: add triton in check_env PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/914
* Support MLA for DeepSeek-V2 with Triton - step 1 by ispobock in https://github.com/sgl-project/sglang/pull/905
* enhance latency test - part 2 by min-xu-et in https://github.com/sgl-project/sglang/pull/915
* Make API Key OpenAI-compatible by Ying1123 in https://github.com/sgl-project/sglang/pull/917
* Update hyperparameter_tuning.md by Ying1123 in https://github.com/sgl-project/sglang/pull/918
* Fix CI && python3.8 compatible by hnyls2002 in https://github.com/sgl-project/sglang/pull/920
* Support more OpenAI API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/916
* Bump version to 0.2.10 by Ying1123 in https://github.com/sgl-project/sglang/pull/923
* latency test enhancement - final part by min-xu-et in https://github.com/sgl-project/sglang/pull/921
* Test openai vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/925
* Test regex in vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/926
* Update README.md by Ying1123 in https://github.com/sgl-project/sglang/pull/927
* Fix prompt len in parallel sampling by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/928
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/935
* Remove leftover auth_token by AidanCooper in https://github.com/sgl-project/sglang/pull/934
* Feat: add alternative choices selection methods by AidanCooper in https://github.com/sgl-project/sglang/pull/835
* Fix union operator by ispobock in https://github.com/sgl-project/sglang/pull/940
* Support multiple args options by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/941
* Fix stuck in `get_new_prefill_batch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/948
* Organize code (rename, movement) by hnyls2002 in https://github.com/sgl-project/sglang/pull/953
* fix nsys cannot profile cuda kernel by mpjlu in https://github.com/sgl-project/sglang/pull/957
* Add support for Batch API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/936
* Show more error messages for warmup errors by Ying1123 in https://github.com/sgl-project/sglang/pull/932
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/963
* misc: simplify test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/964
* misc: add compute capability in check_env by zhyncs in https://github.com/sgl-project/sglang/pull/965
* Make `req_pool_indices` on CPU by hnyls2002 in https://github.com/sgl-project/sglang/pull/960
* misc: fix the req_to_token member change by hnyls2002 in https://github.com/sgl-project/sglang/pull/967
* chore: update vllm to 0.5.4 by zhyncs in https://github.com/sgl-project/sglang/pull/966
* chore: bump v0.2.11 by zhyncs in https://github.com/sgl-project/sglang/pull/970
* Purge self-runner's pip cache weekly by hnyls2002 in https://github.com/sgl-project/sglang/pull/975
* Run purge-cache only in sgl-project by hnyls2002 in https://github.com/sgl-project/sglang/pull/976
* misc: correct the int data type for token ids and indices by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/969
* PrefillAdder abstraction by hnyls2002 in https://github.com/sgl-project/sglang/pull/968
* RadixCache method adjust by hnyls2002 in https://github.com/sgl-project/sglang/pull/977
* Adjust max prefix len by hnyls2002 in https://github.com/sgl-project/sglang/pull/980
* 590 Increase default , track changes in examples and documentation by foszto in https://github.com/sgl-project/sglang/pull/971
* [minor] Update type annotation in tokenizer_manager.py by Ying1123 in https://github.com/sgl-project/sglang/pull/982
* Fix chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/984
* Add llama embedding modules [unreachable code] - step 1/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/983
* Add io struct for embedding models [unreachable code] - step 2/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/987
* Adjust `InputeMetadata` and `ScheduleBatch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/981
* support more optioin about usage in stream mode by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/985
* Create contributor_guide.md by Ying1123 in https://github.com/sgl-project/sglang/pull/992
* feat: frequency, min_new_tokens, presence, and repetition penalties by vhain in https://github.com/sgl-project/sglang/pull/973
* Move torch.compile configs into cuda_graph_runner.py by Ying1123 in https://github.com/sgl-project/sglang/pull/993
* Add e5-mistral embedding model - step 3/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/988
* test: negative value testing for frequency, presence penalizers by vhain in https://github.com/sgl-project/sglang/pull/995
* support models from www.modelscope.cn by liuyhwangyh in https://github.com/sgl-project/sglang/pull/994
* bugfix: penalizers to be merged before reqs by vhain in https://github.com/sgl-project/sglang/pull/1001
* fix: resolve correctness_test issue by zhyncs in https://github.com/sgl-project/sglang/pull/1002
* Minor bugfix on benchmark serving by ywang96 in https://github.com/sgl-project/sglang/pull/1005
* Add openai embedding API by Ying1123 in https://github.com/sgl-project/sglang/pull/997
* Add skip_tokenizer_init args. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/959
* Fix benchmark latency by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1007
* Some warnings to crash when CI by hnyls2002 in https://github.com/sgl-project/sglang/pull/1009
* Reduce the overhead when cache is disabled by hnyls2002 in https://github.com/sgl-project/sglang/pull/1010
* Support embedding input as a list by Ying1123 in https://github.com/sgl-project/sglang/pull/1014
* misc: update test config by zhyncs in https://github.com/sgl-project/sglang/pull/990
* fix: force max new tokens to be 1 for embedding request by Ying1123 in https://github.com/sgl-project/sglang/pull/1019
* Clean up unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1020
* Fix `input_ids` && rename to `fill_ids` by hnyls2002 in https://github.com/sgl-project/sglang/pull/1021
* feat: use FlashInfer rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/907
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/1024
* Clean up readme and arguments of chunked prefill by merrymercy in https://github.com/sgl-project/sglang/pull/1022
* Fix wrong assert by hnyls2002 in https://github.com/sgl-project/sglang/pull/1028
* Improve type annotation by merrymercy in https://github.com/sgl-project/sglang/pull/1029
* hotfix: add CustomOp abstraction by zhyncs in https://github.com/sgl-project/sglang/pull/1027
* Fix the case where r.prefix_indices is None by merrymercy in https://github.com/sgl-project/sglang/pull/1031
* Fix triton args init by hnyls2002 in https://github.com/sgl-project/sglang/pull/1034
* Fix the case when max_new_tokens is too large by merrymercy in https://github.com/sgl-project/sglang/pull/1025
* Test the case when max_new_tokens is very large by merrymercy in https://github.com/sgl-project/sglang/pull/1038
* Fix the prefix indices by hnyls2002 in https://github.com/sgl-project/sglang/pull/1037
* Improve end-to-end throughput test and its coverage by merrymercy in https://github.com/sgl-project/sglang/pull/1039
* Delete the useless test/srt/test_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/1045
* minor: some potential bugs by hnyls2002 in https://github.com/sgl-project/sglang/pull/1044
* Clean up the comments and names under python/sglang/srt/layers by merrymercy in https://github.com/sgl-project/sglang/pull/1047
* fix: Fix returned prefill logits and add output str test by Ying1123 in https://github.com/sgl-project/sglang/pull/1046
* feat: update Dockerfile by zhyncs in https://github.com/sgl-project/sglang/pull/1033
* docs: update setup github runner by zhyncs in https://github.com/sgl-project/sglang/pull/1050
* Add longer accuracy test on CI by merrymercy in https://github.com/sgl-project/sglang/pull/1049
* Fix accuracy test by merrymercy in https://github.com/sgl-project/sglang/pull/1051
* Re-organize CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/1052
* chore: bump v0.2.12 by zhyncs in https://github.com/sgl-project/sglang/pull/1048
* feat: replace all rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/1057
* fix: not use the default port by zhyncs in https://github.com/sgl-project/sglang/pull/1068
* Fix layernorm input shape by ispobock in https://github.com/sgl-project/sglang/pull/1066
* fix: temporary solution for DeepSeek V2 H100 layout conversion issue by zhyncs in https://github.com/sgl-project/sglang/pull/1060
* ci: add cancel pr workflow by zhyncs in https://github.com/sgl-project/sglang/pull/1070
* ci: add moe test by zhyncs in https://github.com/sgl-project/sglang/pull/1053
* fix: use devel for Triton's compiler requirements by zhyncs in https://github.com/sgl-project/sglang/pull/1074
* ci: add accuracy timeout by zhyncs in https://github.com/sgl-project/sglang/pull/1078
* Fix create_abort_task, GenerateReqInput does not have rids. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1079
* Example file for docker compose and k8s by LucienShui in https://github.com/sgl-project/sglang/pull/1006
* Update the mixtral to use the better FusedMoE layer by merrymercy in https://github.com/sgl-project/sglang/pull/1081
* [Feat] Add window attention for gemma-2 by Ying1123 in https://github.com/sgl-project/sglang/pull/1056
* Fix jump forward final state circular path bug. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1084
* ci: update timeout and retry by zhyncs in https://github.com/sgl-project/sglang/pull/1086
* [Feature] modify Runtime to support skip_tokenizer_init by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1088
* Fix a bug in cuda graph runner by merrymercy in https://github.com/sgl-project/sglang/pull/1094
* ci: remove workflow path trigger by zhyncs in https://github.com/sgl-project/sglang/pull/1096
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/1098
* Update grok 1 model by merrymercy in https://github.com/sgl-project/sglang/pull/1095
* docs: update pr template by zhyncs in https://github.com/sgl-project/sglang/pull/1099
* Use `dtype` to control generate by hnyls2002 in https://github.com/sgl-project/sglang/pull/1082
* [Fix] Compatibility of window attention and cuda graph by Ying1123 in https://github.com/sgl-project/sglang/pull/1090
* docs: update nsys usage by zhyncs in https://github.com/sgl-project/sglang/pull/1103
* Support `stop_token_ids` in sglang API by hnyls2002 in https://github.com/sgl-project/sglang/pull/1092
* Support jinja as chat template file by Ying1123 in https://github.com/sgl-project/sglang/pull/1104
* Use a single workspace for flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/1077
* [Fix] fix the typo bug for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1106
* Enable chunked prefill by default by merrymercy in https://github.com/sgl-project/sglang/pull/1040
* [Fix] fix flashinfer usage for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1107
* misc: rm unused model_loader by zhyncs in https://github.com/sgl-project/sglang/pull/1110
* [Fix] Window attention compatible with RadixAttention and chunked prefill by Ying1123 in https://github.com/sgl-project/sglang/pull/1112
* set CUDA_DEVICE_MAX_CONNECTIONS=1 by merrymercy in https://github.com/sgl-project/sglang/pull/1113
* chore: bump v0.2.13 by zhyncs in https://github.com/sgl-project/sglang/pull/1111

New Contributors
* min-xu-et made their first contribution in https://github.com/sgl-project/sglang/pull/896
* mpjlu made their first contribution in https://github.com/sgl-project/sglang/pull/957
* xiezhq-hermann made their first contribution in https://github.com/sgl-project/sglang/pull/969
* foszto made their first contribution in https://github.com/sgl-project/sglang/pull/971
* vhain made their first contribution in https://github.com/sgl-project/sglang/pull/973
* liuyhwangyh made their first contribution in https://github.com/sgl-project/sglang/pull/994
* ywang96 made their first contribution in https://github.com/sgl-project/sglang/pull/1005
* gryffindor-rr made their first contribution in https://github.com/sgl-project/sglang/pull/959
* LucienShui made their first contribution in https://github.com/sgl-project/sglang/pull/1006

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.9...v0.2.13

0.2.9

Highlights
- **New feature**: Chunked prefill (800, 811)
- **New models**: Deepseek v2
- **Performance improvement**: vectorized logprob computation
- **Accuracy fix**: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- **Feature fix**: fixed many missing logprob-related features in the OpenAI API server
- **CI/CD infra** is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.


What's Changed
* Deepseek v2 support by hnyls2002 in https://github.com/sgl-project/sglang/pull/693
* Fix context length by hnyls2002 in https://github.com/sgl-project/sglang/pull/757
* docs: update model support by zhyncs in https://github.com/sgl-project/sglang/pull/760
* fix: not run workflows on fork repo by zhyncs in https://github.com/sgl-project/sglang/pull/762
* Update supported models by hnyls2002 in https://github.com/sgl-project/sglang/pull/763
* Fix TransformerTokenizer init for chatglm2 & 3 by ispobock in https://github.com/sgl-project/sglang/pull/761
* [Minor] Improve the code style in TokenizerManager by merrymercy in https://github.com/sgl-project/sglang/pull/767
* Update readme by Ying1123 in https://github.com/sgl-project/sglang/pull/769
* feat: add fake tag by zhyncs in https://github.com/sgl-project/sglang/pull/770
* Fix max_tokens for OpenAI chat completion API by merrymercy in https://github.com/sgl-project/sglang/pull/766
* Fix max new tokens by merrymercy in https://github.com/sgl-project/sglang/pull/772
* Move sampling logits to float32 by merrymercy in https://github.com/sgl-project/sglang/pull/773
* minor refactor: move check server args to server_args.py by wisclmy0611 in https://github.com/sgl-project/sglang/pull/774
* Fix return_log_probs with cuda graph by merrymercy in https://github.com/sgl-project/sglang/pull/775
* Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by merrymercy in https://github.com/sgl-project/sglang/pull/776
* Allow disabling flashinfer sampling kernel by merrymercy in https://github.com/sgl-project/sglang/pull/778
* Bump version to 0.2.6 by merrymercy in https://github.com/sgl-project/sglang/pull/779
* fix: replace pillow with PIL in PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/781
* docs: init readthedocs support by zhyncs in https://github.com/sgl-project/sglang/pull/783
* fix: init readthedocs support by zhyncs in https://github.com/sgl-project/sglang/pull/784
* fix: exclude logo png in gitignore by zhyncs in https://github.com/sgl-project/sglang/pull/785
* docs: update index by zhyncs in https://github.com/sgl-project/sglang/pull/786
* Vectorize logprobs computation by Ying1123 in https://github.com/sgl-project/sglang/pull/787
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/788
* docs: make badges center by zhyncs in https://github.com/sgl-project/sglang/pull/789
* chore: add copyright for srt by zhyncs in https://github.com/sgl-project/sglang/pull/790
* Fix echo + lobprob for OpenAI API when the prompt is a list by Ying1123 in https://github.com/sgl-project/sglang/pull/791
* Update README.md by Ying1123 in https://github.com/sgl-project/sglang/pull/792
* Lazy-import third-party backends by bgyoon in https://github.com/sgl-project/sglang/pull/794
* Fix lazy import location by Ying1123 in https://github.com/sgl-project/sglang/pull/795
* Fix logging by Ying1123 in https://github.com/sgl-project/sglang/pull/796
* Add role documentation, add system begin & end tokens by objnf-dev in https://github.com/sgl-project/sglang/pull/793
* Chunked prefill support by hnyls2002 in https://github.com/sgl-project/sglang/pull/797
* Revert "Chunked prefill support" by Ying1123 in https://github.com/sgl-project/sglang/pull/799
* Chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/800
* fix: update flashinfer to 0.1.2 to fix sampling for cu118 by zhyncs in https://github.com/sgl-project/sglang/pull/803
* Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by Ying1123 in https://github.com/sgl-project/sglang/pull/805
* feat: add chat template for internlm2-chat by zhyncs in https://github.com/sgl-project/sglang/pull/802
* Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by Ying1123 in https://github.com/sgl-project/sglang/pull/806
* Add support for OpenAI API : offline batch(file) processing by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/699
* Organize public APIs by hnyls2002 in https://github.com/sgl-project/sglang/pull/809
* Remove inf value for chunked prefill size by hnyls2002 in https://github.com/sgl-project/sglang/pull/812
* Revert "Organize public APIs" by Ying1123 in https://github.com/sgl-project/sglang/pull/815
* fix: use v0.2.5 for benchmark by zhyncs in https://github.com/sgl-project/sglang/pull/814
* Fix LiteLLM kwargs by qeternity in https://github.com/sgl-project/sglang/pull/817
* Code structure refactor by hnyls2002 in https://github.com/sgl-project/sglang/pull/807
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/819
* Fix streaming bug by objnf-dev in https://github.com/sgl-project/sglang/pull/820
* feat: add runner by zhyncs in https://github.com/sgl-project/sglang/pull/821
* feat: add pr e2e test by zhyncs in https://github.com/sgl-project/sglang/pull/822
* Support disable_ignore_eos in bench_serving.py by Ying1123 in https://github.com/sgl-project/sglang/pull/824
* Adjust default mem fraction to avoid OOM by Ying1123 in https://github.com/sgl-project/sglang/pull/823
* Add awq_marlin by Ying1123 in https://github.com/sgl-project/sglang/pull/826
* misc: update e2e test benchmark config by zhyncs in https://github.com/sgl-project/sglang/pull/825
* misc: enable e2e test when push by zhyncs in https://github.com/sgl-project/sglang/pull/828
* docs: add set up runner by zhyncs in https://github.com/sgl-project/sglang/pull/829
* chore: bump v0.2.7 by zhyncs in https://github.com/sgl-project/sglang/pull/830
* Add `--max-total-tokens` by hnyls2002 in https://github.com/sgl-project/sglang/pull/840
* Fix List input bug by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/838
* Add req slots leaking check by hnyls2002 in https://github.com/sgl-project/sglang/pull/842
* docs: update README.md by eltociear in https://github.com/sgl-project/sglang/pull/843
* misc: update e2e test paths config by zhyncs in https://github.com/sgl-project/sglang/pull/848
* chore: update flashinfer to v0.1.3 by zhyncs in https://github.com/sgl-project/sglang/pull/850
* Fix llama for classification by Ying1123 in https://github.com/sgl-project/sglang/pull/855
* Add troubleshooting doc by Ying1123 in https://github.com/sgl-project/sglang/pull/856
* Fix 857 by kaifronsdal in https://github.com/sgl-project/sglang/pull/858
* Add support for logprobs in OpenAI chat API by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/852
* Support chunked prefill when radix cache is disabled by hnyls2002 in https://github.com/sgl-project/sglang/pull/811
* misc: update e2e test paths config by zhyncs in https://github.com/sgl-project/sglang/pull/860
* Rename github workflows by Ying1123 in https://github.com/sgl-project/sglang/pull/861
* misc: disable auto release by zhyncs in https://github.com/sgl-project/sglang/pull/862
* misc: add cancel previous at e2e by zhyncs in https://github.com/sgl-project/sglang/pull/864
* Add OpenAI backend to the CI test by Ying1123 in https://github.com/sgl-project/sglang/pull/869
* Fix openai CI tests by Ying1123 in https://github.com/sgl-project/sglang/pull/870
* misc: use pip cache purge and add unit test ci by zhyncs in https://github.com/sgl-project/sglang/pull/871
* misc: update unit test config by zhyncs in https://github.com/sgl-project/sglang/pull/873
* Fix unit tests for the frontend language part by Ying1123 in https://github.com/sgl-project/sglang/pull/872
* bump to 0.2.8 by Ying1123 in https://github.com/sgl-project/sglang/pull/877
* Make scripts under `/test/srt` as unit tests by Ying1123 in https://github.com/sgl-project/sglang/pull/875
* Update runner docs by hnyls2002 in https://github.com/sgl-project/sglang/pull/876
* Improve the coverage of the openai api server test by Ying1123 in https://github.com/sgl-project/sglang/pull/878
* Implement served_model_name to customize model id when use local mode… by dionren in https://github.com/sgl-project/sglang/pull/749
* Update runner docs by hnyls2002 in https://github.com/sgl-project/sglang/pull/879
* Add more unit tests to CI by Ying1123 in https://github.com/sgl-project/sglang/pull/880
* Add accuracy test to CI: MMLU by Ying1123 in https://github.com/sgl-project/sglang/pull/882
* Update workflow name by Ying1123 in https://github.com/sgl-project/sglang/pull/883
* Fix the double BOS problem in the HF chat template by Ying1123 in https://github.com/sgl-project/sglang/pull/888
* Add benchmark: HumanEval by Ying1123 in https://github.com/sgl-project/sglang/pull/889
* Increase openai client limit by Ying1123 in https://github.com/sgl-project/sglang/pull/886
* Bump version to v0.2.9 by Ying1123 in https://github.com/sgl-project/sglang/pull/890

New Contributors
* bgyoon made their first contribution in https://github.com/sgl-project/sglang/pull/794
* objnf-dev made their first contribution in https://github.com/sgl-project/sglang/pull/793
* kaifronsdal made their first contribution in https://github.com/sgl-project/sglang/pull/858
* dionren made their first contribution in https://github.com/sgl-project/sglang/pull/749

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.5...v0.2.9

0.2.5

Highlights

- We recently released a [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/). Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. **SGLang consistently outperforms vLLM**, achieving up to **3.1x** higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.

- We have now automated the release processes for [PyPI](https://pypi.org/project/sglang/), [Docker](https://hub.docker.com/r/lmsysorg/sglang/tags), and [Release](https://github.com/sgl-project/sglang/releases) using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from [v0.2.0](https://github.com/sgl-project/sglang/releases/tag/v0.2.0) directly to [v0.2.5](https://github.com/sgl-project/sglang/releases/tag/v0.2.5).

- Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

0.2.0

Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed
* Optimize mem indices mangement by hnyls2002 in https://github.com/sgl-project/sglang/pull/619
* Unify index operations by hnyls2002 in https://github.com/sgl-project/sglang/pull/620
* Simplify mem state by wisclmy0611 in https://github.com/sgl-project/sglang/pull/623
* Improve tensor parallel performance by Ying1123 in https://github.com/sgl-project/sglang/pull/625
* Bump version to 0.1.21 by Ying1123 in https://github.com/sgl-project/sglang/pull/626
* Fix model forward grad by hnyls2002 in https://github.com/sgl-project/sglang/pull/628
* Update docker file by Ying1123 in https://github.com/sgl-project/sglang/pull/629
* Disable NCCL_NVLS by default by Ying1123 in https://github.com/sgl-project/sglang/pull/631
* Add qwen2 tie word embedding by yileld in https://github.com/sgl-project/sglang/pull/630
* Add support for VertexAI safety settings by AidanCooper in https://github.com/sgl-project/sglang/pull/624
* Fix vertexai by hnyls2002 in https://github.com/sgl-project/sglang/pull/633
* Reduce docker size by hnyls2002 in https://github.com/sgl-project/sglang/pull/632
* clean up step function by Ying1123 in https://github.com/sgl-project/sglang/pull/635
* feat: support internlm2 by zhyncs in https://github.com/sgl-project/sglang/pull/636
* misc: add pre-commit config by zhyncs in https://github.com/sgl-project/sglang/pull/637
* misc: add issue and pr template by zhyncs in https://github.com/sgl-project/sglang/pull/638
* Flashinfer sample kernel by hnyls2002 in https://github.com/sgl-project/sglang/pull/617
* Move `global_server_args_dict` by hnyls2002 in https://github.com/sgl-project/sglang/pull/642
* Increase the capacity of the memory pool by Ying1123 in https://github.com/sgl-project/sglang/pull/643
* feat: add check_env by zhyncs in https://github.com/sgl-project/sglang/pull/645
* Remove the dependency of rpyc by wisclmy0611 in https://github.com/sgl-project/sglang/pull/646
* misc: rm rpyc from PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/649
* fix: set ulimit -n 65535 by zhyncs in https://github.com/sgl-project/sglang/pull/647
* feat: add lint workflow by zhyncs in https://github.com/sgl-project/sglang/pull/648
* fix: resolve lint error by zhyncs in https://github.com/sgl-project/sglang/pull/650
* Remove useless variables in infer_batch.py by Ying1123 in https://github.com/sgl-project/sglang/pull/651
* Detokenize incrementally when streaming by hnyls2002 in https://github.com/sgl-project/sglang/pull/653
* `TokenizerManager.context_len` should inherit from `server_args.conte… by shrirajh in https://github.com/sgl-project/sglang/pull/654
* Remove cached triton launcher by merrymercy in https://github.com/sgl-project/sglang/pull/656
* perf: reduce ttft and itl with stream_interval 1 by zhyncs in https://github.com/sgl-project/sglang/pull/658
* feat: add benchmark serving by zhyncs in https://github.com/sgl-project/sglang/pull/657
* refactor model loader [unreachable code]: initial refactor by Ying1123 in https://github.com/sgl-project/sglang/pull/655
* misc: update SGLang package description by zhyncs in https://github.com/sgl-project/sglang/pull/659
* Update Readme by Ying1123 in https://github.com/sgl-project/sglang/pull/660
* feat: update check env by zhyncs in https://github.com/sgl-project/sglang/pull/661
* Improve docs by Ying1123 in https://github.com/sgl-project/sglang/pull/662
* Add benchmark instructions by Ying1123 in https://github.com/sgl-project/sglang/pull/663
* Fix jump forward when streaming by hnyls2002 in https://github.com/sgl-project/sglang/pull/665
* Fix kill process util by ispobock in https://github.com/sgl-project/sglang/pull/666
* Add support for OpenAI API parallel sampling by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/640
* Update OpenAI API by wisclmy0611 in https://github.com/sgl-project/sglang/pull/667
* Temporary fix invalid sample results by hnyls2002 in https://github.com/sgl-project/sglang/pull/668
* Support random dataset in bench_serving.py by merrymercy in https://github.com/sgl-project/sglang/pull/669
* Revert "Temporary fix invalid sample results" by hnyls2002 in https://github.com/sgl-project/sglang/pull/673
* refactor model loader: initial refactor by Ying1123 in https://github.com/sgl-project/sglang/pull/664
* Fix cuda graph with flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/675
* Tmp fix illegal sample by hnyls2002 in https://github.com/sgl-project/sglang/pull/676
* Update version to 0.1.22 by Ying1123 in https://github.com/sgl-project/sglang/pull/677
* Fallback when sampling failed by ispobock in https://github.com/sgl-project/sglang/pull/678
* feat: support TRT LLM benchmark and multiple benchmarks by zhyncs in https://github.com/sgl-project/sglang/pull/670
* Decouple kv by hnyls2002 in https://github.com/sgl-project/sglang/pull/679
* Support gpt-bigcode model class by hnyls2002 in https://github.com/sgl-project/sglang/pull/681
* support non-streaming benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/682
* Fix StreamExecutor.fork() losing the current role start index. by max99x in https://github.com/sgl-project/sglang/pull/684
* feat: update bench serving by zhyncs in https://github.com/sgl-project/sglang/pull/685
* misc: update output file logic by zhyncs in https://github.com/sgl-project/sglang/pull/686
* Allow disabling streaming in bench by merrymercy in https://github.com/sgl-project/sglang/pull/687
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/688
* Support Deepseek MoE Model by hnyls2002 in https://github.com/sgl-project/sglang/pull/689
* misc: recommend to use chat model for benchmark by zhyncs in https://github.com/sgl-project/sglang/pull/690
* Support Mistral-Nemo by ispobock in https://github.com/sgl-project/sglang/pull/691
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/692
* fix: update bench serving by zhyncs in https://github.com/sgl-project/sglang/pull/694
* misc: update output token logic by zhyncs in https://github.com/sgl-project/sglang/pull/695
* Tune params by Ying1123 in https://github.com/sgl-project/sglang/pull/696
* Fix trt benchmark by Ying1123 in https://github.com/sgl-project/sglang/pull/697
* misc: fix typo by zhyncs in https://github.com/sgl-project/sglang/pull/698
* Fix flashinfer by Ying1123 in https://github.com/sgl-project/sglang/pull/700
* Fix hf config loading by ispobock in https://github.com/sgl-project/sglang/pull/702
* Use min new token ratio at start by hnyls2002 in https://github.com/sgl-project/sglang/pull/701
* feat: add e2e latency by zhyncs in https://github.com/sgl-project/sglang/pull/704
* Update vllm version to support llama3.1 by Ying1123 in https://github.com/sgl-project/sglang/pull/705
* bump version to 0.1.23 by Ying1123 in https://github.com/sgl-project/sglang/pull/706
* Reduce hardcoded logic of kernel usage by wisclmy0611 in https://github.com/sgl-project/sglang/pull/707
* Fix multi-node deadlock by merrymercy in https://github.com/sgl-project/sglang/pull/709
* Auto adjust new ratio by hnyls2002 in https://github.com/sgl-project/sglang/pull/708
* Fix prefill size by Ying1123 in https://github.com/sgl-project/sglang/pull/711
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/712
* docs: update doc by zhyncs in https://github.com/sgl-project/sglang/pull/713
* fix: llama 3.1 405b fp8 by zhyncs in https://github.com/sgl-project/sglang/pull/714
* misc: update doc by zhyncs in https://github.com/sgl-project/sglang/pull/715
* Improve benchmark scripts by Ying1123 in https://github.com/sgl-project/sglang/pull/717
* Bump version to 0.1.24 by Ying1123 in https://github.com/sgl-project/sglang/pull/718
* docs: update supported models by zhyncs in https://github.com/sgl-project/sglang/pull/719
* docs: update comment by zhyncs in https://github.com/sgl-project/sglang/pull/721
* chore: add close inactive issues workflow by zhyncs in https://github.com/sgl-project/sglang/pull/722
* misc: update bulid instruction by zhyncs in https://github.com/sgl-project/sglang/pull/724
* fix: fp8 config by Ying1123 in https://github.com/sgl-project/sglang/pull/723
* Fix dockerfile and triton cache manager by hnyls2002 in https://github.com/sgl-project/sglang/pull/720
* chore: bump v0.1.25 by zhyncs in https://github.com/sgl-project/sglang/pull/725
* fix: resolve the logo display issue on the PyPI page by zhyncs in https://github.com/sgl-project/sglang/pull/726
* misc: update bug issue template by zhyncs in https://github.com/sgl-project/sglang/pull/727
* Revert "fix: fp8 config" by Ying1123 in https://github.com/sgl-project/sglang/pull/728
* Fix bugs (fp8 checkpoints, triton cache manager) by Ying1123 in https://github.com/sgl-project/sglang/pull/729
* Bump version to 0.2.0 by Ying1123 in https://github.com/sgl-project/sglang/pull/730

New Contributors
* yileld made their first contribution in https://github.com/sgl-project/sglang/pull/630
* AidanCooper made their first contribution in https://github.com/sgl-project/sglang/pull/624
* zhyncs made their first contribution in https://github.com/sgl-project/sglang/pull/636
* shrirajh made their first contribution in https://github.com/sgl-project/sglang/pull/654
* yichuan520030910320 made their first contribution in https://github.com/sgl-project/sglang/pull/640
* max99x made their first contribution in https://github.com/sgl-project/sglang/pull/684

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.1.20...v0.2.0

0.1.20

Highlights
* Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (612)
* Model support: Gemma2, minicpm, Qwen2 MoE
* Docker support (217 )
* Various latency optimizations

What's Changed
* Add docker file by Ying1123 in https://github.com/sgl-project/sglang/pull/588
* Add Gemma2 by Ying1123 in https://github.com/sgl-project/sglang/pull/592
* Format by Ying1123 in https://github.com/sgl-project/sglang/pull/593
* Fix Llava model by wisclmy0611 in https://github.com/sgl-project/sglang/pull/594
* * fix(detokenizer_manager.py): fix truncated decoded output by Titan-p in https://github.com/sgl-project/sglang/pull/586
* Add `--enable-p2p-check` option by hnyls2002 in https://github.com/sgl-project/sglang/pull/599
* Fix streaming by hnyls2002 in https://github.com/sgl-project/sglang/pull/600
* Reduce number of workspaces for flashinfer by wisclmy0611 in https://github.com/sgl-project/sglang/pull/601
* add `LogitsMetadata` by hnyls2002 in https://github.com/sgl-project/sglang/pull/604
* add minicpm support by Titan-p in https://github.com/sgl-project/sglang/pull/602
* Make sglang compat with vllm 0.5.1 by M0gician in https://github.com/sgl-project/sglang/pull/598
* Add Qwen2 MoE support by M0gician in https://github.com/sgl-project/sglang/pull/603
* Update chat template for qwen and yi-1.5. by for-just-we in https://github.com/sgl-project/sglang/pull/530
* [Feat] Expose logprob options to `sgl.gen` API by huyiwen in https://github.com/sgl-project/sglang/pull/503
* Fix bench latency by merrymercy in https://github.com/sgl-project/sglang/pull/607
* Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by merrymercy in https://github.com/sgl-project/sglang/pull/609
* Clean up the usage of flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/610
* Cleanup attention backend: flashinfer and triton by merrymercy in https://github.com/sgl-project/sglang/pull/611
* Enable cuda graph by default by merrymercy in https://github.com/sgl-project/sglang/pull/612
* Improve benchmark scripts & fix llava by merrymercy in https://github.com/sgl-project/sglang/pull/613
* Memorypool chunked prefetch by hnyls2002 in https://github.com/sgl-project/sglang/pull/614
* Improve benchmark scripts by merrymercy in https://github.com/sgl-project/sglang/pull/615
* Fix memory pool index error by Ying1123 in https://github.com/sgl-project/sglang/pull/616
* Bump version to 0.1.20 by merrymercy in https://github.com/sgl-project/sglang/pull/618

New Contributors
* wisclmy0611 made their first contribution in https://github.com/sgl-project/sglang/pull/594
* Titan-p made their first contribution in https://github.com/sgl-project/sglang/pull/586
* M0gician made their first contribution in https://github.com/sgl-project/sglang/pull/598
* for-just-we made their first contribution in https://github.com/sgl-project/sglang/pull/530

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.1.18...v0.1.20

0.1.18

Highlight
- 2x large batch prefill improvement with the new flashinfer kernels 579
- Multi-node tensor parallelism 550
- New model support: ChatGLM 516


What's Changed
* Fix missing numpy dependency in pyproject.toml by fpreiss in https://github.com/sgl-project/sglang/pull/524
* Fix RAG nb, parea setup (parea -> parea-ai) by fpreiss in https://github.com/sgl-project/sglang/pull/525
* [Minor] Correct Optional type hints in api by fpreiss in https://github.com/sgl-project/sglang/pull/526
* Add ChatGLM Model Support by Qubitium in https://github.com/sgl-project/sglang/pull/516
* Fix Regression: Disable p2p for 4090 by ZX-ModelCloud in https://github.com/sgl-project/sglang/pull/531
* Decode Incrementally by hnyls2002 in https://github.com/sgl-project/sglang/pull/517
* Fix dependency by merrymercy in https://github.com/sgl-project/sglang/pull/538
* Fix dependency & crash issues by Ying1123 in https://github.com/sgl-project/sglang/pull/539
* Higher priority for user input of max_prefill_tokens & format by Ying1123 in https://github.com/sgl-project/sglang/pull/540
* Add disk cache for loading ShareGPT dataset. by hnyls2002 in https://github.com/sgl-project/sglang/pull/542
* Fix tp worker only checking req[0] for stream by Qubitium in https://github.com/sgl-project/sglang/pull/546
* Fix the Jump-Forward with Chinese by hnyls2002 in https://github.com/sgl-project/sglang/pull/551
* Update fused_moe by merrymercy in https://github.com/sgl-project/sglang/pull/553
* Multi-node Tensor Parallelism by Ying1123 in https://github.com/sgl-project/sglang/pull/550
* Update flashinfer to 0.0.5 by merrymercy in https://github.com/sgl-project/sglang/pull/554
* Follow-up fixes for flashinfer 0.0.5 by merrymercy in https://github.com/sgl-project/sglang/pull/556
* Fix latency benchmark by hnyls2002 in https://github.com/sgl-project/sglang/pull/557
* Clean up logits processor by merrymercy in https://github.com/sgl-project/sglang/pull/558
* Update test_flashinfer by hnyls2002 in https://github.com/sgl-project/sglang/pull/560
* Allow running with vllm==0.4.3 by merrymercy in https://github.com/sgl-project/sglang/pull/561
* Add a new arguments log_level_http to control the HTTP logging by merrymercy in https://github.com/sgl-project/sglang/pull/563
* Add sglang.bench_latency for offline benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/564
* Warmup cublas by merrymercy in https://github.com/sgl-project/sglang/pull/566
* Increase the number of thread limitation for tp worker managers. by merrymercy in https://github.com/sgl-project/sglang/pull/567
* Update readme by merrymercy in https://github.com/sgl-project/sglang/pull/568
* Expose dtype argument by merrymercy in https://github.com/sgl-project/sglang/pull/569
* Update benchmark script by Ying1123 in https://github.com/sgl-project/sglang/pull/571
* Minor fix in compiler & format by ZackZeng999 in https://github.com/sgl-project/sglang/pull/545
* Update run_batch interface and max_prefill_tokens by Ying1123 in https://github.com/sgl-project/sglang/pull/574
* Fix flashinfer version by PanJason in https://github.com/sgl-project/sglang/pull/576
* [BugFix] gemma loading weights "lm_head.weight" key error by dhgarcia in https://github.com/sgl-project/sglang/pull/577
* Turn on flashinfer by default by Ying1123 in https://github.com/sgl-project/sglang/pull/578
* fix the broken server args by hnyls2002 in https://github.com/sgl-project/sglang/pull/585
* 2x performance improvement for large prefill & Fix workspace conflicts by Ying1123 in https://github.com/sgl-project/sglang/pull/579

New Contributors
* fpreiss made their first contribution in https://github.com/sgl-project/sglang/pull/524
* ZackZeng999 made their first contribution in https://github.com/sgl-project/sglang/pull/545
* PanJason made their first contribution in https://github.com/sgl-project/sglang/pull/576
* dhgarcia made their first contribution in https://github.com/sgl-project/sglang/pull/577

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.1.17...v0.1.18

Page 5 of 7

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.