Highlights
* **New Feature**: Support window attention for Gemma-2 (1056 1090 1112), enable chunked-prefill by default (1040 984), support all sampling penalties (973)
* **New Models**: Support embedding model e5-mistral (983 987 988 997 1014) and comprehensive OpenAI-compatible API.
* **Performance**: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (905).
* **More CI Tests**: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
* **Refactor and fix**: More modular, better stability, use more kernels from flashinfer (907)
What's Changed
* fix: set env in runner by zhyncs in https://github.com/sgl-project/sglang/pull/891
* docs: update setup runner by zhyncs in https://github.com/sgl-project/sglang/pull/884
* misc: update cuda graph capture exception log by zhyncs in https://github.com/sgl-project/sglang/pull/894
* chore: add multipart dep for fastapi by zhyncs in https://github.com/sgl-project/sglang/pull/895
* [minor] fixed code formatting doc by min-xu-et in https://github.com/sgl-project/sglang/pull/896
* Bump version to 0.2.9.post1 by Ying1123 in https://github.com/sgl-project/sglang/pull/899
* Update the base image of the docker by Ying1123 in https://github.com/sgl-project/sglang/pull/900
* Reorder CI unit tests. by hnyls2002 in https://github.com/sgl-project/sglang/pull/908
* fixed an error handling in bench_latency.py by min-xu-et in https://github.com/sgl-project/sglang/pull/904
* Add model accuracy test - step 1 by Ying1123 in https://github.com/sgl-project/sglang/pull/866
* latency test enhancement - part 1 by min-xu-et in https://github.com/sgl-project/sglang/pull/909
* Improve the structure of CI by Ying1123 in https://github.com/sgl-project/sglang/pull/911
* fix: use e2e and unit test only for original repo or pr by zhyncs in https://github.com/sgl-project/sglang/pull/912
* misc: add triton in check_env PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/914
* Support MLA for DeepSeek-V2 with Triton - step 1 by ispobock in https://github.com/sgl-project/sglang/pull/905
* enhance latency test - part 2 by min-xu-et in https://github.com/sgl-project/sglang/pull/915
* Make API Key OpenAI-compatible by Ying1123 in https://github.com/sgl-project/sglang/pull/917
* Update hyperparameter_tuning.md by Ying1123 in https://github.com/sgl-project/sglang/pull/918
* Fix CI && python3.8 compatible by hnyls2002 in https://github.com/sgl-project/sglang/pull/920
* Support more OpenAI API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/916
* Bump version to 0.2.10 by Ying1123 in https://github.com/sgl-project/sglang/pull/923
* latency test enhancement - final part by min-xu-et in https://github.com/sgl-project/sglang/pull/921
* Test openai vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/925
* Test regex in vision api by Ying1123 in https://github.com/sgl-project/sglang/pull/926
* Update README.md by Ying1123 in https://github.com/sgl-project/sglang/pull/927
* Fix prompt len in parallel sampling by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/928
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/935
* Remove leftover auth_token by AidanCooper in https://github.com/sgl-project/sglang/pull/934
* Feat: add alternative choices selection methods by AidanCooper in https://github.com/sgl-project/sglang/pull/835
* Fix union operator by ispobock in https://github.com/sgl-project/sglang/pull/940
* Support multiple args options by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/941
* Fix stuck in `get_new_prefill_batch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/948
* Organize code (rename, movement) by hnyls2002 in https://github.com/sgl-project/sglang/pull/953
* fix nsys cannot profile cuda kernel by mpjlu in https://github.com/sgl-project/sglang/pull/957
* Add support for Batch API test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/936
* Show more error messages for warmup errors by Ying1123 in https://github.com/sgl-project/sglang/pull/932
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/963
* misc: simplify test by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/964
* misc: add compute capability in check_env by zhyncs in https://github.com/sgl-project/sglang/pull/965
* Make `req_pool_indices` on CPU by hnyls2002 in https://github.com/sgl-project/sglang/pull/960
* misc: fix the req_to_token member change by hnyls2002 in https://github.com/sgl-project/sglang/pull/967
* chore: update vllm to 0.5.4 by zhyncs in https://github.com/sgl-project/sglang/pull/966
* chore: bump v0.2.11 by zhyncs in https://github.com/sgl-project/sglang/pull/970
* Purge self-runner's pip cache weekly by hnyls2002 in https://github.com/sgl-project/sglang/pull/975
* Run purge-cache only in sgl-project by hnyls2002 in https://github.com/sgl-project/sglang/pull/976
* misc: correct the int data type for token ids and indices by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/969
* PrefillAdder abstraction by hnyls2002 in https://github.com/sgl-project/sglang/pull/968
* RadixCache method adjust by hnyls2002 in https://github.com/sgl-project/sglang/pull/977
* Adjust max prefix len by hnyls2002 in https://github.com/sgl-project/sglang/pull/980
* 590 Increase default , track changes in examples and documentation by foszto in https://github.com/sgl-project/sglang/pull/971
* [minor] Update type annotation in tokenizer_manager.py by Ying1123 in https://github.com/sgl-project/sglang/pull/982
* Fix chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/984
* Add llama embedding modules [unreachable code] - step 1/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/983
* Add io struct for embedding models [unreachable code] - step 2/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/987
* Adjust `InputeMetadata` and `ScheduleBatch` by hnyls2002 in https://github.com/sgl-project/sglang/pull/981
* support more optioin about usage in stream mode by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/985
* Create contributor_guide.md by Ying1123 in https://github.com/sgl-project/sglang/pull/992
* feat: frequency, min_new_tokens, presence, and repetition penalties by vhain in https://github.com/sgl-project/sglang/pull/973
* Move torch.compile configs into cuda_graph_runner.py by Ying1123 in https://github.com/sgl-project/sglang/pull/993
* Add e5-mistral embedding model - step 3/3 by Ying1123 in https://github.com/sgl-project/sglang/pull/988
* test: negative value testing for frequency, presence penalizers by vhain in https://github.com/sgl-project/sglang/pull/995
* support models from www.modelscope.cn by liuyhwangyh in https://github.com/sgl-project/sglang/pull/994
* bugfix: penalizers to be merged before reqs by vhain in https://github.com/sgl-project/sglang/pull/1001
* fix: resolve correctness_test issue by zhyncs in https://github.com/sgl-project/sglang/pull/1002
* Minor bugfix on benchmark serving by ywang96 in https://github.com/sgl-project/sglang/pull/1005
* Add openai embedding API by Ying1123 in https://github.com/sgl-project/sglang/pull/997
* Add skip_tokenizer_init args. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/959
* Fix benchmark latency by wisclmy0611 in https://github.com/sgl-project/sglang/pull/1007
* Some warnings to crash when CI by hnyls2002 in https://github.com/sgl-project/sglang/pull/1009
* Reduce the overhead when cache is disabled by hnyls2002 in https://github.com/sgl-project/sglang/pull/1010
* Support embedding input as a list by Ying1123 in https://github.com/sgl-project/sglang/pull/1014
* misc: update test config by zhyncs in https://github.com/sgl-project/sglang/pull/990
* fix: force max new tokens to be 1 for embedding request by Ying1123 in https://github.com/sgl-project/sglang/pull/1019
* Clean up unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1020
* Fix `input_ids` && rename to `fill_ids` by hnyls2002 in https://github.com/sgl-project/sglang/pull/1021
* feat: use FlashInfer rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/907
* misc: update issue template by zhyncs in https://github.com/sgl-project/sglang/pull/1024
* Clean up readme and arguments of chunked prefill by merrymercy in https://github.com/sgl-project/sglang/pull/1022
* Fix wrong assert by hnyls2002 in https://github.com/sgl-project/sglang/pull/1028
* Improve type annotation by merrymercy in https://github.com/sgl-project/sglang/pull/1029
* hotfix: add CustomOp abstraction by zhyncs in https://github.com/sgl-project/sglang/pull/1027
* Fix the case where r.prefix_indices is None by merrymercy in https://github.com/sgl-project/sglang/pull/1031
* Fix triton args init by hnyls2002 in https://github.com/sgl-project/sglang/pull/1034
* Fix the case when max_new_tokens is too large by merrymercy in https://github.com/sgl-project/sglang/pull/1025
* Test the case when max_new_tokens is very large by merrymercy in https://github.com/sgl-project/sglang/pull/1038
* Fix the prefix indices by hnyls2002 in https://github.com/sgl-project/sglang/pull/1037
* Improve end-to-end throughput test and its coverage by merrymercy in https://github.com/sgl-project/sglang/pull/1039
* Delete the useless test/srt/test_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/1045
* minor: some potential bugs by hnyls2002 in https://github.com/sgl-project/sglang/pull/1044
* Clean up the comments and names under python/sglang/srt/layers by merrymercy in https://github.com/sgl-project/sglang/pull/1047
* fix: Fix returned prefill logits and add output str test by Ying1123 in https://github.com/sgl-project/sglang/pull/1046
* feat: update Dockerfile by zhyncs in https://github.com/sgl-project/sglang/pull/1033
* docs: update setup github runner by zhyncs in https://github.com/sgl-project/sglang/pull/1050
* Add longer accuracy test on CI by merrymercy in https://github.com/sgl-project/sglang/pull/1049
* Fix accuracy test by merrymercy in https://github.com/sgl-project/sglang/pull/1051
* Re-organize CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/1052
* chore: bump v0.2.12 by zhyncs in https://github.com/sgl-project/sglang/pull/1048
* feat: replace all rmsnorm and silu by zhyncs in https://github.com/sgl-project/sglang/pull/1057
* fix: not use the default port by zhyncs in https://github.com/sgl-project/sglang/pull/1068
* Fix layernorm input shape by ispobock in https://github.com/sgl-project/sglang/pull/1066
* fix: temporary solution for DeepSeek V2 H100 layout conversion issue by zhyncs in https://github.com/sgl-project/sglang/pull/1060
* ci: add cancel pr workflow by zhyncs in https://github.com/sgl-project/sglang/pull/1070
* ci: add moe test by zhyncs in https://github.com/sgl-project/sglang/pull/1053
* fix: use devel for Triton's compiler requirements by zhyncs in https://github.com/sgl-project/sglang/pull/1074
* ci: add accuracy timeout by zhyncs in https://github.com/sgl-project/sglang/pull/1078
* Fix create_abort_task, GenerateReqInput does not have rids. by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1079
* Example file for docker compose and k8s by LucienShui in https://github.com/sgl-project/sglang/pull/1006
* Update the mixtral to use the better FusedMoE layer by merrymercy in https://github.com/sgl-project/sglang/pull/1081
* [Feat] Add window attention for gemma-2 by Ying1123 in https://github.com/sgl-project/sglang/pull/1056
* Fix jump forward final state circular path bug. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1084
* ci: update timeout and retry by zhyncs in https://github.com/sgl-project/sglang/pull/1086
* [Feature] modify Runtime to support skip_tokenizer_init by gryffindor-rr in https://github.com/sgl-project/sglang/pull/1088
* Fix a bug in cuda graph runner by merrymercy in https://github.com/sgl-project/sglang/pull/1094
* ci: remove workflow path trigger by zhyncs in https://github.com/sgl-project/sglang/pull/1096
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/1098
* Update grok 1 model by merrymercy in https://github.com/sgl-project/sglang/pull/1095
* docs: update pr template by zhyncs in https://github.com/sgl-project/sglang/pull/1099
* Use `dtype` to control generate by hnyls2002 in https://github.com/sgl-project/sglang/pull/1082
* [Fix] Compatibility of window attention and cuda graph by Ying1123 in https://github.com/sgl-project/sglang/pull/1090
* docs: update nsys usage by zhyncs in https://github.com/sgl-project/sglang/pull/1103
* Support `stop_token_ids` in sglang API by hnyls2002 in https://github.com/sgl-project/sglang/pull/1092
* Support jinja as chat template file by Ying1123 in https://github.com/sgl-project/sglang/pull/1104
* Use a single workspace for flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/1077
* [Fix] fix the typo bug for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1106
* Enable chunked prefill by default by merrymercy in https://github.com/sgl-project/sglang/pull/1040
* [Fix] fix flashinfer usage for window attention by Ying1123 in https://github.com/sgl-project/sglang/pull/1107
* misc: rm unused model_loader by zhyncs in https://github.com/sgl-project/sglang/pull/1110
* [Fix] Window attention compatible with RadixAttention and chunked prefill by Ying1123 in https://github.com/sgl-project/sglang/pull/1112
* set CUDA_DEVICE_MAX_CONNECTIONS=1 by merrymercy in https://github.com/sgl-project/sglang/pull/1113
* chore: bump v0.2.13 by zhyncs in https://github.com/sgl-project/sglang/pull/1111
New Contributors
* min-xu-et made their first contribution in https://github.com/sgl-project/sglang/pull/896
* mpjlu made their first contribution in https://github.com/sgl-project/sglang/pull/957
* xiezhq-hermann made their first contribution in https://github.com/sgl-project/sglang/pull/969
* foszto made their first contribution in https://github.com/sgl-project/sglang/pull/971
* vhain made their first contribution in https://github.com/sgl-project/sglang/pull/973
* liuyhwangyh made their first contribution in https://github.com/sgl-project/sglang/pull/994
* ywang96 made their first contribution in https://github.com/sgl-project/sglang/pull/1005
* gryffindor-rr made their first contribution in https://github.com/sgl-project/sglang/pull/959
* LucienShui made their first contribution in https://github.com/sgl-project/sglang/pull/1006
**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.2.9...v0.2.13