Highlights
* Reduce CPU overhead by enabling overlap scheduler by default. **1.1x higher throughput**. (2105, 2067, 2095)
* Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (1970, 2061)
* Cache-aware load balancer. 4x higher cache hit rate (1934)
* Support xgrammar backend for grammar-guided decoding (2056)
* Support Prometheus metrics (1853, 1981)
* Support torch 2.5.1 (2069) and torch-native tensor parallelism (1876)
* Support graceful termination (1838) and watchdog (1816)
* Support notebook-style documentation (https://sgl-project.github.io/)
* Add an offline benchmark script (1968)
* Bug, deadlock, NaN, and OOM fixes (2083, 1850, 1800, 1779, 1789, 1858)
* New models: Phi3-small (2062), Gemma-2 reward model (1954), GPT-2 (1833)
What's Changed
* Fix edge case for truncated by ByronHsu in https://github.com/sgl-project/sglang/pull/1747
* Fuse more ops & Simplify token mapping by merrymercy in https://github.com/sgl-project/sglang/pull/1758
* [API] add get memory pool size by Ying1123 in https://github.com/sgl-project/sglang/pull/1760
* Fix perf regression for set_kv_buffer by merrymercy in https://github.com/sgl-project/sglang/pull/1765
* [Fix] Fix abort in data parallelism by merrymercy in https://github.com/sgl-project/sglang/pull/1767
* Fix stop condition for <|eom_id|> by merrymercy in https://github.com/sgl-project/sglang/pull/1766
* Update docs by merrymercy in https://github.com/sgl-project/sglang/pull/1768
* Fix missing additional_stop_token_ids by merrymercy in https://github.com/sgl-project/sglang/pull/1769
* Fix out of memory message. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1771
* Crash the server on warnings in CI by merrymercy in https://github.com/sgl-project/sglang/pull/1772
* Fix the perf regression due to additional_stop_token_ids by merrymercy in https://github.com/sgl-project/sglang/pull/1773
* Fix MockTokenizer in the unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1774
* [Bug] Catch any errors caused by parsing json schema by zolinthecow in https://github.com/sgl-project/sglang/pull/1776
* [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/1779
* [Fix] Fix cuda graph padding for triton attention backend by merrymercy in https://github.com/sgl-project/sglang/pull/1782
* check user-specified model_max_len with hf derived max_model_len by BBuf in https://github.com/sgl-project/sglang/pull/1778
* Re-introduce `get_cuda_graph_seq_len_fill_value` by merrymercy in https://github.com/sgl-project/sglang/pull/1783
* Enhance the test case for chunked prefill and check memory leak by merrymercy in https://github.com/sgl-project/sglang/pull/1785
* Fix seq_lens_sum for cuda graph runner in padded cases by merrymercy in https://github.com/sgl-project/sglang/pull/1789
* Qwen2vl support cuda graph and disable radix cache by yizhang2077 in https://github.com/sgl-project/sglang/pull/1780
* Fix log parsing in the chunked prefill unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1793
* Fix memory leak when doing chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/1787
* [Fix] Fix the log parsing in chunked prefill uni tests by merrymercy in https://github.com/sgl-project/sglang/pull/1794
* Revert "Fix memory leak when doing chunked prefill" by merrymercy in https://github.com/sgl-project/sglang/pull/1797
* Fix logprob in the overlapped mode by merrymercy in https://github.com/sgl-project/sglang/pull/1795