Sglang

Latest version: v0.4.4.post3

Safety actively analyzes 724206 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 7

0.4.0

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

- Zero-overhead batch scheduler: 1.1x increase in throughput.
- Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
- Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
- Fast structured outputs with xgrammar: up to 10x faster.

What's Changed
* fix: add xgrammar dependency by zhyncs in https://github.com/sgl-project/sglang/pull/2126
* docs: fix module docstrings and copyright headers by XuehaiPan in https://github.com/sgl-project/sglang/pull/2077
* feat(pre-commit): trim unnecessary notebook metadata from git history by XuehaiPan in https://github.com/sgl-project/sglang/pull/2127
* Expose max total num tokens from Runtime & Engine API by henryhmko in https://github.com/sgl-project/sglang/pull/2092
* Only stream output on tp rank 0 by merrymercy in https://github.com/sgl-project/sglang/pull/2124
* Revert "Only stream output on tp rank 0" by merrymercy in https://github.com/sgl-project/sglang/pull/2130
* Add initial support for intel Gaudi accelerators by ankurneog in https://github.com/sgl-project/sglang/pull/2121
* Add simple CPU offloading support. by janimo in https://github.com/sgl-project/sglang/pull/2081
* Fix grid size in Triton decoding kernel by ispobock in https://github.com/sgl-project/sglang/pull/2134
* [CI] Fix test cases by merrymercy in https://github.com/sgl-project/sglang/pull/2137
* Add concurrency option for benchmark by cermeng in https://github.com/sgl-project/sglang/pull/2136
* Fix dp print message by merrymercy in https://github.com/sgl-project/sglang/pull/2138
* fix: resolve bench_serving args by zhyncs in https://github.com/sgl-project/sglang/pull/2139
* [router] cache-aware load-balancing router v1 by ByronHsu in https://github.com/sgl-project/sglang/pull/2114
* Bump sglang-router to 0.0.5 by ByronHsu in https://github.com/sgl-project/sglang/pull/2142
* update router doc by ByronHsu in https://github.com/sgl-project/sglang/pull/2143
* fix dp_rank env by ByronHsu in https://github.com/sgl-project/sglang/pull/2144
* Add more api routes (completion, health, etc) to the router by ByronHsu in https://github.com/sgl-project/sglang/pull/2146
* add prefix match for certain tenant by ByronHsu in https://github.com/sgl-project/sglang/pull/2147
* Improve sglang router by ByronHsu in https://github.com/sgl-project/sglang/pull/2148
* Merged three native APIs into one: get_server_info by henryhmko in https://github.com/sgl-project/sglang/pull/2152
* feat: remove the dependency on FusedMoE by zhyncs in https://github.com/sgl-project/sglang/pull/2153
* feat: update gitignore and add tuning config for FusedMoE by zhyncs in https://github.com/sgl-project/sglang/pull/2155
* fix: resolve end-of-file-fixer by zhyncs in https://github.com/sgl-project/sglang/pull/2157
* Simplify `Scheduler.update_running_batch` by merrymercy in https://github.com/sgl-project/sglang/pull/2154
* feat: update other MoE models deps by zhyncs in https://github.com/sgl-project/sglang/pull/2156
* Update CI threshold & Improve code style by merrymercy in https://github.com/sgl-project/sglang/pull/2159
* fix: use torch.sum for compatible by zhyncs in https://github.com/sgl-project/sglang/pull/2161
* Fix mixed chunked prefill in overlap mode by merrymercy in https://github.com/sgl-project/sglang/pull/2158
* Balance CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/2162
* Rename triton_fused_moe -> fused_moe_triton by merrymercy in https://github.com/sgl-project/sglang/pull/2163
* Fix docs by merrymercy in https://github.com/sgl-project/sglang/pull/2164
* [Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by BBuf in https://github.com/sgl-project/sglang/pull/2167
* Allow overwrite flashinfer use_tensorcore by merrymercy in https://github.com/sgl-project/sglang/pull/2169
* Replace prob based with threshold based load balancing by ByronHsu in https://github.com/sgl-project/sglang/pull/2170
* feat: fused_moe fp8 monkey patch by zhyncs in https://github.com/sgl-project/sglang/pull/2174
* [Fix] Avoid calling fill_vocab_mask for terminated requests by Ubospica in https://github.com/sgl-project/sglang/pull/2175
* [CI] Split test cases in CI for better load balancing by merrymercy in https://github.com/sgl-project/sglang/pull/2180
* Bump rustls from 0.23.16 to 0.23.18 in /rust by dependabot in https://github.com/sgl-project/sglang/pull/2182
* [feat] Refactor session control interface and add CI by Ying1123 in https://github.com/sgl-project/sglang/pull/2173
* [router] Replace print with logger by ByronHsu in https://github.com/sgl-project/sglang/pull/2183
* Use custom allreduce w/ torch.compile by merrymercy in https://github.com/sgl-project/sglang/pull/2185
* [Performance]: Process affinity to CPU cores with multiple sockets support by HaiShaw in https://github.com/sgl-project/sglang/pull/2171
* Update CI threshold by merrymercy in https://github.com/sgl-project/sglang/pull/2186
* Update XGrammar to the latest API by Ubospica in https://github.com/sgl-project/sglang/pull/2176
* [router] Rust e2e test by ByronHsu in https://github.com/sgl-project/sglang/pull/2184
* Input_embeds support by RinRin-32 in https://github.com/sgl-project/sglang/pull/2052
* [CI] Minor fix for CI by merrymercy in https://github.com/sgl-project/sglang/pull/2187
* Rename double sparsity config file by merrymercy in https://github.com/sgl-project/sglang/pull/2188

0.3.6.post2

* Rename DP_RANK to SGLANG_DP_RANK by merrymercy in https://github.com/sgl-project/sglang/pull/2218
* [3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2191
* Bump sglang-router to 0.0.10 for env name change by ByronHsu in https://github.com/sgl-project/sglang/pull/2226
* fix typo prompts by qibaoyuan in https://github.com/sgl-project/sglang/pull/2224
* Remove fused_moe_grok by merrymercy in https://github.com/sgl-project/sglang/pull/2223
* add profile in offline benchmark & update doc by bjmsong in https://github.com/sgl-project/sglang/pull/2123
* Rename tuned MI300X config files for fused_moe_triton by HaiShaw in https://github.com/sgl-project/sglang/pull/2228
* Update Install Method 2. From source by HaiShaw in https://github.com/sgl-project/sglang/pull/2232
* Fix chunked prefill size for bench_offline_throughput by merrymercy in https://github.com/sgl-project/sglang/pull/2234
* Disable overlap scheduler for multimodal models by merrymercy in https://github.com/sgl-project/sglang/pull/2235
* Add OLMo2 model. by janimo in https://github.com/sgl-project/sglang/pull/2233
* Crash the server correctly during error by merrymercy in https://github.com/sgl-project/sglang/pull/2231
* Fix memory leak during abort by merrymercy in https://github.com/sgl-project/sglang/pull/2238
* fix missing launch server import by qeternity in https://github.com/sgl-project/sglang/pull/2242
* [fix] Fix prefix caching for multi-image/video by Ying1123 in https://github.com/sgl-project/sglang/pull/2239
* Update backend.md by merrymercy in https://github.com/sgl-project/sglang/pull/2250
* Update backend.md by merrymercy in https://github.com/sgl-project/sglang/pull/2251
* Revert "Add simple CPU offloading support" by Ying1123 in https://github.com/sgl-project/sglang/pull/2252
* Revert "Revert "Add simple CPU offloading support"" by Ying1123 in https://github.com/sgl-project/sglang/pull/2253
* Simplify tokenizer manager by merrymercy in https://github.com/sgl-project/sglang/pull/2254
* Fix hash collision for multi modal models by merrymercy in https://github.com/sgl-project/sglang/pull/2256
* [Minor] fix the style for multimodal models by merrymercy in https://github.com/sgl-project/sglang/pull/2257
* chore: bump v0.3.6.post3 by zhyncs in https://github.com/sgl-project/sglang/pull/2259
* minor: add sgl-kernel dir by zhyncs in https://github.com/sgl-project/sglang/pull/2261
* [benchmark] Add fused_moe_triton benchmark and tuning tools by BBuf in https://github.com/sgl-project/sglang/pull/2225
* Fix the default chunked prefill size by merrymercy in https://github.com/sgl-project/sglang/pull/2268
* Support LoRA in Completion API by bjmsong in https://github.com/sgl-project/sglang/pull/2243
* Add new contributors so they can trigger CI automatically by merrymercy in https://github.com/sgl-project/sglang/pull/2269
* udate weights from disk by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2265
* add get weights by parameter name for llama by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2266
* [CI] Print summary on github actions by merrymercy in https://github.com/sgl-project/sglang/pull/2274
* [CI] Kill zombie processes by merrymercy in https://github.com/sgl-project/sglang/pull/2280
* [FEAT] Support GGUF format by zhengy001 in https://github.com/sgl-project/sglang/pull/2215
* [Fix] fix assertion error for chunked prefill when disabling cache by wangraying in https://github.com/sgl-project/sglang/pull/2282
* Revert "[FEAT] Support GGUF format" by merrymercy in https://github.com/sgl-project/sglang/pull/2285
* Revert "[Fix] fix assertion error for chunked prefill when disabling cache" by merrymercy in https://github.com/sgl-project/sglang/pull/2286
* [CI] Fix ci tests by merrymercy in https://github.com/sgl-project/sglang/pull/2284
* Revert "Revert "[FEAT] Support GGUF format"" by merrymercy in https://github.com/sgl-project/sglang/pull/2287
* feat: add Dockerfile for development by zhyncs in https://github.com/sgl-project/sglang/pull/2289
* [CI] Fix missing files in run_suite.py by merrymercy in https://github.com/sgl-project/sglang/pull/2288
* adapt vllm distributed module to sglang by yizhang2077 in https://github.com/sgl-project/sglang/pull/2244
* Fix chunked prefill when ignore eos by hnyls2002 in https://github.com/sgl-project/sglang/pull/2290
* [CI] Balance CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/2293
* feat: add should_use_tensor_core by zhyncs in https://github.com/sgl-project/sglang/pull/2179
* Feat: upgrade outlines & support compatibility with the old version by gobraves in https://github.com/sgl-project/sglang/pull/2292
* minor: support flashinfer nightly by zhyncs in https://github.com/sgl-project/sglang/pull/2295
* Add a simple torch native attention backend by YangQun1 in https://github.com/sgl-project/sglang/pull/2241
* feat: skip good first issue by zhyncs in https://github.com/sgl-project/sglang/pull/2298
* minor: rm unused _grouped_size_compiled_for_decode_kernels by zhyncs in https://github.com/sgl-project/sglang/pull/2299
* feat: support sgl-kernel pypi by zhyncs in https://github.com/sgl-project/sglang/pull/2302
* Fix logprob for completions by merrymercy in https://github.com/sgl-project/sglang/pull/2301
* feat: use warp reduce as a simple example by zhyncs in https://github.com/sgl-project/sglang/pull/2304
* fix: resolve CodeQL cpp issue by zhyncs in https://github.com/sgl-project/sglang/pull/2305
* misc: update build setup by zhyncs in https://github.com/sgl-project/sglang/pull/2306
* Online weight updates from torch.distributed by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2279
* [Fix] Fix the padded hash value for image tokens by merrymercy in https://github.com/sgl-project/sglang/pull/2309
* Use rocminfo instead of rocm-smi for more OS/WSL support by HaiShaw in https://github.com/sgl-project/sglang/pull/2310
* [Minor] Fix code style by merrymercy in https://github.com/sgl-project/sglang/pull/2311
* Add more fused moe benchmark utilities by merrymercy in https://github.com/sgl-project/sglang/pull/2314
* Update model_loader deps and qqq quantization deps (2220) by zhyncs in https://github.com/sgl-project/sglang/pull/2318
* Relax to include more AMD GPUs by HaiShaw in https://github.com/sgl-project/sglang/pull/2319
* [feat] Enable chunked prefill for llava-onevision by Ying1123 in https://github.com/sgl-project/sglang/pull/2281
* [Minor] Fix logger and style by merrymercy in https://github.com/sgl-project/sglang/pull/2325
* Revert "[feat] Enable chunked prefill for llava-onevision" by Ying1123 in https://github.com/sgl-project/sglang/pull/2329
* ROCm Container: set SGLANG_SET_CPU_AFFINITY=1 by HaiShaw in https://github.com/sgl-project/sglang/pull/2328
* Add missing license for router wheel by MrAta in https://github.com/sgl-project/sglang/pull/2324
* Improve torch compile for fused moe by merrymercy in https://github.com/sgl-project/sglang/pull/2327
* fix: resolve cmake url for Dockerfile.dev by zhyncs in https://github.com/sgl-project/sglang/pull/2335
* Fix gptq for moe layers by merrymercy in https://github.com/sgl-project/sglang/pull/2300
* [router] Copy license when publishing & bump version by ByronHsu in https://github.com/sgl-project/sglang/pull/2339
* chore: bump v0.4.0 by zhyncs in https://github.com/sgl-project/sglang/pull/2338

New Contributors
* henryhmko made their first contribution in https://github.com/sgl-project/sglang/pull/2092
* ankurneog made their first contribution in https://github.com/sgl-project/sglang/pull/2121
* cermeng made their first contribution in https://github.com/sgl-project/sglang/pull/2136
* Ubospica made their first contribution in https://github.com/sgl-project/sglang/pull/2175
* dependabot made their first contribution in https://github.com/sgl-project/sglang/pull/2182
* RinRin-32 made their first contribution in https://github.com/sgl-project/sglang/pull/2052
* WrRan made their first contribution in https://github.com/sgl-project/sglang/pull/2195
* apemost made their first contribution in https://github.com/sgl-project/sglang/pull/2198
* qibaoyuan made their first contribution in https://github.com/sgl-project/sglang/pull/2224
* zhengy001 made their first contribution in https://github.com/sgl-project/sglang/pull/2215
* wangraying made their first contribution in https://github.com/sgl-project/sglang/pull/2282
* gobraves made their first contribution in https://github.com/sgl-project/sglang/pull/2292
* YangQun1 made their first contribution in https://github.com/sgl-project/sglang/pull/2241
* MrAta made their first contribution in https://github.com/sgl-project/sglang/pull/2324

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.3.6...v0.4.0

0.3.6.post1

* Update sampler.py to skip the success check by merrymercy in https://github.com/sgl-project/sglang/pull/2197
* remove unused imports by WrRan in https://github.com/sgl-project/sglang/pull/2195
* Remove unresolved reference 'self' by apemost in https://github.com/sgl-project/sglang/pull/2198
* using `is not` not `!=` to test `None` by WrRan in https://github.com/sgl-project/sglang/pull/2196
* fix: add cuda-python for xgrammar by zhyncs in https://github.com/sgl-project/sglang/pull/2199
* minor: update check_env by zhyncs in https://github.com/sgl-project/sglang/pull/2201
* add sglang version to get_server_info by binarycrayon in https://github.com/sgl-project/sglang/pull/2206
* docs: update adoption by zhyncs in https://github.com/sgl-project/sglang/pull/2204
* Bump router to 0.0.9 with better logging by ByronHsu in https://github.com/sgl-project/sglang/pull/2207
* Fix rust warning by ByronHsu in https://github.com/sgl-project/sglang/pull/2208
* Fix flasky tests by merrymercy in https://github.com/sgl-project/sglang/pull/2212
* [feat] Support session control for vision language models by Ying1123 in https://github.com/sgl-project/sglang/pull/2210
* Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by merrymercy in https://github.com/sgl-project/sglang/pull/2217
* Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by merrymercy in https://github.com/sgl-project/sglang/pull/2221
* Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by merrymercy in https://github.com/sgl-project/sglang/pull/2222

0.3.6

Highlights
* Reduce CPU overhead by enabling overlap scheduler by default. **1.1x higher throughput**. (2105, 2067, 2095)
* Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (1970, 2061)
* Cache-aware load balancer. 4x higher cache hit rate (1934)
* Support xgrammar backend for grammar-guided decoding (2056)
* Support Prometheus metrics (1853, 1981)
* Support torch 2.5.1 (2069) and torch-native tensor parallelism (1876)
* Support graceful termination (1838) and watchdog (1816)
* Support notebook-style documentation (https://sgl-project.github.io/)
* Add an offline benchmark script (1968)
* Bug, deadlock, NaN, and OOM fixes (2083, 1850, 1800, 1779, 1789, 1858)
* New models: Phi3-small (2062), Gemma-2 reward model (1954), GPT-2 (1833)

What's Changed
* Fix edge case for truncated by ByronHsu in https://github.com/sgl-project/sglang/pull/1747
* Fuse more ops & Simplify token mapping by merrymercy in https://github.com/sgl-project/sglang/pull/1758
* [API] add get memory pool size by Ying1123 in https://github.com/sgl-project/sglang/pull/1760
* Fix perf regression for set_kv_buffer by merrymercy in https://github.com/sgl-project/sglang/pull/1765
* [Fix] Fix abort in data parallelism by merrymercy in https://github.com/sgl-project/sglang/pull/1767
* Fix stop condition for <|eom_id|> by merrymercy in https://github.com/sgl-project/sglang/pull/1766
* Update docs by merrymercy in https://github.com/sgl-project/sglang/pull/1768
* Fix missing additional_stop_token_ids by merrymercy in https://github.com/sgl-project/sglang/pull/1769
* Fix out of memory message. by hnyls2002 in https://github.com/sgl-project/sglang/pull/1771
* Crash the server on warnings in CI by merrymercy in https://github.com/sgl-project/sglang/pull/1772
* Fix the perf regression due to additional_stop_token_ids by merrymercy in https://github.com/sgl-project/sglang/pull/1773
* Fix MockTokenizer in the unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1774
* [Bug] Catch any errors caused by parsing json schema by zolinthecow in https://github.com/sgl-project/sglang/pull/1776
* [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/1779
* [Fix] Fix cuda graph padding for triton attention backend by merrymercy in https://github.com/sgl-project/sglang/pull/1782
* check user-specified model_max_len with hf derived max_model_len by BBuf in https://github.com/sgl-project/sglang/pull/1778
* Re-introduce `get_cuda_graph_seq_len_fill_value` by merrymercy in https://github.com/sgl-project/sglang/pull/1783
* Enhance the test case for chunked prefill and check memory leak by merrymercy in https://github.com/sgl-project/sglang/pull/1785
* Fix seq_lens_sum for cuda graph runner in padded cases by merrymercy in https://github.com/sgl-project/sglang/pull/1789
* Qwen2vl support cuda graph and disable radix cache by yizhang2077 in https://github.com/sgl-project/sglang/pull/1780
* Fix log parsing in the chunked prefill unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/1793
* Fix memory leak when doing chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/1787
* [Fix] Fix the log parsing in chunked prefill uni tests by merrymercy in https://github.com/sgl-project/sglang/pull/1794
* Revert "Fix memory leak when doing chunked prefill" by merrymercy in https://github.com/sgl-project/sglang/pull/1797
* Fix logprob in the overlapped mode by merrymercy in https://github.com/sgl-project/sglang/pull/1795

0.3.5.post2

* fix a small typo in docs by BBuf in https://github.com/sgl-project/sglang/pull/2047
* Fix core (MI300X) with --enable-overlap by HaiShaw in https://github.com/sgl-project/sglang/pull/2048
* Add Tensor Parallel to torch_native_llama by kwen2501 in https://github.com/sgl-project/sglang/pull/1876
* Add get_amdgpu_memory_capacity() by HaiShaw in https://github.com/sgl-project/sglang/pull/2049
* Fix weight update for data parallelism by merrymercy in https://github.com/sgl-project/sglang/pull/2050
* Support DP MLA by ispobock in https://github.com/sgl-project/sglang/pull/1970
* Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data by merrymercy in https://github.com/sgl-project/sglang/pull/2051
* chore: update torch v2.5.1 by zhyncs in https://github.com/sgl-project/sglang/pull/1849
* Revert "chore: update torch v2.5.1" by merrymercy in https://github.com/sgl-project/sglang/pull/2063
* Remove monkey_patch_vllm_dummy_weight_loader by merrymercy in https://github.com/sgl-project/sglang/pull/2064
* Deprecate --disable-flashinfer and --disable-flashinfer-sampling by merrymercy in https://github.com/sgl-project/sglang/pull/2065
* Support cuda graph for DP attention by ispobock in https://github.com/sgl-project/sglang/pull/2061
* Rename arguments `--disable-nan-detection` to `--enable-nan-detection` by merrymercy in https://github.com/sgl-project/sglang/pull/2066
* [Performance] Update xgrammar-related constrained decoding by DarkSharpness in https://github.com/sgl-project/sglang/pull/2056
* add phi-3 small support by Tushar-ml in https://github.com/sgl-project/sglang/pull/2062
* [Minor] Fix styles for overlap mode by merrymercy in https://github.com/sgl-project/sglang/pull/2068
* Fix cuda illegal memory access in overlap mode by merrymercy in https://github.com/sgl-project/sglang/pull/2070
* Tune the threshold for accuracy tests in CI by merrymercy in https://github.com/sgl-project/sglang/pull/2071
* Crash the CI jobs on model import errors by merrymercy in https://github.com/sgl-project/sglang/pull/2072
* support set role as 'tool' by yukavio in https://github.com/sgl-project/sglang/pull/2075
* feat: update torch 2.5.1 by zhyncs in https://github.com/sgl-project/sglang/pull/2069
* Rename layer_idx to layer_id for consistency by janimo in https://github.com/sgl-project/sglang/pull/2078
* Fix chunked prefill with output logprob by merrymercy in https://github.com/sgl-project/sglang/pull/2083
* Allow passing extra request body to bench_offline_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/2085
* Simplify logits penalizer by merrymercy in https://github.com/sgl-project/sglang/pull/2086
* Use cuda event wait and synchronization instead of busy waiting by merrymercy in https://github.com/sgl-project/sglang/pull/2089
* Fix: incorrect top_logprobs in chat completion by ajwaitz in https://github.com/sgl-project/sglang/pull/2088
* minor: update gsm8k eval by zhyncs in https://github.com/sgl-project/sglang/pull/2091
* Use native fp8 format on MI300X by HaiShaw in https://github.com/sgl-project/sglang/pull/2094
* minor: add dataset dump and questions shuffle by zhyncs in https://github.com/sgl-project/sglang/pull/2093
* Make constrained decoding work for overlap scheduler by merrymercy in https://github.com/sgl-project/sglang/pull/2095
* Set schedule policy more conservative for DP attention by ispobock in https://github.com/sgl-project/sglang/pull/2096
* Enable overlap by default by merrymercy in https://github.com/sgl-project/sglang/pull/2067
* Update nightly-eval.yml by merrymercy in https://github.com/sgl-project/sglang/pull/2100
* [feat] Add session control by Ying1123 in https://github.com/sgl-project/sglang/pull/2073
* Allow skipping warmup in bench_offline_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/2103
* Move test_session_id.py to playground by merrymercy in https://github.com/sgl-project/sglang/pull/2104
* Enable overlap scheduler by default for the triton attention backend by merrymercy in https://github.com/sgl-project/sglang/pull/2105
* Error out when torchao-config option is not recognized by jerryzh168 in https://github.com/sgl-project/sglang/pull/2107
* Turn off autotune for scaled mm for fp8 dynamic quant in torchao by jerryzh168 in https://github.com/sgl-project/sglang/pull/2116
* ROCm: Fix MoE padding for none FP8 cases by HaiShaw in https://github.com/sgl-project/sglang/pull/2111
* Add support for Qwen2-VL-based embedding models by james-p-xu in https://github.com/sgl-project/sglang/pull/2055
* [router] add base_gpu_id server args & merged radix tree python reference by ByronHsu in https://github.com/sgl-project/sglang/pull/2115
* Fix 2037 - Context length check does not take into out pad tokens for visual models by jakep-allenai in https://github.com/sgl-project/sglang/pull/2106
* Rename sglang.bench_latency to sglang.bench_one_batch by merrymercy in https://github.com/sgl-project/sglang/pull/2118
* Benchmark with Pytorch Profiler easily by bjmsong in https://github.com/sgl-project/sglang/pull/2110
* [minor] Clean up unused imports by merrymercy in https://github.com/sgl-project/sglang/pull/2122
* minor: update gsm8k threshold by zhyncs in https://github.com/sgl-project/sglang/pull/2125
* chore: bump v0.3.6 by zhyncs in https://github.com/sgl-project/sglang/pull/2120

New Contributors
* zolinthecow made their first contribution in https://github.com/sgl-project/sglang/pull/1776
* BBuf made their first contribution in https://github.com/sgl-project/sglang/pull/1778
* DarkSharpness made their first contribution in https://github.com/sgl-project/sglang/pull/1752
* hliuca made their first contribution in https://github.com/sgl-project/sglang/pull/1799
* liuyanyi made their first contribution in https://github.com/sgl-project/sglang/pull/1823
* DanielC12321 made their first contribution in https://github.com/sgl-project/sglang/pull/1833
* geeker-smallwhite made their first contribution in https://github.com/sgl-project/sglang/pull/1855
* yichiche made their first contribution in https://github.com/sgl-project/sglang/pull/1871
* inakineitor made their first contribution in https://github.com/sgl-project/sglang/pull/1902
* Lzhang-hub made their first contribution in https://github.com/sgl-project/sglang/pull/1853
* XuehaiPan made their first contribution in https://github.com/sgl-project/sglang/pull/1926
* austin362667 made their first contribution in https://github.com/sgl-project/sglang/pull/1891
* binarycrayon made their first contribution in https://github.com/sgl-project/sglang/pull/1933
* aqweteddy made their first contribution in https://github.com/sgl-project/sglang/pull/1954
* leishaoSC made their first contribution in https://github.com/sgl-project/sglang/pull/1966
* kursataktas made their first contribution in https://github.com/sgl-project/sglang/pull/1745
* HuanzhiMao made their first contribution in https://github.com/sgl-project/sglang/pull/1982
* james-p-xu made their first contribution in https://github.com/sgl-project/sglang/pull/1995
* RangiLyu made their first contribution in https://github.com/sgl-project/sglang/pull/1994
* chottolabs made their first contribution in https://github.com/sgl-project/sglang/pull/2026
* ethe made their first contribution in https://github.com/sgl-project/sglang/pull/2028
* w1ndseeker made their first contribution in https://github.com/sgl-project/sglang/pull/2038
* kwen2501 made their first contribution in https://github.com/sgl-project/sglang/pull/1876
* Tushar-ml made their first contribution in https://github.com/sgl-project/sglang/pull/2062
* yukavio made their first contribution in https://github.com/sgl-project/sglang/pull/2075
* ajwaitz made their first contribution in https://github.com/sgl-project/sglang/pull/2088
* jakep-allenai made their first contribution in https://github.com/sgl-project/sglang/pull/2106
* bjmsong made their first contribution in https://github.com/sgl-project/sglang/pull/2110

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.3.4.post1...v0.3.6

0.3.5.post1

* Do not let invalid grammar crash the server by merrymercy in https://github.com/sgl-project/sglang/pull/2023
* Fix dependency and error message for xgrammar by merrymercy in https://github.com/sgl-project/sglang/pull/2024
* set content to empty string by chottolabs in https://github.com/sgl-project/sglang/pull/2026
* chore: open lto and optimization in release profile by ethe in https://github.com/sgl-project/sglang/pull/2028
* Add download_dir ServerArgs property by pjyi2147 in https://github.com/sgl-project/sglang/pull/2027
* Github runner instructions for AMD by HaiShaw in https://github.com/sgl-project/sglang/pull/2031
* Fix torch.compile for MoE by merrymercy in https://github.com/sgl-project/sglang/pull/2033
* Fix unit tests by merrymercy in https://github.com/sgl-project/sglang/pull/2034
* Fix outlines version by merrymercy in https://github.com/sgl-project/sglang/pull/2036
* Expose no_stop_trim and skip_special_tokens in openai api by merrymercy in https://github.com/sgl-project/sglang/pull/2039
* Offline LLM Engine Benchmark Throughput by zolinthecow in https://github.com/sgl-project/sglang/pull/1968
* fix: align enable_overlap_scheduler naming between code and docs by w1ndseeker in https://github.com/sgl-project/sglang/pull/2038
* Fix the default arguments of bench_offline_throughput.py & simplify detokenizer manager by merrymercy in https://github.com/sgl-project/sglang/pull/2042
* benchmark json schema by DarkSharpness in https://github.com/sgl-project/sglang/pull/2030
* Fix json benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/2043
* [Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity by merrymercy in https://github.com/sgl-project/sglang/pull/2044

Page 2 of 7

Releases

Has known vulnerabilities

Previous Next

Sglang

Page 2 of 7

0.4.0

0.3.6.post2

0.3.6.post1

0.3.6

0.3.5.post2

0.3.5.post1

Page 2 of 7

Links

Releases