Sglang

Latest version: v0.4.4.post3

Safety actively analyzes 724206 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 7

0.4.4

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly **100 tokens/s**, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the [announcement](https://github.com/sgl-project/sglang/discussions/3322), there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

- **AMD Performance Leadership**: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's [technical blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html)

- **Enhanced FlashInfer MLA Support**: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
`--enable-flashinfer-mla`

- **Advanced MTP Capabilities**: Both Triton and FlashInfer backends now offer comprehensive [Multi-Token Prediction](https://docs.sglang.ai/references/deepseek.html#multi-token-prediction) support, easily tunable via the [bench_speculative](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script, compatible with radix cache and chunked prefill.

- **DeepGEMM Integration**: Full integration of [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) for NVIDIA Hopper architectures - enable with
`export SGL_ENABLE_JIT_DEEPGEMM=1`

- **Pioneering INT8 Quantization**: First industry implementation of INT8 support for DeepSeek R1 models:
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)

- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)

- **Other Optimizations**:
- Blackwell architecture Block Scale FP8 GEMM support

- Support page size greater than 1 https://github.com/sgl-project/sglang/pull/4356

- Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89

- Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) https://github.com/sgl-project/sglang/pull/4390

Coming soon

- Integrate Flash Attention https://github.com/sgl-project/sglang/issues/4385

- Integrate FlashMLA https://github.com/sgl-project/sglang/issues/4384

- EAGLE 2 optimization https://github.com/sgl-project/sglang/pull/4383

- EAGLE 3 day one support https://github.com/sgl-project/sglang/pull/4247

- Integrate DeepEP https://github.com/sgl-project/sglang/pull/4232

- Prefill and Decoding Disaggregation

What's Changed
* update flashinfer-python by zhyncs in https://github.com/sgl-project/sglang/pull/3557
* fix doc by zhyncs in https://github.com/sgl-project/sglang/pull/3558
* Add support for OpenAI API o1 model by ChuyueSun in https://github.com/sgl-project/sglang/pull/3363
* fix sgl-kernel codestyle by BBuf in https://github.com/sgl-project/sglang/pull/3563
* docs: update install by zhyncs in https://github.com/sgl-project/sglang/pull/3581
* Copy config files for MI300X to support in virtualized environments by yosoyjay in https://github.com/sgl-project/sglang/pull/3505
* ROCm docker: triton update by HaiShaw in https://github.com/sgl-project/sglang/pull/3584
* [fix] added support for vlm in offline inference by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3548
* Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by ispobock in https://github.com/sgl-project/sglang/pull/3582
* [CI] Improve Docs CI Efficiency by shuaills in https://github.com/sgl-project/sglang/pull/3587
* doc: emphasize and notify the usage of chat_template by mickqian in https://github.com/sgl-project/sglang/pull/3589
* fix eagle unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3591
* fix high qps crash when enable mtp by zhyncs in https://github.com/sgl-project/sglang/pull/3592
* fix apply_token_bitmask_inplace_cuda by zhyncs in https://github.com/sgl-project/sglang/pull/3594
* [docs] added favicon to sphinx html by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3564
* fix lockfile and port_registry file permission error by Jiadalee in https://github.com/sgl-project/sglang/pull/3598
* feat: Support Qwen 2.5 vl by mickqian in https://github.com/sgl-project/sglang/pull/3258
* [ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. by whchung in https://github.com/sgl-project/sglang/pull/3535
* Update to latest amd image. by saienduri in https://github.com/sgl-project/sglang/pull/3597
* Benchmark for reasoning models by simveit in https://github.com/sgl-project/sglang/pull/3532
* Draft of updated doc for sampling params. by simveit in https://github.com/sgl-project/sglang/pull/3260
* [docs] Update sampling_params.md by shuaills in https://github.com/sgl-project/sglang/pull/3617
* [docker] added rdma support by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3619
* Revert "[ROCm] Use `tl.range()` in block GEMM kernels with `num_stage… by zhyncs in https://github.com/sgl-project/sglang/pull/3632
* add mtp unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3634
* update unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3636
* chore: bump v0.4.3.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/3638
* h800 deepseek r1 config and support multi-gpu block-gemm tuning by BBuf in https://github.com/sgl-project/sglang/pull/3639
* feat: support flashinfer mla with prefix cache by zhyncs in https://github.com/sgl-project/sglang/pull/3643
* chore: update flashinfer v0.2.1.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3644
* chore: bump v0.4.3.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3645
* use transformers 4.48.3 by zhyncs in https://github.com/sgl-project/sglang/pull/3650
* [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by whchung in https://github.com/sgl-project/sglang/pull/3616
* [ROCm] Optimal MOE Tuning for AMD Radeon Graphics by BruceXcluding in https://github.com/sgl-project/sglang/pull/3567
* Deploy multi-node inference (LWS method) using sglang in a K8s cluster by whybeyoung in https://github.com/sgl-project/sglang/pull/3624
* Update amd docker image. by saienduri in https://github.com/sgl-project/sglang/pull/3654
* [Feature] Apply Cublas Grouped Gemm kernel by Fridge003 in https://github.com/sgl-project/sglang/pull/3629
* update pr-test by zhyncs in https://github.com/sgl-project/sglang/pull/3663
* Fix draft decode max batch size by ispobock in https://github.com/sgl-project/sglang/pull/3676
* fix: remove dependency on latest transformers impl by mickqian in https://github.com/sgl-project/sglang/pull/3635
* AMD Prefill optimize by fsx950223 in https://github.com/sgl-project/sglang/pull/3665
* fix: apply cache size limit of attention mask for VisionAttention by mickqian in https://github.com/sgl-project/sglang/pull/3657
* set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by zhyncs in https://github.com/sgl-project/sglang/pull/3698
* use warp shuffle style reduce and flashinfer vectorize by BBuf in https://github.com/sgl-project/sglang/pull/3628
* [Docs] Add SkyPilot DeepSeek example by Michaelvll in https://github.com/sgl-project/sglang/pull/3706
* [k8s] remove unnecessary hostIPC for security concern by panpan0000 in https://github.com/sgl-project/sglang/pull/3700
* [moe] optim: reduce memory consumption in fused_moe by ch-wan in https://github.com/sgl-project/sglang/pull/3692
* [Improve] Fix Multi-User Port Allocation Conflicts by shuaills in https://github.com/sgl-project/sglang/pull/3601
* Variance measure for reasoning benchmark by simveit in https://github.com/sgl-project/sglang/pull/3677
* Docs: Fix layout with sub-section by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3710
* add control for cutlass fp8 blockwise gemm by yizhang2077 in https://github.com/sgl-project/sglang/pull/3727
* revert BLOCK and num_warps on HIP by HaiShaw in https://github.com/sgl-project/sglang/pull/3722
* Optimize triton attention custom mask by ispobock in https://github.com/sgl-project/sglang/pull/3731
* [Bugfix] Fix scores mask for moe topk by Chen-XiaoBing in https://github.com/sgl-project/sglang/pull/3705
* [Docs] Modify ep related server args and remove cublas part of deepseek by Fridge003 in https://github.com/sgl-project/sglang/pull/3732
* [Fix] Fix bugs and refactor codes in lora for better scalability. by aoshen524 in https://github.com/sgl-project/sglang/pull/3652
* docs: fix 404 link by trayvonpan in https://github.com/sgl-project/sglang/pull/3588
* [docs] added torch.compile cache to dpsk manual by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3737
* AMD/ROCm: update AITER repo to ROCm/aiter by HaiShaw in https://github.com/sgl-project/sglang/pull/3747
* feat: update grouped_topk to support softmax and sigmoid by zixuanzhang226 in https://github.com/sgl-project/sglang/pull/3680
* feat: Add SageMaker support by andjsmi in https://github.com/sgl-project/sglang/pull/3740
* Change description of nvidia jetson docs by shahizat in https://github.com/sgl-project/sglang/pull/3761
* [Fix] fix OpenAI API adapter tokenizer encoding by shuaills in https://github.com/sgl-project/sglang/pull/3432
* [bug] fixed batch api by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3754
* Adjustments to docs by simveit in https://github.com/sgl-project/sglang/pull/3733
* docs: Add offline engine launch example and documentation by shuaills in https://github.com/sgl-project/sglang/pull/3771
* Update offline_engine_api.ipynb by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3773
* Support Qwen RM model. by simveit in https://github.com/sgl-project/sglang/pull/3772
* Add support for nvidia modelopt fp8 kv cache by Edwardf0t1 in https://github.com/sgl-project/sglang/pull/3223
* Tiny fix Olmo2 by fzyzcjy in https://github.com/sgl-project/sglang/pull/3348
* fix lm head weights in Qwen models by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3777
* Fix weight loader error when LM head weights are tied by fzyzcjy in https://github.com/sgl-project/sglang/pull/3766
* Let DetokenizerManager use TypeBasedDispatcher by fzyzcjy in https://github.com/sgl-project/sglang/pull/3117
* bench: Add a benchmark for vLM: MMMU by mickqian in https://github.com/sgl-project/sglang/pull/3562
* Extract generation_manager from tokenizer_manager by fzyzcjy in https://github.com/sgl-project/sglang/pull/3115
* Rename TokenizerManager to StdOrchestrator by fzyzcjy in https://github.com/sgl-project/sglang/pull/3116
* [Docs]Add instruction for manually stopping nsys profiler by Fridge003 in https://github.com/sgl-project/sglang/pull/3795
* Hierarchical Caching for SGLang by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2693
* Update readme by merrymercy in https://github.com/sgl-project/sglang/pull/3809
* Fix dependency by merrymercy in https://github.com/sgl-project/sglang/pull/3813
* Refactor flashinfer logic for deepseek v3 and fix accuracy bug by Fridge003 in https://github.com/sgl-project/sglang/pull/3785
* Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by laixinn in https://github.com/sgl-project/sglang/pull/3730
* Fix pandas dependency in CI by merrymercy in https://github.com/sgl-project/sglang/pull/3818
* Revert "Rename TokenizerManager to StdOrchestrator" by merrymercy in https://github.com/sgl-project/sglang/pull/3828
* Revert "Extract generation_manager from tokenizer_manager" by merrymercy in https://github.com/sgl-project/sglang/pull/3829
* Fix CI and install docs by merrymercy in https://github.com/sgl-project/sglang/pull/3821
* typos by WrRan in https://github.com/sgl-project/sglang/pull/3801
* doc: fix dead link in router.md by He1pa in https://github.com/sgl-project/sglang/pull/3799
* Fix doc site copyright to current year by wilsonwu in https://github.com/sgl-project/sglang/pull/3741
* [Doc] Fix typo in server-argument description by yuanheng-zhao in https://github.com/sgl-project/sglang/pull/3641
* [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by lcskrishna in https://github.com/sgl-project/sglang/pull/3237
* [BugFix]: Add missing clamp to llavavid by PanJason in https://github.com/sgl-project/sglang/pull/3787
* [dist] made timeout configurable by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3803
* Fix allgather ops inside cuda graphs by nvcastet in https://github.com/sgl-project/sglang/pull/3709
* fix capture_bs by fsx950223 in https://github.com/sgl-project/sglang/pull/3857
* [BugFix] Fix crash when receive a req with structed output in DP attention mode. by hcyz33 in https://github.com/sgl-project/sglang/pull/3841
* Fix maximum recursion depth triggered on exception exit by kebe7jun in https://github.com/sgl-project/sglang/pull/3519
* [doc] added quantization doc for dpsk by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3843
* [doc] fixed dpsk quant faq by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3865
* Expert Parallelism (EP) Support for DeepSeek V3/R1 by sleepcoo in https://github.com/sgl-project/sglang/pull/3602
* Revert recent changes by simveit in https://github.com/sgl-project/sglang/pull/3845
* Feature/improve docs by simveit in https://github.com/sgl-project/sglang/pull/3860
* [Feature] Support llguidance for constrained decoding by JC1DA in https://github.com/sgl-project/sglang/pull/3298
* Move dpsk docs forward a step by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3894
* Docs: Reorngaize dpsk links by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3900
* Implemented frontend docs by simveit in https://github.com/sgl-project/sglang/pull/3791
* [doc] update sponsorship by whybeyoung in https://github.com/sgl-project/sglang/pull/3903
* [Rocm] Fix to the rocm_mla_decode_rope.py returning random result by Chi-Chu319 in https://github.com/sgl-project/sglang/pull/3898
* [doc] Update document for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3907
* Add return hidden state in the native API by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3897
* [Docs] Disable notebook CI when merge to main by xqoasis in https://github.com/sgl-project/sglang/pull/3905
* [Docs] Improve DPSK docs in dark mode by hebiao064 in https://github.com/sgl-project/sglang/pull/3914
* [Doc] Add experimental tag for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3925
* Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by laixinn in https://github.com/sgl-project/sglang/pull/3922
* xgrammar 0.1.14 by qeternity in https://github.com/sgl-project/sglang/pull/3593
* revert "Docs: Reorngaize dpsk links 3900" by zhyncs in https://github.com/sgl-project/sglang/pull/3933
* upgrade flashinfer v0.2.2.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/3934
* Fix the doc link for sampling params by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3861
* [feat] Add Vertex AI compatible prediction route for /generate by KCFindstr in https://github.com/sgl-project/sglang/pull/3866
* [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3613
* Fix bench_serving not recognizing OPENAI_API_KEY by kebe7jun in https://github.com/sgl-project/sglang/pull/3870
* set a strict sgl-kernel version by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3950
* [Bugfix] Fix tokenizer_manager not getting 400 when req is too long by CatherineSue in https://github.com/sgl-project/sglang/pull/3678
* [Feature] integrate Structural Tag in xgrammar backend for function calling by minleminzui in https://github.com/sgl-project/sglang/pull/3566
* SGLang + Verl by fzyzcjy in https://github.com/sgl-project/sglang/pull/3852
* Remove unused imports from rocm mla kernel. by lcskrishna in https://github.com/sgl-project/sglang/pull/3963
* Update cutlass dependency by elfiegg in https://github.com/sgl-project/sglang/pull/3966
* [Feature]Support ragged prefill in flashinfer mla backend by Fridge003 in https://github.com/sgl-project/sglang/pull/3967
* Docs: add type hint to smapling parameters by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3975
* Add redline to highlight main process by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3977
* rename FunctionCallReqInput to ParseFunctionCallReq by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3976
* Docs: add special warning to engine docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3979
* Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3982
* Move return_hidden_states to the generate input by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3985
* Update CODEOWNERS by merrymercy in https://github.com/sgl-project/sglang/pull/3989
* add deepgemm and sglang fp8 block-wise gemm benchmark by BBuf in https://github.com/sgl-project/sglang/pull/3893
* fix typo by BBuf in https://github.com/sgl-project/sglang/pull/3991
* Fix all gather torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3992
* Add accuracy test for TP torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3994
* Enable custom AR for AMD GPUs and maintain it in sgl-kernel by hubertlu-tw in https://github.com/sgl-project/sglang/pull/3406
* Add Benchmark for DeepGEMM Group GEMM by hebiao064 in https://github.com/sgl-project/sglang/pull/3993
* [feat] add small vocab table for eagle's draft model[1]. by Zhou-sx in https://github.com/sgl-project/sglang/pull/3822
* Add fast decode plan for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3987
* Revert "Add fast decode plan for flashinfer mla" by merrymercy in https://github.com/sgl-project/sglang/pull/4008
* Add examples to token-in-token-out for LLM by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4010
* Fix nightly-test CI by yinfan98 in https://github.com/sgl-project/sglang/pull/3826
* Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by hebiao064 in https://github.com/sgl-project/sglang/pull/4014
* Improve code styles by merrymercy in https://github.com/sgl-project/sglang/pull/4021
* Clean up custom allreduce by merrymercy in https://github.com/sgl-project/sglang/pull/4029
* remove cache configs in model definitions by merrymercy in https://github.com/sgl-project/sglang/pull/4031
* Update metrics documentation by binarycrayon in https://github.com/sgl-project/sglang/pull/3264
* Reorganize c++ source files in sgl-kernel with multiple folders by merrymercy in https://github.com/sgl-project/sglang/pull/4025
* Reorganize python source files in sgl-kernel with multiple files by merrymercy in https://github.com/sgl-project/sglang/pull/4027
* Misc clean up; Remove the support of jump forward by merrymercy in https://github.com/sgl-project/sglang/pull/4032
* Docs: Fix sampling parameter by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4034
* Remove outdated test utils and fix links for the doc of sampling params by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3999
* Add examples in sampling parameters by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4039
* Share target model embed and head weights for nextn by ispobock in https://github.com/sgl-project/sglang/pull/4033
* Add a link to the roadmap in README.md by merrymercy in https://github.com/sgl-project/sglang/pull/4043
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/4044
* Fix assert options.num_stages != 0 error in the latest ROCm build image by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4049
* Reasoning parser by xihuai18 in https://github.com/sgl-project/sglang/pull/4000
* HotFix for 3988 using blockwise_int8 by xihuai18 in https://github.com/sgl-project/sglang/pull/4023
* Fix breakage problem when using custom_ar by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4052
* ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by HaiShaw in https://github.com/sgl-project/sglang/pull/4053
* Fix `debug_tensor_dump_output_folder` optional key missing by Qubitium in https://github.com/sgl-project/sglang/pull/4046
* Remove grafana dashboard's datasource uid by kebe7jun in https://github.com/sgl-project/sglang/pull/4051
* [Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by DarkSharpness in https://github.com/sgl-project/sglang/pull/4030
* [XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by cboss6 in https://github.com/sgl-project/sglang/pull/3954
* sgl-router - issues on routing and project build. (3870) by michaelfeil in https://github.com/sgl-project/sglang/pull/3948
* fix: support gelu_new activation function in gpt2 by Xiuyu-Li in https://github.com/sgl-project/sglang/pull/3712
* remove unused max_jobs by sgjzfzzf in https://github.com/sgl-project/sglang/pull/3607
* [Feature] Add test for speculative_token_map by Achazwl in https://github.com/sgl-project/sglang/pull/4016
* Revert "Fix nightly-test CI" by merrymercy in https://github.com/sgl-project/sglang/pull/4065
* Update nextn ci test by ispobock in https://github.com/sgl-project/sglang/pull/4071
* Simplify eagle tests and TP sync in grammar backend by merrymercy in https://github.com/sgl-project/sglang/pull/4066
* Add examples for returning hidden states when using the server by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4074
* [Minor] more code cleanup by merrymercy in https://github.com/sgl-project/sglang/pull/4077
* test: add vlm to token in & out example by mickqian in https://github.com/sgl-project/sglang/pull/3941
* [QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization by Qubitium in https://github.com/sgl-project/sglang/pull/3790
* bench: add dataset param for bench_multiturn by zeroorhero in https://github.com/sgl-project/sglang/pull/3990
* ROCM: AITER BLOCK GEMM by BruceXcluding in https://github.com/sgl-project/sglang/pull/4075
* [Eagle] Refactor eagle speculative decoding by Ying1123 in https://github.com/sgl-project/sglang/pull/3986
* Fix the moe padding conditional logic by HaiShaw in https://github.com/sgl-project/sglang/pull/4081
* [Revision] Add fast decode plan for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/4012
* Fix triton kernel illegal memory issue for eagle by ispobock in https://github.com/sgl-project/sglang/pull/4100
* Add update_weights_from_disk endpoint to Engine by jhinpan in https://github.com/sgl-project/sglang/pull/4102
* Add DeepSeek optimization ablations documentation by M0gician in https://github.com/sgl-project/sglang/pull/4107
* reorganize dpsk docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4108
* Add examples for server token-in-token-out by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4103
* revert deepseek docs by zhyncs in https://github.com/sgl-project/sglang/pull/4109
* Create release-docker-amd-nightly.yml by saienduri in https://github.com/sgl-project/sglang/pull/4105
* remove testing on PR workflow change by saienduri in https://github.com/sgl-project/sglang/pull/4110
* Debug radixcache: refactor recursive helper methods by luzengxiangcn in https://github.com/sgl-project/sglang/pull/3029
* Online serving benchmarks of real datasets for hierarchical KV caching by PanJason in https://github.com/sgl-project/sglang/pull/3211
* fix cross-reference error and spelling mistakes by samzong in https://github.com/sgl-project/sglang/pull/4101
* fix Non-consecutive header level increase in docs/router/router.md by samzong in https://github.com/sgl-project/sglang/pull/4099
* chore: bump v0.4.3.post3 by zhyncs in https://github.com/sgl-project/sglang/pull/4114
* [Hoxfix] Fix incomplete token_to_kv_pool refactor by Edenzzzz in https://github.com/sgl-project/sglang/pull/4121
* Remove prefill-only-one-req by merrymercy in https://github.com/sgl-project/sglang/pull/4117
* Add a pointer to the real KV cache pool by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4113
* feat: support docs auto live-reload with sphinx-autobuild by samzong in https://github.com/sgl-project/sglang/pull/4111
* EAGLE docs by simveit in https://github.com/sgl-project/sglang/pull/4038
* Add codeowners for eagle implementations by Ying1123 in https://github.com/sgl-project/sglang/pull/4131
* Add tag suffix to nightly docker builds. by saienduri in https://github.com/sgl-project/sglang/pull/4129
* remove unused max_jobs in setup_rocm.py by sgjzfzzf in https://github.com/sgl-project/sglang/pull/4126
* Split the __init__ of scheduler as smaller functions. Improve the eagle tests by merrymercy in https://github.com/sgl-project/sglang/pull/4128
* [Minor] make the `__init__` function of model_runner.py shorter by merrymercy in https://github.com/sgl-project/sglang/pull/4132
* AMD/ROCm: update base image string by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4137
* Update CODEOWNER by merrymercy in https://github.com/sgl-project/sglang/pull/4138
* fix bench serving bug by Lzhang-hub in https://github.com/sgl-project/sglang/pull/4135
* Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by merrymercy in https://github.com/sgl-project/sglang/pull/4134
* Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by yinfan98 in https://github.com/sgl-project/sglang/pull/4147
* Fix constrained generation errors by adding datasets dependency by olliestanley in https://github.com/sgl-project/sglang/pull/4142

0.4.3.post4

* [docs] fix HF reference script command by adarshxs in https://github.com/sgl-project/sglang/pull/4148
* Docs: add torch compile cache by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4151
* Hot fix small vocal eagle in docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4154
* ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by HaiShaw in https://github.com/sgl-project/sglang/pull/4152
* Add Support for Qwen2-VL Multi-modal Embedding Models by Titan-p in https://github.com/sgl-project/sglang/pull/3694
* [quant kernel] sgl-kernel support per_tensor_quant fp8 by BBuf in https://github.com/sgl-project/sglang/pull/3786
* Add sgl_per_token_quant_fp8 by hebiao064 in https://github.com/sgl-project/sglang/pull/4089
* [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) by HandH1998 in https://github.com/sgl-project/sglang/pull/3888
* [Refactor] Reducing code duplication across FP8 CUDA quantization kernels by hebiao064 in https://github.com/sgl-project/sglang/pull/4163
* [Docs] Fix links and grammar issues by windsonsea in https://github.com/sgl-project/sglang/pull/4162
* Remove non-existent AMD header include by hebiao064 in https://github.com/sgl-project/sglang/pull/4166
* Put utils in ifndef USE_ROCM to fix CI (4167) by zhyncs in https://github.com/sgl-project/sglang/pull/4168
* Memory pool fix for upstream change about eagle by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4170
* chore: bump v0.0.3.post7 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/4176
* Add an example of using deepseekv3 int8 sglang. by sleepcoo in https://github.com/sgl-project/sglang/pull/4177
* fix int8 doc link by zhyncs in https://github.com/sgl-project/sglang/pull/4179
* [Docs] Improve bullets appearance and grammar by windsonsea in https://github.com/sgl-project/sglang/pull/4174
* ROCm: Flex Attention Enablement with custom backends by HaiShaw in https://github.com/sgl-project/sglang/pull/4178
* Revert "ROCm: Flex Attention Enablement with custom backends (4178)" by zhyncs in https://github.com/sgl-project/sglang/pull/4186
* use same version for ci and pyproject by zhyncs in https://github.com/sgl-project/sglang/pull/4187
* Fix eagle hang issue for max_new_tokens=1 by ispobock in https://github.com/sgl-project/sglang/pull/4185
* Update amd ci docker image to v0.4.3.post4-rocm630. by saienduri in https://github.com/sgl-project/sglang/pull/4189
* New clang format for sgl kernel by merrymercy in https://github.com/sgl-project/sglang/pull/4194
* Remove the vllm dependency from the moe_align function by sleepcoo in https://github.com/sgl-project/sglang/pull/4164
* Minor improvement to per_tensor_quant_fp8 by zcnrex in https://github.com/sgl-project/sglang/pull/4197
* Revert "Minor improvement to per_tensor_quant_fp8 (4197)" by zhyncs in https://github.com/sgl-project/sglang/pull/4198
* lazy import attn backends by merrymercy in https://github.com/sgl-project/sglang/pull/4200
* Fix bench_serving flush cache not recognizing OPENAI_API_KEY by brighill in https://github.com/sgl-project/sglang/pull/4181
* Use clang format 18 in pr-test-sgl-kernel.yml by merrymercy in https://github.com/sgl-project/sglang/pull/4203
* Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB by kebe7jun in https://github.com/sgl-project/sglang/pull/3749
* Test no vllm custom allreduce by merrymercy in https://github.com/sgl-project/sglang/pull/4210
* refine quant kernel code style by BBuf in https://github.com/sgl-project/sglang/pull/4211
* Split test_mla.py into two files (deepseek v2 and deepseek v3) by merrymercy in https://github.com/sgl-project/sglang/pull/4216
* docs(reasoning content): :memo: deepseek-r1 parser support qwq by xihuai18 in https://github.com/sgl-project/sglang/pull/4124
* revert pr 3628 to pass test_mla ci by BBuf in https://github.com/sgl-project/sglang/pull/4219
* use latest sgl-kernel for mla test by zhyncs in https://github.com/sgl-project/sglang/pull/4222
* Rename files in sgl kernel to avoid nested folder structure by merrymercy in https://github.com/sgl-project/sglang/pull/4213
* chore: bump v0.0.4 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/4223
* Lazily import lora backends by merrymercy in https://github.com/sgl-project/sglang/pull/4225
* [docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) by panpan0000 in https://github.com/sgl-project/sglang/pull/3631
* [docs] Unhide production metrics page by hebiao064 in https://github.com/sgl-project/sglang/pull/4193
* use sgl-kernel 0.0.4 by zhyncs in https://github.com/sgl-project/sglang/pull/4224
* Support nextn for flashinfer mla attention backend by Fridge003 in https://github.com/sgl-project/sglang/pull/4218
* Apply sgl w8a8 fp8 kernel by HandH1998 in https://github.com/sgl-project/sglang/pull/3148
* Check eagle server args by Ying1123 in https://github.com/sgl-project/sglang/pull/4217
* update sgl-kernel 3rdparty by zhyncs in https://github.com/sgl-project/sglang/pull/4228
* Update bench speculative script by ispobock in https://github.com/sgl-project/sglang/pull/4235
* Fix test of flashinfer mla with nextn by Fridge003 in https://github.com/sgl-project/sglang/pull/4237
* Move rope and bmm into sgl-kernel by merrymercy in https://github.com/sgl-project/sglang/pull/4241
* Revert "Check eagle server args" by merrymercy in https://github.com/sgl-project/sglang/pull/4242
* Minor style fix for sgl-kernel by merrymercy in https://github.com/sgl-project/sglang/pull/4243
* Auto balance CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/4238
* Clean up fp8 support by merrymercy in https://github.com/sgl-project/sglang/pull/4230
* Move activation.cu to sgl-kernel/elementwise by merrymercy in https://github.com/sgl-project/sglang/pull/4250
* DeepGemm integrate to sgl-kernel by laixinn in https://github.com/sgl-project/sglang/pull/4165
* [Bug fixed] fixed the crash when enable the dp-attention on the single card by DavidChan0519 in https://github.com/sgl-project/sglang/pull/3958
* Added example for multimodal embedding by simveit in https://github.com/sgl-project/sglang/pull/4206
* Simplify tests & Fix trtllm custom allreduce registration by merrymercy in https://github.com/sgl-project/sglang/pull/4252
* fix the input_ids is None error by Young1993 in https://github.com/sgl-project/sglang/pull/4144
* fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 by BBuf in https://github.com/sgl-project/sglang/pull/4231
* Release sgl-kernel v0.0.4.post1 by merrymercy in https://github.com/sgl-project/sglang/pull/4255
* Fix quantization and nightly tests by merrymercy in https://github.com/sgl-project/sglang/pull/4258
* increase the timeout of nightly-test.yml by merrymercy in https://github.com/sgl-project/sglang/pull/4262
* Optimize rope in sgl kernel by merrymercy in https://github.com/sgl-project/sglang/pull/4267
* Test no vllm custom allreduce by merrymercy in https://github.com/sgl-project/sglang/pull/4256
* Amd test fp8 by HandH1998 in https://github.com/sgl-project/sglang/pull/4261
* add THIRDPARTYNOTICES for DeepGEMM by zhyncs in https://github.com/sgl-project/sglang/pull/4272
* upgrade xgrammar 0.1.15 by zhyncs in https://github.com/sgl-project/sglang/pull/4275
* Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 by merrymercy in https://github.com/sgl-project/sglang/pull/4279
* Uupdate cutalss dependency for its bug fix. by elfiegg in https://github.com/sgl-project/sglang/pull/4277
* update deepgemm by zhyncs in https://github.com/sgl-project/sglang/pull/4284
* bump sgl-kernel 0.0.4.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/4288
* Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) by lambert0312 in https://github.com/sgl-project/sglang/pull/4136
* update sgl-kernel 0.0.4.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/4291
* linear support deepgemm by sleepcoo in https://github.com/sgl-project/sglang/pull/4199
* Update MTP doc by ispobock in https://github.com/sgl-project/sglang/pull/4290
* Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 by yych0745 in https://github.com/sgl-project/sglang/pull/4287
* update doc by zhyncs in https://github.com/sgl-project/sglang/pull/4299
* [AMD] Fix rocm sgl-kernel missing modules error by BruceXcluding in https://github.com/sgl-project/sglang/pull/4311
* Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) by Ximingwang-09 in https://github.com/sgl-project/sglang/pull/4220
* refactor: move image processors to separate files by mickqian in https://github.com/sgl-project/sglang/pull/4229
* upgrade flashinfer 0.2.3 by zhyncs in https://github.com/sgl-project/sglang/pull/4317
* unify is_cuda and is_hip by zhyncs in https://github.com/sgl-project/sglang/pull/4321
* Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 by lambert0312 in https://github.com/sgl-project/sglang/pull/4323
* [Docs] Clean up benchmark_and_profiling.md by windsonsea in https://github.com/sgl-project/sglang/pull/4297
* refine sgl_moe_align_block_size_benchmark by BBuf in https://github.com/sgl-project/sglang/pull/4327
* Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% by hebiao064 in https://github.com/sgl-project/sglang/pull/4215
* Add awq dequantize kernel to sgl with 1x to 3x speedup by zcnrex in https://github.com/sgl-project/sglang/pull/4104
* fix awq_dequantize by zhyncs in https://github.com/sgl-project/sglang/pull/4333

0.4.3

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations
- Pioneering integration of FlashInfer MLA Attention delivers **4x performance improvement for long-context scenarios** (Special thanks to the FlashInfer team yzh119 ) https://github.com/sgl-project/sglang/pull/3550
- Added torch.compile support for FP8, achieving **50 tokens/s for online inference** https://github.com/sgl-project/sglang/pull/3232
- Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements
- Upgraded to FlashInfer v0.2
- **Enabled Flash Attention 3 by default** for prefill
- Extended EAGLE 2 support:
- Enhanced integration with FlashInfer backend
- Added support in Triton backend

New Features
- Introduced Function Calling capabilities
- Added regex pattern support in XGrammar backend
- Implemented custom sampling processor for flexible inference control
- Integrated LoRA support in Triton backend

What's Changed
* docs: add deepseek v3 launch instructions by zhyncs in https://github.com/sgl-project/sglang/pull/2589
* fix: only enable moe_align_block_size for now by zhyncs in https://github.com/sgl-project/sglang/pull/2590
* docs: update deepseek v3 example by zhyncs in https://github.com/sgl-project/sglang/pull/2592
* h100 tuning fused_moe_triton for qwen2 moe by BBuf in https://github.com/sgl-project/sglang/pull/2560
* Fix cache hit rate when chunked prefill by hnyls2002 in https://github.com/sgl-project/sglang/pull/2555
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2594
* Error occurs when loading the gemma model in bitsandbytes format. by upskyy in https://github.com/sgl-project/sglang/pull/2557
* [Feature] Support new parameter - EBNF in xgrammar by adarshxs in https://github.com/sgl-project/sglang/pull/2526
* update readme of DeepSeek V3 by fsygd in https://github.com/sgl-project/sglang/pull/2596
* Fix logprob_start_len for multi modal models by merrymercy in https://github.com/sgl-project/sglang/pull/2597
* Fix duplicated handling of GetWeightsByNameReqInput by fzyzcjy in https://github.com/sgl-project/sglang/pull/2565
* [unittest] add unit test to test quant args of srt engine by JamesSand in https://github.com/sgl-project/sglang/pull/2574
* Fix test and benchmark scripts by merrymercy in https://github.com/sgl-project/sglang/pull/2598
* fix: package data missing by yudian0504 in https://github.com/sgl-project/sglang/pull/2521
* [UTILS] improve makefile a bit by adding help info by kzhou003 in https://github.com/sgl-project/sglang/pull/2570
* Super tiny typo fix by fzyzcjy in https://github.com/sgl-project/sglang/pull/2564
* Update contributor_guide.md by merrymercy in https://github.com/sgl-project/sglang/pull/2603
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2605
* Tiny code cleanup in tokenizer_manager.py by fzyzcjy in https://github.com/sgl-project/sglang/pull/2586
* Regression fix to AMD/ROCm from recent change by HaiShaw in https://github.com/sgl-project/sglang/pull/2606
* Update CODEOWNERS by merrymercy in https://github.com/sgl-project/sglang/pull/2608
* Fused moe triton cfg opt for rocm by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2612
* Fix triton kernel performance regression by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2611
* Change extend attention kernel launch parameter for ROCm platform to … by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2610
* fix moe_align_block_size by HandH1998 in https://github.com/sgl-project/sglang/pull/2615
* update sgl_moe_align_block_size usage by HandH1998 in https://github.com/sgl-project/sglang/pull/2617
* chore: bump v0.4.1.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/2616
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2618
* [FIX] Update EOS from config by zhengy001 in https://github.com/sgl-project/sglang/pull/2475
* [minor] clean up docs and eos id by merrymercy in https://github.com/sgl-project/sglang/pull/2622
* Add more supporting organizations by merrymercy in https://github.com/sgl-project/sglang/pull/2623
* Update readme by ispobock in https://github.com/sgl-project/sglang/pull/2625
* avoid fused_moe_triton `padding` circular import by BBuf in https://github.com/sgl-project/sglang/pull/2624
* [CI] Fix nightly test and raise better error message by merrymercy in https://github.com/sgl-project/sglang/pull/2626
* Docs: Add constrained decoding tutorial by shuaills in https://github.com/sgl-project/sglang/pull/2614
* [docs]Refactor constrained decoding tutorial by shuaills in https://github.com/sgl-project/sglang/pull/2633
* add configs for block fp8 related kernels by zhyncs in https://github.com/sgl-project/sglang/pull/2628
* Add `update_weights_from_tensor` by fzyzcjy in https://github.com/sgl-project/sglang/pull/2631
* [Feature] Function Calling by Tushar-ml in https://github.com/sgl-project/sglang/pull/2544
* [Docs] Add EBNF to sampling params docs by adarshxs in https://github.com/sgl-project/sglang/pull/2609
* Clean up wrapper in flashinfer backend by merrymercy in https://github.com/sgl-project/sglang/pull/2638
* minor: add nsys cli for docker dev by zhyncs in https://github.com/sgl-project/sglang/pull/2639
* Add llama_eagle.py by merrymercy in https://github.com/sgl-project/sglang/pull/2640
* [Session] Update session control interface by Ying1123 in https://github.com/sgl-project/sglang/pull/2635
* AMD: set weights and scaling numbers properly for block FP8 by HaiShaw in https://github.com/sgl-project/sglang/pull/2637
* Update Triton configs for block fp8 kernels by HandH1998 in https://github.com/sgl-project/sglang/pull/2641
* chore: bump v0.4.1.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/2643
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2644
* docs: add development guide using docker by zhyncs in https://github.com/sgl-project/sglang/pull/2645
* [Feature] Get Token IDs with Engine.generate() by shuaills in https://github.com/sgl-project/sglang/pull/2636
* Fix unittest for input tokens by shuaills in https://github.com/sgl-project/sglang/pull/2646
* skip special token for unit test by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2648

0.4.1.post3

* Update the timeout in nightly-test.yml by merrymercy in https://github.com/sgl-project/sglang/pull/2649
* add 2*h20 node serving example for deepseek v3 by Lzhang-hub in https://github.com/sgl-project/sglang/pull/2650
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2651
* [feat] Add math eval to CI by XiaotongJiang in https://github.com/sgl-project/sglang/pull/2652
* Revert "[feat] Add math eval to CI" by merrymercy in https://github.com/sgl-project/sglang/pull/2656
* fix typo by HaiShaw in https://github.com/sgl-project/sglang/pull/2655
* [Docs] clean up structured outputs docs by merrymercy in https://github.com/sgl-project/sglang/pull/2654
* Update structured_outputs.ipynb by merrymercy in https://github.com/sgl-project/sglang/pull/2666
* Refactor sgl-kernel build by ispobock in https://github.com/sgl-project/sglang/pull/2642
* Refactor logprob computation to return the real logprob used in sampling by merrymercy in https://github.com/sgl-project/sglang/pull/2664
* Add GemLite caching after each capture by mobicham in https://github.com/sgl-project/sglang/pull/2669
* AMD DeepSeek_V3 FP8 Numerical fix by HaiShaw in https://github.com/sgl-project/sglang/pull/2667
* Minor follow-up fixes for the logprob refactor by merrymercy in https://github.com/sgl-project/sglang/pull/2670
* Tiny update scripts to fail fast by fzyzcjy in https://github.com/sgl-project/sglang/pull/2672
* Improve the computation for time_per_output_token Prometheus metrics by merrymercy in https://github.com/sgl-project/sglang/pull/2674
* Add cutlass submodule for sgl-kernel by ispobock in https://github.com/sgl-project/sglang/pull/2676
* minor: cleanup sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2679
* Eagle speculative decoding part 1: Support target model verification in the attention backend by merrymercy in https://github.com/sgl-project/sglang/pull/2678
* misc: update CODEOWNERS by zhyncs in https://github.com/sgl-project/sglang/pull/2680
* feat: use CUDA 12.4 by default (for FA3) by zhyncs in https://github.com/sgl-project/sglang/pull/2682
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2683
* Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging by merrymercy in https://github.com/sgl-project/sglang/pull/2684
* [Fix] fix openai adapter by Ying1123 in https://github.com/sgl-project/sglang/pull/2685
* h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B by BBuf in https://github.com/sgl-project/sglang/pull/2689
* [Docs] refactor Contribution Guide by shuaills in https://github.com/sgl-project/sglang/pull/2690
* Doc: Rename contribution_guide.md by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2691
* ROCm base image update by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2692
* [Docs] Add Support for Structured Output Format by shuaills in https://github.com/sgl-project/sglang/pull/2697
* [feat] Add math eval to CI nightly run by XiaotongJiang in https://github.com/sgl-project/sglang/pull/2663
* Improve moe reduce sum kernel performance by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2705
* Speed up `update_weights_from_tensor` by fzyzcjy in https://github.com/sgl-project/sglang/pull/2695
* Eagle speculative decoding part 3: small modifications to the general scheduler by merrymercy in https://github.com/sgl-project/sglang/pull/2709
* Eagle speculative decoding part 4: Add EAGLE2 worker by yukavio in https://github.com/sgl-project/sglang/pull/2150
* feat: support moe_align_block_size_triton by zhyncs in https://github.com/sgl-project/sglang/pull/2712
* Included multi-node DeepSeekv3 example by roG0d in https://github.com/sgl-project/sglang/pull/2707
* Update documentation workflow and contribution guide by shuaills in https://github.com/sgl-project/sglang/pull/2704
* [Fix] fix incorrectly overwriting the port specified in ServerArgs by mickqian in https://github.com/sgl-project/sglang/pull/2714
* [Fix] fix retract error in eagle speculative decoding by yukavio in https://github.com/sgl-project/sglang/pull/2711
* Support loading pre-sharded moe weights by merrymercy in https://github.com/sgl-project/sglang/pull/2716
* [Feature, Hardware] Enable DeepseekV3 on AMD GPUs by BruceXcluding in https://github.com/sgl-project/sglang/pull/2601
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2722
* [Docs] fix 404 - Contributor Guide, again by gaocegege in https://github.com/sgl-project/sglang/pull/2727
* feat: Support VLM in reference_hf by gaocegege in https://github.com/sgl-project/sglang/pull/2726
* Refactor SchedulePolicy to improve code organization by libratiger in https://github.com/sgl-project/sglang/pull/2571
* Revert the GLOO_SOCKET_IFNAME change by merrymercy in https://github.com/sgl-project/sglang/pull/2731
* fix lint by zhyncs in https://github.com/sgl-project/sglang/pull/2733
* improve moe_align_kernel for deepseek v3 by BBuf in https://github.com/sgl-project/sglang/pull/2735
* Support twoshot kernel by yizhang2077 in https://github.com/sgl-project/sglang/pull/2688
* chore: bump v0.4.1.post4 by zhyncs in https://github.com/sgl-project/sglang/pull/2713
* Fix sgl-kernel cu118 compile issue by ispobock in https://github.com/sgl-project/sglang/pull/2750
* Remove unused var in moe_align_kernel by ispobock in https://github.com/sgl-project/sglang/pull/2751
* Support cutlass Int8 gemm by ispobock in https://github.com/sgl-project/sglang/pull/2752
* Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied by Xu-Chen in https://github.com/sgl-project/sglang/pull/2748
* feat: add devcontainer.json for VSCode development by observerw in https://github.com/sgl-project/sglang/pull/2745
* Clean up eagle code by merrymercy in https://github.com/sgl-project/sglang/pull/2756
* Enable Nvidia's ModelOpt fp8 quantized models by Edwardf0t1 in https://github.com/sgl-project/sglang/pull/2535
* Add generator-style run_batch function by xingyaoww in https://github.com/sgl-project/sglang/pull/2513
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2757
* Remove --modelopt-config in server_args by merrymercy in https://github.com/sgl-project/sglang/pull/2758
* add benchmark_moe_align_blocks by BBuf in https://github.com/sgl-project/sglang/pull/2767
* Use Optional with None default by HaiShaw in https://github.com/sgl-project/sglang/pull/2770
* Misc fix for min_p_sampling, --cuda-graph-bs by merrymercy in https://github.com/sgl-project/sglang/pull/2761
* Update int8 gemm config by ispobock in https://github.com/sgl-project/sglang/pull/2774
* Host memory pool for hierarchical caching by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2771
* Disable math eval on nightly CI temporarily by merrymercy in https://github.com/sgl-project/sglang/pull/2779
* Fix nightly accuracy tests by merrymercy in https://github.com/sgl-project/sglang/pull/2780
* [eagle2] fix end check when target model verify by jjjjohnson in https://github.com/sgl-project/sglang/pull/2723
* Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm by merrymercy in https://github.com/sgl-project/sglang/pull/2784
* Docs: Rewrite docs for LLama 405B and ModelSpace by minleminzui in https://github.com/sgl-project/sglang/pull/2773
* Update the style of llma 3.1 405B docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2789
* Update modelopt config and fix running issue by ispobock in https://github.com/sgl-project/sglang/pull/2792
* Remove vllm dependency in model config by cermeng in https://github.com/sgl-project/sglang/pull/2809
* Fix typo in cuda_graph_bs by merrymercy in https://github.com/sgl-project/sglang/pull/2813
* minor: support specifying local dataset path for gsm8k and hellaswag by sleepcoo in https://github.com/sgl-project/sglang/pull/2816
* [Doc] Deepseek reference docs by XiaotongJiang in https://github.com/sgl-project/sglang/pull/2787
* Doc: add block-wise FP8 in dpsk model reference by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2830
* Update README.md by merrymercy in https://github.com/sgl-project/sglang/pull/2833
* Add more metrics to serving benchmark. by Mutinifni in https://github.com/sgl-project/sglang/pull/2819
* [Bugfix] Fix embedding model hangs with `--enable-metrics` by CatherineSue in https://github.com/sgl-project/sglang/pull/2822
* [Bugfix] Fix bug in fork logic caused by null text_ by Muqi1029 in https://github.com/sgl-project/sglang/pull/2835
* Fix port number overflow by gty111 in https://github.com/sgl-project/sglang/pull/2826
* [Eagle2]Fix multiple concurrent request crashes by coolhok in https://github.com/sgl-project/sglang/pull/2730
* Cache controller for hierarchical caching by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2804
* Update threshold in test_nightly_gsm8k_eval.py by merrymercy in https://github.com/sgl-project/sglang/pull/2836
* [HotFix] fix fp8 scale load failed in tp>1 by BBuf in https://github.com/sgl-project/sglang/pull/2837
* chore: bump v0.4.1.post5 by zhyncs in https://github.com/sgl-project/sglang/pull/2840
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2841
* Improve: Token-In Token-Out Usage for RLHF by shuaills in https://github.com/sgl-project/sglang/pull/2843
* add sampling_scaling_penalties kernel by BBuf in https://github.com/sgl-project/sglang/pull/2846
* fix sgl-kernel build by zhyncs in https://github.com/sgl-project/sglang/pull/2850
* Add int8 quant kernel by ispobock in https://github.com/sgl-project/sglang/pull/2848
* Support FP8 E4M3 KV Cache by bjmsong in https://github.com/sgl-project/sglang/pull/2786
* Update base image for ROCm by sogalin in https://github.com/sgl-project/sglang/pull/2852
* Integrate ROCm ater package for ck moe function feasibility by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2854
* [Fix]eagle2 health_generate is first request,apiserver will core by coolhok in https://github.com/sgl-project/sglang/pull/2853
* Fix linear.py and improve weight loading by merrymercy in https://github.com/sgl-project/sglang/pull/2851
* Unify sglang coding style by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2856
* fix: not delete CNAME by zhyncs in https://github.com/sgl-project/sglang/pull/2860
* docs: update link by zhyncs in https://github.com/sgl-project/sglang/pull/2857
* minor: use ubuntu-latest instead of self-hosted runner for amd build by zhyncs in https://github.com/sgl-project/sglang/pull/2861
* Use only one GPU for MLA CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/2858
* Collect more metrics: num_requests_total by merrymercy in https://github.com/sgl-project/sglang/pull/2859
* Integration of TurboMind AWQ by bjmsong in https://github.com/sgl-project/sglang/pull/2828
* Fix quant kernel accuracy issue by ispobock in https://github.com/sgl-project/sglang/pull/2865
* Revert "Integration of TurboMind AWQ" by merrymercy in https://github.com/sgl-project/sglang/pull/2866
* Dump requests to a folder by merrymercy in https://github.com/sgl-project/sglang/pull/2862
* Fix typos in io_struct.py by merrymercy in https://github.com/sgl-project/sglang/pull/2867
* minor: fix release docs by zhyncs in https://github.com/sgl-project/sglang/pull/2868
* add qwen2 eagle model by Lzhang-hub in https://github.com/sgl-project/sglang/pull/2863
* Revert "Dump requests to a folder" by merrymercy in https://github.com/sgl-project/sglang/pull/2869
* Sampling penalties memory interface by BBuf in https://github.com/sgl-project/sglang/pull/2870
* CUDA-graph-compatible releasing and resuming KV cache and model weight memory by fzyzcjy in https://github.com/sgl-project/sglang/pull/2630
* Add a new api configure_logging to allow dumping the requests by merrymercy in https://github.com/sgl-project/sglang/pull/2875
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2878
* Adjust flashinfer workspace size for Qwen2 models by ispobock in https://github.com/sgl-project/sglang/pull/2879
* update ROCm docker for layernorm kernel optimization by kkHuang-amd in https://github.com/sgl-project/sglang/pull/2885
* Support w8a8 int8 quantization config by ispobock in https://github.com/sgl-project/sglang/pull/2881
* feat: support internlm 3 dense by zhyncs in https://github.com/sgl-project/sglang/pull/2888
* introduce CUB in sgl-kernel by BBuf in https://github.com/sgl-project/sglang/pull/2887
* chore: bump v0.4.1.post6 by zhyncs in https://github.com/sgl-project/sglang/pull/2899
* Add ut for w8a8 int8 quantization by ispobock in https://github.com/sgl-project/sglang/pull/2897
* Disable graceful shutdown of tokenizer manager when not in the main thread by comaniac in https://github.com/sgl-project/sglang/pull/2872
* optimize custom allreduce kernel by yizhang2077 in https://github.com/sgl-project/sglang/pull/2904
* fix: sgl-kernel link cuda by zhyncs in https://github.com/sgl-project/sglang/pull/2906
* adapt custom allreduce for tensorrt llm by yizhang2077 in https://github.com/sgl-project/sglang/pull/2511
* minor: update pr test by zhyncs in https://github.com/sgl-project/sglang/pull/2908
* minor: rename bench for sgl kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2909
* [kernel] MiniMax-Text-01 prefill lightning_attn with triton by BBuf in https://github.com/sgl-project/sglang/pull/2911
* feat: patch linear base by zhyncs in https://github.com/sgl-project/sglang/pull/2915
* fix setup for sgl kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2917
* minor: use bear for compilation database by zhyncs in https://github.com/sgl-project/sglang/pull/2919
* Improve benchmark scripts and error message printing by merrymercy in https://github.com/sgl-project/sglang/pull/2922
* fixed lm_head.weight error for quantized qwen by RinRin-32 in https://github.com/sgl-project/sglang/pull/2910
* add profiling to bench_one_batch script by yundai424 in https://github.com/sgl-project/sglang/pull/2821
* Simplify the process launch code in server.py by merrymercy in https://github.com/sgl-project/sglang/pull/2923
* Add CI for sgl-kernel by ispobock in https://github.com/sgl-project/sglang/pull/2924
* Support multi-node DP attention by merrymercy in https://github.com/sgl-project/sglang/pull/2925
* Update release-docker-amd.yml to run on amd docker runner. by saienduri in https://github.com/sgl-project/sglang/pull/2927
* Improve type annotation and styles by merrymercy in https://github.com/sgl-project/sglang/pull/2926
* [kernel] MiniMax-Text-01 decode lightning_attn with triton by BBuf in https://github.com/sgl-project/sglang/pull/2920
* Update pull_request_template.md by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2928
* Fix zmq binding by merrymercy in https://github.com/sgl-project/sglang/pull/2930
* [Frontend] Fix request length check and add option to disallow auto truncation in scheduler by CatherineSue in https://github.com/sgl-project/sglang/pull/2876
* Enable CPU device on SGLang by chunyuan-w in https://github.com/sgl-project/sglang/pull/2806
* Update release-docs.yml by merrymercy in https://github.com/sgl-project/sglang/pull/2937
* Fix sgl-kernel ci by ispobock in https://github.com/sgl-project/sglang/pull/2938
* feat: remove vllm distributed by zhyncs in https://github.com/sgl-project/sglang/pull/2907
* Fix qwen accuracy issue by ispobock in https://github.com/sgl-project/sglang/pull/2945
* docs: add Cursor for adoption and sponsorship by zhyncs in https://github.com/sgl-project/sglang/pull/2950
* update ci install dependency by zhyncs in https://github.com/sgl-project/sglang/pull/2949
* cleanup models dependencies 1/n by zhyncs in https://github.com/sgl-project/sglang/pull/2948
* Add ut for qwen model by ispobock in https://github.com/sgl-project/sglang/pull/2947
* Update pr template by ispobock in https://github.com/sgl-project/sglang/pull/2951
* cleanup models unused import 2/n by zhyncs in https://github.com/sgl-project/sglang/pull/2952
* feat: use get_rope for gemma2 by zhyncs in https://github.com/sgl-project/sglang/pull/2954
* Fix Llama-3.1-405B References Docs by HermitSun in https://github.com/sgl-project/sglang/pull/2944
* Multi-turn benchmark for hierarchical caching by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2942
* support e4m3 kvcache in qwen2 & add kv scaling facotr json by bjmsong in https://github.com/sgl-project/sglang/pull/2894
* Query remaining memory dynamically for PrefillAdder by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2941
* Remove fp8 monkey patch by ispobock in https://github.com/sgl-project/sglang/pull/2960
* fix sgl-kernel setup.py by sleepcoo in https://github.com/sgl-project/sglang/pull/2963
* feat: remove vllm get_rope by zhyncs in https://github.com/sgl-project/sglang/pull/2964
* upgrade cutlass v3.7.0 by zhyncs in https://github.com/sgl-project/sglang/pull/2967
* optimize MiniMax-Text-01 lightning_attn_decode triton by BBuf in https://github.com/sgl-project/sglang/pull/2966
* [Feature] Support minicpmv v2.6 by mickqian in https://github.com/sgl-project/sglang/pull/2785
* fix file name spelling mistake and useless variable in minmax-text-01-lightning_attention by BBuf in https://github.com/sgl-project/sglang/pull/2971
* Memory pool: Minor optimize to avoid to by zhengy001 in https://github.com/sgl-project/sglang/pull/2901
* Frontend: better error message handling for FINISH_ABORT in scheduler.py by CatherineSue in https://github.com/sgl-project/sglang/pull/2956
* Refactor to add TypeBasedDispatcher to simplify dispatching by fzyzcjy in https://github.com/sgl-project/sglang/pull/2958
* Remove the unused write_with_records by merrymercy in https://github.com/sgl-project/sglang/pull/2972
* Fix the request loggings to make it fully able to be easily replayed by merrymercy in https://github.com/sgl-project/sglang/pull/2973
* Simplify logits processor by merrymercy in https://github.com/sgl-project/sglang/pull/2974
* remove cub and add cccl by zhyncs in https://github.com/sgl-project/sglang/pull/2976
* [devcontainer] Fix mount and GPU & Support rust dev by ByronHsu in https://github.com/sgl-project/sglang/pull/2978
* [router] Allow empty worker list for sglang.launch_router by ByronHsu in https://github.com/sgl-project/sglang/pull/2979
* [router] Fix sgl router path for release by ByronHsu in https://github.com/sgl-project/sglang/pull/2980
* fix deepseek v2 with cpu device by zhyncs in https://github.com/sgl-project/sglang/pull/2975
* add config to swtich from vllm custom allreduce to sgl_kernel custom allreduce by yizhang2077 in https://github.com/sgl-project/sglang/pull/2981
* feat: check for is_cuda for sgl_kernel import by zhyncs in https://github.com/sgl-project/sglang/pull/2984
* update docker dev image by zhyncs in https://github.com/sgl-project/sglang/pull/2985
* docs: update supported_models by zhyncs in https://github.com/sgl-project/sglang/pull/2987
* cleanup unused header in sgl_kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2986
* fix missing revision arg when loading tokenizer by giorgiopiatti-dfinity in https://github.com/sgl-project/sglang/pull/2982
* [2812] Make the decode status dict capcity adjustable by a CLI param by seungduk-yanolja in https://github.com/sgl-project/sglang/pull/2839
* fix custom op version compatibility by zhyncs in https://github.com/sgl-project/sglang/pull/2988
* support regex in xgrammar backend by qeternity in https://github.com/sgl-project/sglang/pull/2983
* [Feature] Add sampler custom logits processor by hongpeng-guo in https://github.com/sgl-project/sglang/pull/2396
* Move sgl.Runtime under sglang/lang by merrymercy in https://github.com/sgl-project/sglang/pull/2990
* Improve metrics, logging, and importing orders by merrymercy in https://github.com/sgl-project/sglang/pull/2992
* Docs: Only use X-Grammar in structed output by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2991
* Remove dependency of pynvml on ROCm by lcskrishna in https://github.com/sgl-project/sglang/pull/2995
* keep rotary_embedding only by zhyncs in https://github.com/sgl-project/sglang/pull/2997
* Separate two entry points: Engine and HTTP server by merrymercy in https://github.com/sgl-project/sglang/pull/2996
* Update TypeBasedDispatcher and balance CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/3001
* Skip flaky custom_logit_processor tests by merrymercy in https://github.com/sgl-project/sglang/pull/3004
* add performance pic for dpa by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3005
* [Enhancement] Custom Logit Processor Improvement by hongpeng-guo in https://github.com/sgl-project/sglang/pull/2998
* fix deepseekv3 moe align blocks benchmark by yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3003
* Fix perf regression on small batch sizes due to kv cache scale by merrymercy in https://github.com/sgl-project/sglang/pull/3008
* Roll back to use vllm custom allreduce by merrymercy in https://github.com/sgl-project/sglang/pull/3006
* Sync distributed package from vllm 0.6.4.post1 by merrymercy in https://github.com/sgl-project/sglang/pull/3010
* [kernel] port rope cuda kernel to sgl-kernel by ByronHsu in https://github.com/sgl-project/sglang/pull/2993
* chore: bump v0.4.1.post7 by zhyncs in https://github.com/sgl-project/sglang/pull/3009
* Add clang-format check to sgl-kernel ci by ispobock in https://github.com/sgl-project/sglang/pull/3012
* Add compile flags for cutlass 3.x by ispobock in https://github.com/sgl-project/sglang/pull/3013
* [router] Expose worker startup secs & Return error instead of panic for router init by ByronHsu in https://github.com/sgl-project/sglang/pull/3016
* [router] Expose worker startup interval by ByronHsu in https://github.com/sgl-project/sglang/pull/3019
* bump router to 0.1.3 by ByronHsu in https://github.com/sgl-project/sglang/pull/3020
* deepseek v3 and r1 chat template by qeternity in https://github.com/sgl-project/sglang/pull/3015
* enable kv_scale remap by hliuca in https://github.com/sgl-project/sglang/pull/3017
* [Doc] Update doc of custom logit processor by hongpeng-guo in https://github.com/sgl-project/sglang/pull/3021
* Fix flaky tests in test_programs.py by merrymercy in https://github.com/sgl-project/sglang/pull/3022
* [EAGLE] Fix some boundary situation when retract reqs and req's max token = 1 by josephydu in https://github.com/sgl-project/sglang/pull/2939
* Enable Cohere2 Models by hliuca in https://github.com/sgl-project/sglang/pull/3018
* minor: update Makefile for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3025
* upgrade torch version for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3026
* Add accuracy and latency tests of eagle into CI by merrymercy in https://github.com/sgl-project/sglang/pull/3027
* feat: add flashinfer as 3rdparty and use rmsnorm as example by zhyncs in https://github.com/sgl-project/sglang/pull/3033
* Support sm90 Int8 gemm by ispobock in https://github.com/sgl-project/sglang/pull/3035
* fix pr-test-sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3036
* Use int64 as indices for set_kv_buffer by merrymercy in https://github.com/sgl-project/sglang/pull/3039
* Fix sgl-kernel compile for sm80 by ispobock in https://github.com/sgl-project/sglang/pull/3046
* update norm cu by zhyncs in https://github.com/sgl-project/sglang/pull/3048
* sync the upstream updates of flashinfer by zhyncs in https://github.com/sgl-project/sglang/pull/3051
* feat: integrate norm kernels into sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3052
* feat: integrate activation kernels into sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3053
* minor: update header and use pytest by zhyncs in https://github.com/sgl-project/sglang/pull/3054
* feat: integrate bmm_fp8 kernel into sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3056
* fix rotary_embedding rope_scaling for phi by sudo-root-ns in https://github.com/sgl-project/sglang/pull/3055
* add notice about flashinfer in sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3057
* disable custom allreduce on HIP by hliuca in https://github.com/sgl-project/sglang/pull/3058
* [Doc]Update doc of profiling with PyTorch Profiler by Fridge003 in https://github.com/sgl-project/sglang/pull/3038
* Fix the FP8 E4M3 parsing offline scales failure bug by sleepcoo in https://github.com/sgl-project/sglang/pull/3045
* Add some flags to allow sync token ids across TP ranks by merrymercy in https://github.com/sgl-project/sglang/pull/3060
* [devcontainer] add non-root user by ByronHsu in https://github.com/sgl-project/sglang/pull/2989
* [router] make error actionable by ByronHsu in https://github.com/sgl-project/sglang/pull/3063
* Fix tp token sync for dp attention by merrymercy in https://github.com/sgl-project/sglang/pull/3062
* Support loading of larger models with on-the-fly quantization by kwen2501 in https://github.com/sgl-project/sglang/pull/3061
* Revert "disable custom allreduce on HIP" by merrymercy in https://github.com/sgl-project/sglang/pull/3067
* docs: add developer guide for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3068
* docs: update developer guide for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3069
* use v0.6.4.post1 for sgl-kernel ci by zhyncs in https://github.com/sgl-project/sglang/pull/3071
* support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 by BBuf in https://github.com/sgl-project/sglang/pull/3030
* Remove torch dependency in sgl-kernel by merrymercy in https://github.com/sgl-project/sglang/pull/3074
* fix build error for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3078
* update version setup for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3079
* use env variable to control the build conf on the CPU build node by zhyncs in https://github.com/sgl-project/sglang/pull/3080
* sync flashinfer and update sgl-kernel tests by zhyncs in https://github.com/sgl-project/sglang/pull/3081
* Use flashinfer vec_dtypes in sgl_kernel by BBuf in https://github.com/sgl-project/sglang/pull/3083
* [hotfix] fix test_sampling_scaling_penalties.py ci test by BBuf in https://github.com/sgl-project/sglang/pull/3084
* feat: integrate sampling kernels into sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3086
* chore: bump sgl-kernel 0.0.2.post16 by zhyncs in https://github.com/sgl-project/sglang/pull/3087
* Update doc for server arguments by simveit in https://github.com/sgl-project/sglang/pull/2742
* Add shapes for int8 gemm benchmark by ispobock in https://github.com/sgl-project/sglang/pull/3093
* [router] Forward all request headers from router to workers by ByronHsu in https://github.com/sgl-project/sglang/pull/3070
* bump router to 0.1.4 by ByronHsu in https://github.com/sgl-project/sglang/pull/3094
* [router] Fix twine uploading by ByronHsu in https://github.com/sgl-project/sglang/pull/3095
* Fix cu118 group gemm compile issue by ispobock in https://github.com/sgl-project/sglang/pull/3097
* minor: sync flashinfer and add turbomind as 3rdparty by zhyncs in https://github.com/sgl-project/sglang/pull/3105
* Allow local cutlass directory to be used in sgl-kernel build by trevor-m in https://github.com/sgl-project/sglang/pull/3037
* [Docs] minor update for phi-3 and phi-4 by adarshxs in https://github.com/sgl-project/sglang/pull/3096
* minor: update sgl-kernel setup by zhyncs in https://github.com/sgl-project/sglang/pull/3107
* Add workflow for sgl-kernel cu118 release by ispobock in https://github.com/sgl-project/sglang/pull/3109
* Add step to update sgl-kernel whl index by ispobock in https://github.com/sgl-project/sglang/pull/3110
* support fp32 in sampling_scaling_penalties kernel by BBuf in https://github.com/sgl-project/sglang/pull/3121
* mirror fix for custom allreduce by yizhang2077 in https://github.com/sgl-project/sglang/pull/3124
* chore: bump v0.0.2.post17 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3125
* speedup pr test for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3126
* Update tag name for whl release by ispobock in https://github.com/sgl-project/sglang/pull/3127
* Update whl index path by ispobock in https://github.com/sgl-project/sglang/pull/3128
* update installation doc for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3129
* feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by yinfan98 in https://github.com/sgl-project/sglang/pull/3130
* Fix CI tests by merrymercy in https://github.com/sgl-project/sglang/pull/3132
* Use torch.compile for scaling penalty by merrymercy in https://github.com/sgl-project/sglang/pull/3133
* enable kv_scale for Gemma2 by hliuca in https://github.com/sgl-project/sglang/pull/3113
* feat: cross python wheel for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3138
* [Fix] Not skip NVML Check on AMD Platform by BruceXcluding in https://github.com/sgl-project/sglang/pull/3135
* Fix repetition penalty by merrymercy in https://github.com/sgl-project/sglang/pull/3139
* minor: cleanup sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3143
* support w8a8 fp8 kernel with CUTLASS by HandH1998 in https://github.com/sgl-project/sglang/pull/3047
* Add CPU affinity setting to latency benchmark by hubertlu-tw in https://github.com/sgl-project/sglang/pull/3085
* Simplify the computation of cached_tokens by merrymercy in https://github.com/sgl-project/sglang/pull/3145
* Do not load OPENAI_KEY from secrets by merrymercy in https://github.com/sgl-project/sglang/pull/3147
* chore: bump 0.0.2.post18 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3149
* Temporarily skip the openai frontend tests by merrymercy in https://github.com/sgl-project/sglang/pull/3151
* udpate sgl-kernel version for srt by zhyncs in https://github.com/sgl-project/sglang/pull/3150
* Return more infos for computing average acceptance length by merrymercy in https://github.com/sgl-project/sglang/pull/3152
* fix link in README by zhyncs in https://github.com/sgl-project/sglang/pull/3153
* use self-hosted to build sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3154
* Feature/function calling update by YAMY1234 in https://github.com/sgl-project/sglang/pull/2700
* Add function calling in index.rst by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3155
* Doc: Add Docs about EAGLE speculative decoding by jhinpan in https://github.com/sgl-project/sglang/pull/3144
* Add more logprob tests by merrymercy in https://github.com/sgl-project/sglang/pull/3162
* [kernel] Integrate flashinfer's rope with higher precision and better perf by ByronHsu in https://github.com/sgl-project/sglang/pull/3134
* add unit test for block wise fp8 by yizhang2077 in https://github.com/sgl-project/sglang/pull/3156
* Bump sgl kernel to 0.0.2.post19 by ByronHsu in https://github.com/sgl-project/sglang/pull/3167
* Add activation parameters to fused_moe by merrymercy in https://github.com/sgl-project/sglang/pull/3170
* [kernel] Fix position ids in rope by ByronHsu in https://github.com/sgl-project/sglang/pull/3173
* add dsv3 mi300 triton config for block scale by BruceXcluding in https://github.com/sgl-project/sglang/pull/3146
* Improve weight loading and code style by merrymercy in https://github.com/sgl-project/sglang/pull/3174
* Update thresholds in test_nightly_gsm8k_eval.py by merrymercy in https://github.com/sgl-project/sglang/pull/3176
* cleanup sgl-kernel kernels by zhyncs in https://github.com/sgl-project/sglang/pull/3175
* chore: bump 0.0.3 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3178
* feat: use sgl-kernel 0.0.3 in sglang by zhyncs in https://github.com/sgl-project/sglang/pull/3179
* chore: bump v0.4.2 by zhyncs in https://github.com/sgl-project/sglang/pull/3180
* fix: update Dockerfile for cu118 by zhyncs in https://github.com/sgl-project/sglang/pull/3181
* Sanity check to prevent performance regression by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/3171
* Docs fix about EAGLE and streaming output by jhinpan in https://github.com/sgl-project/sglang/pull/3166
* [test] deduplicate test_session_control by ByronHsu in https://github.com/sgl-project/sglang/pull/3183
* clean up useless file by BBuf in https://github.com/sgl-project/sglang/pull/3192
* [kernel] Use sgl_kernel rope by ByronHsu in https://github.com/sgl-project/sglang/pull/3169
* Fix typo in README by falegh in https://github.com/sgl-project/sglang/pull/3190
* [Fix] Address remaining issues of supporting MiniCPMV by mickqian in https://github.com/sgl-project/sglang/pull/2977
* [test] Lower number of top logprobs to get rid of `-inf` by ByronHsu in https://github.com/sgl-project/sglang/pull/3212
* update 3rdparty and rms norm for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3213
* update setup for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3214
* add tensorrt_llm common and cutlass_extensions as 3rdparty by zhyncs in https://github.com/sgl-project/sglang/pull/3216
* add tensorrt_llm moe_gemm as 3rdparty by zhyncs in https://github.com/sgl-project/sglang/pull/3217
* keep the parts needed for moe_kernels by zhyncs in https://github.com/sgl-project/sglang/pull/3218
* docs: add Novita for adoption and sponsorship by Ying1123 in https://github.com/sgl-project/sglang/pull/3227
* Update supported models with Mistral 3 by ravi03071991 in https://github.com/sgl-project/sglang/pull/3229
* revert the MoE dependence by zhyncs in https://github.com/sgl-project/sglang/pull/3230
* [fix] Clamp logprob with dtype min to prevent `-inf` by ByronHsu in https://github.com/sgl-project/sglang/pull/3224
* Fix block wise fp8 torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3232
* support 12.5 CUDA runtime by zhyncs in https://github.com/sgl-project/sglang/pull/3231
* chore: bump v0.4.2.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/3233
* Quick fix for Speculative_decoding doc by jhinpan in https://github.com/sgl-project/sglang/pull/3228
* compatible with flashinfer v0.2 by zhyncs in https://github.com/sgl-project/sglang/pull/3235
* Optimize MoE topk with torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3236
* update sgl-kernel version for sglang by zhyncs in https://github.com/sgl-project/sglang/pull/3238
* update cutlass dependency by zhyncs in https://github.com/sgl-project/sglang/pull/3240
* add tuning block wise fp8 by zhyncs in https://github.com/sgl-project/sglang/pull/3242
* [Docs] Add more details to profiling docs by Edenzzzz in https://github.com/sgl-project/sglang/pull/3221
* Add test for fp8 torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3246
* update ENV to ROCm dockers by HaiShaw in https://github.com/sgl-project/sglang/pull/3248
* update and simplify CustomOp by zhyncs in https://github.com/sgl-project/sglang/pull/3249
* support QuickGELU by zhyncs in https://github.com/sgl-project/sglang/pull/3250
* add contact us in README by zhyncs in https://github.com/sgl-project/sglang/pull/3251
* use srt VocabParallelEmbedding by zhyncs in https://github.com/sgl-project/sglang/pull/3252
* Tune paged attention parameters for AMD GPU. by whchung in https://github.com/sgl-project/sglang/pull/3255
* docs/accuracy evaluation by simveit in https://github.com/sgl-project/sglang/pull/3114
* Docs: Update accuracy evaluation by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3261
* ROCm: bump 6.3.0 by HaiShaw in https://github.com/sgl-project/sglang/pull/3259
* Fix min_p sampling crash when using flashinfer backend by zifeitong in https://github.com/sgl-project/sglang/pull/3207
* Add a Doc about guide on nvidia jetson 3182 by lycanlancelot in https://github.com/sgl-project/sglang/pull/3205
* optimize test_fused_moe style by BBuf in https://github.com/sgl-project/sglang/pull/3268
* refactor EAGLE 2 by zhyncs in https://github.com/sgl-project/sglang/pull/3269
* add copyright for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3270
* adding Triton configs for DeepSeekV3 on Blackwell by kushanam in https://github.com/sgl-project/sglang/pull/3272
* add Nebius for Adoption and Sponsorship by zhyncs in https://github.com/sgl-project/sglang/pull/3274
* add Atlas Cloud for Adoption and Sponsorship by zhyncs in https://github.com/sgl-project/sglang/pull/3276
* Update server args doc by simveit in https://github.com/sgl-project/sglang/pull/3273
* [Feature] Define backends and add Triton backend for Lora by Fridge003 in https://github.com/sgl-project/sglang/pull/3161
* upgrade flashinfer v0.2.0.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3288
* ROCm: sgl-kernel enablement starting with sgl_moe_align_block by HaiShaw in https://github.com/sgl-project/sglang/pull/3287
* Update Triton decode backend interface by ispobock in https://github.com/sgl-project/sglang/pull/3292
* update flashinfer install index url by zhyncs in https://github.com/sgl-project/sglang/pull/3293
* [ROCm] Add tuning configs for AMD Radeon Graphics. by whchung in https://github.com/sgl-project/sglang/pull/3294
* [ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. by whchung in https://github.com/sgl-project/sglang/pull/3299
* Use forward_cuda to execute custom op for hip platform by kkHuang-amd in https://github.com/sgl-project/sglang/pull/3305
* [ROCm] Logic to decide whether to used manually unrolled kernel. by whchung in https://github.com/sgl-project/sglang/pull/3306
* Fix lora flashinfer import bug on ROCM by Fridge003 in https://github.com/sgl-project/sglang/pull/3312
* chore: bump v0.4.2.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3313
* Update Triton extend backend interface by ispobock in https://github.com/sgl-project/sglang/pull/3309
* Support custom mask for Triton attention by ispobock in https://github.com/sgl-project/sglang/pull/3317
* Initial Enablement of CI on MI300 by saienduri in https://github.com/sgl-project/sglang/pull/3168
* update README by zhyncs in https://github.com/sgl-project/sglang/pull/3324
* Docker switch on mi300 CI. by saienduri in https://github.com/sgl-project/sglang/pull/3327
* [ROCm] Fix fp8 unrolledx4 matmul kernel. by whchung in https://github.com/sgl-project/sglang/pull/3325
* clean moe align block kernel code and add acc test by BBuf in https://github.com/sgl-project/sglang/pull/3332
* Add sgl-kernel to MI300 CI paths tested. by saienduri in https://github.com/sgl-project/sglang/pull/3335
* update pull request template by zhyncs in https://github.com/sgl-project/sglang/pull/3337
* add AMD guide for DeepSeek-R1 by zhyncs in https://github.com/sgl-project/sglang/pull/3338
* [Doc] Add optimization option guide for deepseek v3 by ispobock in https://github.com/sgl-project/sglang/pull/3349
* fix sgl-kernel build failure on AMD by zhyncs in https://github.com/sgl-project/sglang/pull/3352
* optimize moe_align_kernel cuda by BBuf in https://github.com/sgl-project/sglang/pull/3347
* enable fake finish for docs PR by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3350
* Feature/docs deepseek usage and add multi-node by lycanlancelot in https://github.com/sgl-project/sglang/pull/3314
* Feature: Fix the binding error in Llama by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3355
* Fix: Runtime error for function calling by shuaills in https://github.com/sgl-project/sglang/pull/3300
* update waves_per_eu to 1 by lizamd in https://github.com/sgl-project/sglang/pull/3356
* update unit test in AMD CI by zhyncs in https://github.com/sgl-project/sglang/pull/3366
* fix undefined symbol cudaGetDriverEntryPointByVersion by zhyncs in https://github.com/sgl-project/sglang/pull/3372
* support speculative decoding kernel in sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3373
* update sgl-kernel version by zhyncs in https://github.com/sgl-project/sglang/pull/3374
* update pr-test ci by zhyncs in https://github.com/sgl-project/sglang/pull/3376
* fix EagleVerifyInput by zhyncs in https://github.com/sgl-project/sglang/pull/3378
* chore: bump v0.4.2.post3 by zhyncs in https://github.com/sgl-project/sglang/pull/3369
* added amd_configure.md to references by zstreet87 in https://github.com/sgl-project/sglang/pull/3275
* Add H20 fp8 w8a8 gemm config by sleepcoo in https://github.com/sgl-project/sglang/pull/3386
* [BUG] fix moe benchmark when bs*seq is small by yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3382
* Update fused_moe's benchmark by WhatGhost in https://github.com/sgl-project/sglang/pull/3346
* Add deepseek-v3 a100 serving example by ispobock in https://github.com/sgl-project/sglang/pull/3404
* fix EAGLE 2 non greedy case by zhyncs in https://github.com/sgl-project/sglang/pull/3407
* add disable cuda graph unit test for eagle 2 by zhyncs in https://github.com/sgl-project/sglang/pull/3412
* [Fix] Fix eagle with disable cuda graph by Ying1123 in https://github.com/sgl-project/sglang/pull/3411
* minor: cleanup test_eagle_infer by zhyncs in https://github.com/sgl-project/sglang/pull/3415
* [docs] Add multi-node inference example for SLURM in documentation by shuaills in https://github.com/sgl-project/sglang/pull/3408
* fix cu118 link issue by zhyncs in https://github.com/sgl-project/sglang/pull/3421
* remove cutex dependency by zhyncs in https://github.com/sgl-project/sglang/pull/3422
* update forward_return_lse by zhyncs in https://github.com/sgl-project/sglang/pull/3425
* add cuda graph capture failure possible solution by zhyncs in https://github.com/sgl-project/sglang/pull/3430
* fix draft cuda graph capture failure by zhyncs in https://github.com/sgl-project/sglang/pull/3431
* remove activation dependency in fused_moe by zhyncs in https://github.com/sgl-project/sglang/pull/3433
* compatible with new outlines by zhyncs in https://github.com/sgl-project/sglang/pull/3435
* [Docs] Add quantization docs by Edenzzzz in https://github.com/sgl-project/sglang/pull/3410
* [docs] Update quantization documentation by shuaills in https://github.com/sgl-project/sglang/pull/3437
* support version in sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3439
* chore: bump sgl-kernel v0.0.3.post3 by zhyncs in https://github.com/sgl-project/sglang/pull/3440
* fix ci by zhyncs in https://github.com/sgl-project/sglang/pull/3441
* feat: enable ragged fa3 by default on hopper 12.4+ by zhyncs in https://github.com/sgl-project/sglang/pull/3442
* Update contribution_guide.md by Ying1123 in https://github.com/sgl-project/sglang/pull/3452
* remove _grouped_size_compiled_for_decode_kernels by zhyncs in https://github.com/sgl-project/sglang/pull/3453
* [Fix] Fix accuracy bug and refactor codes for lora by Fridge003 in https://github.com/sgl-project/sglang/pull/3413
* use nvcr.io/nvidia/tritonserver:24.04-py3-min as base image by zhyncs in https://github.com/sgl-project/sglang/pull/3457
* chore: bump v0.4.2.post4 by zhyncs in https://github.com/sgl-project/sglang/pull/3459
* Support Eagle2 for Triton backend by ispobock in https://github.com/sgl-project/sglang/pull/3466
* [Eagle] reduce one draft forward by Ying1123 in https://github.com/sgl-project/sglang/pull/3468
* fix mla test by zhyncs in https://github.com/sgl-project/sglang/pull/3469
* refine some typo by BBuf in https://github.com/sgl-project/sglang/pull/3473
* [Feat] return hidden states by Jackmin801 in https://github.com/sgl-project/sglang/pull/3364
* [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics by BruceXcluding in https://github.com/sgl-project/sglang/pull/3418
* optimize per token group quant fp8 by BBuf in https://github.com/sgl-project/sglang/pull/3490
* Tune MI300X fused MoE Triton kernel JSON config. by whchung in https://github.com/sgl-project/sglang/pull/3492
* Support Eagle cuda graph for Triton backend by ispobock in https://github.com/sgl-project/sglang/pull/3500
* fix deepseek_v3 typo by didier-durand in https://github.com/sgl-project/sglang/pull/3497
* fix supported_models Qwen typo by didier-durand in https://github.com/sgl-project/sglang/pull/3498
* fix server_arguments typo by didier-durand in https://github.com/sgl-project/sglang/pull/3499
* fix router typo by didier-durand in https://github.com/sgl-project/sglang/pull/3496
* add deepseek-v3 amd docker command by zstreet87 in https://github.com/sgl-project/sglang/pull/3495
* MI30x: More graph captures for larger batch sizes and concurrencies by HaiShaw in https://github.com/sgl-project/sglang/pull/3420
* Make NCCL NVLS configurable by MrAta in https://github.com/sgl-project/sglang/pull/3502
* doc: Support a new vLM by mickqian in https://github.com/sgl-project/sglang/pull/3405
* refine deepseek_v3 launch server doc by BBuf in https://github.com/sgl-project/sglang/pull/3522
* chore: bump 0.0.3.post4 sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3523
* use sgl_per_token_group_quant_fp8 kernel by BBuf in https://github.com/sgl-project/sglang/pull/3493
* added llama and cleaned up by zstreet87 in https://github.com/sgl-project/sglang/pull/3503
* Fix deepseek awq v3 by hnyls2002 in https://github.com/sgl-project/sglang/pull/3450
* support blockwise fp8 matmul kernel by yizhang2077 in https://github.com/sgl-project/sglang/pull/3267
* chore: bump 0.0.3.post5 sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3530
* integrate blockwise fp8 kernel by yizhang2077 in https://github.com/sgl-project/sglang/pull/3529
* [ROCm] Add ROCm tuning configs for AMD Instinct MI325X. by whchung in https://github.com/sgl-project/sglang/pull/3536
* Update DeepSeek V3 Doc by jhinpan in https://github.com/sgl-project/sglang/pull/3541
* fix moe_align_kernel shm init not sync bug by BBuf in https://github.com/sgl-project/sglang/pull/3534
* update README by zhyncs in https://github.com/sgl-project/sglang/pull/3543
* Update install docs by simveit in https://github.com/sgl-project/sglang/pull/3553
* feat: support flashinfer mla attention for deepseek v3 by zhyncs in https://github.com/sgl-project/sglang/pull/3550
* chore: bump 0.0.3.post6 sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/3555
* chore: bump v0.4.3 by zhyncs in https://github.com/sgl-project/sglang/pull/3556

New Contributors
* fsygd made their first contribution in https://github.com/sgl-project/sglang/pull/2596
* fzyzcjy made their first contribution in https://github.com/sgl-project/sglang/pull/2565
* JamesSand made their first contribution in https://github.com/sgl-project/sglang/pull/2574
* yudian0504 made their first contribution in https://github.com/sgl-project/sglang/pull/2521
* kzhou003 made their first contribution in https://github.com/sgl-project/sglang/pull/2570
* XiaotongJiang made their first contribution in https://github.com/sgl-project/sglang/pull/2652
* mobicham made their first contribution in https://github.com/sgl-project/sglang/pull/2669
* roG0d made their first contribution in https://github.com/sgl-project/sglang/pull/2707
* mickqian made their first contribution in https://github.com/sgl-project/sglang/pull/2714
* BruceXcluding made their first contribution in https://github.com/sgl-project/sglang/pull/2601
* gaocegege made their first contribution in https://github.com/sgl-project/sglang/pull/2727
* libratiger made their first contribution in https://github.com/sgl-project/sglang/pull/2571
* observerw made their first contribution in https://github.com/sgl-project/sglang/pull/2745
* Edwardf0t1 made their first contribution in https://github.com/sgl-project/sglang/pull/2535
* xingyaoww made their first contribution in https://github.com/sgl-project/sglang/pull/2513
* jjjjohnson made their first contribution in https://github.com/sgl-project/sglang/pull/2723
* minleminzui made their first contribution in https://github.com/sgl-project/sglang/pull/2773
* sleepcoo made their first contribution in https://github.com/sgl-project/sglang/pull/2816
* Mutinifni made their first contribution in https://github.com/sgl-project/sglang/pull/2819
* CatherineSue made their first contribution in https://github.com/sgl-project/sglang/pull/2822
* Muqi1029 made their first contribution in https://github.com/sgl-project/sglang/pull/2835
* gty111 made their first contribution in https://github.com/sgl-project/sglang/pull/2826
* coolhok made their first contribution in https://github.com/sgl-project/sglang/pull/2730
* sogalin made their first contribution in https://github.com/sgl-project/sglang/pull/2852
* yundai424 made their first contribution in https://github.com/sgl-project/sglang/pull/2821
* saienduri made their first contribution in https://github.com/sgl-project/sglang/pull/2927
* chunyuan-w made their first contribution in https://github.com/sgl-project/sglang/pull/2806
* HermitSun made their first contribution in https://github.com/sgl-project/sglang/pull/2944
* giorgiopiatti-dfinity made their first contribution in https://github.com/sgl-project/sglang/pull/2982
* seungduk-yanolja made their first contribution in https://github.com/sgl-project/sglang/pull/2839
* hongpeng-guo made their first contribution in https://github.com/sgl-project/sglang/pull/2396
* lcskrishna made their first contribution in https://github.com/sgl-project/sglang/pull/2995
* yiakwy-xpu-ml-framework-team made their first contribution in https://github.com/sgl-project/sglang/pull/3003
* josephydu made their first contribution in https://github.com/sgl-project/sglang/pull/2939
* sudo-root-ns made their first contribution in https://github.com/sgl-project/sglang/pull/3055
* Fridge003 made their first contribution in https://github.com/sgl-project/sglang/pull/3038
* simveit made their first contribution in https://github.com/sgl-project/sglang/pull/2742
* trevor-m made their first contribution in https://github.com/sgl-project/sglang/pull/3037
* yinfan98 made their first contribution in https://github.com/sgl-project/sglang/pull/3130
* hubertlu-tw made their first contribution in https://github.com/sgl-project/sglang/pull/3085
* YAMY1234 made their first contribution in https://github.com/sgl-project/sglang/pull/2700
* jhinpan made their first contribution in https://github.com/sgl-project/sglang/pull/3144
* falegh made their first contribution in https://github.com/sgl-project/sglang/pull/3190
* ravi03071991 made their first contribution in https://github.com/sgl-project/sglang/pull/3229
* whchung made their first contribution in https://github.com/sgl-project/sglang/pull/3255
* lycanlancelot made their first contribution in https://github.com/sgl-project/sglang/pull/3205
* kushanam made their first contribution in https://github.com/sgl-project/sglang/pull/3272
* lizamd made their first contribution in https://github.com/sgl-project/sglang/pull/3356
* zstreet87 made their first contribution in https://github.com/sgl-project/sglang/pull/3275
* WhatGhost made their first contribution in https://github.com/sgl-project/sglang/pull/3346
* Jackmin801 made their first contribution in https://github.com/sgl-project/sglang/pull/3364
* didier-durand made their first contribution in https://github.com/sgl-project/sglang/pull/3497

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.4.1...v0.4.3

0.4.1

Highlights
- We're excited to announce SGLang v0.4.1, which now supports [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) - currently the strongest open-source LLM, even surpassing GPT-4o.

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU **from day one**. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team ispobock HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

- Various improvements to the cache-aware sglang router, torchao integration, server termination
- Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed
* Adding SGLang FP8 Utils by HaiShaw in https://github.com/sgl-project/sglang/pull/2348
* docs: add SGLang v0.4 blog by zhyncs in https://github.com/sgl-project/sglang/pull/2341
* MLA prefill w/o weight absorption by ispobock in https://github.com/sgl-project/sglang/pull/2349
* Check gpu availability at server args creation by MrAta in https://github.com/sgl-project/sglang/pull/2340
* minor: limit the range of vllm versions by zhyncs in https://github.com/sgl-project/sglang/pull/2350
* Fix Docs CI When Compile Error by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2323
* Add Docs For SGLang Native Router by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/2308
* Make torch TP composable with torch.compile by kwen2501 in https://github.com/sgl-project/sglang/pull/2352
* move apply_torchao_config_ to model_runner by jerryzh168 in https://github.com/sgl-project/sglang/pull/2342
* [Minor] Code style improvements by merrymercy in https://github.com/sgl-project/sglang/pull/2355
* Fix AWQ with enable MLA by ispobock in https://github.com/sgl-project/sglang/pull/2364
* MoE Expert Parallel by xiaobochen123 in https://github.com/sgl-project/sglang/pull/2371
* Move FP8 to SGLang by zhyncs in https://github.com/sgl-project/sglang/pull/2370
* optimize cuda graph max_bs_settings on low-end gpus by BBuf in https://github.com/sgl-project/sglang/pull/2360
* Add more support for intel Gaudi accelerators by YangQun1 in https://github.com/sgl-project/sglang/pull/2357
* [router] support `/add_worker` api by ByronHsu in https://github.com/sgl-project/sglang/pull/2369
* docs: update adoption (Meituan) by zhyncs in https://github.com/sgl-project/sglang/pull/2373
* Use proc.join instead of busy waiting by merrymercy in https://github.com/sgl-project/sglang/pull/2374
* docs: Improve instructions for supporting new models by vchzls in https://github.com/sgl-project/sglang/pull/2363
* Fix the overlap for xgrammar by merrymercy in https://github.com/sgl-project/sglang/pull/2377

0.4.0.post1

* [Router] remove duplicate char count by ByronHsu in https://github.com/sgl-project/sglang/pull/2378
* [router] add remove tenant method in the radix tree by ByronHsu in https://github.com/sgl-project/sglang/pull/2379
* [router] Add remove worker api by ByronHsu in https://github.com/sgl-project/sglang/pull/2380
* fix: resolve fp8 moe issue by zhyncs in https://github.com/sgl-project/sglang/pull/2387
* fix: update xgrammar v0.1.6 by zhyncs in https://github.com/sgl-project/sglang/pull/2390
* Fp8 MoE optimizations on AMD by HaiShaw in https://github.com/sgl-project/sglang/pull/2388
* minor: update killall script by zhyncs in https://github.com/sgl-project/sglang/pull/2391
* [router] Health check on worker before added to the router by ByronHsu in https://github.com/sgl-project/sglang/pull/2392
* Fix shape error that occurred when loading lora weight of gemma2 model. by upskyy in https://github.com/sgl-project/sglang/pull/2330
* nit: Remove busy waiting on scheduler by rkooo567 in https://github.com/sgl-project/sglang/pull/2382
* Optimize Triton decoding kernel for long context by ispobock in https://github.com/sgl-project/sglang/pull/2394
* Update killall_sglang.sh by merrymercy in https://github.com/sgl-project/sglang/pull/2397
* Remove unused vars in the triton backend by ispobock in https://github.com/sgl-project/sglang/pull/2401
* Fix a bug with logprob streaming + chunked prefill by merrymercy in https://github.com/sgl-project/sglang/pull/2403
* fix: specify dtype with begin_forward aka plan by zhyncs in https://github.com/sgl-project/sglang/pull/2404
* Fix recv_requests by merrymercy in https://github.com/sgl-project/sglang/pull/2405
* minor: update correct measurement unit by zhyncs in https://github.com/sgl-project/sglang/pull/2406
* feat: support custom task runner by zhyncs in https://github.com/sgl-project/sglang/pull/2407
* minor: add random use case by zhyncs in https://github.com/sgl-project/sglang/pull/2408
* minor: add random flashinfer vs triton use case by zhyncs in https://github.com/sgl-project/sglang/pull/2409
* Simplify stream_output by merrymercy in https://github.com/sgl-project/sglang/pull/2398
* [router] Improve cleanup logic by ByronHsu in https://github.com/sgl-project/sglang/pull/2411
* [Router] fix interrupt from terminal by ByronHsu in https://github.com/sgl-project/sglang/pull/2413
* [router] defer health checking to router init by ByronHsu in https://github.com/sgl-project/sglang/pull/2393
* reduce watchdog interval to 5s by ByronHsu in https://github.com/sgl-project/sglang/pull/2410
* Add a unittest for fused_moe by BBuf in https://github.com/sgl-project/sglang/pull/2416
* [Minor] Improve code style by merrymercy in https://github.com/sgl-project/sglang/pull/2419
* [Minor] Improve code style by merrymercy in https://github.com/sgl-project/sglang/pull/2422
* [feat] Enable chunked prefill for llava-onevision by Ying1123 in https://github.com/sgl-project/sglang/pull/2412
* Typo fix in router.md by adarshxs in https://github.com/sgl-project/sglang/pull/2424
* feat: support sgl-kernel PyPI by zhyncs in https://github.com/sgl-project/sglang/pull/2433
* fix: use manylinux2014_x86_64 tag by zhyncs in https://github.com/sgl-project/sglang/pull/2434
* fix: compatible with PEP 440 by zhyncs in https://github.com/sgl-project/sglang/pull/2435
* [router] Refactor: decouple select and send stage by ByronHsu in https://github.com/sgl-project/sglang/pull/2440
* [router] Use borrow if possible to save cost by ByronHsu in https://github.com/sgl-project/sglang/pull/2441
* Make torch TP composable with torchao by kwen2501 in https://github.com/sgl-project/sglang/pull/2436
* chore: update ao v0.7.0 by zhyncs in https://github.com/sgl-project/sglang/pull/2447
* decoding attention kernel benchmark by bjmsong in https://github.com/sgl-project/sglang/pull/2425
* Fix model loader for more quantization formats by merrymercy in https://github.com/sgl-project/sglang/pull/2448
* Fix warmup in bench_offline_throughput.py by merrymercy in https://github.com/sgl-project/sglang/pull/2449
* Add support for IBM Granite 3.x models by frreiss in https://github.com/sgl-project/sglang/pull/2437
* [router] Add retries based fault tolerance by ByronHsu in https://github.com/sgl-project/sglang/pull/2452
* [router] remove main.rs because only lib.rs is used for py binding by ByronHsu in https://github.com/sgl-project/sglang/pull/2453
* [Core] in batch prefix caching by delay scheduling by rkooo567 in https://github.com/sgl-project/sglang/pull/2442
* [router] Update doc for dynamic scaling and fault tolerance by ByronHsu in https://github.com/sgl-project/sglang/pull/2454
* [router] Release router 0.1.0 with dynamic scaling and fault tolerance by ByronHsu in https://github.com/sgl-project/sglang/pull/2455
* Make request payload size configurable by MrAta in https://github.com/sgl-project/sglang/pull/2444
* Include version info into the router package by MrAta in https://github.com/sgl-project/sglang/pull/2456
* Bump sglang-router to 0.1.1 by MrAta in https://github.com/sgl-project/sglang/pull/2459
* chore: bump v0.0.2 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2462
* minor: update pypi tag by zhyncs in https://github.com/sgl-project/sglang/pull/2463
* fix: set runtime path by zhyncs in https://github.com/sgl-project/sglang/pull/2466
* Rename rust folder to sgl-router by MrAta in https://github.com/sgl-project/sglang/pull/2464
* feat: support dev image by zhyncs in https://github.com/sgl-project/sglang/pull/2469
* [Minor] Fix grok model loader by merrymercy in https://github.com/sgl-project/sglang/pull/2473
* Fix correctness issue for triton decoding kernel by ispobock in https://github.com/sgl-project/sglang/pull/2479
* format: add clang-format for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2483
* Remove cuda graph batch size adjustment for dp attention by ispobock in https://github.com/sgl-project/sglang/pull/2484
* hotfix: checking for HIP by zhyncs in https://github.com/sgl-project/sglang/pull/2485
* sgl-kernel adapt tensorrt llm custom allreduce by yizhang2077 in https://github.com/sgl-project/sglang/pull/2481
* fix typo by zhyncs in https://github.com/sgl-project/sglang/pull/2487
* [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by BBuf in https://github.com/sgl-project/sglang/pull/2486
* fix moe-ep accuracy issue for fp8 by xiaobochen123 in https://github.com/sgl-project/sglang/pull/2489
* minor: update flashinfer nightly by zhyncs in https://github.com/sgl-project/sglang/pull/2490
* Small fixes for torchao quant by jerryzh168 in https://github.com/sgl-project/sglang/pull/2476
* Simplify pytorch sampling kernel and logit processor by merrymercy in https://github.com/sgl-project/sglang/pull/2491
* Temporarily disable unit test of torch native attention backend by merrymercy in https://github.com/sgl-project/sglang/pull/2492
* Revert "Small fixes for torchao quant" by merrymercy in https://github.com/sgl-project/sglang/pull/2493
* Add a benchmark script for in-batch prefix caching by merrymercy in https://github.com/sgl-project/sglang/pull/2494
* Small fix for the order of apply_torchao_config by merrymercy in https://github.com/sgl-project/sglang/pull/2495
* benchmark decoding attention kernel with cudnn by bjmsong in https://github.com/sgl-project/sglang/pull/2467
* Clean up GPU memory after killing sglang processes by MrAta in https://github.com/sgl-project/sglang/pull/2457
* ROCm support for sglang.check_env by hliuca in https://github.com/sgl-project/sglang/pull/2426
* Add lora_path to chat completion by ccchow in https://github.com/sgl-project/sglang/pull/2438
* Fix openai protocols and pass top_k, min_p by merrymercy in https://github.com/sgl-project/sglang/pull/2499
* Update readme by merrymercy in https://github.com/sgl-project/sglang/pull/2500
* feat: add llama3 eval by zhyncs in https://github.com/sgl-project/sglang/pull/2515
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/2516
* fix: continue to use flashinfer 0.1.6 temporarily by zhyncs in https://github.com/sgl-project/sglang/pull/2517
* fix followup 2517 by zhyncs in https://github.com/sgl-project/sglang/pull/2524
* Add integration with gemlite weight only quant by jerryzh168 in https://github.com/sgl-project/sglang/pull/2528
* chore: bump v0.4.0.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/2525
* fix 2528 by zhyncs in https://github.com/sgl-project/sglang/pull/2541
* Add lora_paths to v1_chat_generate_request by ccchow in https://github.com/sgl-project/sglang/pull/2529
* docs: update sponsorship (DataCrunch) by zhyncs in https://github.com/sgl-project/sglang/pull/2523
* [kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel by BBuf in https://github.com/sgl-project/sglang/pull/2509
* A better aio rwlock that guarantees the order by merrymercy in https://github.com/sgl-project/sglang/pull/2547
* Updated documentation for Grammar Backend by shuaills in https://github.com/sgl-project/sglang/pull/2545
* Fix gemlite import by merrymercy in https://github.com/sgl-project/sglang/pull/2553
* Reorg moe code by ispobock in https://github.com/sgl-project/sglang/pull/2563
* [Bench] Flush cache before benchmarking by Ying1123 in https://github.com/sgl-project/sglang/pull/2566
* Refactor MoE by HandH1998 in https://github.com/sgl-project/sglang/pull/2575
* fix moe_align_block_size_kernel for shared memory issue by zhyncs in https://github.com/sgl-project/sglang/pull/2579
* chore: bump 0.0.2.post8 for sgl-kernel by zhyncs in https://github.com/sgl-project/sglang/pull/2580
* use sgl-kernel moe_align_block_size by zhyncs in https://github.com/sgl-project/sglang/pull/2581
* chore: bump v0.4.1 by zhyncs in https://github.com/sgl-project/sglang/pull/2582

New Contributors
* vchzls made their first contribution in https://github.com/sgl-project/sglang/pull/2363
* upskyy made their first contribution in https://github.com/sgl-project/sglang/pull/2330
* rkooo567 made their first contribution in https://github.com/sgl-project/sglang/pull/2382
* adarshxs made their first contribution in https://github.com/sgl-project/sglang/pull/2424
* frreiss made their first contribution in https://github.com/sgl-project/sglang/pull/2437
* ccchow made their first contribution in https://github.com/sgl-project/sglang/pull/2438
* shuaills made their first contribution in https://github.com/sgl-project/sglang/pull/2545

**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.4.0...v0.4.1

Page 1 of 7

Releases

Has known vulnerabilities

Sglang

Page 1 of 7

0.4.4

0.4.3.post4

0.4.3

0.4.1.post3

0.4.1

0.4.0.post1

Page 1 of 7

Links

Releases