Highlights
The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly **100 tokens/s**, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!
Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!
Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the [announcement](https://github.com/sgl-project/sglang/discussions/3322), there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!
Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!
Optimizations
- **AMD Performance Leadership**: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's [technical blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html)
- **Enhanced FlashInfer MLA Support**: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
`--enable-flashinfer-mla`
- **Advanced MTP Capabilities**: Both Triton and FlashInfer backends now offer comprehensive [Multi-Token Prediction](https://docs.sglang.ai/references/deepseek.html#multi-token-prediction) support, easily tunable via the [bench_speculative](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script, compatible with radix cache and chunked prefill.
- **DeepGEMM Integration**: Full integration of [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) for NVIDIA Hopper architectures - enable with
`export SGL_ENABLE_JIT_DEEPGEMM=1`
- **Pioneering INT8 Quantization**: First industry implementation of INT8 support for DeepSeek R1 models:
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
- **Other Optimizations**:
- Blackwell architecture Block Scale FP8 GEMM support
- Support page size greater than 1 https://github.com/sgl-project/sglang/pull/4356
- Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
- Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) https://github.com/sgl-project/sglang/pull/4390
Coming soon
- Integrate Flash Attention https://github.com/sgl-project/sglang/issues/4385
- Integrate FlashMLA https://github.com/sgl-project/sglang/issues/4384
- EAGLE 2 optimization https://github.com/sgl-project/sglang/pull/4383
- EAGLE 3 day one support https://github.com/sgl-project/sglang/pull/4247
- Integrate DeepEP https://github.com/sgl-project/sglang/pull/4232
- Prefill and Decoding Disaggregation
What's Changed
* update flashinfer-python by zhyncs in https://github.com/sgl-project/sglang/pull/3557
* fix doc by zhyncs in https://github.com/sgl-project/sglang/pull/3558
* Add support for OpenAI API o1 model by ChuyueSun in https://github.com/sgl-project/sglang/pull/3363
* fix sgl-kernel codestyle by BBuf in https://github.com/sgl-project/sglang/pull/3563
* docs: update install by zhyncs in https://github.com/sgl-project/sglang/pull/3581
* Copy config files for MI300X to support in virtualized environments by yosoyjay in https://github.com/sgl-project/sglang/pull/3505
* ROCm docker: triton update by HaiShaw in https://github.com/sgl-project/sglang/pull/3584
* [fix] added support for vlm in offline inference by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3548
* Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by ispobock in https://github.com/sgl-project/sglang/pull/3582
* [CI] Improve Docs CI Efficiency by shuaills in https://github.com/sgl-project/sglang/pull/3587
* doc: emphasize and notify the usage of chat_template by mickqian in https://github.com/sgl-project/sglang/pull/3589
* fix eagle unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3591
* fix high qps crash when enable mtp by zhyncs in https://github.com/sgl-project/sglang/pull/3592
* fix apply_token_bitmask_inplace_cuda by zhyncs in https://github.com/sgl-project/sglang/pull/3594
* [docs] added favicon to sphinx html by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3564
* fix lockfile and port_registry file permission error by Jiadalee in https://github.com/sgl-project/sglang/pull/3598
* feat: Support Qwen 2.5 vl by mickqian in https://github.com/sgl-project/sglang/pull/3258
* [ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. by whchung in https://github.com/sgl-project/sglang/pull/3535
* Update to latest amd image. by saienduri in https://github.com/sgl-project/sglang/pull/3597
* Benchmark for reasoning models by simveit in https://github.com/sgl-project/sglang/pull/3532
* Draft of updated doc for sampling params. by simveit in https://github.com/sgl-project/sglang/pull/3260
* [docs] Update sampling_params.md by shuaills in https://github.com/sgl-project/sglang/pull/3617
* [docker] added rdma support by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3619
* Revert "[ROCm] Use `tl.range()` in block GEMM kernels with `num_stage… by zhyncs in https://github.com/sgl-project/sglang/pull/3632
* add mtp unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3634
* update unit test by zhyncs in https://github.com/sgl-project/sglang/pull/3636
* chore: bump v0.4.3.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/3638
* h800 deepseek r1 config and support multi-gpu block-gemm tuning by BBuf in https://github.com/sgl-project/sglang/pull/3639
* feat: support flashinfer mla with prefix cache by zhyncs in https://github.com/sgl-project/sglang/pull/3643
* chore: update flashinfer v0.2.1.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3644
* chore: bump v0.4.3.post2 by zhyncs in https://github.com/sgl-project/sglang/pull/3645
* use transformers 4.48.3 by zhyncs in https://github.com/sgl-project/sglang/pull/3650
* [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by whchung in https://github.com/sgl-project/sglang/pull/3616
* [ROCm] Optimal MOE Tuning for AMD Radeon Graphics by BruceXcluding in https://github.com/sgl-project/sglang/pull/3567
* Deploy multi-node inference (LWS method) using sglang in a K8s cluster by whybeyoung in https://github.com/sgl-project/sglang/pull/3624
* Update amd docker image. by saienduri in https://github.com/sgl-project/sglang/pull/3654
* [Feature] Apply Cublas Grouped Gemm kernel by Fridge003 in https://github.com/sgl-project/sglang/pull/3629
* update pr-test by zhyncs in https://github.com/sgl-project/sglang/pull/3663
* Fix draft decode max batch size by ispobock in https://github.com/sgl-project/sglang/pull/3676
* fix: remove dependency on latest transformers impl by mickqian in https://github.com/sgl-project/sglang/pull/3635
* AMD Prefill optimize by fsx950223 in https://github.com/sgl-project/sglang/pull/3665
* fix: apply cache size limit of attention mask for VisionAttention by mickqian in https://github.com/sgl-project/sglang/pull/3657
* set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by zhyncs in https://github.com/sgl-project/sglang/pull/3698
* use warp shuffle style reduce and flashinfer vectorize by BBuf in https://github.com/sgl-project/sglang/pull/3628
* [Docs] Add SkyPilot DeepSeek example by Michaelvll in https://github.com/sgl-project/sglang/pull/3706
* [k8s] remove unnecessary hostIPC for security concern by panpan0000 in https://github.com/sgl-project/sglang/pull/3700
* [moe] optim: reduce memory consumption in fused_moe by ch-wan in https://github.com/sgl-project/sglang/pull/3692
* [Improve] Fix Multi-User Port Allocation Conflicts by shuaills in https://github.com/sgl-project/sglang/pull/3601
* Variance measure for reasoning benchmark by simveit in https://github.com/sgl-project/sglang/pull/3677
* Docs: Fix layout with sub-section by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3710
* add control for cutlass fp8 blockwise gemm by yizhang2077 in https://github.com/sgl-project/sglang/pull/3727
* revert BLOCK and num_warps on HIP by HaiShaw in https://github.com/sgl-project/sglang/pull/3722
* Optimize triton attention custom mask by ispobock in https://github.com/sgl-project/sglang/pull/3731
* [Bugfix] Fix scores mask for moe topk by Chen-XiaoBing in https://github.com/sgl-project/sglang/pull/3705
* [Docs] Modify ep related server args and remove cublas part of deepseek by Fridge003 in https://github.com/sgl-project/sglang/pull/3732
* [Fix] Fix bugs and refactor codes in lora for better scalability. by aoshen524 in https://github.com/sgl-project/sglang/pull/3652
* docs: fix 404 link by trayvonpan in https://github.com/sgl-project/sglang/pull/3588
* [docs] added torch.compile cache to dpsk manual by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3737
* AMD/ROCm: update AITER repo to ROCm/aiter by HaiShaw in https://github.com/sgl-project/sglang/pull/3747
* feat: update grouped_topk to support softmax and sigmoid by zixuanzhang226 in https://github.com/sgl-project/sglang/pull/3680
* feat: Add SageMaker support by andjsmi in https://github.com/sgl-project/sglang/pull/3740
* Change description of nvidia jetson docs by shahizat in https://github.com/sgl-project/sglang/pull/3761
* [Fix] fix OpenAI API adapter tokenizer encoding by shuaills in https://github.com/sgl-project/sglang/pull/3432
* [bug] fixed batch api by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3754
* Adjustments to docs by simveit in https://github.com/sgl-project/sglang/pull/3733
* docs: Add offline engine launch example and documentation by shuaills in https://github.com/sgl-project/sglang/pull/3771
* Update offline_engine_api.ipynb by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3773
* Support Qwen RM model. by simveit in https://github.com/sgl-project/sglang/pull/3772
* Add support for nvidia modelopt fp8 kv cache by Edwardf0t1 in https://github.com/sgl-project/sglang/pull/3223
* Tiny fix Olmo2 by fzyzcjy in https://github.com/sgl-project/sglang/pull/3348
* fix lm head weights in Qwen models by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3777
* Fix weight loader error when LM head weights are tied by fzyzcjy in https://github.com/sgl-project/sglang/pull/3766
* Let DetokenizerManager use TypeBasedDispatcher by fzyzcjy in https://github.com/sgl-project/sglang/pull/3117
* bench: Add a benchmark for vLM: MMMU by mickqian in https://github.com/sgl-project/sglang/pull/3562
* Extract generation_manager from tokenizer_manager by fzyzcjy in https://github.com/sgl-project/sglang/pull/3115
* Rename TokenizerManager to StdOrchestrator by fzyzcjy in https://github.com/sgl-project/sglang/pull/3116
* [Docs]Add instruction for manually stopping nsys profiler by Fridge003 in https://github.com/sgl-project/sglang/pull/3795
* Hierarchical Caching for SGLang by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/2693
* Update readme by merrymercy in https://github.com/sgl-project/sglang/pull/3809
* Fix dependency by merrymercy in https://github.com/sgl-project/sglang/pull/3813
* Refactor flashinfer logic for deepseek v3 and fix accuracy bug by Fridge003 in https://github.com/sgl-project/sglang/pull/3785
* Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by laixinn in https://github.com/sgl-project/sglang/pull/3730
* Fix pandas dependency in CI by merrymercy in https://github.com/sgl-project/sglang/pull/3818
* Revert "Rename TokenizerManager to StdOrchestrator" by merrymercy in https://github.com/sgl-project/sglang/pull/3828
* Revert "Extract generation_manager from tokenizer_manager" by merrymercy in https://github.com/sgl-project/sglang/pull/3829
* Fix CI and install docs by merrymercy in https://github.com/sgl-project/sglang/pull/3821
* typos by WrRan in https://github.com/sgl-project/sglang/pull/3801
* doc: fix dead link in router.md by He1pa in https://github.com/sgl-project/sglang/pull/3799
* Fix doc site copyright to current year by wilsonwu in https://github.com/sgl-project/sglang/pull/3741
* [Doc] Fix typo in server-argument description by yuanheng-zhao in https://github.com/sgl-project/sglang/pull/3641
* [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by lcskrishna in https://github.com/sgl-project/sglang/pull/3237
* [BugFix]: Add missing clamp to llavavid by PanJason in https://github.com/sgl-project/sglang/pull/3787
* [dist] made timeout configurable by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3803
* Fix allgather ops inside cuda graphs by nvcastet in https://github.com/sgl-project/sglang/pull/3709
* fix capture_bs by fsx950223 in https://github.com/sgl-project/sglang/pull/3857
* [BugFix] Fix crash when receive a req with structed output in DP attention mode. by hcyz33 in https://github.com/sgl-project/sglang/pull/3841
* Fix maximum recursion depth triggered on exception exit by kebe7jun in https://github.com/sgl-project/sglang/pull/3519
* [doc] added quantization doc for dpsk by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3843
* [doc] fixed dpsk quant faq by FrankLeeeee in https://github.com/sgl-project/sglang/pull/3865
* Expert Parallelism (EP) Support for DeepSeek V3/R1 by sleepcoo in https://github.com/sgl-project/sglang/pull/3602
* Revert recent changes by simveit in https://github.com/sgl-project/sglang/pull/3845
* Feature/improve docs by simveit in https://github.com/sgl-project/sglang/pull/3860
* [Feature] Support llguidance for constrained decoding by JC1DA in https://github.com/sgl-project/sglang/pull/3298
* Move dpsk docs forward a step by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3894
* Docs: Reorngaize dpsk links by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3900
* Implemented frontend docs by simveit in https://github.com/sgl-project/sglang/pull/3791
* [doc] update sponsorship by whybeyoung in https://github.com/sgl-project/sglang/pull/3903
* [Rocm] Fix to the rocm_mla_decode_rope.py returning random result by Chi-Chu319 in https://github.com/sgl-project/sglang/pull/3898
* [doc] Update document for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3907
* Add return hidden state in the native API by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3897
* [Docs] Disable notebook CI when merge to main by xqoasis in https://github.com/sgl-project/sglang/pull/3905
* [Docs] Improve DPSK docs in dark mode by hebiao064 in https://github.com/sgl-project/sglang/pull/3914
* [Doc] Add experimental tag for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3925
* Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by laixinn in https://github.com/sgl-project/sglang/pull/3922
* xgrammar 0.1.14 by qeternity in https://github.com/sgl-project/sglang/pull/3593
* revert "Docs: Reorngaize dpsk links 3900" by zhyncs in https://github.com/sgl-project/sglang/pull/3933
* upgrade flashinfer v0.2.2.post1 by zhyncs in https://github.com/sgl-project/sglang/pull/3934
* Fix the doc link for sampling params by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3861
* [feat] Add Vertex AI compatible prediction route for /generate by KCFindstr in https://github.com/sgl-project/sglang/pull/3866
* [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by yiakwy-xpu-ml-framework-team in https://github.com/sgl-project/sglang/pull/3613
* Fix bench_serving not recognizing OPENAI_API_KEY by kebe7jun in https://github.com/sgl-project/sglang/pull/3870
* set a strict sgl-kernel version by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3950
* [Bugfix] Fix tokenizer_manager not getting 400 when req is too long by CatherineSue in https://github.com/sgl-project/sglang/pull/3678
* [Feature] integrate Structural Tag in xgrammar backend for function calling by minleminzui in https://github.com/sgl-project/sglang/pull/3566
* SGLang + Verl by fzyzcjy in https://github.com/sgl-project/sglang/pull/3852
* Remove unused imports from rocm mla kernel. by lcskrishna in https://github.com/sgl-project/sglang/pull/3963
* Update cutlass dependency by elfiegg in https://github.com/sgl-project/sglang/pull/3966
* [Feature]Support ragged prefill in flashinfer mla backend by Fridge003 in https://github.com/sgl-project/sglang/pull/3967
* Docs: add type hint to smapling parameters by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3975
* Add redline to highlight main process by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3977
* rename FunctionCallReqInput to ParseFunctionCallReq by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3976
* Docs: add special warning to engine docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3979
* Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/3982
* Move return_hidden_states to the generate input by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3985
* Update CODEOWNERS by merrymercy in https://github.com/sgl-project/sglang/pull/3989
* add deepgemm and sglang fp8 block-wise gemm benchmark by BBuf in https://github.com/sgl-project/sglang/pull/3893
* fix typo by BBuf in https://github.com/sgl-project/sglang/pull/3991
* Fix all gather torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3992
* Add accuracy test for TP torch compile by ispobock in https://github.com/sgl-project/sglang/pull/3994
* Enable custom AR for AMD GPUs and maintain it in sgl-kernel by hubertlu-tw in https://github.com/sgl-project/sglang/pull/3406
* Add Benchmark for DeepGEMM Group GEMM by hebiao064 in https://github.com/sgl-project/sglang/pull/3993
* [feat] add small vocab table for eagle's draft model[1]. by Zhou-sx in https://github.com/sgl-project/sglang/pull/3822
* Add fast decode plan for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/3987
* Revert "Add fast decode plan for flashinfer mla" by merrymercy in https://github.com/sgl-project/sglang/pull/4008
* Add examples to token-in-token-out for LLM by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4010
* Fix nightly-test CI by yinfan98 in https://github.com/sgl-project/sglang/pull/3826
* Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by hebiao064 in https://github.com/sgl-project/sglang/pull/4014
* Improve code styles by merrymercy in https://github.com/sgl-project/sglang/pull/4021
* Clean up custom allreduce by merrymercy in https://github.com/sgl-project/sglang/pull/4029
* remove cache configs in model definitions by merrymercy in https://github.com/sgl-project/sglang/pull/4031
* Update metrics documentation by binarycrayon in https://github.com/sgl-project/sglang/pull/3264
* Reorganize c++ source files in sgl-kernel with multiple folders by merrymercy in https://github.com/sgl-project/sglang/pull/4025
* Reorganize python source files in sgl-kernel with multiple files by merrymercy in https://github.com/sgl-project/sglang/pull/4027
* Misc clean up; Remove the support of jump forward by merrymercy in https://github.com/sgl-project/sglang/pull/4032
* Docs: Fix sampling parameter by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4034
* Remove outdated test utils and fix links for the doc of sampling params by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/3999
* Add examples in sampling parameters by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4039
* Share target model embed and head weights for nextn by ispobock in https://github.com/sgl-project/sglang/pull/4033
* Add a link to the roadmap in README.md by merrymercy in https://github.com/sgl-project/sglang/pull/4043
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/4044
* Fix assert options.num_stages != 0 error in the latest ROCm build image by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4049
* Reasoning parser by xihuai18 in https://github.com/sgl-project/sglang/pull/4000
* HotFix for 3988 using blockwise_int8 by xihuai18 in https://github.com/sgl-project/sglang/pull/4023
* Fix breakage problem when using custom_ar by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4052
* ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by HaiShaw in https://github.com/sgl-project/sglang/pull/4053
* Fix `debug_tensor_dump_output_folder` optional key missing by Qubitium in https://github.com/sgl-project/sglang/pull/4046
* Remove grafana dashboard's datasource uid by kebe7jun in https://github.com/sgl-project/sglang/pull/4051
* [Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by DarkSharpness in https://github.com/sgl-project/sglang/pull/4030
* [XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by cboss6 in https://github.com/sgl-project/sglang/pull/3954
* sgl-router - issues on routing and project build. (3870) by michaelfeil in https://github.com/sgl-project/sglang/pull/3948
* fix: support gelu_new activation function in gpt2 by Xiuyu-Li in https://github.com/sgl-project/sglang/pull/3712
* remove unused max_jobs by sgjzfzzf in https://github.com/sgl-project/sglang/pull/3607
* [Feature] Add test for speculative_token_map by Achazwl in https://github.com/sgl-project/sglang/pull/4016
* Revert "Fix nightly-test CI" by merrymercy in https://github.com/sgl-project/sglang/pull/4065
* Update nextn ci test by ispobock in https://github.com/sgl-project/sglang/pull/4071
* Simplify eagle tests and TP sync in grammar backend by merrymercy in https://github.com/sgl-project/sglang/pull/4066
* Add examples for returning hidden states when using the server by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4074
* [Minor] more code cleanup by merrymercy in https://github.com/sgl-project/sglang/pull/4077
* test: add vlm to token in & out example by mickqian in https://github.com/sgl-project/sglang/pull/3941
* [QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization by Qubitium in https://github.com/sgl-project/sglang/pull/3790
* bench: add dataset param for bench_multiturn by zeroorhero in https://github.com/sgl-project/sglang/pull/3990
* ROCM: AITER BLOCK GEMM by BruceXcluding in https://github.com/sgl-project/sglang/pull/4075
* [Eagle] Refactor eagle speculative decoding by Ying1123 in https://github.com/sgl-project/sglang/pull/3986
* Fix the moe padding conditional logic by HaiShaw in https://github.com/sgl-project/sglang/pull/4081
* [Revision] Add fast decode plan for flashinfer mla by Fridge003 in https://github.com/sgl-project/sglang/pull/4012
* Fix triton kernel illegal memory issue for eagle by ispobock in https://github.com/sgl-project/sglang/pull/4100
* Add update_weights_from_disk endpoint to Engine by jhinpan in https://github.com/sgl-project/sglang/pull/4102
* Add DeepSeek optimization ablations documentation by M0gician in https://github.com/sgl-project/sglang/pull/4107
* reorganize dpsk docs by zhaochenyang20 in https://github.com/sgl-project/sglang/pull/4108
* Add examples for server token-in-token-out by Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/4103
* revert deepseek docs by zhyncs in https://github.com/sgl-project/sglang/pull/4109
* Create release-docker-amd-nightly.yml by saienduri in https://github.com/sgl-project/sglang/pull/4105
* remove testing on PR workflow change by saienduri in https://github.com/sgl-project/sglang/pull/4110
* Debug radixcache: refactor recursive helper methods by luzengxiangcn in https://github.com/sgl-project/sglang/pull/3029
* Online serving benchmarks of real datasets for hierarchical KV caching by PanJason in https://github.com/sgl-project/sglang/pull/3211
* fix cross-reference error and spelling mistakes by samzong in https://github.com/sgl-project/sglang/pull/4101
* fix Non-consecutive header level increase in docs/router/router.md by samzong in https://github.com/sgl-project/sglang/pull/4099
* chore: bump v0.4.3.post3 by zhyncs in https://github.com/sgl-project/sglang/pull/4114
* [Hoxfix] Fix incomplete token_to_kv_pool refactor by Edenzzzz in https://github.com/sgl-project/sglang/pull/4121
* Remove prefill-only-one-req by merrymercy in https://github.com/sgl-project/sglang/pull/4117
* Add a pointer to the real KV cache pool by xiezhq-hermann in https://github.com/sgl-project/sglang/pull/4113
* feat: support docs auto live-reload with sphinx-autobuild by samzong in https://github.com/sgl-project/sglang/pull/4111
* EAGLE docs by simveit in https://github.com/sgl-project/sglang/pull/4038
* Add codeowners for eagle implementations by Ying1123 in https://github.com/sgl-project/sglang/pull/4131
* Add tag suffix to nightly docker builds. by saienduri in https://github.com/sgl-project/sglang/pull/4129
* remove unused max_jobs in setup_rocm.py by sgjzfzzf in https://github.com/sgl-project/sglang/pull/4126
* Split the __init__ of scheduler as smaller functions. Improve the eagle tests by merrymercy in https://github.com/sgl-project/sglang/pull/4128
* [Minor] make the `__init__` function of model_runner.py shorter by merrymercy in https://github.com/sgl-project/sglang/pull/4132
* AMD/ROCm: update base image string by kkHuang-amd in https://github.com/sgl-project/sglang/pull/4137
* Update CODEOWNER by merrymercy in https://github.com/sgl-project/sglang/pull/4138
* fix bench serving bug by Lzhang-hub in https://github.com/sgl-project/sglang/pull/4135
* Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by merrymercy in https://github.com/sgl-project/sglang/pull/4134
* Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by yinfan98 in https://github.com/sgl-project/sglang/pull/4147
* Fix constrained generation errors by adding datasets dependency by olliestanley in https://github.com/sgl-project/sglang/pull/4142