Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
* Optimize mem indices mangement by hnyls2002 in https://github.com/sgl-project/sglang/pull/619
* Unify index operations by hnyls2002 in https://github.com/sgl-project/sglang/pull/620
* Simplify mem state by wisclmy0611 in https://github.com/sgl-project/sglang/pull/623
* Improve tensor parallel performance by Ying1123 in https://github.com/sgl-project/sglang/pull/625
* Bump version to 0.1.21 by Ying1123 in https://github.com/sgl-project/sglang/pull/626
* Fix model forward grad by hnyls2002 in https://github.com/sgl-project/sglang/pull/628
* Update docker file by Ying1123 in https://github.com/sgl-project/sglang/pull/629
* Disable NCCL_NVLS by default by Ying1123 in https://github.com/sgl-project/sglang/pull/631
* Add qwen2 tie word embedding by yileld in https://github.com/sgl-project/sglang/pull/630
* Add support for VertexAI safety settings by AidanCooper in https://github.com/sgl-project/sglang/pull/624
* Fix vertexai by hnyls2002 in https://github.com/sgl-project/sglang/pull/633
* Reduce docker size by hnyls2002 in https://github.com/sgl-project/sglang/pull/632
* clean up step function by Ying1123 in https://github.com/sgl-project/sglang/pull/635
* feat: support internlm2 by zhyncs in https://github.com/sgl-project/sglang/pull/636
* misc: add pre-commit config by zhyncs in https://github.com/sgl-project/sglang/pull/637
* misc: add issue and pr template by zhyncs in https://github.com/sgl-project/sglang/pull/638
* Flashinfer sample kernel by hnyls2002 in https://github.com/sgl-project/sglang/pull/617
* Move `global_server_args_dict` by hnyls2002 in https://github.com/sgl-project/sglang/pull/642
* Increase the capacity of the memory pool by Ying1123 in https://github.com/sgl-project/sglang/pull/643
* feat: add check_env by zhyncs in https://github.com/sgl-project/sglang/pull/645
* Remove the dependency of rpyc by wisclmy0611 in https://github.com/sgl-project/sglang/pull/646
* misc: rm rpyc from PACKAGE_LIST by zhyncs in https://github.com/sgl-project/sglang/pull/649
* fix: set ulimit -n 65535 by zhyncs in https://github.com/sgl-project/sglang/pull/647
* feat: add lint workflow by zhyncs in https://github.com/sgl-project/sglang/pull/648
* fix: resolve lint error by zhyncs in https://github.com/sgl-project/sglang/pull/650
* Remove useless variables in infer_batch.py by Ying1123 in https://github.com/sgl-project/sglang/pull/651
* Detokenize incrementally when streaming by hnyls2002 in https://github.com/sgl-project/sglang/pull/653
* `TokenizerManager.context_len` should inherit from `server_args.conte… by shrirajh in https://github.com/sgl-project/sglang/pull/654
* Remove cached triton launcher by merrymercy in https://github.com/sgl-project/sglang/pull/656
* perf: reduce ttft and itl with stream_interval 1 by zhyncs in https://github.com/sgl-project/sglang/pull/658
* feat: add benchmark serving by zhyncs in https://github.com/sgl-project/sglang/pull/657
* refactor model loader [unreachable code]: initial refactor by Ying1123 in https://github.com/sgl-project/sglang/pull/655
* misc: update SGLang package description by zhyncs in https://github.com/sgl-project/sglang/pull/659
* Update Readme by Ying1123 in https://github.com/sgl-project/sglang/pull/660
* feat: update check env by zhyncs in https://github.com/sgl-project/sglang/pull/661
* Improve docs by Ying1123 in https://github.com/sgl-project/sglang/pull/662
* Add benchmark instructions by Ying1123 in https://github.com/sgl-project/sglang/pull/663
* Fix jump forward when streaming by hnyls2002 in https://github.com/sgl-project/sglang/pull/665
* Fix kill process util by ispobock in https://github.com/sgl-project/sglang/pull/666
* Add support for OpenAI API parallel sampling by yichuan520030910320 in https://github.com/sgl-project/sglang/pull/640
* Update OpenAI API by wisclmy0611 in https://github.com/sgl-project/sglang/pull/667
* Temporary fix invalid sample results by hnyls2002 in https://github.com/sgl-project/sglang/pull/668
* Support random dataset in bench_serving.py by merrymercy in https://github.com/sgl-project/sglang/pull/669
* Revert "Temporary fix invalid sample results" by hnyls2002 in https://github.com/sgl-project/sglang/pull/673
* refactor model loader: initial refactor by Ying1123 in https://github.com/sgl-project/sglang/pull/664
* Fix cuda graph with flashinfer by merrymercy in https://github.com/sgl-project/sglang/pull/675
* Tmp fix illegal sample by hnyls2002 in https://github.com/sgl-project/sglang/pull/676
* Update version to 0.1.22 by Ying1123 in https://github.com/sgl-project/sglang/pull/677
* Fallback when sampling failed by ispobock in https://github.com/sgl-project/sglang/pull/678
* feat: support TRT LLM benchmark and multiple benchmarks by zhyncs in https://github.com/sgl-project/sglang/pull/670
* Decouple kv by hnyls2002 in https://github.com/sgl-project/sglang/pull/679
* Support gpt-bigcode model class by hnyls2002 in https://github.com/sgl-project/sglang/pull/681
* support non-streaming benchmark by merrymercy in https://github.com/sgl-project/sglang/pull/682
* Fix StreamExecutor.fork() losing the current role start index. by max99x in https://github.com/sgl-project/sglang/pull/684
* feat: update bench serving by zhyncs in https://github.com/sgl-project/sglang/pull/685
* misc: update output file logic by zhyncs in https://github.com/sgl-project/sglang/pull/686
* Allow disabling streaming in bench by merrymercy in https://github.com/sgl-project/sglang/pull/687
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/688
* Support Deepseek MoE Model by hnyls2002 in https://github.com/sgl-project/sglang/pull/689
* misc: recommend to use chat model for benchmark by zhyncs in https://github.com/sgl-project/sglang/pull/690
* Support Mistral-Nemo by ispobock in https://github.com/sgl-project/sglang/pull/691
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/692
* fix: update bench serving by zhyncs in https://github.com/sgl-project/sglang/pull/694
* misc: update output token logic by zhyncs in https://github.com/sgl-project/sglang/pull/695
* Tune params by Ying1123 in https://github.com/sgl-project/sglang/pull/696
* Fix trt benchmark by Ying1123 in https://github.com/sgl-project/sglang/pull/697
* misc: fix typo by zhyncs in https://github.com/sgl-project/sglang/pull/698
* Fix flashinfer by Ying1123 in https://github.com/sgl-project/sglang/pull/700
* Fix hf config loading by ispobock in https://github.com/sgl-project/sglang/pull/702
* Use min new token ratio at start by hnyls2002 in https://github.com/sgl-project/sglang/pull/701
* feat: add e2e latency by zhyncs in https://github.com/sgl-project/sglang/pull/704
* Update vllm version to support llama3.1 by Ying1123 in https://github.com/sgl-project/sglang/pull/705
* bump version to 0.1.23 by Ying1123 in https://github.com/sgl-project/sglang/pull/706
* Reduce hardcoded logic of kernel usage by wisclmy0611 in https://github.com/sgl-project/sglang/pull/707
* Fix multi-node deadlock by merrymercy in https://github.com/sgl-project/sglang/pull/709
* Auto adjust new ratio by hnyls2002 in https://github.com/sgl-project/sglang/pull/708
* Fix prefill size by Ying1123 in https://github.com/sgl-project/sglang/pull/711
* docs: update README by zhyncs in https://github.com/sgl-project/sglang/pull/712
* docs: update doc by zhyncs in https://github.com/sgl-project/sglang/pull/713
* fix: llama 3.1 405b fp8 by zhyncs in https://github.com/sgl-project/sglang/pull/714
* misc: update doc by zhyncs in https://github.com/sgl-project/sglang/pull/715
* Improve benchmark scripts by Ying1123 in https://github.com/sgl-project/sglang/pull/717
* Bump version to 0.1.24 by Ying1123 in https://github.com/sgl-project/sglang/pull/718
* docs: update supported models by zhyncs in https://github.com/sgl-project/sglang/pull/719
* docs: update comment by zhyncs in https://github.com/sgl-project/sglang/pull/721
* chore: add close inactive issues workflow by zhyncs in https://github.com/sgl-project/sglang/pull/722
* misc: update bulid instruction by zhyncs in https://github.com/sgl-project/sglang/pull/724
* fix: fp8 config by Ying1123 in https://github.com/sgl-project/sglang/pull/723
* Fix dockerfile and triton cache manager by hnyls2002 in https://github.com/sgl-project/sglang/pull/720
* chore: bump v0.1.25 by zhyncs in https://github.com/sgl-project/sglang/pull/725
* fix: resolve the logo display issue on the PyPI page by zhyncs in https://github.com/sgl-project/sglang/pull/726
* misc: update bug issue template by zhyncs in https://github.com/sgl-project/sglang/pull/727
* Revert "fix: fp8 config" by Ying1123 in https://github.com/sgl-project/sglang/pull/728
* Fix bugs (fp8 checkpoints, triton cache manager) by Ying1123 in https://github.com/sgl-project/sglang/pull/729
* Bump version to 0.2.0 by Ying1123 in https://github.com/sgl-project/sglang/pull/730
New Contributors
* yileld made their first contribution in https://github.com/sgl-project/sglang/pull/630
* AidanCooper made their first contribution in https://github.com/sgl-project/sglang/pull/624
* zhyncs made their first contribution in https://github.com/sgl-project/sglang/pull/636
* shrirajh made their first contribution in https://github.com/sgl-project/sglang/pull/654
* yichuan520030910320 made their first contribution in https://github.com/sgl-project/sglang/pull/640
* max99x made their first contribution in https://github.com/sgl-project/sglang/pull/684
**Full Changelog**: https://github.com/sgl-project/sglang/compare/v0.1.20...v0.2.0