Scalellm

Latest version: v0.2.4

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 4

0.1.1

What's Changed
* [feat] added cuda 11.8 devel image to build cpp release image by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/194
* [fix] fix workflow format by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/195
* [CI] fix docker run options by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/196
* fix: make build pass with gcc-9 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/197
* ci: bump version and build with new manylinux image (gcc-9) by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/198
* [python] added more examples and fix requirments version by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/199
* feat: moved scheduler wait logic from python into scheduler run_until_complete function by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/200
* feat: added multiple threads support for LLMHandler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/201
* fix: use a proper epsilon to avoid division by zero error for rejection sampler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/202
* feat: added batch support for llm handler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/204
* ci: publish wheels to whl index repo by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/205

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.1.0...v0.1.1

0.1.0

Major changes:
* Added python wrapper and published [scalellm](https://pypi.org/project/scalellm/) package to PyPI.
* Supported openai-compatible rest api server. 'python3 -m scalellm.serve.api_server'
* Install scalellm with pip: 'pip install scalellm'
* Added [examples](https://github.com/vectorch-ai/ScaleLLM/tree/main/python/scalellm/examples) for offline inference and async stream.

What's Changed
* [fix] use the pybind11 from libtorch and fix model download issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/167
* [misc] upgrade torch to 2.3 and use gcc-12 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/168
* [feat] added python rest api server skeleton by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/169
* [refactor] combine sequence and request outputs by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/170
* [feat] added python LLMEngine skeleton by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/171
* [refactor] move proto definitions into proto namespace by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/173
* [feat] implement async llm engine for python wrapper by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/172
* [refactor] consolidate handlers to share llm_handler between python rest api server and grpc server by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/174
* [python] move request handling logic into seperate file from api server by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/175
* [python] added model check for rest api by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/176
* [feat] added status handling for grpc server by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/177
* [misc] some changes to cmake file by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/180
* [kernle] change head_dim list to reduce binary size by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/181
* [CI] added base docker image for python wheel build by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/182
* [ci] build python wheels by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/183
* [CI] fix docker image issues and build wheel for different python, pytorch versions by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/184
* [fix] added manylinux support by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/185
* [fix] added cuda 11.8 support for manylinux by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/186
* [feat] added version suffix to include cuda and torch version by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/187
* [CI] Upload wheels to release as asserts by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/188
* [fix] fix extension typo for wheel publish workflow by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/189
* [python] added LLM for offline inference and stream examples for chat and complete by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/190
* [python] added requirements into package by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/191
* [Release] prepare 0.1.0 release by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/192
* [Release] added workflow to publish whls to PyPI by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/193

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.9...v0.1.0

0.0.9

Major Changes
* Enabled speculative decoding and updated README

What's Changed
* [refactor] add implicit conversion between slice and vector by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/134
* [refactor] change tokenizer special tokens from token to token + id. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/135
* [feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/137
* [refactor] refactoring for sequence by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/140
* [unittest] added more unittests for speculative decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/141
* [unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/142
* [feat] added support for kv_cache with different strides. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/143
* [feat] enable speculative decoding and update readme by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/145

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.8...v0.0.9

0.0.8

Major changes
* Added Meta Llama3 and Google Gemma support
* Added cuda graph support for decoding

What's Changed
* [model] added support for google Gemma-2b model by 936187425 in https://github.com/vectorch-ai/ScaleLLM/pull/103
* [feat] added rms norm residual kernel by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/125
* [fix] fix data accuracy issue for gemma by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/126
* [refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/127
* [feat] enable cuda graph for decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/129
* [bugfix] fix cuda graph capture issue for tensor parallelism by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/130
* [feat] optimize batch size for cuda graph by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/132

New Contributors
* 936187425 made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/103

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.7...v0.0.8

0.0.7

Major changes
* Dynamic prefix cache
* Dynamic split-fuse scheduler
* Speculative decoding

What's Changed
* [feat] add support for cudagraph and its unit test. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/79
* [feat] add block id lifecycle management for block sharing scenarios. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/85
* [feat] added prefix cache to share kv cache across sequences. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/86
* [feat] enable prefix cache in block manager by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/87
* [feat] added LRU policy into prefix cache. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/89
* [refactor] move batch related logic into a class by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/90
* [fix] replace submodules git path with https path to avoid permission issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/92
* [feat] add max tokens to process to support dynamic split-fuse by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/93
* [feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/94
* [fix] added small page size support for flash attention. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/95
* [fix] adjust kv_cache_pos to give at least one token to generate logits by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/96
* added layernorm benchmark by dongxianzhe in https://github.com/vectorch-ai/ScaleLLM/pull/97
* [feat] added dynamic split-fuse support in continuous scheduler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/98
* [refactor] move model output process logic into batch by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/99
* [feat] added engine type to allow LLM and SSM share sequence. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/100
* [feat] added speculative engine class without implementation. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/101
* [refactor] moved top_k and top_p from sampler to logits process. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/102
* [workflow] added clang-format workflow by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/105
* [fix] only run git-clang-format agains c/c++ files by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/106
* [feat] added prompt blocks sharing across n sequences by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/107
* [feat] Added selected tokens to return logits from model execution. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/109
* [feat] added rejection sampler for speculative decoding. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/112
* [feat] enable speculative decoding for simple server by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/113
* [feat] mask out rejected tokens with -1 in Rejection Sampler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/114
* [feat] added sampling support for multiple query decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/115
* [feat] added stream support for n > 1 scenarios by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/116
* [feat] enable speculative decoding for scalellm. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/117
* [feat] cancel request if rpc is not ok by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/118
* [fix] put finish reason into a separate response by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/119
* [feat] added skip_special_tokens support for tokenizers by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/120

New Contributors
* dongxianzhe made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/97

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.6...v0.0.7

0.0.6

Major changes:
* Introduced new kernels aimed at enhancing efficiency.
* Implemented an initial Python wrapper, simplifying integration and extending accessibility.
* Incorporated new models such as Baichuan2 and ChatGLM.
* Added support for Jinja chat templates, enhancing customization and user interaction.
* Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
* Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed
* add timestamp into ccache cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/42
* use ${GITHUB_SHA} in cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/43
* replace GITHUB_SHA with ${{ github.sha }} by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/44
* encapsulate class of time for performance tracking. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/46
* upgrade paged_atten kernel to v0.2.7 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/47
* [feat] add speculative decoding. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/50
* added a new attention kernel for speculative decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/52
* added support for small page size. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/53
* enable flash decoding for both prefill and decode phase. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/54
* enable split-k for flash decoding and fix bugs. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/59
* [ut] add unit tests for speculative scheduler. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/57
* added a custom command to generate instantiation for flashinfer by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/61
* add custom command to generate instantiation for flash-attn by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/62
* added gpu memory profiling to decided kv cache size precisely. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/63
* moved attention related files into attention subfolder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/65
* add pybind11 to support python user interface. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/64
* added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/66
* merge huggingface tokenizers and safetensors rust projects into one. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/67
* more changes to support python wrapper by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/68
* [feat] added attention handler for different implementations by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/71
* [perf] enabled speed up for gpa and mqa decoding. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/72
* [perf] use a seperate cuda stream for kv cache by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/73
* [models] added baichuan/baichuan2 model support. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/70
* [minor] cleanup redundant code for models. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/74
* [feat] moved rope logic into attention handler to support apply positional embeding on the fly by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/76
* [refactor] replace dtype and device with options since they are used together usually by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/77
* [refactor] move cutlass and flashinfer into third_party folder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/78
* [refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/80
* [models] support both baichuan and baichuan2 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/81
* [models] fix chatglm model issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/82

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.5...v0.0.6

Page 3 of 4

Releases

Has known vulnerabilities

Previous Next

Scalellm

Page 3 of 4

0.1.1

0.1.0

0.0.9

0.0.8

0.0.7

0.0.6

Page 3 of 4

Links

Releases