Scalellm

Latest version: v0.2.2

Safety actively analyzes 682361 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

0.0.9

Major Changes
* Enabled speculative decoding and updated README

What's Changed
* [refactor] add implicit conversion between slice and vector by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/134
* [refactor] change tokenizer special tokens from token to token + id. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/135
* [feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/137
* [refactor] refactoring for sequence by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/140
* [unittest] added more unittests for speculative decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/141
* [unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/142
* [feat] added support for kv_cache with different strides. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/143
* [feat] enable speculative decoding and update readme by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/145


**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.8...v0.0.9

0.0.8

Major changes
* Added Meta Llama3 and Google Gemma support
* Added cuda graph support for decoding

What's Changed
* [model] added support for google Gemma-2b model by 936187425 in https://github.com/vectorch-ai/ScaleLLM/pull/103
* [feat] added rms norm residual kernel by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/125
* [fix] fix data accuracy issue for gemma by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/126
* [refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/127
* [feat] enable cuda graph for decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/129
* [bugfix] fix cuda graph capture issue for tensor parallelism by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/130
* [feat] optimize batch size for cuda graph by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/132

New Contributors
* 936187425 made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/103

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.7...v0.0.8

0.0.7

Major changes
* Dynamic prefix cache
* Dynamic split-fuse scheduler
* Speculative decoding

What's Changed
* [feat] add support for cudagraph and its unit test. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/79
* [feat] add block id lifecycle management for block sharing scenarios. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/85
* [feat] added prefix cache to share kv cache across sequences. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/86
* [feat] enable prefix cache in block manager by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/87
* [feat] added LRU policy into prefix cache. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/89
* [refactor] move batch related logic into a class by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/90
* [fix] replace submodules git path with https path to avoid permission issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/92
* [feat] add max tokens to process to support dynamic split-fuse by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/93
* [feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/94
* [fix] added small page size support for flash attention. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/95
* [fix] adjust kv_cache_pos to give at least one token to generate logits by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/96
* added layernorm benchmark by dongxianzhe in https://github.com/vectorch-ai/ScaleLLM/pull/97
* [feat] added dynamic split-fuse support in continuous scheduler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/98
* [refactor] move model output process logic into batch by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/99
* [feat] added engine type to allow LLM and SSM share sequence. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/100
* [feat] added speculative engine class without implementation. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/101
* [refactor] moved top_k and top_p from sampler to logits process. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/102
* [workflow] added clang-format workflow by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/105
* [fix] only run git-clang-format agains c/c++ files by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/106
* [feat] added prompt blocks sharing across n sequences by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/107
* [feat] Added selected tokens to return logits from model execution. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/109
* [feat] added rejection sampler for speculative decoding. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/112
* [feat] enable speculative decoding for simple server by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/113
* [feat] mask out rejected tokens with -1 in Rejection Sampler by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/114
* [feat] added sampling support for multiple query decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/115
* [feat] added stream support for n > 1 scenarios by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/116
* [feat] enable speculative decoding for scalellm. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/117
* [feat] cancel request if rpc is not ok by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/118
* [fix] put finish reason into a separate response by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/119
* [feat] added skip_special_tokens support for tokenizers by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/120

New Contributors
* dongxianzhe made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/97

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.6...v0.0.7

0.0.6

Major changes:
* Introduced new kernels aimed at enhancing efficiency.
* Implemented an initial Python wrapper, simplifying integration and extending accessibility.
* Incorporated new models such as Baichuan2 and ChatGLM.
* Added support for Jinja chat templates, enhancing customization and user interaction.
* Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
* Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed
* add timestamp into ccache cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/42
* use ${GITHUB_SHA} in cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/43
* replace GITHUB_SHA with ${{ github.sha }} by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/44
* encapsulate class of time for performance tracking. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/46
* upgrade paged_atten kernel to v0.2.7 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/47
* [feat] add speculative decoding. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/50
* added a new attention kernel for speculative decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/52
* added support for small page size. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/53
* enable flash decoding for both prefill and decode phase. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/54
* enable split-k for flash decoding and fix bugs. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/59
* [ut] add unit tests for speculative scheduler. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/57
* added a custom command to generate instantiation for flashinfer by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/61
* add custom command to generate instantiation for flash-attn by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/62
* added gpu memory profiling to decided kv cache size precisely. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/63
* moved attention related files into attention subfolder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/65
* add pybind11 to support python user interface. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/64
* added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/66
* merge huggingface tokenizers and safetensors rust projects into one. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/67
* more changes to support python wrapper by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/68
* [feat] added attention handler for different implementations by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/71
* [perf] enabled speed up for gpa and mqa decoding. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/72
* [perf] use a seperate cuda stream for kv cache by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/73
* [models] added baichuan/baichuan2 model support. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/70
* [minor] cleanup redundant code for models. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/74
* [feat] moved rope logic into attention handler to support apply positional embeding on the fly by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/76
* [refactor] replace dtype and device with options since they are used together usually by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/77
* [refactor] move cutlass and flashinfer into third_party folder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/78
* [refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/80
* [models] support both baichuan and baichuan2 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/81
* [models] fix chatglm model issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/82


**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.5...v0.0.6

0.0.5

Major changes
* Added Qwen, ChatGLM and Phi2 support.
* Added tiktoken tokenizer support.
* Enabled more custom kernels for sampling.

What's Changed
* [docs] add speculative decoding design docs. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/33
* [docs] add devel image in CONTRIBUTING.md. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/35
* [refactor] rename Executor to ThreadPool. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/36

New Contributors
* liutongxuan made their first contribution in https://github.com/vectorch-ai/ScaleLLM/pull/33

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.4...v0.0.5

0.0.4

Major change:
* Added docker image build for [cuda 11.8](https://hub.docker.com/r/vectorchai/scalellm_cu118/tags).
* Added exception handling logic in http server.

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.3-fix...v0.0.4

v0.0.3-fix
* Added support for [Yi Chat Model](https://huggingface.co/01-ai/Yi-34B-Chat).
* Added args overrider support.
* Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.

**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.2...v0.0.3-fix

Page 3 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.