ScaleLLM is a high-performance inference system for large language models, designed for production environments. It supports most popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
[Key Features](https://github.com/vectorch-ai/ScaleLLM#key-features)
* High Performance: ScaleLLM is optimized for high-performance LLM inference.
* Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
* OpenAI-compatible API Efficient golang rest api server that compatible with OpenAI.
* Huggingface models Integration Seamless integration with most popular HF models.
* Customizable: Offers flexibility for customization to meet your specific needs.
* Production Ready: Designed to be deployed in production environments.
<h2 tabindex="-1" id="user-content-supported-models" dir="auto" style="box-sizing: border-box; margin-top: 24px; margin-bottom: 16px; font-size: 1.5em; font-weight: var(--base-text-weight-semibold, 600); line-height: 1.25; padding-bottom: 0.3em; border-bottom: 1px solid var(--borderColor-muted, var(--color-border-muted)); color: rgb(230, 237, 243); font-family: -apple-system, "system-ui", "Segoe UI", "Noto Sans", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji"; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(13, 17, 23); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><a class="heading-link" href="https://github.com/vectorch-ai/ScaleLLM#supported-models" style="box-sizing: border-box; background-color: transparent; color: unset; text-decoration: none; text-underline-offset: 0.2rem;">Supported Models<svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></h2>
Models | Tensor Parallel | Quantization | HF models examples
-- | -- | -- | --
Llama2 | Yes | Yes | meta-llama/Llama-2-7b, TheBloke/Llama-2-13B-chat-GPTQ, TheBloke/Llama-2-70B-AWQ
Aquila | Yes | Yes | BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom | Yes | Yes | bigscience/bloom
GPT_j | Yes | Yes | EleutherAI/gpt-j-6b
GPT_NeoX | Yes | -- | EleutherAI/gpt-neox-20b
GPT2 | Yes | -- | gpt2
InternLM | Yes | Yes | internlm/internlm-7b
Mistral | Yes | Yes | mistralai/Mistral-7B-v0.1
MPT | Yes | Yes | mosaicml/mpt-30b