What's Changed
A new release, one that took too long again. We have some cool new features, however.
- **ExllamaV2 tensor parallel**: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
- **Support for Command-R+**
- **Support for DBRX**
- **Support for Llama-3**
- **Support for Qwen 2 MoE**
- **`min_tokens` sampling param**: You can now set a minimum amount of tokens to generate.
- **Fused MoE for AWQ and GPTQ quants**: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
- **CMake build system**: Slightly faster, much cleaner builds.
- **CPU support**: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
- **Speculative Decoding**: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
- **Chunked Prefill**: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via `--enable-chunked-prefill`) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.
- **Context Shift reworked**: Context shift finally works now. Enable it with `--context-shift` and Aphrodite will cache processed prompts and re-use them.
- **FP8 E4M3 KV Cache**: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
- **Auto-truncation in API**: The API server can now optionally left-truncate your prompts. Simply pass `truncate_prompt_tokens=1024` to truncate any prompt larger than 1024 tokens.
- **Support for Llava vision models**: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
- **LM Format Enforcer**: You can now use LMFE for guided generations.
- **EETQ Quantization**: EETQ support has been added - a SOTA 8bit quantization method.
- **Arbitrary GGUF model support**: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
- **Aphrodite CLI app**: You no longer have to type `python -m aphrodite...`. Simply type `aphrodite run meta-llama/Meta-Llama-3-8B` to get started. Pass extra flags as normal.
- **Sharded GGUF support**: You can now load sharded GGUF models. Pre-conversion needed.
- **NVIDIA P100/GP100 support**: Support has been restored.
Thanks to all the new contributors!
**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.5.2...v0.5.3