Aphrodite-engine

Latest version: v0.5.1

Safety actively analyzes 623360 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

0.5.1

What's Changed
* feat(openai): Apply chat template for GGUF loader by drummerv in https://github.com/PygmalionAI/aphrodite-engine/pull/312
* Calculate total memory usage. by sgsdxzy in https://github.com/PygmalionAI/aphrodite-engine/pull/316
* chore: add new iMatrix quants by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/320
* fix: optimize AQLM dequantization by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/325

New Contributors
* drummerv made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/312

**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.5.0...v0.5.1

0.5.0

Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition

It's been over a month since our last release. Below is re-written using Opus from my crude hand-written release notes.

New Features

- **Exllamav2 Quantization**: Exllamav2 quantization has been added, although it's currently limited to a single GPU due to kernel constraints.

- **On-the-Fly Quantization**: With the help of `bitsandbytes` and `smoothquant+`, we now support on-the-fly quantization of FP16 models. Use `--load-in-4bit` for lightning-fast 4-bit quantization with `smoothquant+`, `--load-in-smooth` for 8-bit quantization using `smoothquant+`, and `--load-in-8bit` for 8-bit quantization using the `bitsandbytes` library (note: this option is quite slow). `--load-in-4bit` needs Ampere GPUs and above, the other two need Turing and above.

- **Marlin Quantization**: Marlin quantization support has arrived, promising improved speeds at high batch sizes. Convert your GPTQ models to Marlin, but keep in mind that they must be 4-bit, with a group_size of -1 or 128, and act_order set to False.

- **AQLM Quantization**: We now support the state-of-the-art 2-bit quantization scheme, AQLM. Please note that both quantization and inference are extremely slow with this method. Quantizing llama-2 70b on 8x A100s reportedly takes 12 days, and on a single 3090 it takes 70 seconds to reach the prompt processing phase. Use this option with caution, as the wait process may cause the engine to timeout (set to 60 seconds).

- **INT8 KV Cache Quantization**: In addition to fp8_e5m2, we now support INT8 KV Cache. Unlike FP8, it doesn't speed up the throughput (it stays the same), but should offer higher quality, due to the calibration process. Uses the `smoothquant` algorithm for the quantization.

- **Implicit GGUF Model Conversion**: Simply point the `--model` flag to your GGUF file, and it will work out of the box. Be aware that this process requires a considerable amount of RAM to load the model, convert tensors to a PyTorch state_dict, and then load them. Plan accordingly or convert first if you're short on RAM.

- **LoRA support in the API**: The API now supports loading and inferencing LoRAs! Please refer to the wiki for detailed instructions.

- **New Model Support**: We've added support for a wide range of models, including OPT, Baichuan, Bloom, ChatGLM, Falcon, Gemma, GPT2, GPT Bigcode, InternLM2, MPT, OLMo, Qwen, Qwen2, and StableLM.

- **Fused Mixtral MoE**: Mixtral models (FP16 only) now utilize tensor parallelism with fused kernels, replacing the previous expert parallelism approach. Quantized Mixtrals still have this limitation, but we plan to address it by the next release.

- **Fused Top-K Kernels for MoE**: This improvement benefits Mixtral and DeepSeek-MoE models by accelerating the top-k operation using custom CUDA kernels instead of `torch.topk`.

- **Enhanced OpenAI Endpoint**: The OpenAI endpoint has been refactored, introducing JSON and Regex schemas, as well as a detokenization endpoint.

- **LoRA Support for Mixtral Models**: You can now use LoRA with Mixtral models.

- **Fine-Grained Seeds**: Introduce randomness to your requests with per-request seeds.

- **Context Shift**: We have a naive context shifting mechanism. While it's not as effective as we'd like, it's available for experimentation purposes. Enable it using the `--context-shift` flag.

- **Cubic Sampling**: Building upon quadratic sampling's smoothing_factor, we now support smoothing_curve.

- **Navi AMD GPU Support**: GPUs like the 7900 XTX are now supported, although still experimental and requiring significant compilation efforts due to xformers.

- **Kobold API Deprecation**: The Kobold API has been deprecated and merged into the OpenAI API. Launch the OpenAI API using the `--launch-kobold-api` flag. Please note that Kobold routes are not protected with the API key.

- **LoRA Support for Quantized Models**: We've added LoRA support for GPTQ and AWQ quantized models.

- **Logging Experience Overhaul**: We've revamped the logging experience using a custom `loguru` class, inspired by tabbyAPI's recent changes.

- **Informative Logging Metrics**: Logging has been enhanced to display model memory usage and reduce display bloat, among other improvements.

- **Ray Worker Health Check**: The engine now performs health checks on Ray workers, promptly reporting any silent failures or timeouts.

Bug Fixes

- Resolved an issue where `smoothing_factor` would break at high batch sizes.
- Fixed a bug with LoRA vocab embeddings.
- Addressed the missing CUDA suffixes in the version number (e.g., `0.5.0+cu118`). The suffix is now appended when using a CUDA version other than 12.1.
- Dynatemp has been split into min/max from range. The Kobold endpoint still accepts a range as input.
- Fixed worker initialization in WSL.
- Removed the accidental inclusion of FP8 kernels in the ROCm build process.
- The EOS token is now removed by default from the output, unrelated to the API.
- Resolved memory leaks caused by NCCL CUDA graphs.
- Improved garbage collection for LoRAs.
- Optimized the execution of embedded runtime scripts.

Upcoming Improvements

Here's a sneak peek at what we're working on for the next release:

- Investigating tensor parallelism with Exllamav2
- Addressing the issue of missing GPU blocks for GGUF and Exl2 (we already have a fix for FP16, GPTQ, and AWQ)

New Contributors
* anon998 made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/253
* sgsdxzy made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/256
* SwadicalRag made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/268
* thomas-xin made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/260
* StefanDanielSchwarz made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/264
* Pyroserenus made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/296
* Autumnlight02 made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/288

**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.4.9...v0.5.0

0.4.9

Another hotfix to follow v0.4.8. Fixed issues with GGUF, and included the hadamard tensors back in the wheel.

**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.4.8...v0.4.9

0.4.8

**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.4.7...v0.4.8

Quick hotfix to v0.4.7, as it wasn't including LoRAs in the wheel.

0.4.7

What's Changed
Lots of new additions, after a long time.

New features and additions
- Dynamic Temperature. (StefanGliga)
- Switch from Ray to NCCL for control-plane communications. Massive speedup for parallelism
- Support for prefix caching. Needs to be sent as `prefix + prompt`. Not in API servers yet
- Support for S-LoRA. Basically, load multiple LoRAs and pick one for inference. Not in the API servers yet
- Speed up AWQ throughput with new kernels - close to GPTQ speeds now
- Custom all-reduce kernels for parallelism. Massive improvements to throughput - parallel setups are now faster than single-gpu setups even at low batch sizes
- Add Context-Free Grammar support. EBNF format is currently supported
- Add GGUF support
- Add QuIP support
- Add Marlin support
- Add `/metrics` for Kobold server
- Add Quadratic Sampler
- Add Deepseek-MoE support with fused kernels
- Add Grafana + Prometheus production monitoring support

Bug Fixes and Small Optimizations
- Fix temperature always being set to 1
- Logprobs would crash the server if it contained NaN or -inf (miku448)
- Switch to deques in the scheduler instead of lists. Reduces complexity from quadratic to linear
- Fix eager_mode performance by not excessively padding for every iteration
- Optimize memory usage with CUDA graph by tying `max_num_seqs` with the captured batch size. Lower the value to lower memory usage
- Both safetensors and pytorch bins were being downloaded
- Fix crash with `max_tokens=None`
- Fix multi-gpu on WSL
- Fix some outputs returning token_id=0 at high concurrency (50h100a)

0.4.6

What's Changed
* Set CPU Affinity: Electric Boogaloo V2 by KaraKaraWitch in https://github.com/PygmalionAI/aphrodite-engine/pull/187
* chore: backlog 1 by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/191
* feat: support GPTQ 2, 3, and 8bit quants by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/181
* feat: FP8 KV Cache (ENG-4) by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/185
* feat: tokenizer endpoint for OpenAI API by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/195
* feat: rejection sampler by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/197
* feat: better mixtral parallelism by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/193
* fix: triton compile error by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/200
* feat: reduce sampler overhead by making it less blocking by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/198
* fix: test units by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/201
* merge branch 'dev' into 'main' by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/203
* feat: bump cuda to 12.1 by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/205
* bump version to 0.4.6 by AlpinDale in https://github.com/PygmalionAI/aphrodite-engine/pull/204

New Contributors
* KaraKaraWitch made their first contribution in https://github.com/PygmalionAI/aphrodite-engine/pull/187

**Full Changelog**: https://github.com/PygmalionAI/aphrodite-engine/compare/v0.4.5...v0.4.6

Page 1 of 4

Releases

Has known vulnerabilities

Aphrodite-engine

Page 1 of 4

0.5.1

0.5.0

0.4.9

0.4.8

0.4.7

0.4.6

Page 1 of 4

Links

Releases