- Fix dynamic generator fallback mode (was broken for prompts longer than max_input_len) - Fix inference on ROCm wave64 devices - Made model conversion script part of `exllamav2` package - CPU optimizations
- Added Q6 and Q8 cache modes - Defragment cache in dynamic generator - Use SDPA with Torch 2.3.0+ - Updated wheels to Torch 2.3.1 - Added Python 3.12 wheels, plus Python 3.9 for ROCm
- Option to keep calibration states in VRAM while measuring - Fix for Q4 cache for odd key/value sizes (MiniCPM specifically) - Alternative `fasttensors` option on Windows to solve system memory issues - Prefix filter with multiple prefixes