- Support Pixtral - Refactoring for more multimodal support - Faster filter evaluation - Various optimizations and bugfixes - Various quality of life improvements
- No longer use safetensors for loading weights (fix virtual memory issues on Windows especially) - Disable fasttensors option (now redundant) - Prioritize HF Tokenizers model when both HF and SPM models available - Add XTC sampler - Add YaRN support - Various fixes and QoL improvements
- TP: fallback SDPA mode when flash-attn is unavailable - Faster filter/grammar path - Add DRY - Fix issues since 0.1.9 (streams/graphs) when loading certain models via Tabby - Banish Râul
- Add experimental tensor-parallel mode. Currently supports Llama(1+2+3), Qwen2 and Mistral models - CUDA Graphs to reduce overhead and CPU bottlenecking - Various other optimizations - Some bugfixes