Highlights
Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0
Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.
- Support for many **more models**:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
- **Native tool calling support** for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
- **Tensor Parallelism** support (NCCL)!
- **FlashAttention V3** support and integration in PagedAttention
- **30x** reduction in ISQ times on Metal!
- Revamped **prefix cacher system**
What's Changed
* Allow using library in CurrentThread runtime by sgrebnov in https://github.com/EricLBuehler/mistral.rs/pull/1082
* Improve accuracy of uqff auto device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1084
* DeepSeekV3 sigmoid support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1092
* GPU-accelerated sampling (+5% decode perf) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1094
* Fix missing perceiver_config in qwen2vl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1096
* More topk methods for deepseek 2/3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1097
* More accurate layer size computation for deepseek 2/3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1098
* Improve streaming UX by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1102
* Faster fp8 blockwise dequant by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1100
* DS2/3 paged attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1103
* Faster bincount by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1104
* PagedAttention prompt chunking support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1105
* Refactor server SSE by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1107
* PagedAttention + FlashAttention (and FlashAttention V3) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1109
* Take KEEP_ALIVE_INTERVAL into account by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1111
* Refactor enable of flash attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1110
* Fix imatrix isq quantize_onto by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1112
* Tensor parallelism and pipeline parallelism by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1113
* Bump openssl from 0.10.69 to 0.10.70 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1121
* Allow chat streaming to use tools by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1088
* New file format for imatrix: `.cimatrix` by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1004
* Fix isq with bias for column parallel by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1128
* Multi-node support for tensor parallelism by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1125
* Add an NCCL feature flag by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1129
* Fix mistral 2501 gguf by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1131
* Add jinja strftime_now function by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1132
* Multiple models multi node by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1136
* Remove unexpected cp behavior by jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1141
* Revamp speculative decoding! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1027
* Fuse MLP mul-and-act by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1142
* Short-circuit dry sampling: +6% T/s by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1143
* Integrate fused MLP mul-act for more models! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1144
* Use cudarc 0.13.5 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1145
* Handle HF_HUB_CACHE env var by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1146
* FlashAttention V2/V3 metadata with support for device location by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1148
* FP8 blockwise dequant cuda kernel by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1149
* Blockwise FP8 CUDA for cc < 800 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1150
* Fix chat sampling response by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1154
* Multiple processes for TP by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1152
* Ensure we do not bind the port for daemon processes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1158
* Handle CUDA_NVCC_FLAGS in flash attn v3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1160
* build fix for arm. by jamesvren in https://github.com/EricLBuehler/mistral.rs/pull/1164
* Working PrefixCacherV2! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1168
* Implement Phi-4 Multimodal! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1163
* No extra split/cat pair in rope by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1169
* Remove gpu<>cpu sync for faster long-context by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1170
* Refactor NCCL device mappers by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1172
* Bump ring from 0.17.11 to 0.17.13 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1179
* DSV3/R1 fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1173
* Fix diffusion device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1187
* Internal abstraction for distributed op by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1188
* Make Sequence::set_toks more safe by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1190
* Fix CI tests out of storage? by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1191
* Internal abstraction for distributed op by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1189
* Fix build_cuda_all.yaml CI by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1193
* Support tensor parallelism for vision models! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1194
* Always pass _USE_MATH_DEFINES for CUDA by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1195
* Remove matmul via f16 framework by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1196
* Remove API for matmul_via_f16 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1197
* Add UQFF text/vision model API by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1198
* Complete qwen2_5_vl, and some fixes by brrr in https://github.com/EricLBuehler/mistral.rs/pull/1184
* Implement Gemma 3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1201
* Add Gemma 3 vision support! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1202
* Manually fixup sentencepiece detok by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1204
* More vision models with TP by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1200
* Fix topology link in the docs by etiennebalit in https://github.com/EricLBuehler/mistral.rs/pull/1205
* Gemma3 1b support and optimized rotating cache by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1206
* Improve rotating kv cache, prefix cacher system by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1207
* Better handling for kvcache set_len by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1208
* Update deps and use rand 0.9 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1210
* Update hf hub dep, add initial blockwise fp8 GEMM tests by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1212
* Growable RotatingKvCache and fixes for Phi-4 mini by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1215
* Gemma 3 cuda fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1217
* Add pydantic schema examples! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1219
* Sliding window attention fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1220
* adapt to rig crate as client by benliao in https://github.com/EricLBuehler/mistral.rs/pull/1214
* Implement Mistral 3! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1221
* Metal SDPA with masking by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1225
* Send [DONE] SSE chunk per openai spec by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1226
* Fix handling of device when compiled for but disabled nccl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1227
* Fix nccl blocking case by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1228
* Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1229
* OpenAI API compatability fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1230
* [Breaking] Automatic server logging by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1231
* Use default stream for flash attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1232
* Bump version to 0.5.0 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1233
New Contributors
* sgrebnov made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1082
* jncraton made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1141
* jamesvren made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1164
* brrr made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1184
* etiennebalit made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1205
* benliao made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1214
**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.4.0...v0.5.0