Mistralrs

Latest version: v0.5.0

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 7

1.79.0

What's Changed
* Update Dockerfile by Reckon-11 in https://github.com/EricLBuehler/mistral.rs/pull/895
* Add the Qwen2-VL model by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/894
* ISQ for mistralrs-bench by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/902
* Use tokenizers v0.20 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/904
* Fix metal sdpa for v stride by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/905
* Better parsing of the image path by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/906
* Add some Metal kernels for HQQ dequant by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/907
* Handle assistant messages with 'tool_calls' by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/824
* Attention-fused softmax for Metal by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/908
* Metal qmatmul mat-mat product (5.4x performance increase) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/909
* Support --dtype in mistralrs bench by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/911
* Metal: Use mtl resource shared to avoid one copy by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/914
* Preallocated KV cache by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/916
* Fixes for kv cache grow by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/917
* Dont always compile with fp8, bf16 for cuda by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/920
* Expand attnmask on cuda by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/923
* Faster CUDA prompt speeds by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/925
* Paged Attention alibi support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/926
* Default to SDPA for faster VLlama PP T/s by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/927
* VLlama vision model ISQ support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/928
* Support fp8 on Metal by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/930
* Bump rustls from 0.23.15 to 0.23.18 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/932
* Calculate perplexity of ISQ models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/931
* Integrate fast MLX kernel for SDPA with long seqlen by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/933
* Always cast image to rgb8 for qwenvl2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/936
* Fix etag missing in hf hub by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/934
* Fix some examples for vllama 3.2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/937
* Improve memory efficency of vllama by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/938
* Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/939
* Expose a public tokenization API by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/940
* Prepare for v0.3.4 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/942

New Contributors
* Reckon-11 made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/895

**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.3.2...v0.3.4

1.75

0.5.0

Highlights

Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0

Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.

- Support for many **more models**:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
- **Native tool calling support** for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
- **Tensor Parallelism** support (NCCL)!
- **FlashAttention V3** support and integration in PagedAttention
- **30x** reduction in ISQ times on Metal!
- Revamped **prefix cacher system**

What's Changed
* Allow using library in CurrentThread runtime by sgrebnov in https://github.com/EricLBuehler/mistral.rs/pull/1082
* Improve accuracy of uqff auto device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1084
* DeepSeekV3 sigmoid support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1092
* GPU-accelerated sampling (+5% decode perf) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1094
* Fix missing perceiver_config in qwen2vl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1096
* More topk methods for deepseek 2/3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1097
* More accurate layer size computation for deepseek 2/3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1098
* Improve streaming UX by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1102
* Faster fp8 blockwise dequant by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1100
* DS2/3 paged attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1103
* Faster bincount by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1104
* PagedAttention prompt chunking support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1105
* Refactor server SSE by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1107
* PagedAttention + FlashAttention (and FlashAttention V3) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1109
* Take KEEP_ALIVE_INTERVAL into account by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1111
* Refactor enable of flash attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1110
* Fix imatrix isq quantize_onto by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1112
* Tensor parallelism and pipeline parallelism by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1113
* Bump openssl from 0.10.69 to 0.10.70 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1121
* Allow chat streaming to use tools by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1088
* New file format for imatrix: `.cimatrix` by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1004
* Fix isq with bias for column parallel by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1128
* Multi-node support for tensor parallelism by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1125
* Add an NCCL feature flag by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1129
* Fix mistral 2501 gguf by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1131
* Add jinja strftime_now function by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1132
* Multiple models multi node by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1136
* Remove unexpected cp behavior by jncraton in https://github.com/EricLBuehler/mistral.rs/pull/1141
* Revamp speculative decoding! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1027
* Fuse MLP mul-and-act by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1142
* Short-circuit dry sampling: +6% T/s by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1143
* Integrate fused MLP mul-act for more models! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1144
* Use cudarc 0.13.5 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1145
* Handle HF_HUB_CACHE env var by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1146
* FlashAttention V2/V3 metadata with support for device location by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1148
* FP8 blockwise dequant cuda kernel by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1149
* Blockwise FP8 CUDA for cc < 800 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1150
* Fix chat sampling response by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1154
* Multiple processes for TP by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1152
* Ensure we do not bind the port for daemon processes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1158
* Handle CUDA_NVCC_FLAGS in flash attn v3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1160
* build fix for arm. by jamesvren in https://github.com/EricLBuehler/mistral.rs/pull/1164
* Working PrefixCacherV2! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1168
* Implement Phi-4 Multimodal! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1163
* No extra split/cat pair in rope by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1169
* Remove gpu<>cpu sync for faster long-context by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1170
* Refactor NCCL device mappers by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1172
* Bump ring from 0.17.11 to 0.17.13 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/1179
* DSV3/R1 fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1173
* Fix diffusion device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1187
* Internal abstraction for distributed op by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1188
* Make Sequence::set_toks more safe by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1190
* Fix CI tests out of storage? by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1191
* Internal abstraction for distributed op by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1189
* Fix build_cuda_all.yaml CI by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1193
* Support tensor parallelism for vision models! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1194
* Always pass _USE_MATH_DEFINES for CUDA by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1195
* Remove matmul via f16 framework by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1196
* Remove API for matmul_via_f16 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1197
* Add UQFF text/vision model API by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1198
* Complete qwen2_5_vl, and some fixes by brrr in https://github.com/EricLBuehler/mistral.rs/pull/1184
* Implement Gemma 3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1201
* Add Gemma 3 vision support! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1202
* Manually fixup sentencepiece detok by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1204
* More vision models with TP by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1200
* Fix topology link in the docs by etiennebalit in https://github.com/EricLBuehler/mistral.rs/pull/1205
* Gemma3 1b support and optimized rotating cache by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1206
* Improve rotating kv cache, prefix cacher system by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1207
* Better handling for kvcache set_len by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1208
* Update deps and use rand 0.9 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1210
* Update hf hub dep, add initial blockwise fp8 GEMM tests by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1212
* Growable RotatingKvCache and fixes for Phi-4 mini by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1215
* Gemma 3 cuda fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1217
* Add pydantic schema examples! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1219
* Sliding window attention fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1220
* adapt to rig crate as client by benliao in https://github.com/EricLBuehler/mistral.rs/pull/1214
* Implement Mistral 3! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1221
* Metal SDPA with masking by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1225
* Send [DONE] SSE chunk per openai spec by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1226
* Fix handling of device when compiled for but disabled nccl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1227
* Fix nccl blocking case by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1228
* Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1229
* OpenAI API compatability fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1230
* [Breaking] Automatic server logging by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1231
* Use default stream for flash attn by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1232
* Bump version to 0.5.0 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1233

New Contributors
* sgrebnov made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1082
* jncraton made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1141
* jamesvren made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1164
* brrr made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1184
* etiennebalit made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1205
* benliao made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/1214

**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.4.0...v0.5.0

0.4.0

New features
- 🔥 New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
- 🧮 Imatrix quantization
- ⚙️ Automatic device mapping
- BNB quantization
- Support blockwise FP8 dequantization and FP8 on Metal
- Integrate the llguidance library (mmoskal)
- Metal PagedAttention
- Many fixes and improvements from contributors!

Breaking changes
- The Rust device mapping API has changed.

MSRV
The MSRV of this release is **1.83.0**.

What's Changed
* Use CUDA_COMPUTE_CAP if nvidia-smi not found by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/944
* fix(docs): fix broken link by sammcj in https://github.com/EricLBuehler/mistral.rs/pull/945
* Better diffusion interactive mode by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/948
* Implement Imatrix for ISQ by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/949
* Support imatrix quantization for vision models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/950
* Perplexity calculations with imatrix by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/952
* set minimum rustc version to 1.82 by mmoskal in https://github.com/EricLBuehler/mistral.rs/pull/957
* Fix append_sliding_window by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/958
* Fix completion api behavior of best_of by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/959
* Ensure support for cuda cc 5.3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/960
* Improve test speeds on Windows by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/961
* use llguidance library for constraints (including json schemas) by mmoskal in https://github.com/EricLBuehler/mistral.rs/pull/899
* Fix metal fp8 quantization by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/962
* Fix example gguf_locally to match chat template requirements by msk in https://github.com/EricLBuehler/mistral.rs/pull/966
* Bitsandbytes quantization: loading and kernels by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/967
* updated the tokenizers dependency of core to 0.21 by vkomenda in https://github.com/EricLBuehler/mistral.rs/pull/975
* Remove outdated binaries mention in the readme by BafS in https://github.com/EricLBuehler/mistral.rs/pull/973
* Improve error handling by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/974
* Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/979
* Include start offset for metal bitwise ops by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/978
* Fail fast on TcpListener bind errors by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/982
* Inplace softmax long-seqlen attention optimizations by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/984
* Fix cuda cublaslt when using vllama mask by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/985
* Add cross attn quantization for mllama by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/987
* fix mistralrs-server ignoring interactive_mode arg by haricot in https://github.com/EricLBuehler/mistral.rs/pull/990
* Adding streaming function to mistralrs server. by Narsil in https://github.com/EricLBuehler/mistral.rs/pull/986
* Fixes for bnb and more apis in mistralrs-quant by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/972
* Support send + sync in loader by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/991
* More vllama optimizations by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/992
* Update docs by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/993
* Use metal autorelease to optimize memory usage by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/996
* Partial Fix for Sliding Window Attention by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/994
* Only dep on objc when building on metal by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/998
* Prefix cacher v2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1000
* Add `--cpu` flag to `mistralrs-server` by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/997
* Metal PagedAttention support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1001
* Fix cross attention + prefix cacher v2 support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1006
* Support for normal cache for mllama, phi3v, qwen2vl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1007
* Cleaner creation of dummy pa input metadata by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1014
* Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1009
* Support device mapping for Paged Attention by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1011
* Prefix cacher fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1018
* More fixes for the prefix cacher by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1019
* Support uqff for idefics3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1020
* Prepare for v0.3.5 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1021
* Cleaner pipeline no prefix cache setting by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1022
* Support uqff load/save for idefics3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1023
* Update license for 2025 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1024
* Implement DeepSeekV2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1010
* Use cudarc fork to fix CUDA build on Windows by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1032
* Fix metal paged attn phi3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1033
* Use float8 mistralrs_cudarc_fork feature by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1034
* Patch prefix caching to fix incorrect outputs by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1035
* Allocate paged attn cache as empty instead of zeros by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1036
* Remove ug and cudarc transient dep by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1037
* Rename MemoryGpuConfig::Amount->MbAmount by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1038
* CUDA dequant kernels conditional compilation by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1039
* F16 support for mllama, introduce FloatInfo by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1041
* Automatic device mapping support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1042
* Support automatic device mapping for gguf models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1044
* Support loading models without ISQ using device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1045
* Fix GGUF auto device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1047
* More efficient loading of safetensors when casting by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1048
* Fix Loading and Running on CPU by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1052
* Work on better device mapping for mllama by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1049
* Mention interactive mode or server port in readme for gguf by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1055
* Fix panic in mistralrs-server by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/981
* Include device memory avail in device map err by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1060
* Fix `--cpu` on cuda by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1056
* Improve pagedattn support in mistralrs bench by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1063
* Paged attention support for multi gpu by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1059
* Ergonomic automatic device mapping support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1054
* Examples for automatic device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1065
* Fix metal pagedattn half8 vec impl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1067
* Improve support for GGUF auto device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1069
* Fix missing field in idefics3 during loading by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1070
* Fix missing field in idefics3 during loading by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1072
* Fix paged attention for vision models on multiple devices by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1071
* Fixes for idefics3 and idefics2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1073
* Improve automatic device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1076
* Implement the DeepSeekV3 model (support full DeepSeek R1) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1077
* Don't print GGUF model metadata when silent=true by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1079
* Allow `ChatCompletionChunkResponse` (and therefore streaming) to have `Usage`. by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1078
* Support loading blockwise quantized fp8 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1080
* Implement MiniCpm-O 2.6 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1074
* Bump version to v0.4.0 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1081

New Contributors
* sammcj made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/945
* mmoskal made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/957
* vkomenda made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/975
* BafS made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/973
* cdoko made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/974
* Narsil made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/986

**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.3.4...v0.4.0

0.3.4

New features
- Qwen2-VL support
- Idefics 3/SmolVLM support
- ️‍🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
- 🗂️ More efficient non-PagedAttention KV cache implementation!
- Public tokenization API

Python wheels
The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.

MSRV

0.3.2

Key changes
- General improvements and fixes
- ISQ FP8
- GPTQ Marlin
- 26% performance boost on Metal
- Python package wheels are available. See below and the various PyPi packages.

What's Changed
* Update docs and deps by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/804
* Support Qwen 2.5 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/805
* Update docs with clarifications and notes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/806
* Improved inverting for Attention Mask by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/811
* Fix `repeat_interleave` by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/812
* Use f32 for neg inf in cross attn mask by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/814
* Improve UQFF memory efficiency by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/813
* Update Metal, CUDA Candle impls and ISQ by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/816
* chore: update pagedattention.cu by eltociear in https://github.com/EricLBuehler/mistral.rs/pull/822
* MLlama - if f16, load vision model in f32 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/820
* ci: Upgrade actions by polarathene in https://github.com/EricLBuehler/mistral.rs/pull/823
* docs: added a top button because of readme length by bhargavshirin in https://github.com/EricLBuehler/mistral.rs/pull/833
* Typo in error of model architecture enum by nikolaydubina in https://github.com/EricLBuehler/mistral.rs/pull/835
* Expose config for Rust api, tweak modekind by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/841
* Add ISQ FP8 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/832
* Fix Metal F8 build errors by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/846
* Bump pyo3 from 0.22.3 to 0.22.4 by dependabot in https://github.com/EricLBuehler/mistral.rs/pull/854
* Generate standalone UQFF models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/849
* Update README.MD by kaleaditya779 in https://github.com/EricLBuehler/mistral.rs/pull/848
* Add GPTQ Marlin support for 4 and 8 bit by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/856
* Adds wrap_help feature to clap by DaveTJones in https://github.com/EricLBuehler/mistral.rs/pull/858
* Patch UQFF metal generation by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/857
* Add GGUF Qwen 2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/860
* Avoid duplicate Metal command buffer encodings during ISQ by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/861
* Fix for isnanf by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/859
* Fix some metal warnings by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/862
* Support interactive mode markdown bold/italics via ANSI codes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/879
* Even better V-Llama accuracy by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/881
* Trim whitespace (such as carriage returns) from nvidia-smi output. by asaddi in https://github.com/EricLBuehler/mistral.rs/pull/880
* MODEL_ID not "MODEL_ID" by simonw in https://github.com/EricLBuehler/mistral.rs/pull/863
* Sync ggml metal kernels by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/885
* Increase Metal decoding T/s by 26% by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/887
* Remove pretty-printer by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/889
* Fix typo in documentation by msk in https://github.com/EricLBuehler/mistral.rs/pull/888
* fix Half-Quadratic Quantization and Dequantization on CPU by haricot in https://github.com/EricLBuehler/mistral.rs/pull/873
* Prepare for v0.3.2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/891

New Contributors
* bhargavshirin made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/833
* nikolaydubina made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/835
* kaleaditya779 made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/848
* DaveTJones made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/858
* asaddi made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/880
* simonw made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/863
* msk made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/888
* haricot made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/873

**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.3.1...v0.3.2

Page 1 of 7

Releases

Has known vulnerabilities

Mistralrs

Page 1 of 7

1.79.0

1.75

0.5.0

0.4.0

0.3.4

0.3.2

Page 1 of 7

Links

Releases