New features
- 🔥 New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
- 🧮 Imatrix quantization
- ⚙️ Automatic device mapping
- BNB quantization
- Support blockwise FP8 dequantization and FP8 on Metal
- Integrate the llguidance library (mmoskal)
- Metal PagedAttention
- Many fixes and improvements from contributors!
Breaking changes
- The Rust device mapping API has changed.
MSRV
The MSRV of this release is **1.83.0**.
What's Changed
* Use CUDA_COMPUTE_CAP if nvidia-smi not found by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/944
* fix(docs): fix broken link by sammcj in https://github.com/EricLBuehler/mistral.rs/pull/945
* Better diffusion interactive mode by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/948
* Implement Imatrix for ISQ by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/949
* Support imatrix quantization for vision models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/950
* Perplexity calculations with imatrix by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/952
* set minimum rustc version to 1.82 by mmoskal in https://github.com/EricLBuehler/mistral.rs/pull/957
* Fix append_sliding_window by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/958
* Fix completion api behavior of best_of by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/959
* Ensure support for cuda cc 5.3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/960
* Improve test speeds on Windows by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/961
* use llguidance library for constraints (including json schemas) by mmoskal in https://github.com/EricLBuehler/mistral.rs/pull/899
* Fix metal fp8 quantization by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/962
* Fix example gguf_locally to match chat template requirements by msk in https://github.com/EricLBuehler/mistral.rs/pull/966
* Bitsandbytes quantization: loading and kernels by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/967
* updated the tokenizers dependency of core to 0.21 by vkomenda in https://github.com/EricLBuehler/mistral.rs/pull/975
* Remove outdated binaries mention in the readme by BafS in https://github.com/EricLBuehler/mistral.rs/pull/973
* Improve error handling by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/974
* Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/979
* Include start offset for metal bitwise ops by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/978
* Fail fast on TcpListener bind errors by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/982
* Inplace softmax long-seqlen attention optimizations by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/984
* Fix cuda cublaslt when using vllama mask by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/985
* Add cross attn quantization for mllama by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/987
* fix mistralrs-server ignoring interactive_mode arg by haricot in https://github.com/EricLBuehler/mistral.rs/pull/990
* Adding streaming function to mistralrs server. by Narsil in https://github.com/EricLBuehler/mistral.rs/pull/986
* Fixes for bnb and more apis in mistralrs-quant by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/972
* Support send + sync in loader by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/991
* More vllama optimizations by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/992
* Update docs by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/993
* Use metal autorelease to optimize memory usage by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/996
* Partial Fix for Sliding Window Attention by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/994
* Only dep on objc when building on metal by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/998
* Prefix cacher v2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1000
* Add `--cpu` flag to `mistralrs-server` by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/997
* Metal PagedAttention support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1001
* Fix cross attention + prefix cacher v2 support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1006
* Support for normal cache for mllama, phi3v, qwen2vl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1007
* Cleaner creation of dummy pa input metadata by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1014
* Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by guoqingbao in https://github.com/EricLBuehler/mistral.rs/pull/1009
* Support device mapping for Paged Attention by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1011
* Prefix cacher fixes by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1018
* More fixes for the prefix cacher by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1019
* Support uqff for idefics3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1020
* Prepare for v0.3.5 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1021
* Cleaner pipeline no prefix cache setting by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1022
* Support uqff load/save for idefics3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1023
* Update license for 2025 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1024
* Implement DeepSeekV2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1010
* Use cudarc fork to fix CUDA build on Windows by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1032
* Fix metal paged attn phi3 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1033
* Use float8 mistralrs_cudarc_fork feature by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1034
* Patch prefix caching to fix incorrect outputs by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1035
* Allocate paged attn cache as empty instead of zeros by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1036
* Remove ug and cudarc transient dep by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1037
* Rename MemoryGpuConfig::Amount->MbAmount by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1038
* CUDA dequant kernels conditional compilation by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1039
* F16 support for mllama, introduce FloatInfo by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1041
* Automatic device mapping support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1042
* Support automatic device mapping for gguf models by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1044
* Support loading models without ISQ using device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1045
* Fix GGUF auto device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1047
* More efficient loading of safetensors when casting by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1048
* Fix Loading and Running on CPU by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1052
* Work on better device mapping for mllama by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1049
* Mention interactive mode or server port in readme for gguf by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1055
* Fix panic in mistralrs-server by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/981
* Include device memory avail in device map err by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1060
* Fix `--cpu` on cuda by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1056
* Improve pagedattn support in mistralrs bench by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1063
* Paged attention support for multi gpu by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1059
* Ergonomic automatic device mapping support by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1054
* Examples for automatic device mapping by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1065
* Fix metal pagedattn half8 vec impl by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1067
* Improve support for GGUF auto device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1069
* Fix missing field in idefics3 during loading by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1070
* Fix missing field in idefics3 during loading by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1072
* Fix paged attention for vision models on multiple devices by cdoko in https://github.com/EricLBuehler/mistral.rs/pull/1071
* Fixes for idefics3 and idefics2 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1073
* Improve automatic device map by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1076
* Implement the DeepSeekV3 model (support full DeepSeek R1) by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1077
* Don't print GGUF model metadata when silent=true by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1079
* Allow `ChatCompletionChunkResponse` (and therefore streaming) to have `Usage`. by Jeadie in https://github.com/EricLBuehler/mistral.rs/pull/1078
* Support loading blockwise quantized fp8 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1080
* Implement MiniCpm-O 2.6 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1074
* Bump version to v0.4.0 by EricLBuehler in https://github.com/EricLBuehler/mistral.rs/pull/1081
New Contributors
* sammcj made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/945
* mmoskal made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/957
* vkomenda made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/975
* BafS made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/973
* cdoko made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/974
* Narsil made their first contribution in https://github.com/EricLBuehler/mistral.rs/pull/986
**Full Changelog**: https://github.com/EricLBuehler/mistral.rs/compare/v0.3.4...v0.4.0