- Add BitBlas backend support - Simpler HQQLinear from weights `HQQLinear.from_weights(W, bias, etc.)` - Fix memory leak while swaping layers for the TorchAO Backend - Add `HQQLinear.unpack()` call
0.1.7.post3
- Enable CPU quantization and runtime - `_load_state_dict` fix - fix `extra_repr` in `HQQLinear` - fix `from_quantized` bugs - fix `|` typing - fix 3-bit `axis=1` slicing bug - add 5/6 bit for testing
0.1.7.post2
- Various bug fixes, especially with `AutoHQQHFModel` and the patching logic, to make it work with any transformers model. - Readme refactoring. - Whisper example.
0.1.7
- Faster inference with torchao / marlin 4-bit kernels - Multi-gpu support for `model.quantize()` - Custom HF generator - Various bug fixes/improvements
0.1.6.post2
Same as <a href="https://github.com/mobiusml/hqq/releases/tag/0.1.6">v0.1.6</a> with setup.py fixes:
- find_packages fix: https://github.com/mobiusml/hqq/pull/25 - Auto-build CUDA kernels via pypi package: https://github.com/mobiusml/hqq/pull/26
0.1.6.post1
Same as <a href="https://github.com/mobiusml/hqq/releases/tag/0.1.6">v0.1.6</a> with a find_packages fix https://github.com/mobiusml/hqq/pull/25