Use v0.1.6.post1 instead, unless you clone the repo first then install.
Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
- pip install automatically compiles the CUDA kernels now.
- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via AutoHQQHFModel.
- Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).
Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.
Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.
Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.