Hqq

Latest version: v0.2.5

Safety actively analyzes 723954 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 4

0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
- pip install automatically compiles the CUDA kernels now.
- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via AutoHQQHFModel.
- Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).

Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.

Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

0.1.5

New features
- Added support for multi-gpu FSDP QLoRA training (https://github.com/mobiusml/hqq/pull/17)

Issues
- torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
- A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

0.1.4

New features
- Added 1-bit support with CUDA dequant kernels.

0.1.3.post1

New features
- meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.

0.1.3

New features
- Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
- Added support for compute_dtype parameter (useful for float32/bfloat16 LoRA training)

0.1.2.post1

Bug fixes
- Fixed LoRA adapter loading.

Page 3 of 4

Releases

Has known vulnerabilities

Previous Next

Hqq

Page 3 of 4

0.1.6

0.1.5

0.1.4

0.1.3.post1

0.1.3

0.1.2.post1

Page 3 of 4

Links

Releases