Hqq

Latest version: v0.2.2

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

0.1.7

- Faster inference with torchao / marlin 4-bit kernels
- Multi-gpu support for `model.quantize()`
- Custom HF generator
- Various bug fixes/improvements

0.1.6.post2

Same as <a href="https://github.com/mobiusml/hqq/releases/tag/0.1.6">v0.1.6</a> with setup.py fixes:

- find_packages fix: https://github.com/mobiusml/hqq/pull/25
- Auto-build CUDA kernels via pypi package: https://github.com/mobiusml/hqq/pull/26

0.1.6.post1

Same as <a href="https://github.com/mobiusml/hqq/releases/tag/0.1.6">v0.1.6</a> with a find_packages fix https://github.com/mobiusml/hqq/pull/25

0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
- pip install automatically compiles the CUDA kernels now.
- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via AutoHQQHFModel.
- Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).

Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.

Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

0.1.5

New features
- Added support for multi-gpu FSDP QLoRA training (https://github.com/mobiusml/hqq/pull/17)

Issues
- torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
- A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

0.1.4

New features
- Added 1-bit support with CUDA dequant kernels.

Page 2 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.