Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading
efrantar, GPTQ author, released [Marlin](https://github.com/IST-DASLab/marlin), an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with **per-group symmetric quantization** support (without act-order), which [significantly outperforms other existing kernels](https://github.com/IST-DASLab/marlin/issues/2#issuecomment-1923290721) when using batching.
This kernel can be used in AutoGPTQ when loading models with the `use_marlin=True` argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:
python
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
<s> Is quantization a good compression technique?
Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.
A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark
Visual tables coming soon.
* add marlin kernel by qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/514
* updated marlin serialization by rib-2 in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
* Marlin repacking CUDA kernel by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/539
* Marlin kernel can be built against any compute capability by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/540
Ability to load AWQ checkpoints in AutoGPTQ
**Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.**
[AWQ](https://arxiv.org/abs/2306.00978)'s [original implementation](https://github.com/mit-han-lab/llm-awq) adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).
Example:
python
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]
<s> Is quantization a good compression technique?
Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.
* Support inference with AWQ models by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/484
Qwen2, LongLLaMA, Deci_lm models support
These models can be quantized with AutoGPTQ.
* Add qwen2 by JustinLin610 in https://github.com/AutoGPTQ/AutoGPTQ/pull/519
* Change deci_lm model type to deci by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/491
* Support for LongLLaMA models. by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/442
Other changes and bugfixes
* Update version & install instructions by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/485
* fix the support of Qwen by hzhwcmhf in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
* rocm6.0 compatible exllama by seungrokj in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
* Untie weights for safetensors serialization by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/536
* marlin update version 0.1.1 and fix marlin bug by qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/524
* Use ruff for linting by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/537
* Fix wheels build for torch==2.2.0 by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/541
* Fix repo owners in workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/542
* Disable peft compatibility by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/543
* Improve README by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/544
* Add ROCm dockerfile by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/545
* Make all tests pass by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/546
* Fix cuda wheel build workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/547
* Use bash in workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/548
* Dissociate Windows & Linux CUDA build by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/549* Add more guards on compute capability in Marlin kernel by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/550
New Contributors
* hzhwcmhf made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
* rib-2 made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
* seungrokj made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
**Full Changelog**: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0