Auto-gptq

Latest version: v0.7.1

Safety actively analyzes 723217 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 3

0.7.1

Support loading sharded quantized checkpoints

Sharded checkpoints can now be loaded in the `from_quantized` method.

* Support loading sharded quantized checkpoints. by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/425

Gemma GPTQ quantization

Gemma model can be quantized with AutoGPTQ.

* Add support for Gemma models. by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/561

Other changes and fixes
* Add back missing import by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/553
* Fix bias materialization for Marlin by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/554
* Fix shape check marlin by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/557
* Explicitely check compute capability in marlin's QLinear by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/567
* Compatibility with latest transformers by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/573

**Full Changelog**: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.7.0...v0.7.1

0.7.0

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

efrantar, GPTQ author, released [Marlin](https://github.com/IST-DASLab/marlin), an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with **per-group symmetric quantization** support (without act-order), which [significantly outperforms other existing kernels](https://github.com/IST-DASLab/marlin/issues/2#issuecomment-1923290721) when using batching.

This kernel can be used in AutoGPTQ when loading models with the `use_marlin=True` argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

python
import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]

<s> Is quantization a good compression technique?

Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

* add marlin kernel by qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/514
* updated marlin serialization by rib-2 in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
* Marlin repacking CUDA kernel by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/539
* Marlin kernel can be built against any compute capability by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/540

Ability to load AWQ checkpoints in AutoGPTQ

**Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.**

[AWQ](https://arxiv.org/abs/2306.00978)'s [original implementation](https://github.com/mit-han-lab/llm-awq) adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

python
import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]

<s> Is quantization a good compression technique?

Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

* Support inference with AWQ models by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/484

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

* Add qwen2 by JustinLin610 in https://github.com/AutoGPTQ/AutoGPTQ/pull/519
* Change deci_lm model type to deci by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/491
* Support for LongLLaMA models. by LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/442

Other changes and bugfixes
* Update version & install instructions by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/485
* fix the support of Qwen by hzhwcmhf in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
* rocm6.0 compatible exllama by seungrokj in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
* Untie weights for safetensors serialization by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/536
* marlin update version 0.1.1 and fix marlin bug by qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/524
* Use ruff for linting by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/537
* Fix wheels build for torch==2.2.0 by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/541
* Fix repo owners in workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/542
* Disable peft compatibility by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/543
* Improve README by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/544
* Add ROCm dockerfile by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/545
* Make all tests pass by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/546
* Fix cuda wheel build workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/547
* Use bash in workflows by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/548
* Dissociate Windows & Linux CUDA build by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/549* Add more guards on compute capability in Marlin kernel by fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/550

New Contributors
* hzhwcmhf made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
* rib-2 made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
* seungrokj made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/515

**Full Changelog**: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0

0.6.0

What's Changed
* Precise PyTorch version by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/421
* Fix triton unexpected keyword by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/423
* Add support for Yi models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/413
* Add support for Xverse models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/417
* Allow fp32 input to GPTQ linear by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/437
* Fix typos in tests by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/438
* Update _base.py - Remote (.bin) model load fix by Shades-en in https://github.com/PanQiWei/AutoGPTQ/pull/465
* make build successful on Jetson device(L4T) by mikeshi80 in https://github.com/PanQiWei/AutoGPTQ/pull/470
* Add option to disable qigen at build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/471
* Stop trying to convert a list to int in setup.py when trying to retrieve cores_info by wemoveon2 in https://github.com/PanQiWei/AutoGPTQ/pull/474
* Only make_quant on inside_layer_modules. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/479
* Add support for DeciLM models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/481
* Support for StableLM Epoch models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/444
* Add support for Mixtral models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/480
* Fix compatibility with transformers 4.36 by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/483

New Contributors
* Shades-en made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/465
* mikeshi80 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/470
* wemoveon2 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/474

**Full Changelog**: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.1...v0.6.0

0.5.1

Mainly fixes Windows support.

What's Changed
* Update README and version following 0.5.0 release by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/397
* Fix windows support by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/407
* Fix quantize method with None mask by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/408
* Improve message about buffer size in exllama v1 backend by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/410
* Fix windows (no triton) and cpu-only support by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/411
* Fix workflows to use pip instead of conda by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/419

**Full Changelog**: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.0...v0.5.1

0.5.0

Exllama v2 GPTQ kernel support

The more performant GPTQ kernels from turboderp's [exllamav2 library](https://github.com/turboderp/exllamav2) are now available directly in AutoGPTQ, and are the default backend choice.

A comprehensive benchmark is available [here](https://github.com/huggingface/optimum/tree/main/tests/benchmark#generation-benchmark-results).

* exllamav2 integration by SunMarc in https://github.com/PanQiWei/AutoGPTQ/pull/349

CPU inference support

This is experimental.

* Add AutoGPTQ's cpu kernel. by qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/245

Loading from safetensors is now the default

* Allow using a model with basename `model`, use_safetensors defaults to True by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/383

Falcon, Mistral support

* Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/326
* Add support for Mistral models. by LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/362

Other changes and bugfixes
* Fix setuptools classifier by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/285
* Update install instructions by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/286
* Install skip qigen(windows) by qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/309
* fix model type changed after calling .to() method by PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/310
* Update qwen.py for Qwen-VL by JustinLin610 in https://github.com/PanQiWei/AutoGPTQ/pull/303
* fix typo in max_input_length by SunMarc in https://github.com/PanQiWei/AutoGPTQ/pull/311
* Use `adapter_name` for `get_gptq_peft_model` with `train_mode=True` by alex4321 in https://github.com/PanQiWei/AutoGPTQ/pull/347
* Ignore unknown parameters in quantize_config.json by z80maniac in https://github.com/PanQiWei/AutoGPTQ/pull/335
* fix bug(breaking change) remove (zeors -= 1) by qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/325
* Revert "fix bug(breaking change) remove (zeors -= 1)" by PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/354
* import exllama QuantLinear instead of exllamav2's in `pack_model` by PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/355
* Modify qlinear_cuda for tracing the GPTQ model by vivekkhandelwal1 in https://github.com/PanQiWei/AutoGPTQ/pull/367
* Fix QiGen kernel generation by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/379
* Improve RoCm support by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/382
* PEFT initialization fix by alex4321 in https://github.com/PanQiWei/AutoGPTQ/pull/361
* Pin to accelerate>=0.22 by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/384
* Fix overflow in exllama with act-order by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/386
* Default to exllama kernel when exllama v2 is disabled by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/387
* Error out on exllama_set_max_input_length call without exllama backend by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/389
* Add fix for CPU Inference by vivekkhandelwal1 in https://github.com/PanQiWei/AutoGPTQ/pull/385
* Fix dtype issues and add relevant tests by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/393
* Patch accelerate to use correct dtype by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/394
* Fixed missing cstdint include by kodai2199 in https://github.com/PanQiWei/AutoGPTQ/pull/388
* Update RoCm workflow to build for RoCm 5.7 by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/395
* Fix Windows build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/396

New Contributors
* JustinLin610 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/303
* SunMarc made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/311
* alex4321 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/347
* vivekkhandelwal1 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/367
* kodai2199 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/388

**Full Changelog**: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.2...v0.5.0

0.4.2

Major bugfix: exllama backend with arbitrary input length

This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:

python
from auto_gptq import exllama_set_max_input_length

...
model = exllama_set_max_input_length(model, 4096)

* Expose a function to update exllama max input length by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/281

Exllama kernels support in Windows wheels

This patch tentatively includes the exllama kernels in the wheels for Windows.

* Add PyPI build workflow, tentatively fix exllama on windows by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/282

What's Changed
* Build wheels on ubuntu 20.04 by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/272
* Free disk space for rocm build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/273
* Use focal for RoCm build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/274
* Use conda incubator for rocm build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/276
* Update install instructions by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/275
* Use --extra-index-url to resolve dependencies by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/277
* Fix python version for rocm build by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/278
* Fix powershell in workflow by fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/284

**Full Changelog**: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.1...v0.4.2

Page 1 of 3

Releases

Has known vulnerabilities

Auto-gptq

Page 1 of 3

0.7.1

0.7.0

0.6.0

0.5.1

0.5.0

0.4.2

Page 1 of 3

Links

Releases