Gptqmodel

Latest version: v2.2.0

Safety actively analyzes 724004 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 8

1.7.4

What's Changed
⚡ Faster `packing` for post-quantization model weight save.
⚡ `Triton` kernel now validated for `Intel/XPU` when Intel Triton package is installed.
⚡ New `compile()` api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate `bpw` calculations.
🐛 Fix `ROCm` compile with `setup.py`

* Fix exllama slow pack() by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1128
* use optimized torch.round() codes by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1131
* fix shape mismatch for packing by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1132
* Speed up triton dequant by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1136
* add torch compile with backend aot_ts by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1139
* disable sampling by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1141
* mod triton-xpu by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1135
* supress dynamo error by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1143
* fix bpw by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1150
* [FIX] fix incorrectly saved the slow tokenizer by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1151
* Add mod chat by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1154
* optimize pack by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1153
* add quant time test by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1155
* Export to hf model by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1157
* Fix bpw calculation by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1163
* Inference speed test by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1159

New Contributors
* isaranto made their first contribution in https://github.com/ModelCloud/GPTQModel/pull/1162

**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.7.3...v1.7.4

1.7.3

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

* Add util.tensor_parameters() by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1107
* add require_dtype by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1109
* [MODEL] Add Telechat2 (China Telecom) by 1096125073 in https://github.com/ModelCloud/GPTQModel/pull/1106
* [FIX] Filter weight-sharing tensors when save by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1112
* Add telechat test by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1111
* [FIX] fix convert_gptq_to_mlx_weights by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1113
* add test_parameter_count.py by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1115
* Add gpqa eval task by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1117
* [FIX] Call tied_weights() after load_checkpoint_in_model() by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1119
* add phimoe support by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1118

New Contributors
* 1096125073 made their first contribution in https://github.com/ModelCloud/GPTQModel/pull/1106

**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.7.2...v1.7.3

1.7.2

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

* remove catching module error by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1088
* [FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1090
* [FIX] monkey patch only once by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1091
* check CC >= 8 for marlin, fixed 1092 by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1093
* check compute capability for marlin in validate_device() by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1095
* torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1096
* fix local model path & marlin test by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1097
* mod bits info by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1100
* Reduce memory usage in mlx conversion by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1099
* cleanup mlx code by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1101


**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.7.0...v1.7.2

1.7.0

What's Changed

⚡`backend.MLX` added for runtime-conversion and execution of GPTQ models on Apple's `MLX` framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to [huggingface.co/ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2).
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed `setup.py` not correctly detecting incompatible `setuptools`/`wheel` pkgs.


* [CI] run tests with linux tag by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1067
* Add backend.MLX by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1061
* add mlx generate test by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1069
* [CI] upload source in build step by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1070
* code review by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1072
* [CI] install mlx by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1071
* Add option to quantize `lm_head` by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1037
* fix test_packing by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1073
* [CI] add mlx test by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1074
* [CI] fix ci relase env name by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1078
* update mlx test by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1079
* convert to mlx support desc_act true by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1082
* [CI] add extra-index-url for pip install by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1083
* catch module error for setup.py by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1084

**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.6.1...v1.7.0

1.6.1

What's Changed

🎉 New OpenAI api compatible end-point via `model.serve(host, port)`.
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed `sym=False` loading regression.

* code opt by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1038
* fix marlin validate rocm & do validate() if backend not AUTO by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1040
* add global rocm check by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1043
* [FIX] pass sym to make_quant by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1046
* enable flash attn for loading quantized by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1045
* add flash_attn2 test by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1047
* enable flash_attention only when device is cuda by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1050
* move flash attn test to correct folder by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1052
* Expose openai server api by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1048
* update openai server by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1058
* don't download whl for xpu env by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1059
* remove build tag for normal release by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1063
* disable flash attn 2 for internlm by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1065


**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.6.0...v1.6.1

1.6.0

What's Changed

⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.

* note about `batch_size` to speed up quant by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/992
* Add ROCm support by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/993
* Add bits test by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/995
* note about rocm support by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/998
* [FIX] wrong variable name by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/997
* update rocm version tag by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/999
* Auto-tokenizer will be called within `load()` by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/996
* update transformers by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1001
* [FIX] torch qlinear forward by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1002
* cleanup marlin info by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/1004
* Use custom forward hook by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1003
* fix hooked linear init by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1011
* add HookedConv1D by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1012
* record fwd time by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1013
* add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1015
* [FIX] quantize_config could not read from config.json by ZX-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1022
* Fix quant time by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1025
* fix forward hook by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1027
* Fix hooked conv2d by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1030
* clean cache by CL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/1032


**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v1.5.1...v1.6.0

Page 2 of 8

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.