What's Changed
Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.
* [CI] by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/238, https://github.com/ModelCloud/GPTQModel/pull/236, https://github.com/ModelCloud/GPTQModel/pull/237, https://github.com/ModelCloud/GPTQModel/pull/241, https://github.com/ModelCloud/GPTQModel/pull/242, https://github.com/ModelCloud/GPTQModel/pull/243, https://github.com/ModelCloud/GPTQModel/pull/246, https://github.com/ModelCloud/GPTQModel/pull/247, https://github.com/ModelCloud/GPTQModel/pull/250
* [FIX] explicitly call torch.no_grad() by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/239
* Bitblas update by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/249
* [FIX] calib avg for calib dataset arg passed as tensors by Qubitium, LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/254, https://github.com/ModelCloud/GPTQModel/pull/258
* [MODEL] gemma2 27b can load with vLLM now by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/257
* [OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI⦠by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/260
* [FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/279
* FIX vllm llama 3.1 support by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/280
* Use better defaults values for quantization config by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/281
* [REFRACTOR] Cleanup backend and model_type usage by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/276
* [FIX] allow auto_round lm_head quantization by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/282
* [FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by CSY-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/284
* [FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by LRL-ModelCloud in https://github.com/ModelCloud/GPTQModel/pull/288
* [FIX] autoround quants compat with vllm/sglang by Qubitium in https://github.com/ModelCloud/GPTQModel/pull/287
**Full Changelog**: https://github.com/ModelCloud/GPTQModel/compare/v0.9.8...v0.9.9