Post-training Quantization:
Features:
- Introduced `backup_mode` optional parameter in `nncf.compress_weights()` to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights.
- Added the `quantizer_propagation_rule` parameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy.
- Introduced `nncf.data.generate_text_data` API method that utilizes LLM to generate data for further data-aware optimization. See the [example](examples/llm_compression/openvino/tiny_llama_synthetic_data/) for details.
- (OpenVINO) Extended support of data-free and data-aware weight compression methods for `nncf.compress_weights()` with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU.
- (OpenVINO) Introduced a new option `statistics_path` to cache and reuse statistics for `nncf.compress_weights()`, reducing the time required to find optimal compression configurations. See the [TinyLlama example](examples/llm_compression/openvino/tiny_llama_find_hyperparams) for details.
- (TorchFX, Experimental) Added support for quantization and weight compression of [Torch FX](https://pytorch.org/docs/stable/fx.html) models. The compressed models can be directly executed via `torch.compile(compressed_model, backend="openvino")` (see details [here](https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html)). Added [INT8 quantization example](examples/post_training_quantization/torch_fx/resnet18). The list of supported features:
- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via `nncf.quantize()`.
- Data-free INT8, INT4, and mixed-precision weights compression with `nncf.compress_weights()`.
- (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.
Fixes:
- Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
- Fixed type mismatch issue for `nncf.quantize_with_accuracy_control()`.
- Fixed BiasCorrection algorithm for specific branching cases.
- (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
- (OpenVINO) Fixed issue with the variational statistics processing for `nncf.compress_weights()`.
- (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.
Improvements:
- Reduction in peak memory by 30-50% for data-aware `nncf.compress_weights()` with AWQ, Scale Estimation, LoRA and mixed-precision algorithms.
- Reduction in compression time by 10-20% for `nncf.compress_weights()` with AWQ algorithm.
- Aligned behavior for ignored subgraph between different `networkx` versions.
- Extended ignored patterns with RoPE block for `nncf.ModelType.TRANSFORMER` scheme.
- (OpenVINO) Extended to the ignored scope for `nncf.ModelType.TRANSFORMER` scheme with GroupNorm metatype.
- (ONNX) SE-block ignored pattern variant for `torchvision` mobilenet_v3 has been extended.
Tutorials:
- [Post-Training Optimization of Llama-3.2-11B-Vision Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/mllama-3.2/mllama-3.2.ipynb)
- [Post-Training Optimization of YOLOv11 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/yolov11-optimization/yolov11-object-detection.ipynb)
- [Post-Training Optimization of Whisper in Automatic speech recognition with OpenVINO Generate API](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/whisper-asr-genai/whisper-asr-genai.ipynb)
- [Post-Training Optimization of Pixtral Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/pixtral/pixtral.ipynb)
- [Post-Training Optimization of LLM ReAct Agent Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/llm-agent-react/llm-agent-react.ipynb)
- [Post-Training Optimization of CatVTON Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/catvton/catvton.ipynb)
- [Post-Training Optimization of Stable Diffusion v3 Model in Torch FX Representation](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/stable-diffusion-v3/stable-diffusion-v3-torch-fx.ipynb)
Known issues:
- (ONNX) `nncf.quantize()` method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.
Deprecations/Removals:
- Migrated from using `setup.py` to `pyproject.toml` for the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation through `setup.py` does not work anymore. No impact on the installation from PyPI and Conda.
- Removed support for Python 3.8.
- (PyTorch) `nncf.torch.create_compressed_model()` function has been deprecated.
Requirements:
- Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
- Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
- Updated NumPy (<2.2.0) version support.
- Updated Ultralytics (8.3.22) version.
**Acknowledgements**
Thanks for contributions from the OpenVINO developer community:
rk119
zina-cs