Optimum INC CLI
Integration of the Intel Neural Compressor dynamic quantization to the Optimum command line interface. Example commands:
bash
optimum-cli inc --help
optimum-cli inc quantize --help
optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output int8_distilbert/
* Add Optimum INC CLI to apply dynamic quantization by echarlaix in https://github.com/huggingface/optimum-intel/pull/280
Levarage past key values for OpenVINO decoder models
Enable the possibility to use the pre-computed key / values in order to make inference faster. This will be enabled by default when exporting the model.
python
model = OVModelForCausalLM.from_pretrained(model_id, export=True)
To disable it, `use_cache` can be set to `False` when loading the model:
python
model = OVModelForCausalLM.from_pretrained(model_id, export=True, use_cache=False)
* Enable the possibility to use the pre-computed key / values for OpenVINO decoder models by echarlaix in https://github.com/huggingface/optimum-intel/pull/274
INC config summarizing optimizations details
* Add `INCConfig` by echarlaix in https://github.com/huggingface/optimum-intel/pull/263
Fixes
* Remove dynamic shapes restriction for GPU devices by helena-intel in https://github.com/huggingface/optimum-intel/pull/262
* Enable OpenVINO model caching for CPU devices by helena-intel in https://github.com/huggingface/optimum-intel/pull/281
* Fix the `.to()` method for causal langage models by helena-intel in https://github.com/huggingface/optimum-intel/pull/284
* Fix pytorch model saving for `transformers>=4.28.0` when optimized with `OVTrainer` echarlaix in https://github.com/huggingface/optimum-intel/pull/285
* Update for task name for ONNX and OpenVINO export for `optimum>=1.8.0` by echarlaix in https://github.com/huggingface/optimum-intel/pull/286