- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
**Highlights**
- Integrate Intel Neural Compressor into MSFT ONNX Runtime ([16288](https://github.com/microsoft/onnxruntime/pull/16288)) and Olive ([#411](https://github.com/microsoft/Olive/pull/411), [#412](https://github.com/microsoft/Olive/pull/412), [#469](https://github.com/microsoft/Olive/pull/469)).
- Supported low precision (INT4, NF4, FP4) and [Weight-Only](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md) Quantization algorithms including RTN, [AWQ](https://arxiv.org/abs/2306.00978), [GPTQ](https://arxiv.org/abs/2210.17323) and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
- Supported sparseGPT pruner ([88adfc](https://github.com/intel/neural-compressor/commit/88adfc99f6b2edf0144c7344be9236b6e1030b54)).
- Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).
**Features**
- [Quantization] Support ONNX Runtime quantization and inference for DNNL EP ([79be8b](https://github.com/intel/neural-compressor/commit/79be8b99c676f0b0b10ea56eff71868fc8696910))
- [Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP ([750bb9](https://github.com/intel/neural-compressor/commit/750bb9bc28566d2c2189fde2e6edc8bf3ce3cbbb))
- [Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN ([501440](https://github.com/intel/neural-compressor/commit/501440ab560056e2e3a1a75c922361ebf614fc04), [19ab16](https://github.com/intel/neural-compressor/commit/19ab16c1275aed3efea0267c384e203790f04c03), [859315](https://github.com/intel/neural-compressor/commit/85931587d6fb9fd10d16e5c750dc5fdc519bda73)), AWQ ([2562f2](https://github.com/intel/neural-compressor/commit/2562f29842e3eac4a28d11ca4502376375b893bf), [641d42](https://github.com/intel/neural-compressor/commit/641d42b2ebf873e87aa7d5bb0b2fcd518550022f)),
GPTQ ([b5ac3c](https://github.com/intel/neural-compressor/commit/b5ac3c4492c7f21ea0e6910eba11a637b67405f1), [6ba783](https://github.com/intel/neural-compressor/commit/6ba78372cc846ab961b73b9b7007ec41e75341e8)) and TEQ ([d2f995](https://github.com/intel/neural-compressor/commit/d2f995bf00bf808eb318887e8bbbea6e0529740e), [9ff7f0](https://github.com/intel/neural-compressor/commit/9ff7f01c3ca9f5aba0aff01260d58ce3007a8f4c)) for PyTorch
- [Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization ([3d11b5](https://github.com/intel/neural-compressor/commit/3d11b5e78d7bddcdee56f354de9d1f78a3da2033))
- [Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime ([da4c92](https://github.com/intel/neural-compressor/commit/da4c92cdcc1a16df2643a87ab35b49b277c2fb5b))
- [Quantization] Support layer-wise quantization ([d9d1fc](https://github.com/intel/neural-compressor/commit/d9d1fccf67ce32e545bc9986936edebce01c500a)) and enable with SmoothQuant ([ec9ae9](https://github.com/intel/neural-compressor/commit/ec9ae913abfffd138dec55d3915fde52a96f6445))
- [Pruning] Add sparseGPT pruner and refactor pruning class ([88adfc](https://github.com/intel/neural-compressor/commit/88adfc99f6b2edf0144c7344be9236b6e1030b54))
- [Pruning] Add Hyper-parameter Optimization algorithm for pruning ([6613cf](https://github.com/intel/neural-compressor/commit/6613cfa9c7b8a06b3b85f35e2cf3ba2663766fd3))
- [Model Export] Support PT2ONNX dynamic quantization export ([165532](https://github.com/intel/neural-compressor/commit/16553260b23fbe237dc66726d9e3a1637a6e0cb1))
**Improvement**
- [Common] Clean up dataloader usage in examples ([1044d8](https://github.com/intel/neural-compressor/commit/1044d8d4b722315cc62e9b4b80573e4cd7706465),
[a2931e](https://github.com/intel/neural-compressor/commit/a2931eaa4052eec195be3c79a13f7bfa23e54473), [447cc7](https://github.com/intel/neural-compressor/commit/447cc7f2a70b15943c87494662aff32c740b62c8))
- [Common] Enhance ONNX Runtime backend check ([4ce9de](https://github.com/intel/neural-compressor/commit/4ce9de5feb472dbab57a3bb9369c8b7ba1c57305))
- [Strategy] Add block-wise distributed fallback in basic strategy ([ea309f](https://github.com/intel/neural-compressor/commit/ea309f51925be25d3cc0ecfb32922789e3b645cb))
- [Strategy] Enhance strategy exit policy ([d19b42](https://github.com/intel/neural-compressor/commit/d19b42f9193f455990a9b4bfdd47d2795e04b154))
- [Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference ([00bbf8](https://github.com/intel/neural-compressor/commit/00bbf8413e863d1ac4b3ad3c35d95371c9bba023))
- [Quantization] Support more ONNX Runtime direct INT8 ops ([b9ce61](https://github.com/intel/neural-compressor/commit/b9ce61a860cc793123575e549c2c174474e93ef9))
- [Quantization] Support TensorFlow per-channel MatMul quantization ([cf5589](https://github.com/intel/neural-compressor/commit/cf55895b8d5c6c6280fe70d437db93bf76cd76d0))
- [Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant ([084eda](https://github.com/intel/neural-compressor/commit/084edad14a0235c529dc04ce65cd044c32a61047))
- [Quantization] Enhance ONNX SmoothQuant tuning structure ([f0d51c](https://github.com/intel/neural-compressor/commit/f0d51c2cd35b94972a7db2caea2f2d0fd39dc61b))
- [Quantization] Enhance PyTorch SmoothQuant tuning structure ([81da40](https://github.com/intel/neural-compressor/commit/81da4039f47f671fc670df95482aa97caecf4afd))
- [Quantization] Update PyTorch examples dataloader to support transformers 4.31.x ([59371f](https://github.com/intel/neural-compressor/commit/59371feeea7f63bf60c4386f90bcf70569b69284))
- [Quantization] Enhance ONNX Runtime backend setting for GPU EP support ([295535](https://github.com/intel/neural-compressor/commit/295535ac8b0f957deda236f4b06e5565b43974fd))
- [Pruning] Refactor pruning ([92d14d](https://github.com/intel/neural-compressor/commit/92d14d7f8409451c0d8dfc4fc4ab1a0352de7248))
- [Mixed Precision] Update the list of supported layers for Keras mix-precision ([692c8b](https://github.com/intel/neural-compressor/commit/692c8bbc16fb7c4913ebb8ec699ce35757067b41))
- [Mixed Precision] Introduce quant_level into mixed precision ([0dc6a9](https://github.com/intel/neural-compressor/commit/0dc6a92f07b8cad14a0d1967095476a5db7815e3))
**Productivity**
- [Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples ([411](https://github.com/microsoft/Olive/pull/411), [#412](https://github.com/microsoft/Olive/pull/412), [#469](https://github.com/microsoft/Olive/pull/469))
- [Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization ([16288](https://github.com/microsoft/onnxruntime/pull/16288))
- [Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights ([775def](https://github.com/intel/neural-compressor/commit/775deff8e10187a793b902f2dbe248961824d8a0), [74a785](https://github.com/intel/neural-compressor/commit/74a785ef2ad3d494b452680c577959a934b2fcb0))
- [Neural Insights] Add step-by-step diagnosis cases ([99c3b0](https://github.com/intel/neural-compressor/commit/99c3b06b3a90a33434b4a387035459dfe0607e34))
- [Neural Solution] Resource management and user-facing API enhancement ([fbba10](https://github.com/intel/neural-compressor/commit/fbba10cf10d4ee8540e22d2e7ef0b70d4e6e0583))
- [Auto CI] Integrate auto CI code scan bug fix tools ([f77a2c](https://github.com/intel/neural-compressor/commit/f77a2c7606cdd2a0dec39c61d5ab95325272bcf2), [06cc38](https://github.com/intel/neural-compressor/commit/06cc3829eb1fa38db8404272999ee1cc11fa4dff))
**Bug Fixes**
- Fix bugs in PyTorch SmoothQuant ([0349b9](https://github.com/intel/neural-compressor/commit/0349b9ae2e0399900725eb9ec6f7013ae9df3eda), [8f3645](https://github.com/intel/neural-compressor/commit/8f3645289998b28f4206e9fb48c2f4f2123527c1))
- Fix pytorch dataloader batch size issue ([6a98d0](https://github.com/intel/neural-compressor/commit/6a98d0ba7bacd238782f85928d84b5d1ff720d12))
- Fix bugs for ONNX Runtime CUDA EP ([a1b566](https://github.com/intel/neural-compressor/commit/a1b566fb5607c3a8e508d0d24350a36f3c8c0b0a), [d1f315](https://github.com/intel/neural-compressor/commit/d1f315f359440382d713a0a20c7927c7c0d252a1))
- Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB ([1f6b1a](https://github.com/intel/neural-compressor/commit/1f6b1adc09a3fb5ae43cd0e721bc4430b636f596))
- Fix ONNX Runtime diagnosis bug ([f10e26](https://github.com/intel/neural-compressor/commit/f10e26390da84c4d3ef68c4f23c11c62b31cfa1a))
- Update Neural Solution example and fix grpc port issue ([528868](https://github.com/intel/neural-compressor/commit/5288684ba89fd50c325a72abdd4899c568b33dbd))
- Fix the objective initialization issue ([9d7546](https://github.com/intel/neural-compressor/commit/9d7546fd5dc2a4cced6238f940ddb1ad1a4f893f))
- Fix reshape issue for bayesian strategy ([77cb83](https://github.com/intel/neural-compressor/commit/77cb836060e83082ff71c1ac862e7b8aceda08e1))
- Fix CVEs ([d86922](https://github.com/intel/neural-compressor/commit/d869227695a544dfc8f26a1306c386c8858ffc16), [2bbfcd](https://github.com/intel/neural-compressor/commit/2bbfcd38ab4faba656847d7cc7df9b34d18c079d), [fc71fa](https://github.com/intel/neural-compressor/commit/fc71fac7dc6e51b2b259e35ed054d926e91a96fe))
**Examples**
- Add Weight-Only LLM examples for PyTorch ([4b24be](https://github.com/intel/neural-compressor/commit/4b24be1ec31bf9838c6052752f2530aa4814a630), [66f7c1](https://github.com/intel/neural-compressor/commit/66f7c10d566a6217395c2a6c34dea0c32d5a0ad3), [aa457a](https://github.com/intel/neural-compressor/commit/aa457a3f966a4f8dcdacd599834d0b05a38170bf))
- Add Weight-Only LLM examples for ONNX Runtime ([10c133](https://github.com/intel/neural-compressor/commit/10c133162e725c8d96f514a9b7e986730d594c02))
- Enable 3 ONNX Runtime examples, CodeBert ([5e584e](https://github.com/intel/neural-compressor/commit/5e584e6e74e039cfc269dfc1972e9f6a1687d41e)), LayoutLMv2 FUNSD ([5f0b17](https://github.com/intel/neural-compressor/commit/5f0b17e977b095bf6c2ba9e907b004025511369e)), Table Transformer ([eb8a95](https://github.com/intel/neural-compressor/commit/eb8a956dc056075d96a309831c177a63b81a76ee))
- Add ONNX Runtime LLM SmoothQuant example Llama-7B ([7fbcf5](https://github.com/intel/neural-compressor/commit/7fbcf54d9f331c0f4cb38767de1224e3bf3f0db9))
- Enable 2 TensorFlow examples, ViT ([94df99](https://github.com/intel/neural-compressor/commit/94df9977513ae10af3139d2462f5e8ff8ca4329c)), GraphSage ([29ec82](https://github.com/intel/neural-compressor/commit/29ec821fad39f5780d8b0f6be41460c98295e227))
- Add easy get started notebooks ([d7b608](https://github.com/intel/neural-compressor/commit/d7b608b341013e5669c2ab90a6ba59b663aa63a7), [6ee846](https://github.com/intel/neural-compressor/commit/6ee8466d17bc3a3fee9e7722d13e1e5e9e2d63cd))
- Add multi-cards magnitude pruning use case ([909618](https://github.com/intel/neural-compressor/commit/9096188ef18901ab56416601cd965b974800545c))
- Unify ONNX Runtime prepare model scripts ([5ecb13](https://github.com/intel/neural-compressor/commit/5ecb134988eef0d62bc43858ae7321b52ecc8590))
**Validated Configurations**
- Centos 8.4 & Ubuntu 22.04
- Python 3.7, 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.11, 2.12, 2.13
- ITEX 1.1.0, 1.2.0, 2.13.0.0
- PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.1+cpu
- ONNX Runtime 1.13.1, 1.14.1, 1.15.1
- MXNet 1.9.1