New models supported in the ONNX export
Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.
* Add PoolFormer support in exporters.onnx by BakingBrains in https://github.com/huggingface/optimum/pull/646
* Support pegasus exporters by mht-sharma in https://github.com/huggingface/optimum/pull/620
* Audio models support with `optimum.exporters.onnx` by michaelbenayoun in https://github.com/huggingface/optimum/pull/622
* Add MPNet ONNX export by jplu in https://github.com/huggingface/optimum/pull/691
* Add stable diffusion VAE encoder export by echarlaix in https://github.com/huggingface/optimum/pull/705
* Add vision encoder decoder model in exporters by mht-sharma in https://github.com/huggingface/optimum/pull/588
* Nystromformer ONNX export by whr778 in https://github.com/huggingface/optimum/pull/728
* Support Splinter exporters (555) by Allanbeddouk in https://github.com/huggingface/optimum/pull/736
* Add gpt-neo-x support by sidthekidder in https://github.com/huggingface/optimum/pull/745
New models supported in BetterTransformer
A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian
* Add RoCBert support for Bettertransformer by shogohida in https://github.com/huggingface/optimum/pull/542
* Add better transformer support for RoFormer by manish-p-gupta in https://github.com/huggingface/optimum/pull/680
* added BetterTransformer support for Marian by IlyasMoutawwakil in https://github.com/huggingface/optimum/pull/808
Additional tasks supported in the ONNX Runtime integration
With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.
Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models
* Add ORTModelForMaskedLM class by JingyaHuang in https://github.com/huggingface/optimum/pull/729
* Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by mht-sharma in https://github.com/huggingface/optimum/pull/742
* Add ORTModelXXX for audio by mht-sharma in https://github.com/huggingface/optimum/pull/774
* Add stable diffusion onnx runtime pipeline by echarlaix in https://github.com/huggingface/optimum/pull/786
Support of the ONNX export from PyTorch on float16
In the ONNX export, it is possible to pass the options `--fp16 --device cuda` to export using float16 when a GPU is available, directly with the native [`torch.onnx.export`](https://pytorch.org/docs/stable/onnx.html#torch.onnx.export).
Example: `optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/`
* Support ONNX export on `torch.float16` type by fxmarty in https://github.com/huggingface/optimum/pull/749
TFLite export
TFLite export is now supported, with static shapes:
optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/
* `exporters.tflite` initial support by michaelbenayoun in https://github.com/huggingface/optimum/pull/716
* TFLite auto-encoder models by michaelbenayoun in https://github.com/huggingface/optimum/pull/757
* [TFLite Export] Adds support for ResNet by sayakpaul in https://github.com/huggingface/optimum/pull/813
ONNX Runtime optimization and quantization directly in the CLI
* Add optimize and quantize command CLI by jplu in https://github.com/huggingface/optimum/pull/700
* Support ONNX Runtime optimizations in exporters.onnx by fxmarty in https://github.com/huggingface/optimum/pull/807
The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the `--optimize O1`, up to `--optimize O4` option:
optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/
ONNX Runtime quantization is supported directly in command line, using `optimum-cli onnxruntime quantize`:
optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512
ONNX Runtime optimization is supported directly in command line, using `optimum-cli onnxruntime optimize`:
optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3
ORTModelForCausalLM supports decoding with a single ONNX
Up no now, for decoders, two ONNX were used:
* One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
* One handling the following forward pass where past key values have been cached, thus taking them as input.
This release introduces the support in the ONNX export and in `ORTModelForCausalLM` of a single ONNX handling both steps of the decoding. This allows to **reduce memory usage**, as weights are not duplicated between two separate models during inference.
Using a single ONNX for decoders can be used by passing `use_merged=True` to `ORTModelForCausalLM.from_pretrained`, loading directly from a PyTorch model:
python
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)
Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with `ORTModelForCausalLM`, the command `optimum-cli export onnx --model gpt2 gpt2_onnx/` will produce:
└── gpt2_onnx
├── config.json
├── decoder_model_merged.onnx
├── decoder_model.onnx
├── decoder_with_past_model.onnx
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json
The `decoder_model.onnx` and `decoder_with_past_model.onnx` are kept separate for backward compatibility, but during inference using solely `decoder_model_merged.onnx` is enough.
* Enable inference with a merged decoder in `ORTModelForCausalLM` by JingyaHuang in https://github.com/huggingface/optimum/pull/647
Single-file ORTModel accept numpy arrays
ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.
* Accept numpy.ndarray as input and output to ORTModel by fxmarty in https://github.com/huggingface/optimum/pull/790
ORTOptimizer support for ORTModelForCausalLM
* ORTOptimizer support ORTModelForCausalLM by fxmarty in https://github.com/huggingface/optimum/pull/794
* Support IO Binding for merged decoder by fxmarty in https://github.com/huggingface/optimum/pull/797
Breaking changes
* In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: https://github.com/huggingface/optimum/pull/747. The old behavior is still accessible with `--monolith`.
* In decoders, reusing past key values is now the default in the ONNX export: https://github.com/huggingface/optimum/pull/748. The old behavior is still accessible by explicitly passing, for example, `--task causal-lm` instead of `--task causal-lm-with-past`.
* BigBird support in the ONNX export is removed, due to the `block_sparse` attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: https://github.com/huggingface/optimum/pull/778
* The parameter `from_transformers` of `ORTModel.from_pretrained` will be deprecated in favor of `export`.
Bugfixes and improvements
* Fix disable shape inference for optimization by regisss in https://github.com/huggingface/optimum/pull/652
* Fix uninformative message when passing `use_cache=True` to ORTModel and no ONNX with cache is available by fxmarty in https://github.com/huggingface/optimum/pull/650
* Fix provider options when several providers are passed by fxmarty in https://github.com/huggingface/optimum/pull/653
* Add TensorRT engine to ONNX Runtime GPU documentation by fxmarty in https://github.com/huggingface/optimum/pull/657
* Improve documentation around ONNX export by fxmarty in https://github.com/huggingface/optimum/pull/666
* minor updates on ONNX config guide by mszsorondo in https://github.com/huggingface/optimum/pull/662
* Fix FlaubertOnnxConfig by michaelbenayoun in https://github.com/huggingface/optimum/pull/669
* Use nvcr.io/nvidia/tensorrt image for GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/660
* Better Transformer doc fix by HamidShojanazeri in https://github.com/huggingface/optimum/pull/670
* Add support for LongT5 optimization using ORT transformer optimizer script by kunal-vaishnavi in https://github.com/huggingface/optimum/pull/683
* Add test for missing execution providers error messages by fxmarty in https://github.com/huggingface/optimum/pull/659
* ONNX transformation to cast int64 constants to int32 when possible by fxmarty in https://github.com/huggingface/optimum/pull/655
* Add missing normalized configs by fxmarty in https://github.com/huggingface/optimum/pull/694
* Remove code duplication in ORTModel's load_model by fxmarty in https://github.com/huggingface/optimum/pull/695
* Test more architectures in ORTModel by fxmarty in https://github.com/huggingface/optimum/pull/675
* Avoid initializing unwanted attributes for ORTModel's having several inference sessions by fxmarty in https://github.com/huggingface/optimum/pull/696
* Fix the ORTQuantizer loading from specific file by echarlaix in https://github.com/huggingface/optimum/pull/701
* Add saving of diffusion model additional components for onnx export by echarlaix in https://github.com/huggingface/optimum/pull/699
* Fix whisper export by mht-sharma in https://github.com/huggingface/optimum/pull/629
* Support trust remote code option in ONNX export and ONNX Runtime integration by fxmarty in https://github.com/huggingface/optimum/pull/702
* Add nightly tests on dependencies dev versions by fxmarty in https://github.com/huggingface/optimum/pull/703
* Fix exception condition by mht-sharma in https://github.com/huggingface/optimum/pull/706
* Add ORTModelForMultipleChoice to the documentation by fxmarty in https://github.com/huggingface/optimum/pull/712
* Fix yaml format for dev tests by fxmarty in https://github.com/huggingface/optimum/pull/710
* Add ONNX Runtime training benchmark by JingyaHuang in https://github.com/huggingface/optimum/pull/592
* Allow `from optimum.onnxruntime import QuantizationConfig` by fxmarty in https://github.com/huggingface/optimum/pull/715
* Fix documentation for doctest tests to pass by fxmarty in https://github.com/huggingface/optimum/pull/713
* Use transformers>=4.26.0 in setup.py by fxmarty in https://github.com/huggingface/optimum/pull/723
* Fix GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/724
* Fix ONNX Runtime inference in `ORTTrainer` by JingyaHuang in https://github.com/huggingface/optimum/pull/709
* `onnxruntime/modeling_ort.py` refactor, part 1 by michaelbenayoun in https://github.com/huggingface/optimum/pull/698
* Update docker and doc of ORT Trainer by JingyaHuang in https://github.com/huggingface/optimum/pull/725
* Add test for code examples in the documentation and docstrings by fxmarty in https://github.com/huggingface/optimum/pull/704
* add image classification example to optimum by prathikr in https://github.com/huggingface/optimum/pull/711
* Add TensorrtExecutionProvider modeling tests by fxmarty in https://github.com/huggingface/optimum/pull/722
* Whisper shape inference fix by michaelbenayoun in https://github.com/huggingface/optimum/pull/726
* Add some redirections to Optimum Habana's documentation by regisss in https://github.com/huggingface/optimum/pull/735
* Patch `ORTTrainer` inference with ONNX Runtime backend by JingyaHuang in https://github.com/huggingface/optimum/pull/737
* Remove dead code in whisper ONNX output by fxmarty in https://github.com/huggingface/optimum/pull/741
* Unpin protobuf 3.20.1 by fxmarty in https://github.com/huggingface/optimum/pull/738
* Fix speech2text export by mht-sharma in https://github.com/huggingface/optimum/pull/746
* Raise error on double call to `BetterTransformer.transform()` by fxmarty in https://github.com/huggingface/optimum/pull/750
* `exporters.onnx` output names and dynamic axes fix by michaelbenayoun in https://github.com/huggingface/optimum/pull/731
* Fix NNCF supported quantization strategies README table by echarlaix in https://github.com/huggingface/optimum/pull/752
* Add GPU tests for BetterTransformer by fxmarty in https://github.com/huggingface/optimum/pull/751
* Fix doctest by fxmarty in https://github.com/huggingface/optimum/pull/759
* Fix ONNX Runtime cache usage for decoders, add relevant tests by fxmarty in https://github.com/huggingface/optimum/pull/756
* Fix GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/758
* Update quality tooling for formatting by regisss in https://github.com/huggingface/optimum/pull/760
* Fix wrong shapes used at ONNX export and validation by fxmarty in https://github.com/huggingface/optimum/pull/764
* Change type annotation by michaelbenayoun in https://github.com/huggingface/optimum/pull/768
* Fix stable diffusion ONNX export by echarlaix in https://github.com/huggingface/optimum/pull/762
* Disable ONNX Runtime provider check on Windows by fxmarty in https://github.com/huggingface/optimum/pull/771
* Fix FusionOptions following ORT 1.14 release by fxmarty in https://github.com/huggingface/optimum/pull/772
* Unpin numpy <1.24.0 by fxmarty in https://github.com/huggingface/optimum/pull/773
* Fix flaky ONNX Runtime generation test with past key value reuse by fxmarty in https://github.com/huggingface/optimum/pull/765
* Fix output shape dimension for OnnxConfigWithPast by fxmarty in https://github.com/huggingface/optimum/pull/780
* Fix used shapes, device at ONNX export by fxmarty in https://github.com/huggingface/optimum/pull/777
* Pin numpy only for tensorflow export by fxmarty in https://github.com/huggingface/optimum/pull/781
* Fixed broken paper space links by Muhtasham in https://github.com/huggingface/optimum/pull/766
* Temporarily disable python 3.9 + macOS test due to onnxruntime 1.14 regression by fxmarty in https://github.com/huggingface/optimum/pull/783
* Update ORT Training to 1.14.0 by JingyaHuang in https://github.com/huggingface/optimum/pull/787
* Temporarily disable segformer TensorRT test by fxmarty in https://github.com/huggingface/optimum/pull/799
* Use a stateful ordered_input_names in ORTModel by fxmarty in https://github.com/huggingface/optimum/pull/796
* Test ORTOptimizer with IO Binding by fxmarty in https://github.com/huggingface/optimum/pull/801
* [`BT`] Add stable layer-norm Wav2vec2 by younesbelkada in https://github.com/huggingface/optimum/pull/803
* Update rules for ruff by regisss in https://github.com/huggingface/optimum/pull/806
* Improve orttrainer test by JingyaHuang in https://github.com/huggingface/optimum/pull/779
* Fix ORT quantization for TensorRT documentation by fxmarty in https://github.com/huggingface/optimum/pull/812
* Fix GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/814
* Update ONNX Runtime training doc - use torchrun by JingyaHuang in https://github.com/huggingface/optimum/pull/820
* Fix ONNX export tests by fxmarty in https://github.com/huggingface/optimum/pull/822
* All back workflow dispatch on GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/823
* BetterTransformer pipeline padding issue fix by vrdn-23 in https://github.com/huggingface/optimum/pull/821
* Fix optimum pipeline initialization by fxmarty in https://github.com/huggingface/optimum/pull/824
* Fix failing GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/829
* Remove feature dimension as dynamic axes for stable diffusion ONNX export by echarlaix in https://github.com/huggingface/optimum/pull/816
* Fix pipeline task dropping arguments bug by fxmarty in https://github.com/huggingface/optimum/pull/828
* Fix ORTQuantizer behavior with ORTModelForCausalLM by fxmarty in https://github.com/huggingface/optimum/pull/831
* Update tests by mht-sharma in https://github.com/huggingface/optimum/pull/826
* Fix exporters GPU CI by fxmarty in https://github.com/huggingface/optimum/pull/835
* Keep intermediary models for ONNX causal-lm by fxmarty in https://github.com/huggingface/optimum/pull/834
* Fix duplicate name merged decoder by fxmarty in https://github.com/huggingface/optimum/pull/837
* Apply lazy import for exporters by JingyaHuang in https://github.com/huggingface/optimum/pull/836
**Full Changelog**: https://github.com/huggingface/optimum/compare/v1.6.0...v1.7.0