This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.
Breaking change: constant outputs removed from ONNX encoder-decoder models
We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary `Identity` nodes in the models.
* Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by fxmarty in https://github.com/huggingface/optimum/pull/872
`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer
Pytorch 2.0 introduces in beta [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html), a fastpath for attention extending their accelerated transformer features. This is included in `optimum.bettertransformer` to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.
Beware that this is still experimental and speedups have yet to be validated on all architectures.
PyTorch's `scaled_dot_product_attention` allows to use [flash attention](https://arxiv.org/abs/2205.14135) and [memory efficient attention](https://arxiv.org/abs/2112.05682) natively in PyTorch.
Usage is as follow:
python
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = BetterTransformer.transform(model) modify transformers modeling to use native scaled_dot_product_attention
do you inference or training here
model = BetterTransformer.reverse(model) go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")
Inference benchmark (on fp16):
| Model | batch size | Input sequence length | Generated tokens | Latency eager (s) | Latency BT (s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|--------------|------------|-----------------------|------------------|-------------------|-------------------------------|---------|------------------------|------------------------------------|----------------|
| gpt2 | 1 | 64 | 256 | 1.800 | 1.607 | 12.0% | 569.90 | 569.89 | 0% |
| gpt2 | 64 | 64 | 256 | 2.159 | 1.617 | 33.5% | 2067.45 | 2093.80 | 0% |
| opt-1.3b | 1 | 64 | 256 | 3.010 | 2.667 | 12.9% | 5408.238 | 5408.238 | 0% |
| gpt-neox-20b | 1 | 64 | 256 | 10.869 | 9.937 | 9.4% | 83670.67 | 83673.53 | 0% |
Training benchmark (on fp16):
| Model | batch size | Sequence length | time/epoch (eager, s) | time/epoch (BT, s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|-------|------------|-----------------|------------------------------|------------------------------------------|---------|------------------------|------------------------------------|----------------|
| gpt2 | 8 | 1024 | 17.732 | 14.037 | 26.3% | 13291.16 | 10191.52 | 30.4% |
| gpt2 | 32 | 1024 | 17.336 | 13.309 | 30.3% | 52834.83 | 38858.56 | 36.0% |
| gpt2 | 64 | 1024 | OOM | 14.067 | / | OOM | 75600.08 | / |
Benchmarks can be reproduced using the [inference script](https://github.com/huggingface/optimum/blob/main/tests/benchmark/benchmark_bettertransformer.py) and [training script](https://github.com/huggingface/optimum/blob/main/tests/benchmark/benchmark_bettertransformer_training_minimal.py):
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0
* Add scaled_dot_product_attention support for decoder models by fxmarty in https://github.com/huggingface/optimum/pull/853
* Support scaled_dot_product_attention for t5 by fxmarty in https://github.com/huggingface/optimum/pull/856
* [`BT`] add decoder benchmark script by younesbelkada in https://github.com/huggingface/optimum/pull/857
* [`BT`] Fix bt benchmark by younesbelkada in https://github.com/huggingface/optimum/pull/858
* Fix pytorch version check in bettertransformer by fxmarty in https://github.com/huggingface/optimum/pull/862
* [`BT`] Add fp16 support by younesbelkada in https://github.com/huggingface/optimum/pull/859
* [`BT`] Add decoder training support by younesbelkada in https://github.com/huggingface/optimum/pull/860
* Bart support scaled_dot_product_attention by fxmarty in https://github.com/huggingface/optimum/pull/863
* [`BT`] add `accelerate_test` markers by younesbelkada in https://github.com/huggingface/optimum/pull/864
* Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by fxmarty in https://github.com/huggingface/optimum/pull/865
* Add bettertransformer reverse transform by fxmarty in https://github.com/huggingface/optimum/pull/868
* Add bettertransformer training benchmark script by fxmarty in https://github.com/huggingface/optimum/pull/873
New architectures in the ONNX export
Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.
* Adding ONNX support for ImageGPT by adit299 in https://github.com/huggingface/optimum/pull/819
* Add ONNX support for RegNet by asrimanth in https://github.com/huggingface/optimum/pull/833
* Adding support for Facebook's OPT models by hivaze in https://github.com/huggingface/optimum/pull/852
(WIP) TFLite export with quantization support
Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.
* Quantization with TFLite by michaelbenayoun in https://github.com/huggingface/optimum/pull/854
Bugfixes and improvements
* Update documentation by echarlaix in https://github.com/huggingface/optimum/pull/843
* Fix typo in documentation by regisss in https://github.com/huggingface/optimum/pull/848
* Remove redundant code by mht-sharma in https://github.com/huggingface/optimum/pull/841
* Update README by echarlaix in https://github.com/huggingface/optimum/pull/850
* Update documentation by echarlaix in https://github.com/huggingface/optimum/pull/855
* Remove iobinding ORTModelForCTC by mht-sharma in https://github.com/huggingface/optimum/pull/840
* Fix typo in documentation by echarlaix in https://github.com/huggingface/optimum/pull/861
* Fix causal-lm ONNX axis names by fxmarty in https://github.com/huggingface/optimum/pull/871
* add NNCF openvino notebook by echarlaix in https://github.com/huggingface/optimum/pull/875
* Remove positional-only parameters not support by python < v3.8 by echarlaix in https://github.com/huggingface/optimum/pull/881
* lazy import for task manager by JingyaHuang in https://github.com/huggingface/optimum/pull/844
* Remove onnx and ort dependencies on the TasksManager by michaelbenayoun in https://github.com/huggingface/optimum/pull/846
* Reactivate export & optimization tests for causal-lm models by fxmarty in https://github.com/huggingface/optimum/pull/885
* Fix ONNX export on transformers 4.27 release by fxmarty in https://github.com/huggingface/optimum/pull/884
* Do not use scaled_dot_product_attention for stable diffusion onnx export by fxmarty in https://github.com/huggingface/optimum/pull/888
* Fix loading of an ONNX stable diffusion model when config doesn't match by echarlaix in https://github.com/huggingface/optimum/pull/887
* Automatic framework detection in TasksManager for large models by fxmarty in https://github.com/huggingface/optimum/pull/883
* Fix WavLM onnx export upon torch 2.0 release by fxmarty in https://github.com/huggingface/optimum/pull/889
* Fix PushToHubMixin._create_repo according to transformers 4.27 release by fxmarty in https://github.com/huggingface/optimum/pull/892
* Fix stable diffusion framework detection by fxmarty in https://github.com/huggingface/optimum/pull/893
* Add donut CPU inference ORT by mht-sharma in https://github.com/huggingface/optimum/pull/761
* Fix check_model for large merged ONNX models by fxmarty in https://github.com/huggingface/optimum/pull/896
* Drop python 3.7 support by fxmarty in https://github.com/huggingface/optimum/pull/891
* Fix dummy label generator for vision tasks by JingyaHuang in https://github.com/huggingface/optimum/pull/900
* Add stable diffusion dummy object by echarlaix in https://github.com/huggingface/optimum/pull/899
* Automatic support for large ONNX models in ORTOptimizer by fxmarty in https://github.com/huggingface/optimum/pull/886
* Remove subprocess calls in ONNX export by fxmarty in https://github.com/huggingface/optimum/pull/897
* Registering mechanism for the `TasksManager` by michaelbenayoun in https://github.com/huggingface/optimum/pull/898
* add option to run inference with ort by prathikr in https://github.com/huggingface/optimum/pull/838
* Check min diffusers version by echarlaix in https://github.com/huggingface/optimum/pull/902
* Update bug-report.yml by lewtun in https://github.com/huggingface/optimum/pull/895
* Fix axis name for seq2seq ONNX models by fxmarty in https://github.com/huggingface/optimum/pull/904
* Fix GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/909
* Fix misleading error message in ORTOptimizer by fxmarty in https://github.com/huggingface/optimum/pull/910
* Delete all Docker images before building the doc of Optimum by regisss in https://github.com/huggingface/optimum/pull/911
* Fix onnx export preprocessors save by fxmarty in https://github.com/huggingface/optimum/pull/913
* Fix GPU CI by fxmarty in https://github.com/huggingface/optimum/pull/914
New Contributors
* adit299 made their first contribution in https://github.com/huggingface/optimum/pull/819
* asrimanth made their first contribution in https://github.com/huggingface/optimum/pull/833
* hivaze made their first contribution in https://github.com/huggingface/optimum/pull/852
**Full Changelog**: https://github.com/huggingface/optimum/compare/v1.2.0...v1.7.2