optimum Changelog

1.7.5

Fix a bug due a changing import in Diffusers.

- Fix import from Diffusers 399 regisss

**Full Changelog**: https://github.com/huggingface/optimum-habana/compare/v1.7.4...v1.7.5

1.7.4

Fix a bug where DeepSpeed ZeRO-3 was not working.

- Fix T5 DeepSpeed ZeRO-3 393 regisss

**Full Changelog**: https://github.com/huggingface/optimum-habana/compare/v1.7.3...v1.7.4

1.7.3

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary `Identity` nodes in the models.

* Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by fxmarty in https://github.com/huggingface/optimum/pull/872

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html), a fastpath for attention extending their accelerated transformer features. This is included in `optimum.bettertransformer` to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's `scaled_dot_product_attention` allows to use [flash attention](https://arxiv.org/abs/2205.14135) and [memory efficient attention](https://arxiv.org/abs/2112.05682) natively in PyTorch.

Usage is as follow:

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model) modify transformers modeling to use native scaled_dot_product_attention

do you inference or training here

model = BetterTransformer.reverse(model) go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

| Model | batch size | Input sequence length | Generated tokens | Latency eager (s) | Latency BT (s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|--------------|------------|-----------------------|------------------|-------------------|-------------------------------|---------|------------------------|------------------------------------|----------------|
| gpt2 | 1 | 64 | 256 | 1.800 | 1.607 | 12.0% | 569.90 | 569.89 | 0% |
| gpt2 | 64 | 64 | 256 | 2.159 | 1.617 | 33.5% | 2067.45 | 2093.80 | 0% |
| opt-1.3b | 1 | 64 | 256 | 3.010 | 2.667 | 12.9% | 5408.238 | 5408.238 | 0% |
| gpt-neox-20b | 1 | 64 | 256 | 10.869 | 9.937 | 9.4% | 83670.67 | 83673.53 | 0% |

Training benchmark (on fp16):

| Model | batch size | Sequence length | time/epoch (eager, s) | time/epoch (BT, s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
|-------|------------|-----------------|------------------------------|------------------------------------------|---------|------------------------|------------------------------------|----------------|
| gpt2 | 8 | 1024 | 17.732 | 14.037 | 26.3% | 13291.16 | 10191.52 | 30.4% |
| gpt2 | 32 | 1024 | 17.336 | 13.309 | 30.3% | 52834.83 | 38858.56 | 36.0% |
| gpt2 | 64 | 1024 | OOM | 14.067 | / | OOM | 75600.08 | / |

Benchmarks can be reproduced using the [inference script](https://github.com/huggingface/optimum/blob/main/tests/benchmark/benchmark_bettertransformer.py) and [training script](https://github.com/huggingface/optimum/blob/main/tests/benchmark/benchmark_bettertransformer_training_minimal.py):

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

* Add scaled_dot_product_attention support for decoder models by fxmarty in https://github.com/huggingface/optimum/pull/853
* Support scaled_dot_product_attention for t5 by fxmarty in https://github.com/huggingface/optimum/pull/856
* [`BT`] add decoder benchmark script by younesbelkada in https://github.com/huggingface/optimum/pull/857
* [`BT`] Fix bt benchmark by younesbelkada in https://github.com/huggingface/optimum/pull/858
* Fix pytorch version check in bettertransformer by fxmarty in https://github.com/huggingface/optimum/pull/862
* [`BT`] Add fp16 support by younesbelkada in https://github.com/huggingface/optimum/pull/859
* [`BT`] Add decoder training support by younesbelkada in https://github.com/huggingface/optimum/pull/860
* Bart support scaled_dot_product_attention by fxmarty in https://github.com/huggingface/optimum/pull/863
* [`BT`] add `accelerate_test` markers by younesbelkada in https://github.com/huggingface/optimum/pull/864
* Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by fxmarty in https://github.com/huggingface/optimum/pull/865
* Add bettertransformer reverse transform by fxmarty in https://github.com/huggingface/optimum/pull/868
* Add bettertransformer training benchmark script by fxmarty in https://github.com/huggingface/optimum/pull/873

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

* Adding ONNX support for ImageGPT by adit299 in https://github.com/huggingface/optimum/pull/819
* Add ONNX support for RegNet by asrimanth in https://github.com/huggingface/optimum/pull/833
* Adding support for Facebook's OPT models by hivaze in https://github.com/huggingface/optimum/pull/852

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

* Quantization with TFLite by michaelbenayoun in https://github.com/huggingface/optimum/pull/854

Bugfixes and improvements

* Update documentation by echarlaix in https://github.com/huggingface/optimum/pull/843
* Fix typo in documentation by regisss in https://github.com/huggingface/optimum/pull/848
* Remove redundant code by mht-sharma in https://github.com/huggingface/optimum/pull/841
* Update README by echarlaix in https://github.com/huggingface/optimum/pull/850
* Update documentation by echarlaix in https://github.com/huggingface/optimum/pull/855
* Remove iobinding ORTModelForCTC by mht-sharma in https://github.com/huggingface/optimum/pull/840
* Fix typo in documentation by echarlaix in https://github.com/huggingface/optimum/pull/861
* Fix causal-lm ONNX axis names by fxmarty in https://github.com/huggingface/optimum/pull/871
* add NNCF openvino notebook by echarlaix in https://github.com/huggingface/optimum/pull/875
* Remove positional-only parameters not support by python < v3.8 by echarlaix in https://github.com/huggingface/optimum/pull/881
* lazy import for task manager by JingyaHuang in https://github.com/huggingface/optimum/pull/844
* Remove onnx and ort dependencies on the TasksManager by michaelbenayoun in https://github.com/huggingface/optimum/pull/846
* Reactivate export & optimization tests for causal-lm models by fxmarty in https://github.com/huggingface/optimum/pull/885
* Fix ONNX export on transformers 4.27 release by fxmarty in https://github.com/huggingface/optimum/pull/884
* Do not use scaled_dot_product_attention for stable diffusion onnx export by fxmarty in https://github.com/huggingface/optimum/pull/888
* Fix loading of an ONNX stable diffusion model when config doesn't match by echarlaix in https://github.com/huggingface/optimum/pull/887
* Automatic framework detection in TasksManager for large models by fxmarty in https://github.com/huggingface/optimum/pull/883
* Fix WavLM onnx export upon torch 2.0 release by fxmarty in https://github.com/huggingface/optimum/pull/889
* Fix PushToHubMixin._create_repo according to transformers 4.27 release by fxmarty in https://github.com/huggingface/optimum/pull/892
* Fix stable diffusion framework detection by fxmarty in https://github.com/huggingface/optimum/pull/893
* Add donut CPU inference ORT by mht-sharma in https://github.com/huggingface/optimum/pull/761
* Fix check_model for large merged ONNX models by fxmarty in https://github.com/huggingface/optimum/pull/896
* Drop python 3.7 support by fxmarty in https://github.com/huggingface/optimum/pull/891
* Fix dummy label generator for vision tasks by JingyaHuang in https://github.com/huggingface/optimum/pull/900
* Add stable diffusion dummy object by echarlaix in https://github.com/huggingface/optimum/pull/899
* Automatic support for large ONNX models in ORTOptimizer by fxmarty in https://github.com/huggingface/optimum/pull/886
* Remove subprocess calls in ONNX export by fxmarty in https://github.com/huggingface/optimum/pull/897
* Registering mechanism for the `TasksManager` by michaelbenayoun in https://github.com/huggingface/optimum/pull/898
* add option to run inference with ort by prathikr in https://github.com/huggingface/optimum/pull/838
* Check min diffusers version by echarlaix in https://github.com/huggingface/optimum/pull/902
* Update bug-report.yml by lewtun in https://github.com/huggingface/optimum/pull/895
* Fix axis name for seq2seq ONNX models by fxmarty in https://github.com/huggingface/optimum/pull/904
* Fix GPU tests by fxmarty in https://github.com/huggingface/optimum/pull/909
* Fix misleading error message in ORTOptimizer by fxmarty in https://github.com/huggingface/optimum/pull/910
* Delete all Docker images before building the doc of Optimum by regisss in https://github.com/huggingface/optimum/pull/911
* Fix onnx export preprocessors save by fxmarty in https://github.com/huggingface/optimum/pull/913
* Fix GPU CI by fxmarty in https://github.com/huggingface/optimum/pull/914

New Contributors
* adit299 made their first contribution in https://github.com/huggingface/optimum/pull/819
* asrimanth made their first contribution in https://github.com/huggingface/optimum/pull/833
* hivaze made their first contribution in https://github.com/huggingface/optimum/pull/852

**Full Changelog**: https://github.com/huggingface/optimum/compare/v1.2.0...v1.7.2

1.7.2

* Fix OpenVINO Seq2Seq model export for optimum v1.7.3 by echarlaix in https://github.com/huggingface/optimum-intel/pull/253

1.7.1

Temporarily fix a critical bug in BetterTransformer https://github.com/huggingface/optimum/pull/849

**Full Changelog**: https://github.com/huggingface/optimum/compare/v1.7.0...v1.7.1

1.7

This release is fully compatible with SynapseAI 1.7.0, which is the latest version. Check out Habana's [documentation](https://docs.habana.ai/en/v1.7.0/) for more information about the new features.

Memory stats

Memory stats are now logged every `logging_steps` steps to give more information about the memory consumption of HPUs.

- Memory stats 89

DeepSpeed demo notebook with GPT2-XL

This repository now has a notebook displaying how to use DeepSpeed to pre-train/fine-tune GPT2-XL on GAUDI. You can find it [here](https://github.com/huggingface/optimum-habana/blob/main/notebooks/AI_HW_Summit_2022.ipynb).

- Add DeepSpeed demo notebook 112

Fix gradient checkpointing for BERT/RoBERTa/ALBERT

An error used to be raised by PyTorch when running BERT-like models with gradient checkpointing. This has been fixed.

- Fix gradient checkpointing for BERT/RoBERTa/ALBERT 118

Optimum

Page 14 of 23