v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework
ONNX rework
This version introduces a new package, `transformers.onnx`, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.
bash
python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
Validating ONNX model...
-[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
- Validating ONNX Model output "last_hidden_state":
-[✓] (2, 8, 768) matchs (2, 8, 768)
-[✓] all values close (atol: 0.0001)
- Validating ONNX Model output "pooler_output":
-[✓] (2, 768) matchs (2, 768)
-[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/bert-base-cased/model.onnx
- [RFC] Laying down building stone for more flexible ONNX export capabilities 11786 (mfuntowicz)
CANINE model
Four new models are released as part of the CANINE implementation: `CanineForSequenceClassification`, `CanineForMultipleChoice`, `CanineForTokenClassification` and `CanineForQuestionAnswering`, in PyTorch.
The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.
- Add CANINE 12024 (NielsRogge)
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine
Tokenizer training
This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.
py
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
We train on batch of texts, 1000 at a time here.
batch_size = 1000
corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))
tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)
- Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) 12420 (SaulLu)
- Easily train a new fast tokenizer from a given one 12361 (sgugger)
TensorFlow examples
The `TFTrainer` is now entering deprecation - and it is replaced by `Keras`. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.
- NER example for Tensorflow 12469 (Rocketknight1)
- TF summarization example 12617 (Rocketknight1)
- Adding TF translation example 12667 (Rocketknight1)
- Deprecate TFTrainer 12706 (Rocketknight1)
TensorFlow implementations
HuBERT is now implemented in TensorFlow:
- Add TFHubertModel 12206 (will-rice)
Breaking changes
When `load_best_model_at_end` was set to `True` in the `TrainingArguments`, having a different `save_strategy` and `eval_strategy` was accepted but the `save_strategy` was overwritten by the `eval_strategy` (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set `save_strategy` and `eval_strategy` to the same values, and in the case that value is `"steps"`, `save_steps` must be a round multiple of `eval_steps`.
General improvements and bugfixes
- UpdateDescription of TrainingArgs param save_strategy 12328 (sam-qordoba)
- [Deepspeed] new docs 12077 (stas00)
- [ray] try fixing import error 12338 (richardliaw)
- [examples/Flax] move the examples table up 12341 (patil-suraj)
- Fix torchscript tests 12336 (LysandreJik)
- Add flax/jax quickstart 12342 (marcvanzee)
- Fixed a typo in readme 12356 (MichalPitr)
- Fix exception in prediction loop occurring for certain batch sizes 12350 (jglaser)
- Add FlaxBigBird QuestionAnswering script 12233 (vasudevgupta7)
- Replace NotebookProgressReporter by ProgressReporter in Ray Tune run 12357 (krfricke)
- [examples] remove extra white space from log format 12360 (stas00)
- fixed multiplechoice tokenization 12362 (cronoik)
- [trainer] add main_process_first context manager 12351 (stas00)
- [Examples] Replicates the new --log_level feature to all trainer-based pytorch 12359 (bhadreshpsavani)
- [Examples] Update Example Template for `--log_level` feature 12365 (bhadreshpsavani)
- [Examples] Replace `print` statement with `logger.info` in QA example utils 12368 (bhadreshpsavani)
- Onnx export v2 fixes 12388 (LysandreJik)
- [Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers 12371 (ionicsolutions)
- Update run_mlm.py 12344 (TahaAslani)
- Add possibility to maintain full copies of files 12312 (sgugger)
- [CI] add dependency table sync verification 12364 (stas00)
- [Examples] Added context manager to datasets map 12367 (bhadreshpsavani)
- [Flax community event] Add more description to readme 12398 (patrickvonplaten)
- Remove the need for `einsum` in Albert's attention computation 12394 (mfuntowicz)
- [Flax] Adapt flax examples to include `push_to_hub` 12391 (patrickvonplaten)
- Tensorflow LM examples 12358 (Rocketknight1)
- [Deepspeed] match the trainer log level 12401 (stas00)
- [Flax] Add T5 pretraining script 12355 (patrickvonplaten)
- [models] respect dtype of the model when instantiating it 12316 (stas00)
- Rename detr targets to labels 12280 (NielsRogge)
- Add out of vocabulary error to ASR models 12288 (will-rice)
- Fix TFWav2Vec2 SpecAugment 12289 (will-rice)
- [example/flax] add summarization readme 12393 (patil-suraj)
- [Flax] Example scripts - correct weight decay 12409 (patrickvonplaten)
- fix ids_to_tokens naming error in tokenizer of deberta v2 12412 (hjptriplebee)
- Minor fixes in original RAG training script 12395 (shamanez)
- Added talks 12415 (suzana-ilic)
- [modelcard] fix 12422 (stas00)
- Add option to save on each training node 12421 (sgugger)
- Added to talks section 12433 (suzana-ilic)
- Fix default bool in argparser 12424 (sgugger)
- Add default bos_token and eos_token for tokenizer of deberta_v2 12429 (hjptriplebee)
- fix typo in mt5 configuration docstring 12432 (fcakyon)
- Add to talks section 12442 (suzana-ilic)
- [JAX/Flax readme] add philosophy doc 12419 (patil-suraj)
- [Flax] Add wav2vec2 12271 (patrickvonplaten)
- Add test for a WordLevel tokenizer model 12437 (SaulLu)
- [Flax community event] How to use hub during training 12447 (patrickvonplaten)
- [Wav2Vec2, Hubert] Fix ctc loss test 12458 (patrickvonplaten)
- Comment fast GPU TF tests 12452 (LysandreJik)
- Fix training_args.py barrier for torch_xla 12464 (jysohn23)
- Added talk details 12465 (suzana-ilic)
- Add TPU README 12463 (patrickvonplaten)
- Import check_inits handling of duplicate definitions. 12467 (Iwontbecreative)
- Validation split added: custom data files sgugger, patil-suraj 12407 (Souvic)
- Fixing bug with param count without embeddings 12461 (TevenLeScao)
- [roberta] fix lm_head.decoder.weight ignore_key handling 12446 (stas00)
- Rework notebooks and move them to the Notebooks repo 12471 (sgugger)
- fixed typo in flax-projects readme 12466 (mplemay)
- Fix TAPAS test uncovered by 12446 12480 (LysandreJik)
- Add guide on how to build demos for the Flax sprint 12468 (osanseviero)
- Add `Repository` import to the FLAX example script 12501 (LysandreJik)
- [examples/flax] clip style image-text training example 12491 (patil-suraj)
- [Flax] Fix wav2vec2 pretrain arguments 12498 (Wikidepia)
- [Flax] ViT training example 12300 (patil-suraj)
- Fix order of state and input in Flax Quickstart README 12510 (navjotts)
- [Flax] Dataset streaming example 12470 (patrickvonplaten)
- [Flax] Correct flax training scripts 12514 (patrickvonplaten)
- [Flax] Correct logging steps flax 12515 (patrickvonplaten)
- [Flax] Fix another bug in logging steps 12516 (patrickvonplaten)
- [Wav2Vec2] Flax - Adapt wav2vec2 script 12520 (patrickvonplaten)
- [Flax] Fix hybrid clip 12519 (patil-suraj)
- [RoFormer] Fix some issues 12397 (JunnYu)
- FlaxGPTNeo 12493 (patil-suraj)
- Updated README 12540 (suzana-ilic)
- Edit readme 12541 (SaulLu)
- implementing tflxmertmodel integration test 12497 (sadakmed)
- [Flax] Adapt examples to be able to use eval_steps and save_steps 12543 (patrickvonplaten)
- [examples/flax] add adafactor optimizer 12544 (patil-suraj)
- [Flax] Add FlaxMBart 12236 (stancld)
- Add a warning for broken ProphetNet fine-tuning 12511 (JetRunner)
- [trainer] add option to ignore keys for the train function too (11719) 12551 (shabie)
- MLM training fails with no validation file(same as 12406 for pytorch now) 12517 (Souvic)
- [Flax] Allow retraining from save checkpoint 12559 (patrickvonplaten)
- Adding prepare_decoder_input_ids_from_labels methods to all TF ConditionalGeneration models 12560 (Rocketknight1)
- Remove tf.roll wherever not needed 12512 (szutenberg)
- Double check for attribute num_examples 12562 (sgugger)
- [examples/hybrid_clip] fix loading clip vision model 12566 (patil-suraj)
- Remove logging of GPU count etc from run_t5_mlm_flax.py 12569 (ibraheem-moosa)
- raise exception when arguments to pipeline are incomplete 12548 (hwijeen)
- Init pickle 12567 (sgugger)
- Fix group_lengths for short datasets 12558 (sgugger)
- Don't stop at num_epochs when using IterableDataset 12561 (sgugger)
- Fixing the pipeline optimization by reindexing targets (V2) 12330 (Narsil)
- Fix MT5 init 12591 (sgugger)
- [model.from_pretrained] raise exception early on failed load 12574 (stas00)
- [doc] fix broken ref 12597 (stas00)
- Add Flax sprint project evaluation section 12592 (osanseviero)
- This will reduce "Already borrowed error": 12550 (Narsil)
- [Flax] Add flax marian 12595 (patrickvonplaten)
- [Flax] Fix cur step flax examples 12608 (patrickvonplaten)
- Simplify unk token 12582 (sgugger)
- Fix arg count for partial functions 12609 (sgugger)
- Pass `model_kwargs` when loading a model in `pipeline()` 12449 (aphedges)
- [Flax] Fix mt5 auto 12612 (patrickvonplaten)
- [Flax Marian] Add marian flax example 12614 (patrickvonplaten)
- [FLax] Fix marian docs 2 12615 (patrickvonplaten)
- [debugging utils] minor doc improvements 12525 (stas00)
- [doc] DP/PP/TP/etc parallelism 12524 (stas00)
- [doc] fix anchor 12620 (stas00)
- [Examples][Flax] added test file in summarization example 12630 (bhadreshpsavani)
- Point to the right file for hybrid CLIP 12599 (edugp)
- [flax]fix jax array type check 12638 (patil-suraj)
- Add tokenizer_file parameter to PreTrainedTokenizerFast docstring 12624 (lewisbails)
- Skip TestMarian_MT_EN 12649 (LysandreJik)
- The extended trainer tests should require torch 12650 (LysandreJik)
- Pickle auto models 12654 (sgugger)
- Pipeline should be agnostic 12656 (LysandreJik)
- Fix transfo xl integration test 12652 (LysandreJik)
- Remove SageMaker documentation 12657 (philschmid)
- Fixed docs 12646 (KickItLikeShika)
- fix typo in modeling_t5.py docstring 12640 (PhilipMay)
- Translate README.md to Simplified Chinese 12596 (JetRunner)
- Fix typo in README_zh-hans.md 12663 (JetRunner)
- Updates timeline for project evaluation 12660 (osanseviero)
- [WIP] Patch BigBird tokenization test 12653 (LysandreJik)
- **encode_plus() shouldn't run for W2V2CTC 12655 (LysandreJik)
- Add ByT5 option to example run_t5_mlm_flax.py 12634 (mapmeld)
- Wrong model is used in example, should be character instead of subword model 12676 (jsteggink)
- [Blenderbot] Fix docs 12227 (patrickvonplaten)
- Add option to load a pretrained model with mismatched shapes 12664 (sgugger)
- Fix minor docstring typos. 12682 (qqaatw)
- [tokenizer.prepare_seq2seq_batch] change deprecation to be easily actionable 12669 (stas00)
- [Flax Generation] Correct inconsistencies PyTorch/Flax 12662 (patrickvonplaten)
- [Deepspeed] adapt multiple models, add zero_to_fp32 tests 12477 (stas00)
- Add timeout to CI. 12684 (LysandreJik)
- Fix Tensorflow Bart-like positional encoding 11897 (JunnYu)
- [Deepspeed] non-native optimizers are mostly ok with zero-offload 12690 (stas00)
- Fix multiple choice doc examples 12679 (sgugger)
- Provide mask_time_indices to `_mask_hidden_states` to avoid double masking 12692 (mfuntowicz)
- Update TF examples README 12703 (Rocketknight1)
- Fix uninitialized variables when `config.mask_feature_prob > 0` 12705 (mfuntowicz)
- Only test the files impacted by changes in the diff 12644 (sgugger)
- flax model parallel training 12590 (patil-suraj)
- [test] split test into 4 sub-tests to avoid timeout 12710 (stas00)
- [trainer] release tmp memory in checkpoint load 12718 (stas00)
- [Flax] Correct shift labels for seq2seq models in Flax 12720 (patrickvonplaten)
- Fix typo in Speech2TextForConditionalGeneration example 12716 (will-rice)
- Init adds its own files as impacted 12709 (sgugger)
- LXMERT integration test typo 12736 (LysandreJik)
- Fix AutoModel tests 12733 (LysandreJik)
- Skip test while the model is not available 12739 (LysandreJik)
- Skip test while the model is not available 12740 (LysandreJik)
- Translate README.md to Traditional Chinese 12701 (qqaatw)
- Fix MBart failing test 12737 (LysandreJik)
- Patch T5 device test 12742 (LysandreJik)
- Fix DETR integration test 12734 (LysandreJik)
- Fix led torchscript 12735 (LysandreJik)
- Remove framework mention 12731 (LysandreJik)
- [doc] parallelism: Which Strategy To Use When 12712 (stas00)
- [doc] performance: batch sizes 12725 (stas00)
- Replace specific tokenizer in log message by AutoTokenizer 12745 (SaulLu)
- [Wav2Vec2] Correctly pad mask indices for PreTraining 12748 (patrickvonplaten)
- [doc] testing: how to trigger a self-push workflow 12724 (stas00)
- add intel-tensorflow-avx512 to the candidates 12751 (zzhou612)
- [flax/model_parallel] fix typos 12757 (patil-suraj)
- Turn on eval mode when exporting to ONNX 12758 (mfuntowicz)
- Preserve `list` type of `additional_special_tokens` in `special_token_map` 12759 (SaulLu)
- [Wav2Vec2] Padded vectors should not allowed to be sampled 12764 (patrickvonplaten)
- Add tokenizers class mismatch detection between `cls` and checkpoint 12619 (europeanplaice)
- Fix push_to_hub docstring and make it appear in doc 12770 (sgugger)
- [ray] Fix `datasets_modules` ImportError with Ray Tune 12749 (Yard1)
- Longer timeout for slow tests 12779 (LysandreJik)
- Enforce eval and save strategies are compatible when --load_best_model_at_end 12786 (sgugger)
- [CIs] add troubleshooting docs 12791 (stas00)
- Fix Padded Batch Error 12282 12487 (will-rice)
- Flax MLM: Allow validation split when loading dataset from local file 12689 (fgaim)
- [Longformer] Correct longformer docs 12809 (patrickvonplaten)
- [CLIP/docs] add and fix examples 12810 (patil-suraj)
- [trainer] sanity checks for `save_steps=0|None` and `logging_steps=0` 12796 (stas00)
- Expose get_config() on ModelTesters 12812 (LysandreJik)
- Refactor slow sentencepiece tokenizers. 11716 (PhilipMay)
- Refer warmup_ratio when setting warmup_num_steps. 12818 (tsuchm)
- Add versioning system to fast tokenizer files 12713 (sgugger)
- Add _CHECKPOINT_FOR_DOC to all models 12811 (LysandreJik)