Compilation caching system
Since compiling models before being able to train them can be a real bottleneck (for example on small datasets, compile-time is longer than training-time), we introduce a caching system directly connected to the Hugging Face Hub.
Before starting compilation, the `TrainiumTrainer` checks if the needed compile files are on the Hub, and fetched them if that is the case, saving the user the need to do that himself.
Custom cache repo
Since each user might want to have its own cache repo to be able to push stuff and/or keep things private, we offer the possibility to do so via CUSTOM_CACHE_REPO environment variable:
bash
CUSTOM_CACHE_REPO=michaelbenayoun/cache_test python train.py
Neuron export
Support exporting PyTorch models to serialized TorchScript Module compiled by Neuron Compiler ([`neuron-cc`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuron-cc/command-line-reference.html#neuron-compiler-cli-reference) or [`neuronx-cc`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/api-reference-guide/neuron-compiler-cli-reference-guide.html#neuron-compiler-cli-reference-guide)) that can be used on AWS [INF2](https://aws.amazon.com/ec2/instance-types/inf2/) or [INF1](https://aws.amazon.com/ec2/instance-types/inf1/).
Example: Export the BERT model with static shapes:
optimum-cli export neuron --help
optimum-cli export neuron --model bert-base-uncased --sequence_length 128 --batch_size 16 bert_neuron/
By default, on INF2, `matmul` operations will be cast from `fp32` to `bf16`. And on INF1, all operations will be cast to `bf16`. Using `--auto_cast` to configure which operations to perform auto-casting and using `--auto_cast_type` to define the data type for auto-casting.
Example: Auto-cast __all__ operations (*this option can potentially lower precision/accuracy*) to `fp16` data type:
optimum-cli export neuron --model bert-base-uncased --auto_cast all --auto_cast_type fp16 bert_neuron/