* Python versions: 3.9, 3.10, 3.11
* Operating system: Linux
* PyTorch: 2.4.0.dev20240429
* TensorFlow: 2.17.0.dev20240509
See [this section](https://github.com/google-ai-edge/ai-edge-torch/tree/v0.1.1?tab=readme-ov-file#installation) of the README
PyTorch Converter (Beta)
Functionality
First release of a direct path from PyTorch to the TFLite runtime ([blog post](https://developers.googleblog.com/en/ai-edge-torch-high-performance-inference-of-pytorch-models-on-mobile-devices/)).
Coverage
* Verified successful conversion of Pytorch to TFLite on a Beta test set of 72 Pytorch models readily available from [torchvision](https://pytorch.org/vision/0.9/models.html), [torchaudio](https://pytorch.org/audio/stable/models.html), [timm](https://github.com/huggingface/pytorch-image-models?tab=readme-ov-file#models), [HuggingFace transformers](https://github.com/huggingface/transformers/), and open source GitHub repositories (such as [Yolox](https://github.com/Megvii-BaseDetection/YOLOX/tree/main), [U2Net](https://github.com/xuebinqin/U-2-Net/tree/master), [IS-Net](https://github.com/xuebinqin/DIS)) spanning computer vision, text, audio, and speech applications.
Performance
* Excellent CPU performance for the converted models, leveraging the TFLite XNNPACK delegate.
* A subset of the Beta test set can be fully delegated to GPU, others are partially delegated or unsupported.
* QNN delegate ([available here](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.22.0.240425.zip)) supports most models in the Beta test set with significant average acceleration relative to CPU (20X) and GPU (5X) using Qualcomm’s DSP and neural processing units.
Quantization
* Support for dynamic quantization with [PT2E](https://pytorch.org/tutorials/prototype/quantization_in_pytorch_2_0_export_tutorial.html).
* Support for [post-training quantization](https://www.tensorflow.org/lite/performance/post_training_quantization) via the TFLite converter.
* AI Edge Torch Converter APIs to use these two quantization frameworks are [here](https://github.com/google-ai-edge/ai-edge-torch/blob/v0.1.1/docs/pytorch_converter/README.md#quantization).
Known Issues
* Inference latency with quantized models is higher than unquantized models in some cases.
Generative API (Alpha)
Functionality
* Provides PyTorch native [building blocks](https://github.com/google-ai-edge/ai-edge-torch/tree/v0.1.1/ai_edge_torch/generative/layers) to compose LLMs using mobile friendly abstractions for performant execution on TFLite runtime.
* Examples to author LLMs via Edge Generative API for conversion to TFLite for Gemma, TinyLlama and Phi-2. ([Examples](https://github.com/google-ai-edge/ai-edge-torch/tree/v0.1.1/ai_edge_torch/generative/examples))
* Supports 8-bit dynamic range quantization. ([here](https://github.com/google-ai-edge/ai-edge-torch/tree/v0.1.1/ai_edge_torch/generative/quantize))
* Integration with [MediaPipe LLM Inference API](https://github.com/google-ai-edge/ai-edge-torch/tree/v0.1.1/ai_edge_torch/generative#use-mediapipe-llm-inference-api) for easy integration into Mobile Apps, and a prompt interface.
Known Issues
* The conversion, and serialization process is unoptimized for LLMs. It requires keeping multiple copies of the weights in memory for transformations, and serialization/deserialization.
* Runtime execution of the LLM in TFLite is missing some memory optimizations, and inefficient during memory unpacking on XNNPack.