This is the first release of Optimum TPU that includes support for Jetstream Pytorch engine as backend for Test Generation Inference (TGI).
[JetStream](https://github.com/AI-Hypercomputer/JetStream) is a throughput and memory optimized engine for LLM inference on TPUs, and its [Pytorch implementation](https://github.com/AI-Hypercomputer/jetstream-pytorch) allows for a seamless integration in the TGI code. The supported models (for now Llama 2 and Llama 3, Gemma 1 and Mixtral, and serving inference on these models resulted has given results close to 10x in terms of tokens/sec compared to the previously used backend (Pytorch XLA/transformers).
On top of that, it is possible to use quantization to serve using even less resources while maintaining a similar throughput and quality.
Details follow.
What's Changed
* Update colab examples by wenxindongwork in https://github.com/huggingface/optimum-tpu/pull/86
* ci(docker): update torch-xla to 2.4.0 by tengomucho in https://github.com/huggingface/optimum-tpu/pull/89
* โ๏ธ Introduce Jetstream/Pytorch in TGI by tengomucho in https://github.com/huggingface/optimum-tpu/pull/88
* ๐ฆ Llama3 on TGI - Jetstream Pytorch by tengomucho in https://github.com/huggingface/optimum-tpu/pull/90
* โ๏ธ Update Jetstream Pytorch revision by tengomucho in https://github.com/huggingface/optimum-tpu/pull/91
* Correct extra token, start preparing docker image for TGI/Jetstream Pt by tengomucho in https://github.com/huggingface/optimum-tpu/pull/93
* Fix generation using Jetstream Pytorch by tengomucho in https://github.com/huggingface/optimum-tpu/pull/94
* Fix slow tests by tengomucho in https://github.com/huggingface/optimum-tpu/pull/95
* ๐งน Cleanup and fixes for TGI by tengomucho in https://github.com/huggingface/optimum-tpu/pull/96
* Small TGI enhancements by tengomucho in https://github.com/huggingface/optimum-tpu/pull/97
* fix(TGI Jetstream Pt): prefill should be done with max input size by tengomucho in https://github.com/huggingface/optimum-tpu/pull/98
* ๐ Gemma on TGI Jetstream Pytorch by tengomucho in https://github.com/huggingface/optimum-tpu/pull/99
* Fix ci nightly jetstream by tengomucho in https://github.com/huggingface/optimum-tpu/pull/101
* CI ephemeral TPUs by tengomucho in https://github.com/huggingface/optimum-tpu/pull/102
* ๐ Added Mixtral on TGI / Jetstream Pytorch by tengomucho in https://github.com/huggingface/optimum-tpu/pull/103
* Add CLI to install dependencies by tengomucho in https://github.com/huggingface/optimum-tpu/pull/104
* โฐ CI: mount hub cache and fix issues with cli by tengomucho in https://github.com/huggingface/optimum-tpu/pull/106
* fix(docker): correct jetstream installation in TGI docker image by tengomucho in https://github.com/huggingface/optimum-tpu/pull/107
* โ๏ธ docs: Add training guide and improve documentation consistency by baptistecolle in https://github.com/huggingface/optimum-tpu/pull/110
* Quantization Jetstream Pytorch by tengomucho in https://github.com/huggingface/optimum-tpu/pull/111
* fix: graceful shutdown was not working with entrypoint, exec launcher by co42 in https://github.com/huggingface/optimum-tpu/pull/112
* fix(doc): correct link to deploy page by tengomucho in https://github.com/huggingface/optimum-tpu/pull/115
* More Jetstream Pytorch fixes, prepare for release by tengomucho in https://github.com/huggingface/optimum-tpu/pull/116
New Contributors
* wenxindongwork made their first contribution in https://github.com/huggingface/optimum-tpu/pull/86
* baptistecolle made their first contribution in https://github.com/huggingface/optimum-tpu/pull/110
* co42 made their first contribution in https://github.com/huggingface/optimum-tpu/pull/112
**Full Changelog**: https://github.com/huggingface/optimum-tpu/compare/v0.1.5...v0.2.0