We're excited to announce Petals 2.0.0 โ the largest Petals release to date!
Highlights
๐ฆ **Support for LLaMA and LLaMA 2.** We've added support for **inference and fine-tuning** of any models based on ๐ค Transformers [`LlamaModel`](https://huggingface.co/docs/transformers/main/model_doc/llama), including all variants of [LLaMA](https://github.com/facebookresearch/llama/blob/llama_v1/MODEL_CARD.md) and [LLaMA 2](https://ai.meta.com/llama/) โ one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of **up to 5-6 tokens/sec**.
- You can try them in the ๐ฌ **[chatbot web app](https://chat.petals.dev)** or in ๐ **[our Colab tutorial](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing)**.
๐๏ธ **4-bit quantization.** We've integrated efficient 4-bit (NF4) quantization from the recent ["QLoRA: Efficient Finetuning of Quantized LLMs"](https://arxiv.org/abs/2305.14314) paper. This allows to use ~40% less GPU memory (thus, **~40% less servers**) to fit all model blocks and have **~2x speedup** for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.
๐ **Pre-loading LoRA adapters, such as Guanaco.** We added an opportunity to pre-load LoRA adapters compatible with the ๐ค [PEFT](https://github.com/huggingface/peft) library, which may add extra functionality to the model you host. You can do this using the `--adapters` argument on the server (e.g., `--adapters repo1/adapter1 repo2/adapter2`). These adapters are activated at a client's request - specifically, the client may specify `.from_pretrained(..., active_adapter="repo1/adapter1")` when loading a distributed model. One example of this is [Guanaco](https://huggingface.co/timdettmers/guanaco-65b) - an **instruction-finetuned adapter** for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our [chatbot](https://chat.petals.dev) app.
โก๏ธ **Direct server-to-server communication.** Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.
๐ฃ๏ธ **Shortest-path routing for inference.** Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the **fastest chain** of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.
๐ **Loading models directly from ๐ค Model Hub and `Auto` classes.** Starting from Petals 2.0.0, models **do not need to be converted** to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from ๐ค [Model Hub](https://huggingface.co/models), fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using `Auto` classes, such as `AutoDistributedConfig.from_pretrained(...)` and `AutoDistributedModelForCausalLM.from_pretrained(...)`. The [guide](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals) for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.
๐๏ธ **Fine-tuning examples.** We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the ["Getting started"](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing) notebook now includes a simple example of deep prompt tuning on a dummy task, and the [sequence classification notebook](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb) uses LLaMA-65B and improved hyperparameters for a stable training.
๐ฅ๏ธ **Upgraded swarm monitor.** The [swarm monitor](https://health.petals.dev) now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to **publish their name, advertise their company or a social media account** in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's `--public_name` argument.
What's Changed
* Remove unused imports and attributes by mryab in https://github.com/bigscience-workshop/petals/pull/324
* Determine block dtype in a unified manner by mryab in https://github.com/bigscience-workshop/petals/pull/325
* Use number of tokens for attn_cache_size by mryab in https://github.com/bigscience-workshop/petals/pull/286
* Add LLaMA support by borzunov in https://github.com/bigscience-workshop/petals/pull/323
* Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} by borzunov in https://github.com/bigscience-workshop/petals/pull/329
* Fix llama's lm_head.weight.requires_grad by borzunov in https://github.com/bigscience-workshop/petals/pull/330
* Show license links when loading models by borzunov in https://github.com/bigscience-workshop/petals/pull/332
* Add benchmark scripts by borzunov in https://github.com/bigscience-workshop/petals/pull/319
* Fix warmup steps and minor issues in benchmarks by borzunov in https://github.com/bigscience-workshop/petals/pull/334
* Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) by borzunov in https://github.com/bigscience-workshop/petals/pull/337
* Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) by borzunov in https://github.com/bigscience-workshop/petals/pull/333
* Allow free_disk_space_for() remove arbitrary files from Petals cache by borzunov in https://github.com/bigscience-workshop/petals/pull/339
* Implement direct server-to-server communication by borzunov in https://github.com/bigscience-workshop/petals/pull/331
* Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 by borzunov in https://github.com/bigscience-workshop/petals/pull/340
* Delete deprecated petals.cli scripts by borzunov in https://github.com/bigscience-workshop/petals/pull/336
* Use bitsandbytes 0.40.0.post4 with bias hotfix by borzunov in https://github.com/bigscience-workshop/petals/pull/342
* Support peft LoRA adapters by artek0chumak in https://github.com/bigscience-workshop/petals/pull/335
* Fix convergence issues and switch to LLaMA in the SST-2 example by mryab in https://github.com/bigscience-workshop/petals/pull/343
* Mention LLaMA in readme by borzunov in https://github.com/bigscience-workshop/petals/pull/344
* Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes by borzunov in https://github.com/bigscience-workshop/petals/pull/345
* Fix Docker build by avoiding Python 3.11 by borzunov in https://github.com/bigscience-workshop/petals/pull/348
* Support LLaMA repos without "-hf" suffix by borzunov in https://github.com/bigscience-workshop/petals/pull/349
* Estimate adapter memory overhead in choose_num_blocks() by justheuristic in https://github.com/bigscience-workshop/petals/pull/346
* Spam less in server logs by borzunov in https://github.com/bigscience-workshop/petals/pull/350
* Remove unused import os by justheuristic in https://github.com/bigscience-workshop/petals/pull/352
* Test that bitsandbytes is not imported when it's not used by borzunov in https://github.com/bigscience-workshop/petals/pull/351
* Fix bugs in _choose_num_blocks() added in 346 by borzunov in https://github.com/bigscience-workshop/petals/pull/354
* Switch adapters slightly faster by justheuristic in https://github.com/bigscience-workshop/petals/pull/353
* Share more info about a server in DHT by borzunov in https://github.com/bigscience-workshop/petals/pull/355
* Make a server ping next servers by borzunov in https://github.com/bigscience-workshop/petals/pull/356
* Use bitsandbytes 0.40.1.post1 by borzunov in https://github.com/bigscience-workshop/petals/pull/357
* Update readme and "Getting started" link by borzunov in https://github.com/bigscience-workshop/petals/pull/360
* Report inference, forward, and network RPS separately by borzunov in https://github.com/bigscience-workshop/petals/pull/358
* Fix typo in generation_algorithms.py by eltociear in https://github.com/bigscience-workshop/petals/pull/364
* Implement shortest-path routing for inference by borzunov in https://github.com/bigscience-workshop/petals/pull/362
* Update readme to show new models by borzunov in https://github.com/bigscience-workshop/petals/pull/365
* Require transformers < 4.31.0 until we're compatible by borzunov in https://github.com/bigscience-workshop/petals/pull/369
* Fix AssertionError on rebalancing by borzunov in https://github.com/bigscience-workshop/petals/pull/370
* Update transformers to 4.31.0 and peft to 0.4.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/371
* Fix readme code example, require Python < 3.11 until supported by borzunov in https://github.com/bigscience-workshop/petals/pull/374
* Fix handler memory leak, get rid of mp.Manager by justheuristic in https://github.com/bigscience-workshop/petals/pull/373
* Inherit bitsandbytes compute dtype correctly (override peft quirk) by justheuristic in https://github.com/bigscience-workshop/petals/pull/377
* Fix --token arg by borzunov in https://github.com/bigscience-workshop/petals/pull/378
* Support Llama 2 by borzunov in https://github.com/bigscience-workshop/petals/pull/379
* Require accelerate>=0.20.3 as transformers do by borzunov in https://github.com/bigscience-workshop/petals/pull/383
* Bump version to 2.0.0.post1 by borzunov in https://github.com/bigscience-workshop/petals/pull/384
New Contributors
* eltociear made their first contribution in https://github.com/bigscience-workshop/petals/pull/364
**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v1.1.5...v2.0.0.post1