Petals

Latest version: v2.2.0.post1

Safety actively analyzes 623133 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

2.2.0

Highlights

๐Ÿฆ… **Falcon support.** Petals now supports all models based on [Falcon](https://huggingface.co/blog/falcon), including [Falcon 180B](https://huggingface.co/tiiuae/falcon-180B) released today. We improved the ๐Ÿค— Transformers `FalconModel` implementation to be up to 40% faster on recent GPUs. Our [chatbot app](http://chat.petals.dev) runs Falcon 180B-Chat at ~2 tokens/sec.

Falcon-40B is licensed under Apache 2.0, so you can load it by specifying `tiiuae/falcon-40b` or `tiiuae/falcon-40b-instruct` as the model name. Falcon-180B is licensed under a custom license, and it is not clear if we can provide a Python interface for inference and fine-tuning of this model. Right now, it is only available in the chatbot app, and we are waiting for further clarifications from TII on this issue.

๐Ÿ **Native macOS support.** You can run Petals clients and servers on macOS natively - just install [Homebrew](https://brew.sh/) and run these commands:

bash
brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2


If your computer has Apple M1/M2 chip, the Petals server will use the integrated GPU automatically. We recommend to only host Llama-based models, since other supported architectures do not work efficiently on M1/M2 chips yet. We also recommend using Python 3.10+ on macOS (installed by Homebrew automatically).

๐Ÿ”Œ **Serving custom models.** Custom models now automatically show up at https://health.petals.dev as "not officially supported" models. As a reminder, you are not limited to models available at https://health.petals.dev and can run a server hosting any model based on BLOOM, Llama, or Falcon architecture (given that it's allowed by the model license), or even add a support for a [new architecture](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals) yourself. We also improved Petals compatibility with some popular Llama-based models (e.g., models from [NousResearch](https://huggingface.co/NousResearch)) in this release.

๐Ÿž **Bug fixes.** This release also fixes inference of prefix-tuned models, which was broken in Petals 2.1.0.

What's Changed
* Require transformers>=4.32.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/479
* Fix requiring transformers>=4.32.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/480
* Rewrite MemoryCache alloc_timeout logic by justheuristic in https://github.com/bigscience-workshop/petals/pull/434
* Refactor readme by borzunov in https://github.com/bigscience-workshop/petals/pull/482
* Support macOS natively by borzunov in https://github.com/bigscience-workshop/petals/pull/477
* Remove no-op process in PrioritizedTaskPool by borzunov in https://github.com/bigscience-workshop/petals/pull/484
* Fix `.generate(input_ids=...)` by borzunov in https://github.com/bigscience-workshop/petals/pull/485
* Wait for DHT storing state OFFLINE on shutdown by borzunov in https://github.com/bigscience-workshop/petals/pull/486
* Fix race condition in MemoryCache by borzunov in https://github.com/bigscience-workshop/petals/pull/487
* Replace dots in repo names when building DHT prefixes by borzunov in https://github.com/bigscience-workshop/petals/pull/489
* Create model index in DHT by borzunov in https://github.com/bigscience-workshop/petals/pull/491
* Force use_cache=True by borzunov in https://github.com/bigscience-workshop/petals/pull/496
* Force use_cache=True in config only by borzunov in https://github.com/bigscience-workshop/petals/pull/497
* Add Falcon support by borzunov in https://github.com/bigscience-workshop/petals/pull/499
* Fix prompt tuning after 464 by borzunov in https://github.com/bigscience-workshop/petals/pull/501
* Optimize the Falcon block for inference by mryab in https://github.com/bigscience-workshop/petals/pull/500


**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v2.1.0...v2.2.0

2.1.0

Not secure
Highlights

๐Ÿ”Œ **Compatibility with ๐Ÿค— Transformers generation utils.** Petals models now directly use ๐Ÿค— Transformers **[.generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate)** implementation instead of custom generation code. This means that you can use a variety of generation methods and constraints implemented in ๐Ÿค— Transformers (e.g., `repetition_penalty`, beam search, etc.) and expect an exact match between Petals and a model running locally.

Most common methods are compatible with reusing inference sessions, so that you can run `.generate()` multiple times without reprocessing the dialogue history from scratch:

python
with model.inference_session(max_length=100):
outputs1 = model.generate(user_prompt1, repetition_penalty=1.2)
outputs2 = model.generate(user_prompt2, repetition_penalty=1.2)


โšก **Faster loading of Stable Beluga 2.** We repacked [Stable Beluga 2](https://huggingface.co/petals-team/StableBeluga2), the most popular model at the moment, to increase its loading speed and minimize RAM and disk space requirements. The repacked version can be loaded from the `petals-team/StableBeluga2` repository and is fully compatible with clients and servers using the standard repository (`stabilityai/StableBeluga2`).

Now, clients **need to download only 1.05 GB of data** to run Stable Beluga 2 (instead of ~20 GB needed before) and require only 4 GB of RAM (instead of ~20 GB required before). Servers need to download and store **2x less data** and load the model from disk significantly faster. If you're switching from the old repository, don't forget to remove the old cache in the`~/.cache/petals/models--stabilityai--StableBeluga2` directory to save disk space.

โฑ๏ธ **More responsive inference.** In older versions, servers could become unresponsive for a few seconds while processing large prefixes (thousands of tokens) on inference. This release allows to perform small inference requests (a few tokens) in the middle of processing a large request, thus avoiding freezes during token-by-token inference caused by someone processing a large prefix.

๐Ÿ”’ **Minor improvements.** This release adds support for loading weights in the [safetensors](https://github.com/huggingface/safetensors) format on servers and adds the `blocked_servers` client option to avoid a given set of servers:

python
from petals import AutoDistributedModelForCausalLM

blocked_servers = ["12D3KooWA6g...", "12D3KooWGyD..."] Full peer IDs from https://health.petals.dev
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, blocked_servers=blocked_servers)


๐Ÿž **Bug fixes.** This release also includes a variety of bug fixes allowing to speed up the [chatbot app](https://chat.petals.dev) and fine-tuning, better bypass recently disconnect servers, improve rebalancing algorithm and usability of benchmarks, fix throughput measurements and installation on ARM CPUs.

We also fixed Petals compatibility with the latest releases of ๐Ÿค— [Transformers](https://github.com/huggingface/transformers), [Accelerate](https://github.com/huggingface/accelerate), and [PEFT](https://github.com/huggingface/peft) libraries.

Breaking changes

๐Ÿ“– **Default inference sessions.** If you run `.generate()` or forward passes **inside** an `.inference_session()` context, they now **use the opened session by default**. These snippets are now equivalent:

python
Using default session
with model.inference_session(max_length=100):
output_ids = model.generate(input_ids, max_new_tokens=3)

Explicitly specifying a session
with model.inference_session(max_length=100) as sess:
output_ids = model.generate(input_ids, max_new_tokens=3, session=sess)


Earlier, the 1st snippet was creating a new session, which confused most people and lead to bugs.

โžก๏ธ **Renaming.** We renamed `SequenceManagerConfig` to **[petals.ClientConfig](https://github.com/bigscience-workshop/petals/blob/main/src/petals/client/config.py#L10)** and `petals.dht_utils` to **[petals.utils.dht](https://github.com/bigscience-workshop/petals/blob/main/src/petals/utils/dht.py)**. The old names now lead to `DeprecationWarning`s and will be removed in Petals 2.2.0+.

What's Changed
* Fix stale link by bot66 in https://github.com/bigscience-workshop/petals/pull/418
* Add Discord badge and more Discord links to readme by borzunov in https://github.com/bigscience-workshop/petals/pull/422
* Add connect_timeout by borzunov in https://github.com/bigscience-workshop/petals/pull/423
* Add Stable Beluga 2 to readme by borzunov in https://github.com/bigscience-workshop/petals/pull/424
* Penalize servers that use relays during rebalancing by borzunov in https://github.com/bigscience-workshop/petals/pull/428
* Fix petals.utils.ping for servers with client-mode DHT by borzunov in https://github.com/bigscience-workshop/petals/pull/430
* Fix typo and make blocks message more informative by vadi2 in https://github.com/bigscience-workshop/petals/pull/437
* Update Discord links from channels to forums by borzunov in https://github.com/bigscience-workshop/petals/pull/440
* Remove distracting links from readme by borzunov in https://github.com/bigscience-workshop/petals/pull/441
* Remove deprecated comment in fine-tuning notebook by borzunov in https://github.com/bigscience-workshop/petals/pull/443
* Use bitsandbytes 0.41.1 by borzunov in https://github.com/bigscience-workshop/petals/pull/442
* [Refactor] extract block forward, backward and inference into a separate file by justheuristic in https://github.com/bigscience-workshop/petals/pull/435
* Override float32 in config to bfloat16 by borzunov in https://github.com/bigscience-workshop/petals/pull/431
* Prefer longer servers for fine-tuning, exclude unreachable by borzunov in https://github.com/bigscience-workshop/petals/pull/448
* Force using --new_swarm instead of empty --initial_peers by borzunov in https://github.com/bigscience-workshop/petals/pull/451
* Test Llama, rebalancing, throughput eval, and all CLI scripts by borzunov in https://github.com/bigscience-workshop/petals/pull/452
* benchmarks: Aggregate speed among workers, set default dtype torch32 by borzunov in https://github.com/bigscience-workshop/petals/pull/454
* Use torch.cuda.synchronize for compute throughput by justheuristic in https://github.com/bigscience-workshop/petals/pull/456
* Prioritize short inference, unmerge pools for long inference by borzunov in https://github.com/bigscience-workshop/petals/pull/458
* Bump version to 2.0.1.post2 by borzunov in https://github.com/bigscience-workshop/petals/pull/459
* Add `blocked_servers` argument by borzunov in https://github.com/bigscience-workshop/petals/pull/462
* Add customizable input tensors by artek0chumak in https://github.com/bigscience-workshop/petals/pull/445
* Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht by borzunov in https://github.com/bigscience-workshop/petals/pull/463
* Make client compatible with transformers' GenerationMixin by borzunov in https://github.com/bigscience-workshop/petals/pull/464
* Temporarily require peft<0.5.0, transformers<4.32.0 by justheuristic in https://github.com/bigscience-workshop/petals/pull/470
* Support transformers 4.32.x by justheuristic in https://github.com/bigscience-workshop/petals/pull/471
* Change transformers version assert by justheuristic in https://github.com/bigscience-workshop/petals/pull/472
* Support loading weights from Safetensors on server by borzunov in https://github.com/bigscience-workshop/petals/pull/473
* Update peft to 0.5.0 version by artek0chumak in https://github.com/bigscience-workshop/petals/pull/475
* Hide excess key message by borzunov in https://github.com/bigscience-workshop/petals/pull/476
* Bump version to 2.1.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/474
* Don't install cpufeature on non-x86_64 machines by borzunov in https://github.com/bigscience-workshop/petals/pull/478

New Contributors
* bot66 made their first contribution in https://github.com/bigscience-workshop/petals/pull/418

**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v2.0.1...v2.1.0

2.0.1

Not secure
Highlights

๐Ÿ›ฃ๏ธ **Inference of longer sequences.** We extended the max sequence length to **8192 tokens for Llama 2** and added chunking to avoid server out-of-memory errors (happened when processing long prefixes). This became possible thanks to multi-query attention used in Llama 2, which uses 8x less GPU memory for attention caches. Now you can process longer sequences using a Petals client and have dialogues of up to 8192 tokens at https://chat.petals.dev

๐Ÿ **Python 3.11 support.** Petals clients and servers now work on Python 3.11.

๐Ÿž **Bug fixes.** We fixed the server's `--token` argument (used to provide your ๐Ÿค— Model Hub [access token](https://huggingface.co/settings/tokens) for loading Llama 2), possible deadlocks in the server, issues with fine-tuning speed (servers available via relays are deprioritized) and other minor load balancing issues.

๐ŸชŸ **Running server on Windows.** We made a better [guide](https://github.com/bigscience-workshop/petals/wiki/Run-Petals-server-on-Windows) for running a server in WSL (Windows Subsystem for Linux).

๐Ÿ“ฆ **Running server on Runpod.** We added a [guide](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#cloud-providers) for using a Petals template on Runpod.

What's Changed
* Update to petals.dev by justheuristic in https://github.com/bigscience-workshop/petals/pull/390
* Bump version to 2.0.0.post3 by borzunov in https://github.com/bigscience-workshop/petals/pull/391
* Fix --attn_cache_tokens default by borzunov in https://github.com/bigscience-workshop/petals/pull/392
* Fix deadlocks in MemoryCache by borzunov in https://github.com/bigscience-workshop/petals/pull/396
* Support Python 3.11 by borzunov in https://github.com/bigscience-workshop/petals/pull/393
* Fix routing through relay, default network RPS, --token, logging, readme by borzunov in https://github.com/bigscience-workshop/petals/pull/399
* If speedtest fails, assume network speed of 100 Mbit/s by borzunov in https://github.com/bigscience-workshop/petals/pull/404
* Split long sequences into chunks by justheuristic in https://github.com/bigscience-workshop/petals/pull/403
* Add Llama 2, WSL instructions to readme by borzunov in https://github.com/bigscience-workshop/petals/pull/406
* Update README.md by borzunov in https://github.com/bigscience-workshop/petals/pull/407
* Update commands for hosting Llama 2 in readme by borzunov in https://github.com/bigscience-workshop/petals/pull/409
* Update --update_period and --expiration defaults by borzunov in https://github.com/bigscience-workshop/petals/pull/410
* Bump version to 2.0.1 by borzunov in https://github.com/bigscience-workshop/petals/pull/411


**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v2.0.0.post1...v2.0.1

2.0.0.post1

Not secure
We're excited to announce Petals 2.0.0 โ€” the largest Petals release to date!

Highlights

๐Ÿฆ™ **Support for LLaMA and LLaMA 2.** We've added support for **inference and fine-tuning** of any models based on ๐Ÿค— Transformers [`LlamaModel`](https://huggingface.co/docs/transformers/main/model_doc/llama), including all variants of [LLaMA](https://github.com/facebookresearch/llama/blob/llama_v1/MODEL_CARD.md) and [LLaMA 2](https://ai.meta.com/llama/) โ€” one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of **up to 5-6 tokens/sec**.

- You can try them in the ๐Ÿ’ฌ **[chatbot web app](https://chat.petals.dev)** or in ๐Ÿš€ **[our Colab tutorial](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing)**.

๐Ÿ—œ๏ธ **4-bit quantization.** We've integrated efficient 4-bit (NF4) quantization from the recent ["QLoRA: Efficient Finetuning of Quantized LLMs"](https://arxiv.org/abs/2305.14314) paper. This allows to use ~40% less GPU memory (thus, **~40% less servers**) to fit all model blocks and have **~2x speedup** for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.

๐Ÿ”Œ **Pre-loading LoRA adapters, such as Guanaco.** We added an opportunity to pre-load LoRA adapters compatible with the ๐Ÿค— [PEFT](https://github.com/huggingface/peft) library, which may add extra functionality to the model you host. You can do this using the `--adapters` argument on the server (e.g., `--adapters repo1/adapter1 repo2/adapter2`). These adapters are activated at a client's request - specifically, the client may specify `.from_pretrained(..., active_adapter="repo1/adapter1")` when loading a distributed model. One example of this is [Guanaco](https://huggingface.co/timdettmers/guanaco-65b) - an **instruction-finetuned adapter** for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our [chatbot](https://chat.petals.dev) app.

โžก๏ธ **Direct server-to-server communication.** Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.

๐Ÿ›ฃ๏ธ **Shortest-path routing for inference.** Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the **fastest chain** of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.

๐ŸŒŽ **Loading models directly from ๐Ÿค— Model Hub and `Auto` classes.** Starting from Petals 2.0.0, models **do not need to be converted** to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from ๐Ÿค— [Model Hub](https://huggingface.co/models), fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using `Auto` classes, such as `AutoDistributedConfig.from_pretrained(...)` and `AutoDistributedModelForCausalLM.from_pretrained(...)`. The [guide](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals) for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.

๐Ÿ‹๏ธ **Fine-tuning examples.** We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the ["Getting started"](https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing) notebook now includes a simple example of deep prompt tuning on a dummy task, and the [sequence classification notebook](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb) uses LLaMA-65B and improved hyperparameters for a stable training.

๐Ÿ–ฅ๏ธ **Upgraded swarm monitor.** The [swarm monitor](https://health.petals.dev) now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to **publish their name, advertise their company or a social media account** in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's `--public_name` argument.

What's Changed
* Remove unused imports and attributes by mryab in https://github.com/bigscience-workshop/petals/pull/324
* Determine block dtype in a unified manner by mryab in https://github.com/bigscience-workshop/petals/pull/325
* Use number of tokens for attn_cache_size by mryab in https://github.com/bigscience-workshop/petals/pull/286
* Add LLaMA support by borzunov in https://github.com/bigscience-workshop/petals/pull/323
* Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} by borzunov in https://github.com/bigscience-workshop/petals/pull/329
* Fix llama's lm_head.weight.requires_grad by borzunov in https://github.com/bigscience-workshop/petals/pull/330
* Show license links when loading models by borzunov in https://github.com/bigscience-workshop/petals/pull/332
* Add benchmark scripts by borzunov in https://github.com/bigscience-workshop/petals/pull/319
* Fix warmup steps and minor issues in benchmarks by borzunov in https://github.com/bigscience-workshop/petals/pull/334
* Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) by borzunov in https://github.com/bigscience-workshop/petals/pull/337
* Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) by borzunov in https://github.com/bigscience-workshop/petals/pull/333
* Allow free_disk_space_for() remove arbitrary files from Petals cache by borzunov in https://github.com/bigscience-workshop/petals/pull/339
* Implement direct server-to-server communication by borzunov in https://github.com/bigscience-workshop/petals/pull/331
* Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 by borzunov in https://github.com/bigscience-workshop/petals/pull/340
* Delete deprecated petals.cli scripts by borzunov in https://github.com/bigscience-workshop/petals/pull/336
* Use bitsandbytes 0.40.0.post4 with bias hotfix by borzunov in https://github.com/bigscience-workshop/petals/pull/342
* Support peft LoRA adapters by artek0chumak in https://github.com/bigscience-workshop/petals/pull/335
* Fix convergence issues and switch to LLaMA in the SST-2 example by mryab in https://github.com/bigscience-workshop/petals/pull/343
* Mention LLaMA in readme by borzunov in https://github.com/bigscience-workshop/petals/pull/344
* Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes by borzunov in https://github.com/bigscience-workshop/petals/pull/345
* Fix Docker build by avoiding Python 3.11 by borzunov in https://github.com/bigscience-workshop/petals/pull/348
* Support LLaMA repos without "-hf" suffix by borzunov in https://github.com/bigscience-workshop/petals/pull/349
* Estimate adapter memory overhead in choose_num_blocks() by justheuristic in https://github.com/bigscience-workshop/petals/pull/346
* Spam less in server logs by borzunov in https://github.com/bigscience-workshop/petals/pull/350
* Remove unused import os by justheuristic in https://github.com/bigscience-workshop/petals/pull/352
* Test that bitsandbytes is not imported when it's not used by borzunov in https://github.com/bigscience-workshop/petals/pull/351
* Fix bugs in _choose_num_blocks() added in 346 by borzunov in https://github.com/bigscience-workshop/petals/pull/354
* Switch adapters slightly faster by justheuristic in https://github.com/bigscience-workshop/petals/pull/353
* Share more info about a server in DHT by borzunov in https://github.com/bigscience-workshop/petals/pull/355
* Make a server ping next servers by borzunov in https://github.com/bigscience-workshop/petals/pull/356
* Use bitsandbytes 0.40.1.post1 by borzunov in https://github.com/bigscience-workshop/petals/pull/357
* Update readme and "Getting started" link by borzunov in https://github.com/bigscience-workshop/petals/pull/360
* Report inference, forward, and network RPS separately by borzunov in https://github.com/bigscience-workshop/petals/pull/358
* Fix typo in generation_algorithms.py by eltociear in https://github.com/bigscience-workshop/petals/pull/364
* Implement shortest-path routing for inference by borzunov in https://github.com/bigscience-workshop/petals/pull/362
* Update readme to show new models by borzunov in https://github.com/bigscience-workshop/petals/pull/365
* Require transformers < 4.31.0 until we're compatible by borzunov in https://github.com/bigscience-workshop/petals/pull/369
* Fix AssertionError on rebalancing by borzunov in https://github.com/bigscience-workshop/petals/pull/370
* Update transformers to 4.31.0 and peft to 0.4.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/371
* Fix readme code example, require Python < 3.11 until supported by borzunov in https://github.com/bigscience-workshop/petals/pull/374
* Fix handler memory leak, get rid of mp.Manager by justheuristic in https://github.com/bigscience-workshop/petals/pull/373
* Inherit bitsandbytes compute dtype correctly (override peft quirk) by justheuristic in https://github.com/bigscience-workshop/petals/pull/377
* Fix --token arg by borzunov in https://github.com/bigscience-workshop/petals/pull/378
* Support Llama 2 by borzunov in https://github.com/bigscience-workshop/petals/pull/379
* Require accelerate>=0.20.3 as transformers do by borzunov in https://github.com/bigscience-workshop/petals/pull/383
* Bump version to 2.0.0.post1 by borzunov in https://github.com/bigscience-workshop/petals/pull/384

New Contributors
* eltociear made their first contribution in https://github.com/bigscience-workshop/petals/pull/364

**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v1.1.5...v2.0.0.post1

1.1.5

Not secure
Highlights

**โฑ Faster fine-tuning.** Fine-tuning uses ~2x less traffic (tensors are now sent in bfloat16 by default) and builds routes using a heuristic maximizing the swarm's throughput. This should address timeout errors that could happen during fine-tuning.

**๐Ÿž Bug fixes.** On servers, this release fixes out-of-memory errors and freezing network throughput evals. On clients, it fixes issues with slicing `RemoteSequential` and silently ignoring unsupported `.generate()` kwargs. Also, this release fixes warnings originated from `hivemind.p2p` and `hivemind.compression`.

**๐Ÿ›ฃ๏ธ Updated throughput formula.** We have updated the throughput formula to reflect that servers hosting many blocks still run forward and backward passes through only one block at a time. Don't be surprised if your throughput became smaller than in 1.1.4 โ€” these numbers are not directly comparable!

**๐Ÿ–ผ๏ธ Improved lower-level interfaces.** We have refactored lower-level interfaces, such as `RemoteSequential` and `RemoteSequenceManager`, to be more reliable (e.g. when doing retries) and much easier to use. Some rarely used low-level functions in `petals.dht_utils` were removed.

What's Changed
* Fix OOMs happening in case of accelerate >= 0.16.0 by borzunov in https://github.com/bigscience-workshop/petals/pull/310
* Refactor RemoteSequenceManager by borzunov in https://github.com/bigscience-workshop/petals/pull/309
* Update hivemind to 1.1.8, enable efficient bfloat16 encoding by borzunov in https://github.com/bigscience-workshop/petals/pull/311
* Replace .make_sequence(..., mode="random") with mode="max_throughput" by borzunov in https://github.com/bigscience-workshop/petals/pull/313
* Divide compute throughput by average no. of used blocks by borzunov in https://github.com/bigscience-workshop/petals/pull/314
* Raise error for unexpected .generate() kwargs by borzunov in https://github.com/bigscience-workshop/petals/pull/315
* Abort speedtest if it runs too long by borzunov in https://github.com/bigscience-workshop/petals/pull/316
* Bump version to 1.1.5 by borzunov in https://github.com/bigscience-workshop/petals/pull/312


**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v1.1.4...v1.1.5

1.1.4

Not secure
Highlights

๐Ÿ—๏ธ **8-bit servers support more GPUs.** A [bitsandbytes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.38.0) update brings 8-bit support to older generations of NVIDIA GPUs, as well as the [GeForce 16](https://ru.wikipedia.org/wiki/GeForce_16) GPU series (e.g. 1660 Ti). Please try Petals 1.1.4 if you previously had errors like `Your GPU does not support Int8 Matmul!` and `cublasLt ran into an error!` on some GPUs. This version also loads weights in 8-bit by default when [tensor parallelism](https://github.com/bigscience-workshop/petals/wiki/FAQ:-Frequently-asked-questions#managing-gpus) is enabled.

โฑ๏ธ **Servers start faster.** Servers take ~2x less time to load block weights from the disk cache to the GPU memory. The next release will also reduce the time it takes to download the weights from the Internet, since they will be downloaded in 8-bit instead of 16-bit.

๐Ÿงต **Multi-threaded clients work faster.** Earlier, multi-threaded clients were actually performing only one network request at a time due to a bug in hivemind. This bug was recently fixed in [hivemind](https://github.com/learning-at-home/hivemind/releases/tag/1.1.7). This significantly improves the speed of the [chat.petals.ml](https://github.com/borzunov/chat.petals.ml) app when multiple users chat concurrently.

โฑ๏ธ **Clients start faster.** Clients take ~10% less time to load the model, since they build a route through remote servers in parallel with loading the local part of the model (input/output embeddings).

๐ŸŒณ **Relaxed dependency requirements.** We relaxed version requirements for [transformers](https://github.com/huggingface/transformers) and other [huggingface](https://github.com/huggingface) libraries, so you can update them independently of Petals. In particular, Petals works with PyTorch 2.0 and the latest `transformers` release. Also, we fixed a bug where the client loaded a model in float32 by default (instead of bfloat16/float16) in some `transformers` releases. Please try Petals 1.1.4 if you previously had out-of-memory errors when running the client.

What's Changed

* Speed up loading blocks using init with meta weights by mryab in https://github.com/bigscience-workshop/petals/pull/285
* Add benchmarks to readme by borzunov in https://github.com/bigscience-workshop/petals/pull/284
* Fix invalid author email in setup.cfg by borzunov in https://github.com/bigscience-workshop/petals/pull/287
* Hotfix: Increase daemon_startup_timeout by borzunov in https://github.com/bigscience-workshop/petals/pull/292
* Update bitsandbytes, hivemind, transformers by justheuristic in https://github.com/bigscience-workshop/petals/pull/290
* Fix deps, enable 8-bit by default for TP by borzunov in https://github.com/bigscience-workshop/petals/pull/298
* Add Python 3.10 to CI by borzunov in https://github.com/bigscience-workshop/petals/pull/299
* Remove CustomLinear8bitLt by borzunov in https://github.com/bigscience-workshop/petals/pull/297
* Remove use_auto_relay=True in client by borzunov in https://github.com/bigscience-workshop/petals/pull/300
* Start SequenceManager's thread only after first .make_sequence() by borzunov in https://github.com/bigscience-workshop/petals/pull/301
* Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 by borzunov in https://github.com/bigscience-workshop/petals/pull/302
* Suggest commands for Docker first by borzunov in https://github.com/bigscience-workshop/petals/pull/304
* Relax the rest of Hugging Face dependencies by borzunov in https://github.com/bigscience-workshop/petals/pull/305
* Force transformers to use config.torch_dtype by default by borzunov in https://github.com/bigscience-workshop/petals/pull/307
* Bump version to 1.1.4 by borzunov in https://github.com/bigscience-workshop/petals/pull/306


**Full Changelog**: https://github.com/bigscience-workshop/petals/compare/v1.1.3...v1.1.4

Page 1 of 2

ยฉ 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.