New documentation
The whole documentation has been revamped, just go look at it [here](https://huggingface.co/docs/accelerate)!
* Complete revamp of the docs by muellerzr in 495
New gather_for_metrics method
When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the `gather` your did in evaluation by `gather_for_metrics`.
* Reenable Gather for Metrics by muellerzr in 590
* Fix gather_for_metrics by muellerzr in 578
* Add a gather_for_metrics capability by muellerzr in 540
Balanced device maps
When loading big models for inference, `device_map="auto"` used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!
M1 GPU support
Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the [documentation](https://huggingface.co/docs/accelerate/v0.12.0/en/usage_guides/mps#accelerated-pytorch-training-on-mac).
* M1 GPU `mps` device integration by pacman100 in 596
What's new?
* Small fixed for balanced device maps by sgugger in 583
* Add balanced option for auto device map creation by sgugger in 534
* fixing deepspeed slow tests issue by pacman100 in 604
* add more conditions on casting by younesbelkada in 606
* Remove redundant `.run` in `WandBTracker`. by zh-plus in 605
* Fix some typos + wordings by muellerzr in 603
* reorg of test scripts and minor changes to tests by pacman100 in 602
* Move warning by muellerzr in 598
* Shorthand way to grab a tracker by muellerzr in 594
* Pin deepspeed by muellerzr in 595
* Improve docstring by muellerzr in 591
* TESTS! by muellerzr in 589
* Fix DispatchDataloader by sgugger in 588
* Use main_process_first in the examples by muellerzr in 581
* Skip and raise NotImplementedError for gather_for_metrics for now by muellerzr in 580
* minor FSDP launcher fix by pacman100 in 579
* Refine test in set_module_tensor_to_device by sgugger in 577
* Fix `set_module_tensor_to_device` by sgugger in 576
* Add 8 bit support - chapter II by younesbelkada in 539
* Fix tests, add wandb to gitignore by muellerzr in 573
* Fix step by muellerzr in 572
* Speed up main CI by muellerzr in 571
* ccl version check and import different module according to version by sywangyi in 567
* set default num_cpu_threads_per_process to improve oob performance by sywangyi in 562
* Add a tqdm helper by muellerzr in 564
* Rename actions to be a bit more accurate by muellerzr in 568
* Fix clean by muellerzr in 569
* enhancements and fixes for FSDP and DeepSpeed by pacman100 in 532
* fix: saving model weights by csarron in 556
* add on_main_process decorators by ZhiyuanChen in 488
* Update imports.py by KimBioInfoStudio in 554
* unpin `datasets` by lhoestq in 563
* Create good defaults in `accelerate launch` by muellerzr in 553
* Fix a few minor issues with example code in docs by BenjaminBossan in 551
* deepspeed version `0.6.7` fix by pacman100 in 544
* Rename test extras to testing by muellerzr in 545
* Add production testing + fix failing CI by muellerzr in 547
* Add a gather_for_metrics capability by muellerzr in 540
* Allow for kwargs to be passed to trackers by muellerzr in 542
* Add support for downcasting bf16 on TPUs by muellerzr in 523
* Add more documentation for device maps computations by sgugger in 530
* Restyle prepare one by muellerzr in 531
* Pick a better default for offload_state_dict by sgugger in 529
* fix some parameter setting does not work for CPU DDP and bf16 fail in… by sywangyi in 527
* Fix accelerate tests command by sgugger in 528
Significant community contributions
The following contributors have made significant changes to the library over the last release:
* sywangyi
* ccl version check and import different module according to version (567)
* set default num_cpu_threads_per_process to improve oob performance (562)
* fix some parameter setting does not work for CPU DDP and bf16 fail in… (527)
* ZhiyuanChen
* add on_main_process decorators (488)