Product Research Enterprise Plans Docs

Lightning

Latest version: v2.5.1

Safety actively analyzes 723929 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 32

2.2.0.rc0

This is a preview release for Lightning 2.2.0.

2.1.4

Not secure

Fabric

Fixed

- Fixed an issue preventing Fabric to run on CPU when the system's CUDA driver is outdated or broken ([19234](https://github.com/Lightning-AI/lightning/pull/19234))
- Fixed typo in kwarg in SpikeDetection ([19282](https://github.com/Lightning-AI/lightning/pull/19282))

---

PyTorch

Fixed

- Fixed `Trainer` not expanding the `default_root_dir` if it has the `~` (home) prefix ([19179](https://github.com/Lightning-AI/lightning/pull/19179))
- Fixed warning for Dataloader if `num_workers=1` and CPU count is 1 ([19224](https://github.com/Lightning-AI/lightning/pull/19224))
- Fixed `WandbLogger.watch()` method annotation to accept `None` for the log parameter ([19237](https://github.com/Lightning-AI/lightning/pull/19237))
- Fixed an issue preventing the Trainer to run on CPU when the system's CUDA driver is outdated or broken ([19234](https://github.com/Lightning-AI/lightning/pull/19234))
- Fixed an issue with the ModelCheckpoint callback not saving relative symlinks with `ModelCheckpoint(save_last="link")` ([19303](https://github.com/Lightning-AI/lightning/pull/19303))
- Fixed issue where the `_restricted_classmethod_impl` would incorrectly raise a TypeError on inspection rather than on call ([19332](https://github.com/Lightning-AI/lightning/pull/19332))
- Fixed exporting `__version__` in `__init__` ([19221](https://github.com/Lightning-AI/lightning/pull/19221))

---

**Full Changelog**: https://github.com/Lightning-AI/pytorch-lightning/compare/2.1.3...2.1.4

Contributors

andyland asingh9530 awaelchli Borda daturkel dipta007 lauritsf mjbommar shenmishajing tchaton

_If we forgot someone due to not matching commit email with GitHub account, let us know :]_

2.1.3

Not secure

App

Changed

- Lightning App: Use the batch get endpoint (19180)
- Drop starsessions from App's requirements (18470)
- Optimize loading time for chunks to be there (19109)

---

Data

Added

- Add fault tolerance `StreamingDataset` (19052)
- Add numpy support for the `StreamingDataset` (19050)
- Add fault tolerance for the `StreamingDataset` (19049)
- Add direct s3 support to the `StreamingDataset` (19044)
- Add disk usage check before downloading files (19041)

Changed

- Cleanup chunks right away if the dataset doesn't fit within the cache in `StreamingDataset` (19168)
- `StreamingDataset` improve deletion strategy (19118)
- Improve `StreamingDataset` Speed (19114)
- Remove time in the Data Processor progress bar (19108)
- Optimize loading time for chunks to be there (19109)
- Resolve path for `StreamingDataset` (19094)
- Make input dir in `DataProcessor` required (18910)
- Remove the `LightningDataset` relying on un-maintained torchdata (19019)

Fixed

- Resolve checkpointing for the Streaming Dataset (19123)
- Resolve Item Loader bugs (19017)

---

Fabric

Fixed

- Avoid moving the model to device if `move_to_device=False` is passed (19152)
- Fixed broadcast at initialization in `MPIEnvironment` (19074)

---

PyTorch

Changed

- `LightningCLI` no longer allows setting a normal class instance as default. A `lazy_instance` can be used instead (18822)

Fixed

- Fixed checks for local file protocol due to fsspec changes in 2023.10.0 (19023)
- Fixed automatic detection of 'last.ckpt' files to respect the extension when filtering (17072)
- Fixed an issue where setting `CHECKPOINT_JOIN_CHAR` or `CHECKPOINT_EQUALS_CHAR` would only work on the `ModelCheckpoint` class but not on an instance (19054)
- Fixed `ModelCheckpoint` not expanding the `dirpath` if it has the `~` (home) prefix (19058)
- Fixed handling checkpoint dirpath suffix in NeptuneLogger (18863)
- Fixed an edge case where `ModelCheckpoint` would alternate between versioned and unversioned filename (19064)
- Fixed broadcast at initialization in `MPIEnvironment` (19074)
- Fixed the tensor conversion in `self.log` to respect the default dtype (19046)

---

**Full Changelog**: https://github.com/Lightning-AI/lightning/compare/2.1.2...2.1.3

Contributors

AleksanderWWW, awaelchli, borda, carmocca, dependabot[bot], mauvilsa, MF-FOOM, tchaton, yassersouri

_If we forgot someone due to not matching commit email with GitHub account, let us know :]_

2.1.2

Not secure

App

Changed

- Forced plugin server to use localhost (18976)
- Enabled bundling additional files into app source (18980)
- Limited rate of requests to http queue (18981)

---

Fabric

Fixed

- Fixed precision default from environment (18928)

---

PyTorch

Fixed

- Fixed an issue causing permission errors on Windows when attempting to create a symlink for the "last" checkpoint (18942)
- Fixed an issue where Metric instances from `torchmetrics` wouldn't get moved to the device when using FSDP (18954)
- Fixed an issue preventing the user to `Trainer.save_checkpoint()` an FSDP model when `Trainer.test/validate/predict()` ran after `Trainer.fit()` (18992)

---

Contributors

awaelchli, carmocca, ethanwharris, tchaton

_If we forgot someone due to not matching commit email with GitHub account, let us know :]_

**Full Changelog**: https://github.com/Lightning-AI/lightning/compare/2.1.1...2.1.2

2.1.1

Not secure

App

Added

- add flow `fail()` (18883)

Fixed

- Fix failing lightning cli entry point (18821)

---

Fabric

Changed

- Calling a method other than `forward` that invokes submodules is now an error when the model is wrapped (e.g., with DDP) (18819)

Fixed

- Fixed false-positive warnings about method calls on the Fabric-wrapped module (18819)
- Refined the FSDP saving logic and error messaging when the path exists (18884)
- Fixed layer conversion under `Fabric.init_module()` context manager when using the `BitsandbytesPrecision` plugin (18914)

---

PyTorch

Fixed

- Fixed an issue when replacing an existing `last.ckpt` file with a symlink (18793)
- Fixed an issue when `BatchSizeFinder` `steps_per_trial` parameter ends up defining how many validation batches to run during the entire training (18394)
- Fixed an issue saving the `last.ckpt` file when using `ModelCheckpoint` on a remote filesystem, and no logger is used (18867)
- Refined the FSDP saving logic and error messaging when the path exists (18884)
- Fixed an issue parsing the version from folders that don't include a version number in `TensorBoardLogger` and `CSVLogger` (18897)

---

Contributors

awaelchli, borda, BoringDonut, carmocca, hiaoxui, ioangatop, nohalon, rasbt, tchaton

_If we forgot someone due to not matching commit email with GitHub account, let us know :]_

**Full Changelog**: https://github.com/Lightning-AI/lightning/compare/2.1.0...2.1.1

2.1.0

Not secure

[Lightning AI](https://lightning.ai) is excited to announce the release of Lightning 2.1 :zap: It's the culmination of work from 79 contributors who have worked on features, bug-fixes, and documentation for a total of over 750+ commits since v2.0.

The theme of 2.1 is "bigger, better, faster": **Bigger** because training large multi-billion parameter models has gotten even more efficient thanks to FSDP, efficient initialization and sharded checkpointing improvements, **better** because it's easier than ever to scale models without making substantial code changes or installing third-party packages and **faster** because it leverages the latest hardware features to speed up training in low-bit precision thanks to new precision plugins like bitsandbytes and transformer engine.
And of course, as the name implies, this release fully leverages the latest features in [PyTorch 2.1](https://pytorch.org/blog/pytorch-2-1/) :tada:

- [Highlights](highlights)
- [Improvements To Large-Scale Training With FSDP](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-fsdp)
- [True Half-Precision](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-half-precision)
- [Bitsandbytes Quantization](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-bitsandbytes)
- [Transformer Engine](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-transformer-engine)
- [Lightning on TPU Goes Brrr](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-tpu)
- [Granular Control Over Checkpoints in Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#highlights-fabric-checkpoints)
- [Backward Incompatible Changes](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes)
- [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes-pytorch)
- [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#bc-changes-fabric)
- [Full Changelog](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog)
- [PyTorch Lightning](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-pytorch)
- [Lightning Fabric](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-fabric)
- [Lightning App](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#changelog-app)
- [Contributors](https://github.com/Lightning-AI/lightning/releases/tag/2.1.0#contributors)

<a name="highlights"></a>
Highlights

<a name="highlights-fsdp"></a>
Improvements To Large-Scale Training With FSDP

The FSDP strategy for training large billion-parameter models gets substantial improvements and new features in Lightning 2.1, both in Trainer and Fabric (in case you didn't know, [Fabric](https://lightning.ai/docs/fabric/stable) is the latest addition to the Lightning family of tools to scale models without the boilerplate code).
FSDP is now more user-friendly to configure, has memory management and speed improvements, and we have a brand new end-to-end user guide with best practices ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_parallel/fsdp.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_parallel/fsdp.html)).

Efficient Saving and Loading of Large Checkpoints

When training large billion-parameter models with FSDP, saving and resuming training, or even just loading model parameters for finetuning can be challenging, as users are are often plagued by out-of-memory errors and speed bottlenecks.

In 2.1, we made several improvements. Starting with saving checkpoints, we added support for distributed/sharded checkpoints, enabled through the setting `state_dict_type` in the strategy ([18364](https://github.com/Lightning-AI/lightning/pull/18364), [#18358](https://github.com/Lightning-AI/lightning/pull/18358)):

**Trainer:**
python
import lightning as L
from lightning.pytorch.strategies import FSDPStrategy

Default used by the strategy
strategy = FSDPStrategy(state_dict_type="full")

Enable saving distributed checkpoints
strategy = FSDPStrategy(state_dict_type="sharded")

trainer = L.Trainer(strategy=strategy, ...)

**Fabric:**
python
import lightning as L
from lightning.fabric.strategies import FSDPStrategy

Saving distributed checkpoints is the default
strategy = FSDPStrategy(state_dict_type="sharded")

Save consolidated (single file) checkpoints
strategy = FSDPStrategy(state_dict_type="full")

fabric = L.Fabric(strategy=strategy, ...)

Distributed checkpoints are the fastest and most memory efficient way to save the state of very large models.
The distributed checkpoint format also makes it efficient to load these checkpoints back for resuming training in parallel, and it reduces the impact on CPU memory usage significantly. Furthermore, we've also introduced lazy-loading for non-distributed checkpoints ([18150](https://github.com/Lightning-AI/lightning/pull/18150), [#18379](https://github.com/Lightning-AI/lightning/pull/18379)), which greatly reduces the impact on CPU memory usage when loading a consolidated (single-file) checkpoint (e.g. for finetuning). Learn more about these features in our FSDP guides ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_parallel/fsdp.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_parallel/fsdp.html)).

Fast and Memory-Optimized Initialization

A major challenge that users face when working with large models such as LLMs is dealing with the extreme memory requirements. Even something as simple as instantiating a model becomes non-trivial if the model is so large it won't fit in a single GPU or even a single machine. In Lightning 2.1, we are introducing empty-weights initialization through the `Fabric.init_module()` ([17462](https://github.com/Lightning-AI/lightning/pull/17462), [#17627](https://github.com/Lightning-AI/lightning/pull/17627)) and `Trainer.init_module()`/`LightningModule.configure_model()` ([#18004](https://github.com/Lightning-AI/lightning/pull/18004), [#18004](https://github.com/Lightning-AI/lightning/pull/18004), [#18385](https://github.com/Lightning-AI/lightning/pull/18385)) methods:

**Trainer:**
python
import lightning as L

class MyModel(L.LightningModule):
def __init__(self):
super().__init__()
Delay initialization of model to `configure_model()`

def configure_model(self):
Model initialized in correct precision and weights on meta-device
self.model = ...

...

trainer = L.Trainer(strategy="fsdp", ...)
trainer.fit(model)

**Fabric:**
python
import lightning as L

fabric = L.Fabric(strategy="fsdp", ...)

Model initialized in correct precision and weights on meta-device
with fabric.init_module(empty_init=True):
model = ...

You can also initialize buffers and tensors directly on device and dtype
with fabric.init_tensor():
model.mask.create()
model.kv_cache.create()
x = torch.randn(4, 128)

Materialization and sharding of model happens inside here
model = fabric.setup(model)

Read more about this new feature and its other benefits in our docs ([Trainer](https://lightning.ai/docs/pytorch/latest/advanced/model_init.html), [Fabric](https://lightning.ai/docs/fabric/latest/advanced/model_init.html)).

User-Friendly Configuration

We made it super easy to configure the sharding- and activation-checkpointing policy when you want to auto-wrap particular layers of your model for advanced control ([18045](https://github.com/Lightning-AI/lightning/pull/18045), [#18084](https://github.com/Lightning-AI/lightning/pull/18084)).

diff
import lightning as L
from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.wrap import ModuleWrapPolicy

- strategy = FSDPStrategy(auto_wrap_policy=ModuleWrapPolicy({MyTransformerBlock}))
+ strategy = FSDPStrategy(auto_wrap_policy={MyTransformerBlock})
trainer = L.Trainer(strategy=strategy, ...)

Furthermore, the sharding strategy can now be conveniently set with a string value ([18087](https://github.com/Lightning-AI/lightning/pull/18087)):

diff
import lightning as L
from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy

- strategy = FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP)
+ strategy = FSDPStrategy(sharding_strategy="SHARD_GRAD_OP")
trainer = L.Trainer(strategy=strategy, ...)

You no longer need to remember the long PyTorch imports! Fabric also supports all these improvements shown above.

<a name="highlights-half-precision"></a>
True Half-Precision

Lightning now supports true half-precision for training and inference with all built-in strategies ([18193](https://github.com/Lightning-AI/lightning/pull/18193), [#18217](https://github.com/Lightning-AI/lightning/pull/18217), [#18213](https://github.com/Lightning-AI/lightning/pull/18213), [#18219](https://github.com/Lightning-AI/lightning/pull/18219)). With this setting, the memory required to store the model weights is only half of what is normally needed when running with float32. In addition, you get the same speed benefits as mixed precision training (`precision="16-mixed"`) has:

python
import lightning as L

default
trainer = L.Trainer(precision="32-true")

train with model weights in `torch.float16`
trainer = L.Trainer(precision="16-true")

train with model weights in `torch.bfloat16`
(if hardware supports it)
trainer = L.Trainer(precision="bf16-true")

The same settings are also available in Fabric! We recommend to try bfloat16 training (`precision="bf16-true"`) as it is often more numerically stable than regular 16-bit precision (`precision="16-true"`).

<a name="highlights-bitsandbytes"></a>
Bitsandbytes Quantization

With the new [Bitsandbytes precision plugin](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#quantization-via-bitsandbytes) [18655](https://github.com/Lightning-AI/lightning/pull/18655), you can now quantize your model for significant memory savings during training, finetuning, or inference with a selection of several state-of-the-art quantization algorithms (int8, fp4, nf4 and more). For the first time, Trainer and Fabric make [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) easy to use for general models.

**Trainer:**
python
import lightning as L
from lightning.pytorch.plugins import BitsandbytesPrecisionPlugin

this will pick out the compute dtype automatically, by default `bfloat16`
precision = BitsandbytesPrecisionPlugin("nf4-dq")
trainer = L.Trainer(plugins=precision)

**Fabric:**
python
import lightning as L
from lightning.fabric.plugins import BitsandbytesPrecision

this will pick out the compute dtype automatically, by default `bfloat16`
precision = BitsandbytesPrecision("nf4-dq")
trainer = L.Fabric(plugins=precision)

[Learn more!](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#quantization-via-bitsandbytes)

<a name="highlights-transformer-engine"></a>
Transformer Engine

The [Transformer Engine by NVIDIA](https://docs.nvidia.com/deeplearning/transformer-engine) is a library for accelerating transformer layers on the new Hopper (H100) generation of GPUs. With the integration in Lightning Trainer and Fabric ([#17597](https://github.com/Lightning-AI/lightning/pull/17597), [#18459](https://github.com/Lightning-AI/lightning/pull/18459)), you have easy access to the 8-bit mixed precision for significant speed ups:

**Trainer:**
python
import lightning as L

Select 8-bit mixed precision via TransformerEngine, with model weights in float16
trainer = L.Trainer(precision="transformer-engine-float16")

**Fabric:**
python
import lightning as L

Select 8-bit mixed precision via TransformerEngine, with model weights in float16
fabric = L.Fabric(precision="transformer-engine-float16")

More configuration options are available through the respective plugins in [Trainer](https://lightning.ai/docs/pytorch/latest/common/precision_intermediate.html#float8-mixed-precision-via-nvidia-s-transformerengine) and [Fabric](https://lightning.ai/docs/fabric/latest/fundamentals/precision.html#float8-mixed-precision-via-nvidia-s-transformerengine).

<a name="highlights-tpu"></a>
Lightning on TPU Goes Brrr

Lightning 2.1 runs on the latest generation of TPU hardware on Google Cloud! TPU-v4 and TPU-v5 ([17227](https://github.com/Lightning-AI/lightning/pull/17227)) are now fully supported both in Fabric and Trainer and run using the new PjRT runtime by default ([#17352](https://github.com/Lightning-AI/lightning/pull/17352)). PjRT is the runtime used by Jax and has shown an average improvement of 35% on benchmarks.

**Trainer:**
python
import lightning as L

trainer = L.Trainer(accelerator="tpu", devices=8)
model = MyModel()
trainer.fit(model) uses PjRT if available

**Fabric:**
python
import lightning as L

def train(fabric):
...

fabric = L.Fabric(accelerator="tpu")
fabric.launch(train) uses PjRT if available

And what's even more exciting, you can now scale massive multi-billion parameter models on TPUs using FSDP ([17421](https://github.com/Lightning-AI/lightning/pull/17421)).

python
import lightning as L
from lightning.fabric.strategies import XLAFSDPStrategy

strategy = XLAFSDPStrategy(
Most arguments from the PyTorch native FSDP strategy are also available here!
auto_wrap_policy={Block},
activation_checkpointing_policy={Block},
state_dict_type="full",
sequential_save=True,
)

fabric = L.Fabric(devices=8, strategy=strategy)
fabric.launch(finetune)

You can find a full end-to-end finetuning example script in our [Lit-GPT repository](https://github.com/Lightning-AI/lit-gpt/blob/main/xla/finetune/adapter.py). The new XLA-FSDP strategy is experimental and currently only available in Fabric. Support in the Trainer will follow in the future.

<a name="highlights-fabric-checkpoints"></a>
Granular Control Over Checkpoints in Fabric

Several improvements for checkpoint saving and loading have landed in Fabric, enabling more fine-grained control over what is saved/loaded while reducing boilerplate code:

1. There is a new `Fabric.load_raw()` method with which you can load model- or optimizer state-dicts saved externally by a non-Fabric application (e.g., raw PyTorch) ([18049](https://github.com/Lightning-AI/lightning/pull/18049))

python
import lightning as L

fabric = L.Fabric()
model = MyModel()

A model weights file saved by your friend who doesn't use Fabric
fabric.load_raw("path/to/model.pt", model)

Equivalent to this:
model.load_state_dict(torch.load("path/to/model.pt"))

2. A new parameter `Fabric.load(..., strict=True|False)` to disable strict loading ([17645](https://github.com/Lightning-AI/lightning/pull/17645))

python
import lightning as L

fabric = L.Fabric()
model = MyModel()
state = {"model": model}

strict loading is the default
fabric.load("path/to/checkpoint.ckpt", state, strict=True)

disable strict loading
fabric.load("path/to/checkpoint.ckpt", state, strict=False)

3. A new parameter `Fabric.save(..., filter=...)` that enables you to exclude certain parameters of your model without writing boilerplate code for it ([17845](https://github.com/Lightning-AI/lightning/pull/17845))

python
import lightning as L

fabric = L.Fabric()
model, optimizer = ...

state = {"model": model, "optimizer": optimizer, "foo": 123}

save only the weights that match a pattern
filter = {"model": lambda k, v: "weight" in k}
fabric.save("path/to/checkpoint.ckpt", state, filter=filter)

You can read more about the new options in our [checkpoint guide](https://lightning.ai/docs/fabric/latest/guide/checkpoint.html).

<a name="bc-changes"></a>
Backward Incompatible Changes

The release of PyTorch Lightning 2.0 was a big step into a new chapter: It brought a more polished API and removed a lot of legacy code and outdated as well as experimental features, at the cost of a long list of breaking changes resulting in more work needed than usual to upgrade from 1.9 to 2.0. Moving forward, we promised to maintain full backward compatibility of our public core APIs to guarantee a smooth upgrade experience for everyone, and with 2.1 we are happy to deliver on this promise. A few exceptions were made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.

<a name="bc-changes-pytorch"></a>
PyTorch Lightning

TPU/XLA Changes

When selecting device indices via `devices=[i]`, the Trainer now selects the i-th TPU core (0-based, previously it was 1-based) ([17227](https://github.com/Lightning-AI/lightning/pull/17227))

**Before:**
python
Selects the first TPU core (1-based index)
trainer = Trainer(accelerator="tpu", devices=[1])

**Now:**
python
Selects the second TPU core (0-based index)
trainer = Trainer(accelerator="tpu", devices=[1])

Multi-GPU in Jupyter Notebooks

Due to lack of reliability, Trainer now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([18291](https://github.com/Lightning-AI/lightning/pull/18291))

**Before:**
python
import lightning as L

In Jupyter notebooks, this would select all available GPUs (DDP)
trainer = L.Trainer(accelerator="cuda", devices="auto")

**Now:**
python
In Jupyter notebooks, this now selects only one GPU (the first)
trainer = L.Trainer(accelerator="cuda", devices="auto")

You can still explicitly select multiple
trainer = L.Trainer(accelerator="cuda", devices=8)

Device Access in Setup Hook

- During `LightningModule.setup()`, the `self.device` now returns the device the module *will be placed on* instead of `cpu` ([18021](https://github.com/Lightning-AI/lightning/pull/18021))

**Before:**
python
def setup(self, stage):
CPU regardless of the accelerator used
print(self.device)

**Now:**
python
def setup(self, stage):
CPU/CUDA/MPS/XLA depending on accelerator
print(self.device)

Miscellaneous Changes

- `self.log`ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations ([17334](https://github.com/Lightning-AI/lightning/pull/17334))
- The `FSDPStrategy` now loads checkpoints after the `configure_model`/`configure_sharded_model` hook ([18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `FSDPStrategy.load_optimizer_state_dict` and `FSDPStrategy.load_model_state_dict` are a no-op now ([18358](https://github.com/Lightning-AI/lightning/pull/18358))
- Removed experimental support for `torchdistx` due to a lack of project maintenance ([17995](https://github.com/Lightning-AI/lightning/pull/17995))
- Dropped support for PyTorch 1.11 ([18691](https://github.com/Lightning-AI/lightning/pull/18691))

<a name="bc-changes-fabric"></a>
Lightning Fabric

We thank the community for the amazing feedback we got for [Fabric](https://lightning.ai/docs/fabric/stable/) so far - keep it coming. The list of breaking changes is short and won't affect the vast majority of users.

Sharding Context Manager in Fabric.run()

We removed automatic sharding support with `Fabric.run` or using `fabric.launch(fn)`. This only impacts FSDP and DeepSpeed strategy users who use this way of launching. Please note that `Fabric.run` is a legacy construct from the `LightningLite` days, and is not recommended today. Please instantiate your large FSDP or DeepSpeed model under the newly added `fabric.init_module` context manager ([17832](https://github.com/Lightning-AI/lightning/pull/17832)).

**Before:**
python
import lightning as L

def train(fabric):
FSDP's `enable_wrap` context or `deepspeed.zero.Init()`
were applied automaticaly here
model = LargeModel()
...

fabric = L.Fabric()
fabric.launch(train)

**Now:**
python
def train(fabric):
Use `init_module` explicitly to apply these context managers
with fabric.init_module():
model = LargeModel()
...

Multi-GPU in Jupyter Notebooks

Due to lack of reliability, Fabric now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([18291](https://github.com/Lightning-AI/lightning/pull/18291))

**Before:**
python
import lightning as L

In Jupyter notebooks, this would select all available GPUs (DDP)
fabric = L.Fabric(accelerator="cuda", devices="auto")

**Now:**
python
In Jupyter notebooks, this now selects only one GPU (the first)
fabric = L.Fabric(accelerator="cuda", devices="auto")

You can still explicitly select multiple
fabric = L.Fabric(accelerator="cuda", devices=8)

<a name="changelog"></a>
CHANGELOG

<a name="changelog-pytorch"></a>
PyTorch Lightning

<details><summary>Added</summary>

- Added `metrics_format` attribute to `RichProgressBarTheme` class ([18373](https://github.com/Lightning-AI/lightning/pull/18373))
- Added `CHECKPOINT_EQUALS_CHAR` attribute to `ModelCheckpoint` class ([17999](https://github.com/Lightning-AI/lightning/pull/17999))
- Added `**summarize_kwargs` to `ModelSummary` and `RichModelSummary` callbacks ([16788](https://github.com/Lightning-AI/lightning/pull/16788))
- Added support for the `max_size_cycle|max_size|min_size` iteration modes during evaluation ([17163](https://github.com/Lightning-AI/lightning/pull/17163))
- Added support for the TPU-v4 architecture ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added support for XLA's new PJRT runtime ([17352](https://github.com/Lightning-AI/lightning/pull/17352))
- Check for invalid TPU device inputs ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added `XLAStrategy(sync_module_states=bool)` to control whether to broadcast the parameters to all devices ([17522](https://github.com/Lightning-AI/lightning/pull/17522))
- Added support for multiple optimizer parameter groups when using the FSDP strategy ([17309](https://github.com/Lightning-AI/lightning/pull/17309))
- Enabled saving the full model state dict when using the `FSDPStrategy` ([16558](https://github.com/Lightning-AI/lightning/pull/16558))
- Update `LightningDataModule.from_datasets` to support arbitrary iterables ([17402](https://github.com/Lightning-AI/lightning/pull/17402))
- Run the DDP wrapper in a CUDA stream ([17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Added `SaveConfigCallback.save_config` to ease use cases such as saving the config to a logger ([17475](https://github.com/Lightning-AI/lightning/pull/17475))
- Enabled optional file versioning of model checkpoints ([17320](https://github.com/Lightning-AI/lightning/pull/17320))
- Added the process group timeout argument `FSDPStrategy(timeout=...)` for the FSDP strategy ([17274](https://github.com/Lightning-AI/lightning/pull/17274))
- Added `FSDPStrategy(activation_checkpointing_policy=...)` to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) ([18045](https://github.com/Lightning-AI/lightning/pull/18045))
- Added CLI option `--map-to-cpu` to the checkpoint upgrade script to enable converting GPU checkpoints on a CPU-only machine ([17527](https://github.com/Lightning-AI/lightning/pull/17527))
- Added non-layer param count to the model summary ([17005](https://github.com/Lightning-AI/lightning/pull/17005))
- Updated `LearningRateMonitor` to log monitored values to `trainer.callback_metrics` ([17626](https://github.com/Lightning-AI/lightning/pull/17626))
- Added `log_weight_decay` argument to `LearningRateMonitor` callback ([18439](https://github.com/Lightning-AI/lightning/pull/18439))
- Added `Trainer.print()` to print on local rank zero only ([17980](https://github.com/Lightning-AI/lightning/pull/17980))
- Added `Trainer.init_module()` context manager to instantiate large models efficiently directly on device, dtype ([18004](https://github.com/Lightning-AI/lightning/pull/18004))
* Creates the model parameters in the desired dtype (`torch.float32`, `torch.float64`) depending on the 'true' precision choice in `Trainer(precision='32-true'|'64-true')`
- Added the `LightningModule.configure_model()` hook to instantiate large models efficiently directly on device, dtype, and with sharding support ([18004](https://github.com/Lightning-AI/lightning/pull/18004))
* Handles initialization for FSDP models before wrapping and the Zero stage 3 initialization for DeepSpeed before sharding
- Added support for meta-device initialization with `Trainer.init_module(empty_init=True)` in FSDP ([18385](https://github.com/Lightning-AI/lightning/pull/18385))
- Added `lightning.pytorch.plugins.PrecisionPlugin.module_init_context()` and `lightning.pytorch.strategies.Strategy.tensor_init_context()` context managers to control model and tensor instantiation ([18004](https://github.com/Lightning-AI/lightning/pull/18004))
- Automatically call `xla_model.mark_step()` before saving checkpoints with XLA ([17882](https://github.com/Lightning-AI/lightning/pull/17882))
- Added a callback for spike-detection ([18014](https://github.com/Lightning-AI/lightning/pull/18014))
- Added the ability to set the `torch.distributed.fsdp.ShardingStrategy` via string in `FSDPStrategy` ([18087](https://github.com/Lightning-AI/lightning/pull/18087))
- Improved error messages when attempting to load a DeepSpeed checkpoint at an invalid path ([17795](https://github.com/Lightning-AI/lightning/pull/17795))
- Allowed accessing rank information in the main process before processes are launched when using the `XLAStrategy` ([18194](https://github.com/Lightning-AI/lightning/pull/18194))
- Added support for true half-precision training via `Trainer(precision="16-true"|"bf16-true")` ([18193](https://github.com/Lightning-AI/lightning/pull/18193), [#18217](https://github.com/Lightning-AI/lightning/pull/18217), [#18213](https://github.com/Lightning-AI/lightning/pull/18213), [#18219](https://github.com/Lightning-AI/lightning/pull/18219))
- Added automatic process cleanup to avoid zombie child processes and stalls when exceptions are raised ([18218](https://github.com/Lightning-AI/lightning/pull/18218))
- Added validation of user input for `devices` and `num_nodes` when running with `SLURM` or `TorchElastic` ([18292](https://github.com/Lightning-AI/lightning/pull/18292))
- Added support for saving checkpoints with either full state-dict or sharded state dict via `FSDPStrategy(state_dict_type="full"|"sharded")` ([18364](https://github.com/Lightning-AI/lightning/pull/18364))
- Added support for loading sharded/distributed checkpoints in FSDP ([18358](https://github.com/Lightning-AI/lightning/pull/18358))
- Made the text delimiter in the rich progress bar configurable ([18372](https://github.com/Lightning-AI/lightning/pull/18372))
- Improved the error messaging and instructions when handling custom batch samplers in distributed settings ([18402](https://github.com/Lightning-AI/lightning/pull/18402))
- Added support for mixed 8-bit precision as `Trainer(precision="transformer-engine")` using [Nvidia's Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine) ([#18459](https://github.com/Lightning-AI/lightning/pull/18459))
- Added support for linear layer quantization with `Trainer(plugins=BitsandbytesPrecision())` using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) ([#18655](https://github.com/Lightning-AI/lightning/pull/18655))
- Added support for passing the process group to the `FSDPStrategy` ([18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Enabled the default process group configuration for FSDP's hybrid sharding ([18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Added `lightning.pytorch.utilities.suggested_max_num_workers` to assist with setting a good value in distributed settings ([18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Improved the `num_workers` warning to give a more accurate upper limit on the `num_workers` suggestion ([18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Added `lightning.pytorch.utilities.is_shared_filesystem` utility function to automatically check whether the filesystem is shared between machines ([18586](https://github.com/Lightning-AI/lightning/pull/18586))
- Added support for returning an object of type `Mapping` from `LightningModule.training_step()` ([18657](https://github.com/Lightning-AI/lightning/pull/18657))
- Added the hook `LightningModule.on_validation_model_zero_grad()` to allow overriding the behavior of zeroing the gradients before entering the validation loop ([18710](https://github.com/Lightning-AI/lightning/pull/18710))

</details>

<details><summary>Changed</summary>

- Changed default metric formatting from `round(..., 3)` to `".3f"` format string in `MetricsTextColumn` class ([18483](https://github.com/Lightning-AI/lightning/pull/18483))
- Removed the limitation to call `self.trainer.model.parameters()` in `LightningModule.configure_optimizers()` ([17309](https://github.com/Lightning-AI/lightning/pull/17309))
- `Trainer(accelerator="tpu", devices=[i])"` now selects the i-th TPU core (0-based, previously it was 1-based) ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Allow using iterable-style datasets with TPUs ([17331](https://github.com/Lightning-AI/lightning/pull/17331))
- Increased the minimum XLA requirement to 1.13 ([17368](https://github.com/Lightning-AI/lightning/pull/17368))
- `self.log`ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations ([17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Made the run initialization in `WandbLogger` lazy to avoid creating artifacts when the CLI is used ([17573](https://github.com/Lightning-AI/lightning/pull/17573))
- Simplified redirection of `*_step` methods in strategies by removing the `_LightningModuleWrapperBase` wrapper module ([17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Support kwargs input for LayerSummary ([17709](https://github.com/Lightning-AI/lightning/pull/17709))
- Dropped support for `wandb` versions older than 0.12.0 in `WandbLogger` ([17876](https://github.com/Lightning-AI/lightning/pull/17876))
- During `LightningModule.setup()`, the `self.device` now returns the device the module will be placed on instead of `cpu` ([18021](https://github.com/Lightning-AI/lightning/pull/18021))
- Increased the minimum supported `wandb` version for `WandbLogger` from 0.12.0 to 0.12.10 ([18171](https://github.com/Lightning-AI/lightning/pull/18171))
- The input tensors now get cast to the right precision type before transfer to the device ([18264](https://github.com/Lightning-AI/lightning/pull/18264))
- Improved the formatting of emitted warnings ([18288](https://github.com/Lightning-AI/lightning/pull/18288))
- Broadcast and reduction of tensors with XLA-based strategies now preserve the input's device ([18275](https://github.com/Lightning-AI/lightning/pull/18275))
- The `FSDPStrategy` now loads checkpoints after the `configure_model`/`configure_sharded_model` hook ([18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `FSDPStrategy.load_optimizer_state_dict` and `FSDPStrategy.load_model_state_dict` are a no-op now ([18358](https://github.com/Lightning-AI/lightning/pull/18358))
- The `Trainer.num_val_batches`, `Trainer.num_test_batches` and `Trainer.num_sanity_val_batches` now return a list of sizes per dataloader instead of a single integer ([18441](https://github.com/Lightning-AI/lightning/pull/18441))
- The `*_step(dataloader_iter)` flavor now no longer takes the `batch_idx` in the signature ([18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Calling `next(dataloader_iter)` now returns a triplet `(batch, batch_idx, dataloader_idx)` ([18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Calling `next(combined_loader)` now returns a triplet `(batch, batch_idx, dataloader_idx)` ([18390](https://github.com/Lightning-AI/lightning/pull/18390))
- Due to lack of reliability, Trainer now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([18291](https://github.com/Lightning-AI/lightning/pull/18291))
- Made the `batch_idx` argument optional in `validation_step`, `test_step` and `predict_step` to maintain consistency with `training_step` ([18512](https://github.com/Lightning-AI/lightning/pull/18512))
- The `TQDMProgressBar` now consistently shows it/s for the speed even when the iteration time becomes larger than one second ([18593](https://github.com/Lightning-AI/lightning/pull/18593))
- The `LightningDataModule.load_from_checkpoint` and `LightningModule.load_from_checkpoint` methods now raise an error if they are called on an instance instead of the class ([18432](https://github.com/Lightning-AI/lightning/pull/18432))
- Enabled launching via `torchrun` in a SLURM environment; the `TorchElasticEnvironment` now gets chosen over the `SLURMEnvironment` if both are detected ([18618](https://github.com/Lightning-AI/lightning/pull/18618))
- If not set by the user, Lightning will set `OMP_NUM_THREADS` to `num_cpus / num_processes` when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks ([18677](https://github.com/Lightning-AI/lightning/pull/18677))
- The `ModelCheckpoint` no longer deletes files under the save-top-k mechanism when resuming from a folder that is not the same as the current checkpoint folder ([18750](https://github.com/Lightning-AI/lightning/pull/18750))
- The `ModelCheckpoint` no longer deletes the file that was passed to `Trainer.fit(ckpt_path=...)` ([18750](https://github.com/Lightning-AI/lightning/pull/18750))
- Calling `trainer.fit()` twice now raises an error with strategies that spawn subprocesses through `multiprocessing` (ddp_spawn, xla) ([18776](https://github.com/Lightning-AI/lightning/pull/18776))
- The `ModelCheckpoint` now saves a symbolic link if `save_last=True` and `save_top_k != 0` ([18748](https://github.com/Lightning-AI/lightning/pull/18748))

</details>

<details><summary>Deprecated</summary>

- Deprecated the `SingleTPUStrategy` (`strategy="single_tpu"`) in favor of `SingleDeviceXLAStrategy` (`strategy="single_xla"`) ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUAccelerator` in favor of `XLAAccelerator` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUPrecisionPlugin` in favor of `XLAPrecisionPlugin` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUBf16PrecisionPlugin` in favor of `XLABf16PrecisionPlugin` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `Strategy.post_training_step` method ([17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Deprecated the `LightningModule.configure_sharded_model` hook in favor of `LightningModule.configure_model` ([18004](https://github.com/Lightning-AI/lightning/pull/18004))
- Deprecated the `LightningDoublePrecisionModule` wrapper in favor of calling `Trainer.precision_plugin.convert_input()` ([18209](https://github.com/Lightning-AI/lightning/pull/18209))

</details>

<details><summary>Removed</summary>

- Removed the `XLAStrategy.is_distributed` property. It is always True ([17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Removed the `SingleTPUStrategy.is_distributed` property. It is always False ([17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Removed experimental support for `torchdistx` due to a lack of project maintenance ([17995](https://github.com/Lightning-AI/lightning/pull/17995))
- Removed support for PyTorch 1.11 ([18691](https://github.com/Lightning-AI/lightning/pull/18691))

</details>

<details><summary>Fixed</summary>

- Fixed an issue with reusing the same model across multiple trainer stages when using the `DeepSpeedStrategy` ([17531](https://github.com/Lightning-AI/lightning/pull/17531))
- Fixed the saving and loading of FSDP optimizer states ([17819](https://github.com/Lightning-AI/lightning/pull/17819))
- Fixed FSDP re-applying activation checkpointing when the user had manually applied it already ([18006](https://github.com/Lightning-AI/lightning/pull/18006))
- Fixed issue where unexpected exceptions would leave the default torch dtype modified when using true precision settings ([18500](https://github.com/Lightning-AI/lightning/pull/18500))
- Fixed issue where not including the `batch_idx` argument in the `training_step` would disable gradient accumulation ([18619](https://github.com/Lightning-AI/lightning/pull/18619))
- Fixed the replacement of callbacks returned in `LightningModule.configure_callbacks` when the callback was a subclass of an existing Trainer callback ([18508](https://github.com/Lightning-AI/lightning/pull/18508))
- Fixed `Trainer.log_dir` not returning the correct directory for the `CSVLogger` ([18548](https://github.com/Lightning-AI/lightning/pull/18548))
- Fixed redundant input-type casting in FSDP precision ([18630](https://github.com/Lightning-AI/lightning/pull/18630))
- Fixed numerical issues when reducing values in low precision with `self.log` ([18686](https://github.com/Lightning-AI/lightning/pull/18686))
- Fixed an issue that would cause the gradients to be erased if validation happened in the middle of a gradient accumulation phase ([18710](https://github.com/Lightning-AI/lightning/pull/18710))
- Fixed redundant file writes in `CSVLogger` ([18567](https://github.com/Lightning-AI/lightning/pull/18567))
- Fixed an issue that could lead to checkpoint files being deleted accidentally when resuming training ([18750](https://github.com/Lightning-AI/lightning/pull/18750))

</details>

<a name="changelog-fabric"></a>
Lightning Fabric

<details><summary>Added</summary>

- Added support for the TPU-v4 architecture ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added support for XLA's new PJRT runtime ([17352](https://github.com/Lightning-AI/lightning/pull/17352))
- Added support for Fully Sharded Data Parallel (FSDP) training with XLA ([18126](https://github.com/Lightning-AI/lightning/pull/18126), [#18424](https://github.com/Lightning-AI/lightning/pull/18424), [#18430](https://github.com/Lightning-AI/lightning/pull/18430))
- Check for invalid TPU device inputs ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Added `XLAStrategy(sync_module_states=bool)` to control whether to broadcast the parameters to all devices ([17522](https://github.com/Lightning-AI/lightning/pull/17522))
- Added support for joint setup of model and optimizer with FSDP ([17305](https://github.com/Lightning-AI/lightning/pull/17305))
- Added support for handling multiple parameter groups in optimizers set up with FSDP ([17305](https://github.com/Lightning-AI/lightning/pull/17305))
- Added support for saving and loading sharded model and optimizer state with `FSDPStrategy` ([17323](https://github.com/Lightning-AI/lightning/pull/17323))
- Added a warning when calling methods on `_FabricModule` that bypass the strategy-specific wrappers ([17424](https://github.com/Lightning-AI/lightning/pull/17424))
- Added `Fabric.init_tensor()` context manager to instantiate tensors efficiently directly on device and dtype ([17488](https://github.com/Lightning-AI/lightning/pull/17488))
- Added `Fabric.init_module()` context manager to instantiate large models efficiently directly on device, dtype, and with sharding support ([17462](https://github.com/Lightning-AI/lightning/pull/17462))
* Creates the model parameters in the desired dtype (`torch.float32`, `torch.float64`, `torch.float16`, or `torch.bfloat16`) depending on the 'true' precision choice in `Fabric(precision='32-true'|'64-true'|'16-true'|'bf16-true')`
* Handles initialization for FSDP models before wrapping and the Zero stage 3 initialization for DeepSpeed before sharding
- Added support for empty weight initialization with `Fabric.init_module(empty_init=True)` for checkpoint loading ([17627](https://github.com/Lightning-AI/lightning/pull/17627))
- Added support for meta-device initialization with `Fabric.init_module(empty_init=True)` in FSDP ([18122](https://github.com/Lightning-AI/lightning/pull/18122))
- Added `lightning.fabric.plugins.Precision.module_init_context()` and `lightning.fabric.strategies.Strategy.module_init_context()` context managers to control model and tensor instantiation ([17462](https://github.com/Lightning-AI/lightning/pull/17462))
- `lightning.fabric.strategies.Strategy.tensor_init_context()` context manager to instantiate tensors efficiently directly on device and dtype ([17607](https://github.com/Lightning-AI/lightning/pull/17607))
- Run the DDP wrapper in a CUDA stream ([17334](https://github.com/Lightning-AI/lightning/pull/17334))
- Added support for true half-precision as `Fabric(precision="16-true"|"bf16-true")` ([17287](https://github.com/Lightning-AI/lightning/pull/17287))
- Added support for mixed 8-bit precision as `Fabric(precision="transformer-engine")` using [Nvidia's Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine) ([#17597](https://github.com/Lightning-AI/lightning/pull/17597))
- Added support for linear layer quantization with `Fabric(plugins=BitsandbytesPrecision())` using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) ([#18655](https://github.com/Lightning-AI/lightning/pull/18655))
- Added error messaging for missed `.launch()` when it is required ([17570](https://github.com/Lightning-AI/lightning/pull/17570))
- Added support for saving checkpoints with either full state-dict or sharded state dict via `FSDPStrategy(state_dict_type="full"|"sharded")` ([17526](https://github.com/Lightning-AI/lightning/pull/17526))
- Added support for loading a full-state checkpoint file into a sharded model ([17623](https://github.com/Lightning-AI/lightning/pull/17623))
- Added support for calling hooks on a LightningModule via `Fabric.call` ([17874](https://github.com/Lightning-AI/lightning/pull/17874))
- Added the parameter `Fabric.load(..., strict=True|False)` to enable non-strict loading of partial checkpoint state ([17645](https://github.com/Lightning-AI/lightning/pull/17645))
- Added the parameter `Fabric.save(..., filter=...)` to enable saving a partial checkpoint state ([17845](https://github.com/Lightning-AI/lightning/pull/17845))
- Added support for loading optimizer states from a full-state checkpoint file ([17747](https://github.com/Lightning-AI/lightning/pull/17747))
- Automatically call `xla_model.mark_step()` before saving checkpoints with XLA ([17882](https://github.com/Lightning-AI/lightning/pull/17882))
- Automatically call `xla_model.mark_step()` after `optimizer.step()` with XLA ([17883](https://github.com/Lightning-AI/lightning/pull/17883))
- Added support for all half-precision modes in FSDP precision plugin ([17807](https://github.com/Lightning-AI/lightning/pull/17807))
- Added `FSDPStrategy(activation_checkpointing_policy=...)` to customize the layer policy for automatic activation checkpointing (requires torch>=2.1) ([18045](https://github.com/Lightning-AI/lightning/pull/18045))
- Added a callback for spike-detection ([18014](https://github.com/Lightning-AI/lightning/pull/18014))
- Added the ability to set the `torch.distributed.fsdp.ShardingStrategy` via string in `FSDPStrategy` ([18087](https://github.com/Lightning-AI/lightning/pull/18087))
- Improved error messages when attempting to load a DeepSpeed checkpoint at an invalid path ([17795](https://github.com/Lightning-AI/lightning/pull/17795))
- Added `Fabric.load_raw()` for loading raw PyTorch state dict checkpoints for model or optimizer objects ([18049](https://github.com/Lightning-AI/lightning/pull/18049))
- Allowed accessing rank information in the main process before processes are launched when using the `XLAStrategy` ([18194](https://github.com/Lightning-AI/lightning/pull/18194))
- Added automatic process cleanup to avoid zombie child processes and stalls when exceptions are raised ([18218](https://github.com/Lightning-AI/lightning/pull/18218))
- Added validation of user input for `devices` and `num_nodes` when running with `SLURM` or `TorchElastic` ([18292](https://github.com/Lightning-AI/lightning/pull/18292))
- Improved the error messaging and instructions when handling custom batch samplers in distributed settings ([18402](https://github.com/Lightning-AI/lightning/pull/18402))
- Added support for saving and loading stateful objects other than modules and optimizers ([18513](https://github.com/Lightning-AI/lightning/pull/18513))
- Enabled the default process group configuration for FSDP's hybrid sharding ([18583](https://github.com/Lightning-AI/lightning/pull/18583))
- Added `lightning.fabric.utilities.suggested_max_num_workers` to assist with setting a good value in distributed settings ([18591](https://github.com/Lightning-AI/lightning/pull/18591))
- Added `lightning.fabric.utilities.is_shared_filesystem` utility function to automatically check whether the filesystem is shared between machines ([18586](https://github.com/Lightning-AI/lightning/pull/18586))
- Removed support for PyTorch 1.11 ([18691](https://github.com/Lightning-AI/lightning/pull/18691))
- Added support for passing the argument `.load_state_dict(..., assign=True|False)` on Fabric-wrapped modules in PyTorch 2.1 or newer ([18690](https://github.com/Lightning-AI/lightning/pull/18690))

</details>

<details><summary>Changed</summary>

- Allow using iterable-style datasets with TPUs ([17331](https://github.com/Lightning-AI/lightning/pull/17331))
- Increased the minimum XLA requirement to 1.13 ([17368](https://github.com/Lightning-AI/lightning/pull/17368))
- Fabric argument validation now only raises an error if conflicting settings are set through the CLI ([17679](https://github.com/Lightning-AI/lightning/pull/17679))
- DataLoader re-instantiation is now only performed when a distributed sampler is required ([18191](https://github.com/Lightning-AI/lightning/pull/18191))
- Improved the formatting of emitted warnings ([18288](https://github.com/Lightning-AI/lightning/pull/18288))
- Broadcast and reduction of tensors with XLA-based strategies now preserve the input's device ([18275](https://github.com/Lightning-AI/lightning/pull/18275))
- Due to lack of reliability, Fabric now only runs on one GPU instead of all GPUs in a Jupyter notebook if `devices="auto"` (default) ([18291](https://github.com/Lightning-AI/lightning/pull/18291))
- Enabled launching via `torchrun` in a SLURM environment; the `TorchElasticEnvironment` now gets chosen over the `SLURMEnvironment` if both are detected ([18618](https://github.com/Lightning-AI/lightning/pull/18618))
- If not set by the user, Lightning will set `OMP_NUM_THREADS` to `num_cpus / num_processes` when launching subprocesses (e.g. when DDP is used) to avoid system overload for CPU-intensive tasks ([18677](https://github.com/Lightning-AI/lightning/pull/18677))

</details>

<details><summary>Deprecated</summary>

- Deprecated the `DDPStrategy.is_distributed` property. This strategy is distributed by definition ([17381](https://github.com/Lightning-AI/lightning/pull/17381))
- Deprecated the `SingleTPUStrategy` (`strategy="single_tpu"`) in favor of `SingleDeviceXLAStrategy` (`strategy="single_xla"`) ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUAccelerator` in favor of `XLAAccelerator` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUPrecision` in favor of `XLAPrecision` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))
- Deprecated the `TPUBf16Precision` in favor of `XLABf16Precision` ([17383](https://github.com/Lightning-AI/lightning/pull/17383))

</details>

<details><summary>Removed</summary>

- Removed automatic sharding support with `Fabric.run` or using `fabric.launch(fn)`. This only impacts FSDP and DeepSpeed strategy users. Please instantiate your module under the newly added `fabric.init_module` context manager ([17832](https://github.com/Lightning-AI/lightning/pull/17832))
- Removed the unsupported `checkpoint_io` argument from the `FSDPStrategy` ([18192](https://github.com/Lightning-AI/lightning/pull/18192))

</details>

<details><summary>Fixed</summary>

- Fixed issue where running on TPUs would select the wrong device index ([17227](https://github.com/Lightning-AI/lightning/pull/17227))
- Removed the need to call `.launch()` when using the DP-strategy (`strategy="dp"`) ([17931](https://github.com/Lightning-AI/lightning/pull/17931))
- Fixed FSDP re-applying activation checkpointing when the user had manually applied it already ([18006](https://github.com/Lightning-AI/lightning/pull/18006))
- Fixed FSDP re-wrapping the module root when the user had manually wrapped the model ([18054](https://github.com/Lightning-AI/lightning/pull/18054))
- Fixed issue where unexpected exceptions would leave the default torch dtype modified when using true precision settings ([18500](https://github.com/Lightning-AI/lightning/pull/18500))
- Fixed redundant input-type casting in FSDP precision ([18630](https://github.com/Lightning-AI/lightning/pull/18630))
- Fixed an issue with `find_usable_cuda_devices(0)` incorrectly returning a list of devices ([18722](https://github.com/Lightning-AI/lightning/pull/18722))
- Fixed redundant file writes in `CSVLogger` ([18567](https://github.com/Lightning-AI/lightning/pull/18567))

</details>

<a name="changelog-app"></a>
Lightning App

<details><summary>Added</summary>

- Allow customizing `gradio` components with lightning colors ([17054](https://github.com/Lightning-AI/lightning/pull/17054))

</details>

<details><summary>Changed</summary>

- Changed `LocalSourceCodeDir` cache_location to not use home in some certain cases ([17491](https://github.com/Lightning-AI/lightning/pull/17491))

</details>

<details><summary>Removed</summary>

- Remove cluster commands from the CLI ([18151](https://github.com/Lightning-AI/lightning/pull/18151))

</details>

**Full commit list**: https://github.com/Lightning-AI/lightning/compare/2.0.0...2.1.0

<a name="contributors"></a>
Contributors

Veteran

adamjstewart akreuzer ethanwharris dmitsf lantiga nicolai86 pl-ghost carmocca awaelchli justusschock edenlightning belerico lightningforever nisheethlahoti tchaton yurijmikhalevich mauvilsa rlizzo rusmux yhl48 Liyang90 jerome-habana JustinGoheen Borda speediedan SkafteNicki dcfidalgo

New

saryazdi parambharat kshitij12345 woqidaideshi colehawkins md-121 gkroiz idc9 BoringDonut OmerShubi ishandutta0098 ryan597 leng-yue alicanb One-sixth santurini SpirinEgor KogaiIrina shanmugamr1992 janeyx99 asmith26 dingusagar AleksanderWWW strawberrypie solyaH kaczmarj voidful water-vapor bkiat1123 rhiga2 baskrahmer felipewhitaker mukhery Quasar-Kim robieta one-matrix jere357 schmidt-ai schuhschuh anio rjarun8 callumhay minhlong94 klieret giorgioskij shihaoyin JonathanRayner NripeshN marcimarc1 bilelomrani1 NikolasWolke 0x404 quintenroets Borodin amorehead SebastianGer ioangatop Tribhuvan0 f0k sameertantry kwsp nik777 matsumotosan

Did you know?

When Chuck Norris trains a neural network, it not only learns, but it also gains the ability to defend itself from adversarial attacks by roundhouse kicking them into submission.

Page 4 of 32

Releases

Has known vulnerabilities

Previous Next

Lightning

Page 4 of 32

2.2.0.rc0

2.1.4

2.1.3

2.1.2

2.1.1

2.1.0

Page 4 of 32

Links

Releases