A module in PyTorch is always either in `train` (default) or `eval` mode.
This improvement should give users more visibility into the state of their model and help debug issues, for example when you need to make sure certain layers of the model are frozen.
<a name="highlights-forward-methods"></a>
Special Forward Methods in Fabric
Until now, Lightning Fabric warned the user in case the forward pass of the model or a subset of its modules was conducted through methods other than the dedicated `forward` method of the PyTorch module. The reason for this is that PyTorch needs to run special hooks in case of DDP/FSDP and other strategies to function properly, and not running through the real `forward` method would skip these hooks and lead to correctness issues.
In Lightning Fabric 2.3, we added a [feature to explicitly mark alternative forward methods](https://lightning.ai/docs/fabric/latest/api/wrappers.html#using-methods-other-than-forward-for-computation) so that Fabric can add the necessary rerouting behind the scenes:
python
import lightning as L
fabric = L.Fabric(devices=2, strategy="ddp")
fabric.launch()
model = MyModel()
model = fabric.setup(model)
OK: Calling the model directly
output = model(input)
ERROR: Calling another method that calls forward indirectly
prediction = model.generate(input)
New: Mark special forward methods explicitly before using them
model.mark_forward_method(model.generate)
OK: Now can use `model.generate()` in DDP/FSDP without issues
prediction = model.generate(input)
Find the [full example](https://lightning.ai/docs/fabric/latest/api/wrappers.html#using-methods-other-than-forward-for-computation) and more details in our docs.
<a name="bc-changes"></a>
Notable Changes
The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.
Skipping the training step in DDP
It is no longer allowed to skip `training_step()` by returning `None` in distributed training ([19918](https://github.com/Lightning-AI/pytorch-lightning/pull/19918)). The following usage was previously possible but would result in unpredictable hangs and timeouts in distributed training:
python
def training_step(self, batch):
loss = ...
if loss.isnan():
No longer allowed in multi-GPU!
Raises error in Lightning >= 2.3
return None
return loss
We decided to raise an error if the user attempts to return `None` when running in a multi-GPU setting.
Miscellaneous Changes
- Dropped support for PyTorch 1.13 ([19300](https://github.com/Lightning-AI/lightning/pull/19300)). With every new Lightning release, we add official support for the latest PyTorch stable version and drop the oldest version in our support window.
- The `prepare_data()` hook in `LightningModule` and `LightningDataModule` is now subject to a barrier without timeout to avoid long-running tasks to be interrupted ([19448](https://github.com/Lightning-AI/lightning/pull/19448)). Similarly, also in Fabric the `Fabric.rank_zero_first` context manager now uses an infinite barrier ([#19448](https://github.com/Lightning-AI/lightning/pull/19448)).
<a name="changelog"></a>
CHANGELOG
<a name="changelog-pytorch"></a>
PyTorch Lightning
<details><summary>Added</summary>
- The `ModelSummary` and `RichModelSummary` callbacks now display the training mode of each layer in the column "Mode" ([19468](https://github.com/Lightning-AI/lightning/pull/19468))
- Added `load_from_checkpoint` support for `LightningCLI` when using dependency injection ([18105](https://github.com/Lightning-AI/lightning/pull/18105))
- Added robust timer duration parsing with an informative error message when parsing fails ([19513](https://github.com/Lightning-AI/pytorch-lightning/pull/19513))
- Added `on_exception` hook to `LightningDataModule` ([19601](https://github.com/Lightning-AI/pytorch-lightning/pull/19601))
- Added support for PyTorch 2.3 ([19708](https://github.com/Lightning-AI/pytorch-lightning/pull/19708))
- Added `ModelParallelStrategy` to support 2D parallelism ([19878](https://github.com/Lightning-AI/pytorch-lightning/pull/19878), [#19888](https://github.com/Lightning-AI/pytorch-lightning/pull/19888))
- Added a call to `torch.distributed.destroy_process_group` in atexit handler if process group needs destruction ([19931](https://github.com/Lightning-AI/pytorch-lightning/pull/19931))
- Added support for configuring hybrid-sharding by passing a tuple for the `FSDPStrategy(device_mesh=...)` argument ([19504](https://github.com/Lightning-AI/pytorch-lightning/pull/19504))
</details>
<details><summary>Changed</summary>
- The `prepare_data()` hook in `LightningModule` and `LightningDataModule` is now subject to a barrier without timeout to avoid long-running tasks to be interrupted ([19448](https://github.com/Lightning-AI/lightning/pull/19448))
- Relaxed the requirement for custom batch samplers to expose `drop_last` for prediction ([19678](https://github.com/Lightning-AI/pytorch-lightning/pull/19678))
- It is no longer allowed to skip `training_step()` by returning `None` in distributed training ([19918](https://github.com/Lightning-AI/pytorch-lightning/pull/19918))
</details>
<details><summary>Removed</summary>
- Removed the Bagua integration (`Trainer(strategy="bagua")`) ([19445](https://github.com/Lightning-AI/lightning/pull/19445))
- Removed support for PyTorch 1.13 ([19706](https://github.com/Lightning-AI/lightning/pull/19706))
</details>
<details><summary>Fixed</summary>
- Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) ([19886](https://github.com/Lightning-AI/lightning/pull/19886))
- Fixed `WandbLogger.log_hyperparameters()` raising an error if hyperparameters are not JSON serializable ([19769](https://github.com/Lightning-AI/pytorch-lightning/pull/19769))
- Fixed an issue with the LightningCLI not being able to set the `ModelCheckpoint(save_last=...)` argument ([19808](https://github.com/Lightning-AI/pytorch-lightning/pull/19808))
- Fixed an issue causing ValueError for certain object such as TorchMetrics when dumping hyperparameters to YAML ([19804](https://github.com/Lightning-AI/pytorch-lightning/pull/19804))
- Fixed resetting `epoch_loop.restarting` to avoid full validation run after `LearningRateFinder` ([19818](https://github.com/Lightning-AI/pytorch-lightning/issues/19818))
</details>
<a name="changelog-fabric"></a>
Lightning Fabric
<details><summary>Added</summary>
- Added sanitization for classes before logging them as hyperparameters ([19771](https://github.com/Lightning-AI/pytorch-lightning/pull/19771))
- Enabled consolidating distributed checkpoints through `fabric consolidate` in the new CLI ([19560](https://github.com/Lightning-AI/pytorch-lightning/pull/19560))
- Added the ability to explicitly mark forward methods in Fabric via `_FabricModule.mark_forward_method()` ([19690](https://github.com/Lightning-AI/pytorch-lightning/pull/19690))
- Added support for PyTorch 2.3 ([19708](https://github.com/Lightning-AI/pytorch-lightning/pull/19708))
- Added `ModelParallelStrategy` to support 2D parallelism ([19846](https://github.com/Lightning-AI/pytorch-lightning/pull/19846), [#19852](https://github.com/Lightning-AI/pytorch-lightning/pull/19852), [#19870](https://github.com/Lightning-AI/pytorch-lightning/pull/19870), [#19872](https://github.com/Lightning-AI/pytorch-lightning/pull/19872))
- Added a call to `torch.distributed.destroy_process_group` in atexit handler if process group needs destruction ([19931](https://github.com/Lightning-AI/pytorch-lightning/pull/19931))
- Added support for configuring hybrid-sharding by passing a tuple for the `FSDPStrategy(device_mesh=...)` argument ([19504](https://github.com/Lightning-AI/pytorch-lightning/pull/19504))
</details>
<details><summary>Changed</summary>
- Renamed `lightning run model` to `fabric run` ([19442](https://github.com/Lightning-AI/pytorch-lightning/pull/19442), [#19527](https://github.com/Lightning-AI/pytorch-lightning/pull/19527))
- The `Fabric.rank_zero_first` context manager now uses a barrier without timeout to avoid long-running tasks to be interrupted ([19448](https://github.com/Lightning-AI/lightning/pull/19448))
- Fabric now raises an error if you forget to call `fabric.backward()` when it is needed by the strategy or precision selection ([19447](https://github.com/Lightning-AI/lightning/pull/19447), [#19493](https://github.com/Lightning-AI/lightning/pull/19493))
- `_BackwardSyncControl` can now control what to do when gradient accumulation is disabled ([19577](https://github.com/Lightning-AI/lightning/pull/19577))
</details>
<details><summary>Removed</summary>
- Removed support for PyTorch 1.13 ([19706](https://github.com/Lightning-AI/lightning/pull/19706))
</details>
<details><summary>Fixed</summary>
- Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) ([19886](https://github.com/Lightning-AI/lightning/pull/19886))
</details>
</br>
**Full commit list**: [2.2.0 -> 2.3.0](https://github.com/Lightning-AI/lightning/compare/2.2.0...2.3.0)
<a name="contributors"></a>
Contributors
We thank all our contributors who submitted pull requests for features, bug fixes and documentation updates.
New Contributors
* cauyxy made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19437
* mwip made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19518
* kylebgorman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19513
* kashif made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19520
* ash0ts made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19451
* dimitri-voytan made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19524
* ankitgola005 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19615
* invisprints made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19629
* kvenkman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19465
* fnhirwa made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19640
* inyong37 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19677
* clumsy made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19601
* judidoko made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19692
* Lunamos made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19701
* dominicgkerr made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19727
* daavoo made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19774
* Peiffap made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19805
* IvanYashchuk made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19926
* ringohoffman made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19904
* afspies made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19847
* fedebotu made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19822
* mariovas3 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19808
* Bhavay-2001 made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19947
* V0XNIHILI made their first contribution in https://github.com/Lightning-AI/pytorch-lightning/pull/19771
Did you know?
Chuck Norris is a big fan and daily user of Lightning Studio.