Added
- Added `extract_batch_size` utility and corresponding tests to extract batch dimension from multiple batch types (8357)
- Added support for named parameter groups in `LearningRateMonitor` (7987)
- Added `dataclass` support for `pytorch_lightning.utilities.apply_to_collection` (7935)
- Added support to `LightningModule.to_torchscript` for saving to custom filesystems with `fsspec` (7617)
- Added `KubeflowEnvironment` for use with the `PyTorchJob` operator in Kubeflow
- Added LightningCLI support for config files on object stores (7521)
- Added `ModelPruning(prune_on_train_epoch_end=True|False)` to choose when to apply pruning (7704)
- Added support for checkpointing based on a provided time interval during training (7515)
- Progress tracking
* Added dataclasses for progress tracking (6603, 7574, 8140, 8362)
* Add `{,load_}state_dict` to the progress tracking dataclasses (8140)
* Connect the progress tracking dataclasses to the loops (8244, 8362)
* Do not reset the progress tracking dataclasses total counters (8475)
- Added support for passing a `LightningDataModule` positionally as the second argument to `trainer.{validate,test,predict}` (7431)
- Added argument `trainer.predict(ckpt_path)` (7430)
- Added `clip_grad_by_value` support for TPUs (7025)
- Added support for passing any class to `is_overridden` (7918)
- Added `sub_dir` parameter to `TensorBoardLogger` (6195)
- Added correct `dataloader_idx` to batch transfer hooks (6241)
- Added `include_none=bool` argument to `apply_to_collection` (7769)
- Added `apply_to_collections` to apply a function to two zipped collections (7769)
- Added `ddp_fully_sharded` support (7487)
- Added `should_rank_save_checkpoint` property to Training Plugins (7684)
- Added `log_grad_norm` hook to `LightningModule` to customize the logging of gradient norms (7873)
- Added `save_config_filename` init argument to `LightningCLI` to ease resolving name conflicts (7741)
- Added `save_config_overwrite` init argument to `LightningCLI` to ease overwriting existing config files (8059)
- Added reset dataloader hooks to Training Plugins and Accelerators (7861)
- Added trainer stage hooks for Training Plugins and Accelerators (7864)
- Added the `on_before_optimizer_step` hook (8048)
- Added IPU Accelerator (7867)
- Fault-tolerant training
* Added `{,load_}state_dict` to `ResultCollection` (7948)
* Added `{,load_}state_dict` to `Loops` (8197)
* Set `Loop.restarting=False` at the end of the first iteration (8362)
* Save the loops state with the checkpoint (opt-in) (8362)
* Save a checkpoint to restore the state on exception (opt-in) (8362)
* Added `state_dict` and `load_state_dict` utilities for `CombinedLoader` + utilities for dataloader (8364)
- Added `rank_zero_only` to `LightningModule.log` function (7966)
- Added `metric_attribute` to `LightningModule.log` function (7966)
- Added a warning if `Trainer(log_every_n_steps)` is a value too high for the training dataloader (7734)
- Added LightningCLI support for argument links applied on instantiation (7895)
- Added LightningCLI support for configurable callbacks that should always be present (7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (7234)
- Added support for `torch.nn.UninitializedParameter` in `ModelSummary` (7642)
- Added support `LightningModule.save_hyperparameters` when `LightningModule` is a dataclass (7992)
- Added support for overriding `optimizer_zero_grad` and `optimizer_step` when using accumulate_grad_batches (7980)
- Added `logger` boolean flag to `save_hyperparameters` (7960)
- Added support for calling scripts using the module syntax (`python -m package.script`) (8073)
- Added support for optimizers and learning rate schedulers to `LightningCLI` (8093)
- Added XLA Profiler (8014)
- Added `PrecisionPlugin.{pre,post}_backward` (8328)
- Added `on_load_checkpoint` and `on_save_checkpoint` hooks to the `PrecisionPlugin` base class (7831)
- Added `max_depth` parameter in `ModelSummary` (8062)
- Added `XLAStatsMonitor` callback (8235)
- Added `restore` function and `restarting` attribute to base `Loop` (8247)
- Added `FastForwardSampler` and `CaptureIterableDataset` (8307)
- Added support for `save_hyperparameters` in `LightningDataModule` (3792)
- Added the `ModelCheckpoint(save_on_train_epoch_end)` to choose when to run the saving logic (8389)
- Added `LSFEnvironment` for distributed training with the LSF resource manager `jsrun` (5102)
- Added support for `accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'` (7808)
- Added `tpu_spawn_debug` to plugin registry (7933)
- Enabled traditional/manual launching of DDP processes through `LOCAL_RANK` and `NODE_RANK` environment variable assignments (7480)
- Added `quantize_on_fit_end` argument to `QuantizationAwareTraining` (8464)
- Added experimental support for loop specialization (8226)
- Added support for `devices` flag to Trainer (8440)
- Added private `prevent_trainer_and_dataloaders_deepcopy` context manager on the `LightningModule` (8472)
- Added support for providing callables to the Lightning CLI instead of types (8400)
Changed
- Decoupled device parsing logic from Accelerator connector to Trainer (8180)
- Changed the `Trainer`'s `checkpoint_callback` argument to allow only boolean values (7539)
- Log epoch metrics before the `on_evaluation_end` hook (7272)
- Explicitly disallow calling `self.log(on_epoch=False)` during epoch-only or single-call hooks (7874)
- Changed these `Trainer` methods to be protected: `call_setup_hook`, `call_configure_sharded_model`, `pre_dispatch`, `dispatch`, `post_dispatch`, `call_teardown_hook`, `run_train`, `run_sanity_check`, `run_evaluate`, `run_evaluation`, `run_predict`, `track_output_for_epoch_end`
- Changed `metrics_to_scalars` to work with any collection or value (7888)
- Changed `clip_grad_norm` to use `torch.nn.utils.clip_grad_norm_` (7025)
- Validation is now always run inside the training epoch scope (7357)
- `ModelCheckpoint` now runs at the end of the training epoch by default (8389)
- `EarlyStopping` now runs at the end of the training epoch by default (8286)
- Refactored Loops
* Moved attributes `global_step`, `current_epoch`, `max/min_steps`, `max/min_epochs`, `batch_idx`, and `total_batch_idx` to TrainLoop (7437)
* Refactored result handling in training loop (7506)
* Moved attributes `hiddens` and `split_idx` to TrainLoop (7507)
* Refactored the logic around manual and automatic optimization inside the optimizer loop (7526)
* Simplified "should run validation" logic (7682)
* Simplified logic for updating the learning rate for schedulers (7682)
* Removed the `on_epoch` guard from the "should stop" validation check (7701)
* Refactored internal loop interface; added new classes `FitLoop`, `TrainingEpochLoop`, `TrainingBatchLoop` (7871, 8077)
* Removed `pytorch_lightning/trainer/training_loop.py` (7985)
* Refactored evaluation loop interface; added new classes `DataLoaderLoop`, `EvaluationLoop`, `EvaluationEpochLoop` (7990, 8077)
* Removed `pytorch_lightning/trainer/evaluation_loop.py` (8056)
* Restricted public access to several internal functions (8024)
* Refactored trainer `_run_*` functions and separate evaluation loops (8065)
* Refactored prediction loop interface; added new classes `PredictionLoop`, `PredictionEpochLoop` (7700, 8077)
* Removed `pytorch_lightning/trainer/predict_loop.py` (8094)
* Moved result teardown to the loops (8245)
* Improve `Loop` API to better handle children `state_dict` and `progress` (8334)
- Refactored logging
* Renamed and moved `core/step_result.py` to `trainer/connectors/logger_connector/result.py` (7736)
* Dramatically simplify the `LoggerConnector` (7882)
* `trainer.{logged,progress_bar,callback}_metrics` are now updated on-demand (7882)
* Completely overhaul the `Result` object in favor of `ResultMetric` (7882)
* Improve epoch-level reduction time and overall memory usage (7882)
* Allow passing `self.log(batch_size=...)` (7891)
* Each of the training loops now keeps its own results collection (7891)
* Remove `EpochResultStore` and `HookResultStore` in favor of `ResultCollection` (7909)
* Remove `MetricsHolder` (7909)
- Moved `ignore_scalar_return_in_dp` warning suppression to the DataParallelPlugin class (7421)
- Changed the behaviour when logging evaluation step metrics to no longer append `/epoch_*` to the metric name (7351)
- Raised `ValueError` when a `None` value is `self.log`-ed (7771)
- Changed `resolve_training_type_plugins` to allow setting `num_nodes` and `sync_batchnorm` from `Trainer` setting (7026)
- Default `seed_everything(workers=True)` in the `LightningCLI` (7504)
- Changed `model.state_dict()` in `CheckpointConnector` to allow `training_type_plugin` to customize the model's `state_dict()` (7474)
- `MLflowLogger` now uses the env variable `MLFLOW_TRACKING_URI` as default tracking URI (7457)
- Changed `Trainer` arg and functionality from `reload_dataloaders_every_epoch` to `reload_dataloaders_every_n_epochs` (5043)
- Changed `WandbLogger(log_model={True/'all'})` to log models as artifacts (6231)
- MLFlowLogger now accepts `run_name` as an constructor argument (7622)
- Changed `teardown()` in `Accelerator` to allow `training_type_plugin` to customize `teardown` logic (7579)
- `Trainer.fit` now raises an error when using manual optimization with unsupported features such as `gradient_clip_val` or `accumulate_grad_batches` (7788)
- Accelerator hooks are called regardless if `LightningModule` overrides the same hooks (7826)
- Moved profilers to their own file (7822)
- The `on_after_backward` hook is now called on accumulating iterations. Use the `on_before_optimizer_step` hook to mimic the old behaviour (8328)
- The mixed precision loss is no longer unscaled before the `on_after_backward` hook. Use the `on_before_optimizer_step` hook to mimic the old behaviour (8328)
- The `TrainingTypePlugin.{pre,post}_backward` hooks no longer take the `optimizer, opt_idx, should_accumulate` arguments (8328)
- The `PrecisionPlugin.backward` hooks no longer returns a value (8328)
- The `PrecisionPlugin.backward` hooks no longer takes a `should_accumulate` argument (8328)
- Added the `on_before_backward` hook (7865)
- `LightningCLI` now aborts with a clearer message if config already exists and disables save config during `fast_dev_run`(7963)
- Saved the `LightningCLI` config on `setup` and only on the main process (8017)
- Dropped the `LightningCLI` `ArgumentParser` when pickling (8017)
- Skip `broadcast` if distributed not initialized for the spawn plugins (8017)
- `Trainer(resume_from_checkpoint=...)` now restores the model directly after `LightningModule.setup()`, which is before `LightningModule.configure_sharded_model()` (7652)
- Moved `torch.cuda.set_device()` to enable collective calls earlier in setup (8312)
- Used XLA utility API to move data to CPU (Single TPU core) (8078)
- Improved error messages in `replace_sampler` when the `DataLoader` attributes are not included in the signature or the signature is missing optional arguments (8519)
- Moved `DeviceDtypeModuleMixin` and `HyperparametersMixin` mixin to `core` (8396)
- Return the `default_root_dir` as the `log_dir` when the logger is a `LoggerCollection` (8187)
Deprecated
- Deprecated `LightningModule.loaded_optimizer_states_dict` (8229)
- Standardized the dataloaders arguments of `trainer.{fit,valdiate,test,tune}` (7431)
- Deprecated `DataModule` properties: `has_prepared_data`, `has_setup_fit`, `has_setup_validate`, `has_setup_test`, `has_setup_predict`, `has_teardown_fit`, `has_teardown_validate`, `has_teardown_test`, `has_teardown_predict` (7657)
- Deprecated `TrainerModelHooksMixin` in favor of `pytorch_lightning.utilities.signature_utils` (7422)
- Deprecated `num_nodes` and `sync_batchnorm` arguments in `DDPPlugin` and `DDPSpawnPlugin` (7026)
- Deprecated `self.log(sync_dist_op)` in favor of `self.log(reduce_fx)`. (7891)
- Deprecated `is_overridden(model=...)` in favor of `is_overridden(instance=...)` (7918)
- Deprecated automatically detaching returned extras with grads (7994)
- Deprecated default value of `monitor` argument in EarlyStopping callback to enforce `monitor` as a required argument (7907)
- Deprecated importing `rank_zero_{warn,deprecation}` directly from `pytorch_lightning.utilities.distributed` (8085)
- Deprecated the use of `CheckpointConnector.hpc_load()` in favor of `CheckpointConnector.restore()` (7652)
- Deprecated `ModelCheckpoint(every_n_val_epochs)` in favor of `ModelCheckpoint(every_n_epochs)` (8383)
- Deprecated `DDPPlugin.task_idx` in favor of `DDPPlugin.local_rank` (8203)
- Deprecated the `Trainer.train_loop` property in favor of `Trainer.fit_loop` (8025)
- Deprecated the `Trainer.disable_validation` property in favor of `not Trainer.enable_validation` (8291)
- Deprecated `mode` parameter in `ModelSummary` in favor of `max_depth` (8062)
- Deprecated `reload_dataloaders_every_epoch` argument of `Trainer` in favor of `reload_dataloaders_every_n_epochs` (5043)
- Deprecated `distributed_backend` argument for `Trainer` (8575)
Removed
- Dropped official support/testing for PyTorch <1.6 (8288)
- Removed `ProfilerConnector` (7654)
- Pruned deprecated classif. metrics from `pytorch_lightning.metrics.functional.classification` (7499)
- Removed deprecated data parallel classes `LightningDataParallel` and `LightningDistributedDataParallel` from `pytorch_lightning.overrides.data_parallel` (7510)
- Removed deprecated trainer attributes - `get_model` and `accelerator_backend` (7502)
- Removed support for automatically monitoring the `val_loss` key with `ModelCheckpoint`. Pass your `monitor` of choice to the `ModelCheckpoint` instance instead (8293)
- Removed support for `self.log(tbptt_reduce_fx)` and `self.log(tbptt_pad_token)`. Please, open a discussion explaining your use-case if you relied on these. (7644)
- Removed deprecated utils modules `model_utils`, `warning_utils`, `xla_device_utils` and partially `argparse_utils` (7503)
- Removed `RPCPlugin` and `RPCSequentialPlugin`. If you were successfully using these plugins, please open a GitHub discussion about your use case (8101)
- Removed deprecated trainer attributes - `on_cpu`, `on_tpu`, `use_tpu`, `on_gpu`, `use_dp`, `use_ddp`, `use_ddp2`, `use_horovod`, `use_single_gpu` (7501)
- Removed deprecated `optimizer` argument in `LightningModule.manual_backward()`; Toggling optimizers in manual optimization should be done using `LightningModule.{un}toggle_optimizer()` (8287)
- Removed DeepSpeed FP16 Exception as FP32 is now supported (8462)
- Removed environment variable `PL_EXP_VERSION` from DDP subprocesses (7403)
Fixed
- Fixed the `GPUStatsMonitor` callbacks to use the correct GPU IDs if `CUDA_VISIBLE_DEVICES` set (8260)
- Fixed `lr_scheduler` checkpointed state by calling `update_lr_schedulers` before saving checkpoints (7877)
- Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (7685)
- Fixed dev debugger memory growing due to tracking events even when disabled (7875)
- Fixed `None` loss keys getting added in `training_epoch_end` when using manual optimization and not returning a loss (7772)
- Fixed a bug where `precision=64` with `accelerator='ddp_spawn'` would throw a pickle error (6924)
- Do not override the existing `epoch` value in `logged_metrics` when already logged by the user (7982)
- Support for manual optimization with DeepSpeed (7970)
- Fixed `dataloader_idx` argument value when predicting with only one `DataLoader` (7941)
- Fixed passing the `stage` argument of `Callback.{setup,teardown}` as a keyword (7973)
- Fixed metrics generated during `validation sanity checking` are cleaned on end (8171)
- Fixed `log_gpu_memory` metrics not being added to `logging` when nothing else is logged (8174)
- Fixed a bug where calling `log` with a `Metric` instance would raise an error if it was a nested attribute of the model (8181)
- Fixed a bug where using `precision=64` would cause buffers with complex dtype to be cast to real (8208)
- Fixed `is_overridden` returning true for wrapped functions with no changes (8296)
- Fixed a bug where `truncated_bptt_steps` would throw an AttributeError when the target RNN has multiple hidden states (8145)
- Fixed `self.optimizers()` not returning a single optimizer if it had been wrapped (8326)
- Fixed the `on_after_backward` hook not getting called when using manual optimization and no plugins (8328)
- Fixed the `LightningModule.backward` hook only getting called with the `apex` plugin when using manual optimization (8328)
- Fixed moving batch to device before sending it to the `on_*_batch_start`/`on_*_batch_end` callbacks and model hooks (7378)
- Fixed passing a custom `DDPPlugin` when choosing `accelerator="ddp_cpu"` for the accelerator (6208)
- Fixed missing call to `LightningModule.untoggle_optimizer` in training loop when running gradient accumulation with multiple optimizers (8284)
- Fixed hash of LightningEnum to work with value instead of name (8421).
- Fixed a bug where an extra checkpoint was saved at the end of training if the `val_check_interval` did not align with the number of training batches (7724)
- Fixed hash of LightningEnum to work with value instead of name(8421).
- Fixed `move_data_to_device` to return the batch if the object `to` function didn't return `self` (8433)
- Fixed progress bar updates for Pod Training (8258)
- Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (8442)
- Fixed memory leaks on GPU by moving `optimizer_states`, `ResultCollection.extra`, `ResultMetric` attributes, and `LoggerConnector` metrics to `cpu`. Also, delete the DDP wrapper on `teardown` (8490)
- Fixed `SWA` callback using LightningModule `prevent_trainer_and_dataloaders_deepcopy` to avoid OOM (8472)
- Fixed `ModelPruning` callback `on_save_checkpoint` to avoid making a `deepcopy` potentially leading to OOM (8472)
- Fixed the sampler replacement logic for `DataLoader`s which do not define all `DataLoader` attributes as `__init__` parameters (8519)
- Fixed DeepSpeed Windows support (8488)
- Fixed DeepSpeed not properly setting the trainer `lr_schedulers` attribute (8527)
- Fixed experiment version and log-dir divergence in DDP when using multiple `Trainer` instances in sequence (7403)
- Enabled manual optimization for TPUs (8458)
- Fixed `accumulate_grad_batches` not been recomputed during model reload (5334)
- Fixed a `TypeError` when wrapping optimizers in the `HorovodPlugin` and running `Trainer.test` (7840)
- Fixed `BackboneFinetuning` restoration (8501)
- Fixed `lr_scheduler` with metric (e.g. `torch.optim.lr_scheduler.ReduceLROnPlateau`) when using `automatic_optimization = False` (7643)
- Fixed `DeepSpeed` breaking with no schedulers ([8580](https://github.com/PyTorchLightning/pytorch-lightning/pull/8580))
Contributors
00sapo AffineParameter ajtritt akihironitta ananthsub aniketmaurya aslisabanci awaelchli bamblebam Borda borisdayma carmocca dalek-who DavidMChan davors72 dcfidalgo ddrevicky deepsource-autofix djthegr8 edenlightning edgarriba eladsegal ethanwharris eugeneh101 fepegar gaoteng-git gtauzin i-aki-y janhenriklambrechts jiwidi justusschock karthikrangasai kaushikb11 loic-beheshti Lucklyric ManuelPalermo mauvilsa maxoppelt neggert nikvaessen nisheethlahoti pre-commit-ci rohitgr7 ruotianluo satishjasthi SeanNaren shirayu shuyingsunshine21 sid-sundrani Sileadim simran2905 stancld t-vi tchaton theblackfly theodumont tilman151 tomy0000000 tshu-w vatch123 WrRan yifuwang
_If we forgot someone, let us know :]_