This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.
Overview
The major changes are:
- Results objects are deprecated (we hated them too haha)
- This means dataflow and logging have been decoupled
To log:
python
def any_step(...):
self.log('something', i_computed)
Separately, return whatever you want from methods:
python
def training_step(...):
return loss
or
python
def training_step(...):
return {'loss': loss, 'whatever': [1, 'want']}
Detail changes
Added
- Added new Metrics API. (3868, [3921)
- Enable PyTorch 1.7 compatibility (3541)
- Added `LightningModule.to_torchscript` to support exporting as `ScriptModule` (3258)
- Added warning when dropping unpicklable `hparams` (2874)
- Added EMB similarity (3349)
- Added `ModelCheckpoint.to_yaml` method (3048)
- Allow `ModelCheckpoint` monitor to be `None`, meaning it will always save ([3630)
- Disabled optimizers setup during testing (3059)
- Added support for datamodules to save and load checkpoints when training (3563
- Added support for datamodule in learning rate finder (3425)
- Added gradient clip test for native AMP (3754)
- Added dist lib to enable syncing anything across devices (3762)
- Added `broadcast` to `TPUBackend` (3814)
- Added `XLADeviceUtils` class to check XLA device type (3274)
Changed
- Refactored accelerator backends:
* moved TPU `xxx_step` to backend (3118)
* refactored DDP backend `forward` (3119)
* refactored GPU backend `__step` (3120)
* refactored Horovod backend (3121, 3122)
* remove obscure forward call in eval + CPU backend `___step` (3123)
* reduced all simplified forward (3126)
* added hook base method (3127)
* refactor eval loop to use hooks - use `test_mode` for if so we can split later (3129)
* moved `___step_end` hooks (3130)
* training forward refactor (3134)
* training AMP scaling refactor (3135)
* eval step scaling factor (3136)
* add eval loop object to streamline eval loop (3138)
* refactored dataloader process hook (3139)
* refactored inner eval loop (3141)
* final inner eval loop hooks (3154)
* clean up hooks in `run_evaluation` (3156)
* clean up data reset (3161)
* expand eval loop out (3165)
* moved hooks around in eval loop (3195)
* remove `_evaluate` fx (3197)
* `Trainer.fit` hook clean up (3198)
* DDPs train hooks (3203)
* refactor DDP backend (3204, 3207, 3208, 3209, 3210)
* reduced accelerator selection (3211)
* group prepare data hook (3212)
* added data connector (3285)
* modular is_overridden (3290)
* adding `Trainer.tune()` (3293)
* move `run_pretrain_routine` -> `setup_training` (3294)
* move train outside of setup training (3297)
* move `prepare_data` to data connector (3307)
* moved accelerator router (3309)
* train loop refactor - moving train loop to own object (3310, 3312, 3313, 3314)
* duplicate data interface definition up into DataHooks class (3344)
* inner train loop (3359, 3361, 3362, 3363, 3365, 3366, 3367, 3368, 3369, 3370, 3371, 3372, 3373, 3374, 3375, 3376, 3385, 3388, 3397)
* all logging related calls in a connector (3395)
* device parser (3400, 3405)
* added model connector (3407)
* moved eval loop logging to loggers (3408)
* moved eval loop (3412[3408)
* trainer/separate argparse (3421, 3428, 3432)
* move `lr_finder` (3434)
* organize args (3435, 3442, 3447, 3448, 3449, 3456)
* move specific accelerator code (3457)
* group connectors (3472)
* accelerator connector methods x/n (3469, 3470, 3474)
* merge backends (3476, 3477, 3478, 3480, 3482)
* apex plugin (3502)
* precision plugins (3504)
* Result - make monitor default to `checkpoint_on` to simplify (3571)
* reference to the Trainer on the `LightningDataModule` (3684)
* add `.log` to lightning module (3686, 3699, 3701, 3704, 3715)
* enable tracking original metric when step and epoch are both true (3685)
* deprecated results obj, added support for simpler comms (3681)
* move backends back to individual files (3712)
* fixes logging for eval steps (3763)
* decoupled DDP, DDP spawn (3733, 3766, 3767, 3774, 3802, 3806)
* remove weight loading hack for ddp_cpu (3808)
* separate `torchelastic` from DDP (3810)
* separate SLURM from DDP (3809)
* decoupled DDP2 (3816)
* bug fix with logging val epoch end + monitor (3812)
* decoupled DDP, DDP spawn (3733, 3817, 3819, 3927)
* callback system and init DDP (3836)
* adding compute environments (3837, [3842)
* epoch can now log independently (3843)
* test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (3848)
* fixed `init_slurm_connection` causing hostname errors (3856)
* moves init apex from LM to apex connector (3923)
* moves sync bn to each backend (3925)
* moves configure ddp to each backend (3924)
- Deprecation warning (3844)
- Changed `LearningRateLogger` to `LearningRateMonitor` (3251)
- Used `fsspec` instead of `gfile` for all IO (3320)
* Swaped `torch.load` for `fsspec` load in DDP spawn backend (3787)
* Swaped `torch.load` for `fsspec` load in cloud_io loading (3692)
* Added support for `to_disk()` to use remote filepaths with `fsspec` (3930)
* Updated model_checkpoint's to_yaml to use `fsspec` open (3801)
* Fixed `fsspec` is inconsistant when doing `fs.ls` (3805)
- Refactor `GPUStatsMonitor` to improve training speed (3257)
- Changed IoU score behavior for classes absent in target and pred (3098)
- Changed IoU `remove_bg` bool to `ignore_index` optional int (3098)
- Changed defaults of `save_top_k` and `save_last` to `None` in ModelCheckpoint (3680)
- `row_log_interval` and `log_save_interval` are now based on training loop's `global_step` instead of epoch-internal batch index (3667)
- Silenced some warnings. verified ddp refactors (3483)
- Cleaning up stale logger tests (3490)
- Allow `ModelCheckpoint` monitor to be `None` (3633)
- Enable `None` model checkpoint default (3669)
- Skipped `best_model_path` if `checkpoint_callback` is `None` (2962)
- Used `raise .. from ..` to explicitly chain exceptions (3750)
- Mocking loggers (3596, 3617, 3851, 3859, 3884, 3853, 3910, 3889, 3926)
- Write predictions in LightningModule instead of EvalResult [3882
Deprecated
- Deprecated `TrainResult` and `EvalResult`, use `self.log` and `self.write` from the `LightningModule` to log metrics and write predictions. `training_step` can now only return a scalar (for the loss) or a dictionary with anything you want. (3681)
- Deprecate `early_stop_callback` Trainer argument (3845)
- Rename Trainer arguments `row_log_interval` >> `log_every_n_steps` and `log_save_interval` >> `flush_logs_every_n_steps` (3748)
Removed
- Removed experimental Metric API (3868, 3943, 3949, 3946), listed changes before final removal:
* Added `EmbeddingSimilarity` metric (3349, [3358)
* Added hooks to metric module interface (2528)
* Added error when AUROC metric is used for multiclass problems (3350)
* Fixed `ModelCheckpoint` with `save_top_k=-1` option not tracking the best models when a monitor metric is available (3735)
* Fixed counter-intuitive error being thrown in `Accuracy` metric for zero target tensor (3764)
* Fixed aggregation of metrics (3517)
* Fixed Metric aggregation (3321)
* Fixed RMSLE metric (3188)
* Renamed `reduction` to `class_reduction` in classification metrics (3322)
* Changed `class_reduction` similar to sklearn for classification metrics (3322)
* Renaming of precision recall metric (3308)
Fixed
- Fixed `on_train_batch_start` hook to end epoch early (3700)
- Fixed `num_sanity_val_steps` is clipped to `limit_val_batches` (2917)
- Fixed ONNX model save on GPU (3145)
- Fixed `GpuUsageLogger` to work on different platforms (3008)
- Fixed auto-scale batch size not dumping `auto_lr_find` parameter (3151)
- Fixed `batch_outputs` with optimizer frequencies (3229)
- Fixed setting batch size in `LightningModule.datamodule` when using `auto_scale_batch_size` (3266)
- Fixed Horovod distributed backend compatibility with native AMP (3404)
- Fixed batch size auto scaling exceeding the size of the dataset (3271)
- Fixed getting `experiment_id` from MLFlow only once instead of each training loop (3394)
- Fixed `overfit_batches` which now correctly disables shuffling for the training loader. (3501)
- Fixed gradient norm tracking for `row_log_interval > 1` (3489)
- Fixed `ModelCheckpoint` name formatting ([3164)
- Fixed auto-scale batch size (3151)
- Fixed example implementation of AutoEncoder (3190)
- Fixed invalid paths when remote logging with TensorBoard (3236)
- Fixed change `t()` to `transpose()` as XLA devices do not support `.t()` on 1-dim tensor (3252)
- Fixed (weights only) checkpoints loading without PL (3287)
- Fixed `gather_all_tensors` cross GPUs in DDP (3319)
- Fixed CometML save dir (3419)
- Fixed forward key metrics (3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (3465)
- Fixed global step increment in training loop when `training_epoch_end` hook is used (3673)
- Fixed dataloader shuffling not getting turned off with `overfit_batches > 0` and `distributed_backend = "ddp"` (3534)
- Fixed determinism in `DDPSpawnBackend` when using `seed_everything` in main process (3335)
- Fixed `ModelCheckpoint` `period` to actually save every `period` epochs (3630)
- Fixed `val_progress_bar` total with `num_sanity_val_steps` (3751)
- Fixed Tuner dump: add `current_epoch` to dumped_params (3261)
- Fixed `current_epoch` and `global_step` properties mismatch between `Trainer` and `LightningModule` (3785)
- Fixed learning rate scheduler for optimizers with internal state (3897)
- Fixed `tbptt_reduce_fx` when non-floating tensors are logged (3796)
- Fixed model checkpoint frequency (3852)
- Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (3855)
- Fixed `TrainerEvaluationLoopMixin` activates `model.train()` at the end (3858)
- Fixed `overfit_batches` when using with multiple val/test_dataloaders (3857)
- Fixed enables `training_step` to return `None` (3862)
- Fixed init nan for checkpointing (3863)
- Fixed for `load_from_checkpoint` (2776)
- Fixes incorrect `batch_sizes` when Dataloader returns a dict with multiple tensors (3668)
- Fixed unexpected signature for `validation_step` (3947)
Contributors
abrahambotros, akihironitta, ananthsub, ananyahjha93, awaelchli, Borda, c00k1ez, carmocca, f4hy, GimmickNG, jbschiratti, justusschock, LeeJZh, lezwon, Lucas-Steinmann, maxjeblick, monney, mpariente, nateraw, nrupatunga, patrickorlando, PhilJd, rohitgr7, s-rog, ShomyLiu, SkafteNicki, Sordie, teddykoker, tgaddair, Vozf, williamFalcon, XDynames, ydcjeff
_If we forgot someone due to not matching the commit email with GitHub account, let us know :]_