In this release we have refactored some names inside every algorithm, in particular:
* we have introduced the concept of `policy_step`, which is the number of (distributed) policy steps per environment step, where the environment step does not take into consideration the action repeat, i.e. is the number of times the policy is called to collect an action given an observation. If one has `n` ranks and `m` environments per rank, then the number of policy steps per environment step is `policy_steps = n * m`
We have also refactored the hydra configs, in particular:
* we have introduced both the `metric`, `checkpoint` and `buffer` config, containing the shared hyperparameters for those objects in every algorithm
* the `metric` config has the `metric.log_every` parameter, which controls the logging frequency. Since it's hard for the `policy_step` variable to be divisible by the `metric.log_every` value, the logging will happen as soon as `policy_step - last_log >= cfg.metric.log_every`, with `last_log = policy_step` is updated everytime something is logged
* the `checkpoint` has the `every` and `resume_from` parameters. The `every` parameter works as the `metric.log_every` one, while the `resume_from` specifies the experiment folder, which must contain the `.hydra` folder, to resume the training from. This is now only supported by the Dreamer algorithms
* `num_envs` and `clip_reward` have been moved to the `env` config