Breaking Changes
- All customs environments (e.g. the ``BitFlippingEnv`` or ``IdentityEnv``) were moved to ``stable_baselines3.common.envs`` folder
- Refactored ``HER`` which is now the ``HerReplayBuffer`` class that can be passed to any off-policy algorithm
- Handle timeout termination properly for off-policy algorithms (when using ``TimeLimit``)
- Renamed ``_last_dones`` and ``dones`` to ``_last_episode_starts`` and ``episode_starts`` in ``RolloutBuffer``.
- Removed ``ObsDictWrapper`` as ``Dict`` observation spaces are now supported
python
her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
SB3 < 1.1.0
model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
SB3 >= 1.1.0:
model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)
- Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (09tangriro)
- Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (09tangriro)
- Removed parameter ``channels_last`` from ``is_image_space`` as it can be inferred.
- The logger object is now an attribute ``model.logger`` that be set by the user using ``model.set_logger()``
- Changed the signature of ``logger.configure`` and ``utils.configure_logger``, they now return a ``Logger`` object
- Removed ``Logger.CURRENT`` and ``Logger.DEFAULT``
- Moved ``warn(), debug(), log(), info(), dump()`` methods to the ``Logger`` class
- ``.learn()`` now throws an import error when the user tries to log to tensorboard but the package is not installed
New Features
- Added support for single-level ``Dict`` observation space (JadenTravnik)
- Added ``DictRolloutBuffer`` ``DictReplayBuffer`` to support dictionary observations (JadenTravnik)
- Added ``StackedObservations`` and ``StackedDictObservations`` that are used within ``VecFrameStack``
- Added simple 4x4 room Dict test environments
- ``HerReplayBuffer`` now supports ``VecNormalize`` when ``online_sampling=False``
- Added [VecMonitor](https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/vec_monitor.py) and [VecExtractDictObs](https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/vec_extract_dict_obs.py) wrappers to handle gym3-style vectorized environments (vwxyzjn)
- Ignored the terminal observation if the it is not provided by the environment
such as the gym3-style vectorized environments. (vwxyzjn)
- Added policy_base as input to the OnPolicyAlgorithm for more flexibility (09tangriro)
- Added support for image observation when using ``HER``
- Added ``replay_buffer_class`` and ``replay_buffer_kwargs`` arguments to off-policy algorithms
- Added ``kl_divergence`` helper for ``Distribution`` classes (09tangriro)
- Added support for vector environments with ``num_envs > 1`` (benblack769)
- Added ``wrapper_kwargs`` argument to ``make_vec_env`` (amy12xx)
Bug Fixes
- Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
- Fixed loading of ``ent_coef`` for ``SAC`` and ``TQC``, it was not optimized anymore (thanks Atlis)
- Fixed saving of ``A2C`` and ``PPO`` policy when using gSDE (thanks liusida)
- Fixed a bug where no output would be shown even if ``verbose>=1`` after passing ``verbose=0`` once
- Fixed observation buffers dtype in DictReplayBuffer (c-rizz)
- Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (skandermoalla)
Others
- Added ``flake8-bugbear`` to tests dependencies to find likely bugs
- Updated ``env_checker`` to reflect support of dict observation spaces
- Added Code of Conduct
- Added tests for GAE and lambda return computation
- Updated distribution entropy test (thanks 09tangriro)
- Added sanity check ``batch_size > 1`` in PPO to avoid NaN in advantage normalization
Documentation:
- Added gym pybullet drones project (JacopoPan)
- Added link to SuperSuit in projects (justinkterry)
- Fixed DQN example (thanks ltbd78)
- Clarified channel-first/channel-last recommendation
- Update sphinx environment installation instructions (tom-doerr)
- Clarified pip installation in Zsh (tom-doerr)
- Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
- Added example for using ``ProcgenEnv``
- Added note about advanced custom policy example for off-policy algorithms
- Fixed DQN unicode checkmarks
- Updated migration guide (juancroldan)
- Pinned ``docutils==0.16`` to avoid issue with rtd theme
- Clarified callback ``save_freq`` definition
- Added doc on how to pass a custom logger
- Remove recurrent policies from ``A2C`` docs (bstee615)