Breaking Changes
- ``evaluate_policy`` now returns the standard deviation of the reward per episode
as second return value (instead of ``n_steps``)
- ``evaluate_policy`` now returns as second return value a list of the episode lengths
when ``return_episode_rewards`` is set to ``True`` (instead of ``n_steps``)
- Callback are now called after each ``env.step()`` for consistency (it was called every ``n_steps`` before
in algorithm like ``A2C`` or ``PPO2``)
- Removed unused code in ``common/a2c/utils.py`` (``calc_entropy_softmax``, ``make_path``)
- **Refactoring, including removed files and moving functions.**
- Algorithms no longer import from each other, and ``common`` does not import from algorithms.
- ``a2c/utils.py`` removed and split into other files:
- common/tf_util.py: ``sample``, ``calc_entropy``, ``mse``, ``avg_norm``, ``total_episode_reward_logger``,
``q_explained_variance``, ``gradient_add``, ``avg_norm``, ``check_shape``,
``seq_to_batch``, ``batch_to_seq``.
- common/tf_layers.py: ``conv``, ``linear``, ``lstm``, ``_ln``, ``lnlstm``, ``conv_to_fc``, ``ortho_init``.
- a2c/a2c.py: ``discount_with_dones``.
- acer/acer_simple.py: ``get_by_index``, ``EpisodeStats``.
- common/schedules.py: ``constant``, ``linear_schedule``, ``middle_drop``, ``double_linear_con``, ``double_middle_drop``,
``SCHEDULES``, ``Scheduler``.
- ``trpo_mpi/utils.py`` functions moved (``traj_segment_generator`` moved to ``common/runners.py``, ``flatten_lists`` to ``common/misc_util.py``).
- ``ppo2/ppo2.py`` functions moved (``safe_mean`` to ``common/math_util.py``, ``constfn`` and ``get_schedule_fn`` to ``common/schedules.py``).
- ``sac/policies.py`` function ``mlp`` moved to ``common/tf_layers.py``.
- ``sac/sac.py`` function ``get_vars`` removed (replaced with ``tf.util.get_trainable_vars``).
- ``deepq/replay_buffer.py`` renamed to ``common/buffers.py``.
New Features:
- Parallelized updating and sampling from the replay buffer in DQN. (flodorner)
- Docker build script, `scripts/build_docker.sh`, can push images automatically.
- Added callback collection
- Added ``unwrap_vec_normalize`` and ``sync_envs_normalization`` in the ``vec_env`` module
to synchronize two VecNormalize environment
- Added a seeding method for vectorized environments. (NeoExtended)
- Added extend method to store batches of experience in ReplayBuffer. (solliet)
Bug Fixes:
- Fixed Docker images via ``scripts/build_docker.sh`` and ``Dockerfile``: GPU image now contains ``tensorflow-gpu``,
and both images have ``stable_baselines`` installed in developer mode at correct directory for mounting.
- Fixed Docker GPU run script, ``scripts/run_docker_gpu.sh``, to work with new NVidia Container Toolkit.
- Repeated calls to ``RLModel.learn()`` now preserve internal counters for some episode
logging statistics that used to be zeroed at the start of every call.
- Fix `DummyVecEnv.render` for ``num_envs > 1``. This used to print a warning and then not render at all. (shwang)
- Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to ``learn(total_timesteps)`` reset
the environment on every call, potentially biasing samples toward early episode timesteps.
(shwang)
- Fixed by adding lazy property ``ActorCriticRLModel.runner``. Subclasses now use lazily-generated
``self.runner`` instead of reinitializing a new Runner every time ``learn()`` is called.
- Fixed a bug in ``check_env`` where it would fail on high dimensional action spaces
- Fixed ``Monitor.close()`` that was not calling the parent method
- Fixed a bug in ``BaseRLModel`` when seeding vectorized environments. (NeoExtended)
- Fixed ``num_timesteps`` computation to be consistent between algorithms (updated after ``env.step()``)
Only ``TRPO`` and ``PPO1`` update it differently (after synchronization) because they rely on MPI
- Fixed bug in ``TRPO`` with NaN standardized advantages (richardwu)
- Fixed partial minibatch computation in ExpertDataset (richardwu)
- Fixed normalization (with ``VecNormalize``) for off-policy algorithms
- Fixed ``sync_envs_normalization`` to sync the reward normalization too
- Bump minimum Gym version (>=0.11)
Others:
- Removed redundant return value from ``a2c.utils::total_episode_reward_logger``. (shwang)
- Cleanup and refactoring in ``common/identity_env.py`` (shwang)
- Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)
Documentation:
- Add dedicated page for callbacks
- Fixed example for creating a GIF (KuKuXia)
- Change Colab links in the README to point to the notebooks repo
- Fix typo in Reinforcement Learning Tips and Tricks page. (mmcenta)