In this release we have:
* refactored the recurrent PPO implementation. In particular:
* A single LSTM model is used, taking in input the current observation, the previous played action and the previous recurrent state, i.e., `LSTM([o_t, a_t-1], h_t-1)`. The LSTM has an optional pre-mlp an post-mlp: those can be controlled in the relative `algo/ppo_recurrent.yaml` config
* A feature extractor is used to extract features from the observations, being them vectors or images
* Every PPO algorithm now computes the bootstrapped value, summing it to the current reward, whenever an environment has been truncated