* New observation space with better normalization improving performance of both central and multi agent PPO
* Extra observations and new reward function for multi agent PPO to learn non-greedy, cooperative & fair behavior, taking other UEs into account
* Support for continuous instead of episodic training
* Refactoring, fixes, improvements
Details: [v0.10 details](https://github.com/CN-UPB/deep-rl-mobility-management/blob/master/docs/mdp.md#v010-fair-cooperative-multi-agent)