New Features
- **Improved PPO Implementation**
- Added feature extraction (`Extractor`) using convolutional layers for improved state representation.
- Modified `Actor` to output a Gaussian distribution instead of categorical.
- Adjusted `Critic` network architecture for better value estimation.
- Integrated GAE (Generalized Advantage Estimation) and MC (Monte Carlo) value estimation as options.
- **Enhanced Trainer and Buffer**
- `OnPolicyTrainer` now supports multiple value estimation methods (`gae` and `mc`).
- `RolloutBuffer` now supports reward normalization and advantage normalization.
- Improved buffer memory management and device handling for efficient training.
- **CI/CD Improvements**
- Updated GitHub Actions workflow to use `coverage.xml` for reporting instead of `pytest-coverage.txt`.
- Added Codecov integration with GitHub Actions, utilizing the `CODECOV` secret for token authentication.
Bug Fixes
- Fixed incorrect reshaping of stored tensors in `BaseBuffer`.
- Fixed `MSELoss` implementation to ensure consistency in tensor shapes.
- Corrected improper usage of `torch.argmax` in discrete action selection.
- Resolved missing `.to(device)` calls for tensor operations in `RolloutBuffer`.
Performance Improvements
- Optimized batch processing by avoiding redundant `.detach()` calls.
- Refactored advantage computation in GAE for efficiency.
- Modified learning rates for `PPO` (`lr_actor = 1e-4`, `lr_critic = 3e-4`) for more stable training.
- Updated `compute_returns_and_advantages_mc` to apply reward normalization.