-----
**ENHANCEMENTS**
- Slurm:
- Add support for scheduling with GPU options. Currently supports the following GPU-related options: `—G/——gpus,
——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu`.
- Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf
is automatically generated by node daemon and contains GPU information from compute instances. If need to specify
additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when
possible.
- Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
required to satisfy all GPU requirements.
- Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended
to submit jobs when the cluster is stopped.
- Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general,
cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
- Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.
**CHANGES**
- Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This
improves the scaling time especially with increased ASG launch rates.
- Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
to recover when under heavy load.
**BUG FIXES**
- Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
- Better handling of errors occurred when adding/removing nodes from the scheduler config.
- Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.