This release introduces many additional optimization, leading to a speedup of more than 7X on data with more than 300K rows.
- All internal statistics (histograms, gradient/hessian sums) have been converted to using `f32` data types. However, for any summing aggregations these values are cast to `f64` and then summed, this is to ensure that higher precision is maintained.
- All gradients are aligned in memory before calculating feature histograms. This led to a about half of the performance improvement.
- The data is realigned in memory prior to each tree being constructed, this led to most of the remaining speed gain.
- The histograms, which where originally a hashmap of vectors, has been converted to a jagged matrix, to have a data structure with faster access.
By aligning the data in memory, this reduced the overall number of cache hits, which leads to drastically increased performance.