Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.
sh
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
fma_u8_haswell<1536d>/min_time:10.000/threads:1 248 ns 248 ns 56523758 abs_delta=8.20566 bytes=18.6111G/s pairs=4.03886M/s relative_error=2.16737m
wsum_u8_haswell<1536d>/min_time:10.000/threads:1 197 ns 197 ns 71164289 abs_delta=7.76442 bytes=15.5983G/s pairs=5.07757M/s relative_error=2.86599m
fma_u8_sapphire<1536d>/min_time:10.000/threads:1 70.9 ns 70.9 ns 197581878 abs_delta=9.2812 bytes=64.9908G/s pairs=14.1039M/s relative_error=2.45142m
wsum_u8_sapphire<1536d>/min_time:10.000/threads:1 51.2 ns 51.2 ns 275604255 abs_delta=8.89144 bytes=60.0323G/s pairs=19.5418M/s relative_error=3.28203m
fma_u8_serial<1536d>/min_time:10.000/threads:1 9749 ns 9748 ns 1428411 abs_delta=1.66854 bytes=472.69M/s pairs=102.58k/s relative_error=440.882u
wsum_u8_serial<1536d>/min_time:10.000/threads:1 9455 ns 9455 ns 1488320 abs_delta=2.32787 bytes=324.901M/s pairs=105.762k/s relative_error=859.403u