SimSIMD is expanding and becoming __closer to a fully-fledged BLAS library. BLAS level 1 for now, but it's a start!__ SimSIMD will prioritize mixed and low-precision vector math, favoring modern AI workloads. For image & media processing workloads, the new `fma` and `wsum` kernels approach 65 GB/s per core on Intel Sapphire Rapids. That's __100x faster__ than the serial code for `u8` inputs with `f32` scaling and accumulation.
Contains the following element-wise operations:
math
\text{FMA}_i(A, B, C, \alpha, \beta) = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i
math
\text{WSum}_i(A, B, \alpha, \beta) = \alpha \cdot A_i + \beta \cdot B_i
In NumPy terms:
py
import numpy as np
def wsum(A: np.ndarray, B: np.ndarray, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype, "Input types must match and affect the output style"
return (Alpha * A + Beta * B).astype(A.dtype)
def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style"
return (Alpha * A * B + Beta * C).astype(A.dtype)
This tiny set of operations is enough to implement a wide range of algorithms:
- To scale a vector by a scalar, just call **WSum** with $\beta = 0$.
- To sum two vectors, just call **WSum** with $\alpha = \beta = 1$.
- To average two vectors, just call **WSum** with $\alpha = \beta = 0.5$.
- To multiply vectors element-wise, just call **FMA** with $\beta = 0$.
Benchmarks
On Intel Sapphire Rapids:
sh
Run on (16 X 3900 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 2048 KiB (x8)
L3 Unified 61440 KiB (x1)