simsimd Changelog

5.1.1

![Preview](https://github.com/ashvardanian/www.ashvardanian.com/blob/main/static/discount-on-keyword-arguments-in-python/Discount-on-Keyword-Arguments.jpg?raw=true)

Most importantly, this is the first SimSIMD release deprecating Python 3.6, released in 2016. Now, 8 years later, we deprecated it to more broadly utilize the [Fast Calling Convention](https://docs.python.org/3/c-api/structures.html#c.METH_FASTCALL). Read more in a dedicated article on the cost of function arguments parsing in Pyhton - [35% discount on keyword arguments](https://ashvardanian.com/posts/discount-on-keyword-arguments-in-python/) 😄

- Thanks to stuartatnosible for noticing `dtype=` issues 👓
- Thanks to MarkReedZ for accelerating `bf16` dot-product on Arm 🦾

5.1.0

Major additions:

- [x] Set Intersections & Sparse Distances 100
- [x] Bilinear Forms & Curved Spaces 149, suggested by guyrosin
- [x] Reduced Memory Consumption 142, suggested by Charlyo
- [x] Fixing accuracy issues 153, spotted by cbornet

Minor fixes:

- Extends benchmarks, thanks to MarkReedZ
- Cleans up CMake, suggested by rschu1ze
- Exposing `f16`, `i8`, `b8` to Python buffers in [33f1b13](https://github.com/ashvardanian/SimSIMD/pull/154/commits/33f1b132816922f87fe898907858ca5d17300f14)
- Documenting algorithms in [405bef0](https://github.com/ashvardanian/SimSIMD/pull/154/commits/405bef0b021eb43a1698360d656650f91c7cad75)
- Exposing `bf16` to Rust, thanks to Wyctus

Some crazy findings:

- `rsqrt` precision on Arm is not documented at all
- `rsqrt` approximation for `double` in AVX-512 is only 6x more accurate, than for `float`
- SciPy raises the wrong errors when overflows - `ValueError` instead of `OverflowError`
- SciPy occasionally misuses `math.sqrt` over `numpy.sqrt` when dealing with NumPy array
- `sqrt` in `libc` [is bit-precise](https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html)

5.0.1

5.0.0

Handling `bf16` & Dynamic Dispatch on Arm

This major release adds new capability levels for Arm allowing us to differentiate `f16`, `bf16`. and `i8`-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the `bf16` kernels perform very well:

dot_bf16_neon_1536d/min_time:10.000/threads:1 183 ns 183 ns 76204478 abs_delta=0 bytes=33.5194G/s pairs=5.45563M/s relative_error=0
cos_bf16_neon_1536d/min_time:10.000/threads:1 239 ns 239 ns 58180403 abs_delta=0 bytes=25.7056G/s pairs=4.18386M/s relative_error=0
l2sq_bf16_neon_1536d/min_time:10.000/threads:1 312 ns 312 ns 43724273 abs_delta=0 bytes=19.7064G/s pairs=3.20742M/s relative_error=0

The `bf16` kernels reach 33 GB/s as opposed to 19 GB/s for `f16`:

dot_f16_neon_1536d/min_time:10.000/threads:1 323 ns 323 ns 43311367 abs_delta=82.3015n bytes=19.0324G/s pairs=3.09772M/s relative_error=109.717n
cos_f16_neon_1536d/min_time:10.000/threads:1 367 ns 367 ns 38007895 abs_delta=1.5456m bytes=16.7349G/s pairs=2.72377M/s relative_error=6.19568m
l2sq_f16_neon_1536d/min_time:10.000/threads:1 341 ns 341 ns 41010555 abs_delta=66.7783n bytes=18.0436G/s pairs=2.93679M/s relative_error=133.449n

Researching MMLA Extensions & Future Work

Arm supports 2x2 matrix multiplications for `i8` and `bf16`. All of our initial attempts with eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:

cos_i8_neon_16d/min_time:10.000/threads:1 5.41 ns 5.41 ns 1000000000 abs_delta=910.184u bytes=5.91441G/s pairs=184.825M/s relative_error=4.20295m
cos_i8_neon_64d/min_time:10.000/threads:1 7.63 ns 7.63 ns 1000000000 abs_delta=939.825u bytes=16.7729G/s pairs=131.039M/s relative_error=3.82144m
cos_i8_neon_1536d/min_time:10.000/threads:1 101 ns 101 ns 139085845 abs_delta=917.35u bytes=30.394G/s pairs=9.89387M/s relative_error=3.63925m

Attempts with `i8` for different dimensionality vectors:

cos_i8_neon_16d/min_time:10.000/threads:1 5.72 ns 5.72 ns 1000000000 abs_delta=0.282084 bytes=5.59562G/s pairs=174.863M/s relative_error=1.15086
cos_i8_neon_64d/min_time:10.000/threads:1 8.40 ns 8.40 ns 1000000000 abs_delta=0.234385 bytes=15.2345G/s pairs=119.02M/s relative_error=0.923009
cos_i8_neon_1536d/min_time:10.000/threads:1 117 ns 117 ns 118998604 abs_delta=0.23264 bytes=26.2707G/s pairs=8.55167M/s relative_error=0.920099

This line of work is to be continued, in parallel with new similarity metrics and distance functions.

Other Improvements

- Added the missing Haswell kernels for `f32` spatial distances on older x86 CPUs. Thanks to makarr 👏
- Added Python typing extensions for the native modules. Thanks to cbornet 👏
- Fixed offline builds from the `.tar.gz` packages, that were missing a `VERSION` file. Thanks to OD-Ice 👏

4.4.0

New `bf16` kernels

The "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - https://github.com/numpy/numpy/issues/19808. Most importantly, low-precision `bf16` dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.

bash
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
dot_bf16_haswell_1536d/min_time:10.000/threads:1 203 ns 203 ns 68785823 abs_delta=29.879n bytes=30.1978G/s pairs=4.91501M/s relative_error=39.8289n
dot_bf16_haswell_1536b/min_time:10.000/threads:1 93.0 ns 93.0 ns 150582910 abs_delta=24.8365n bytes=33.0344G/s pairs=10.7534M/s relative_error=33.1108n
dot_bf16_genoa_1536d/min_time:10.000/threads:1 71.0 ns 71.0 ns 197340105 abs_delta=23.6042n bytes=86.5917G/s pairs=14.0937M/s relative_error=31.4977n
dot_bf16_genoa_1536b/min_time:10.000/threads:1 36.1 ns 36.1 ns 387637713 abs_delta=22.3063n bytes=85.0019G/s pairs=27.6699M/s relative_error=29.7341n
dot_bf16_serial_1536d/min_time:10.000/threads:1 15992 ns 15991 ns 874491 abs_delta=311.896n bytes=384.216M/s pairs=62.5352k/s relative_error=415.887n
dot_bf16_serial_1536b/min_time:10.000/threads:1 7979 ns 7978 ns 1754703 abs_delta=193.719n bytes=385.045M/s pairs=125.34k/s relative_error=258.429n
dot_bf16c_serial_1536d/min_time:10.000/threads:1 16430 ns 16429 ns 852438 abs_delta=251.692n bytes=373.964M/s pairs=60.8665k/s relative_error=336.065n
dot_bf16c_serial_1536b/min_time:10.000/threads:1 8207 ns 8202 ns 1707289 abs_delta=165.209n bytes=374.54M/s pairs=121.92k/s relative_error=220.35n
vdot_bf16c_serial_1536d/min_time:10.000/threads:1 16489 ns 16488 ns 849194 abs_delta=247.646n bytes=372.639M/s pairs=60.6509k/s relative_error=330.485n
vdot_bf16c_serial_1536b/min_time:10.000/threads:1 8224 ns 8217 ns 1704397 abs_delta=162.036n bytes=373.839M/s pairs=121.693k/s relative_error=216.042n

That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting `bf16` left by 16 bits and using `_mm256_fmadd_ps` intrinsic / `vfmadd` instruction available since Intel Haswell.

Faster `i8` kernels

We can't directly use `_mm512_dpbusd_epi32` every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:

> Signed(ZeroExtend16(a.byte[4*j]) * SignExtend16(b.byte[4*j]))

In the past we would just upcast to 16-bit integers and resort to `_mm512_dpwssds_epi32`. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:

c
a_abs_vec = _mm512_abs_epi8(a_vec);
b_abs_vec = _mm512_abs_epi8(b_vec);
a2_i32s_vec = _mm512_dpbusds_epi32(a2_i32s_vec, a_abs_vec, a_abs_vec);
b2_i32s_vec = _mm512_dpbusds_epi32(b2_i32s_vec, b_abs_vec, b_abs_vec);

On Intel Sapphire Rapids it resulted in a higher single-thread utilization, but didn't lead to improvements on other platforms.

sh
---------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1 92.4 ns 92.4 ns 151487077 abs_delta=105.739u bytes=33.2344G/s pairs=10.8185M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1 92.4 ns 92.4 ns 151478714 abs_delta=0 bytes=33.2383G/s pairs=10.8198M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1 61.6 ns 61.6 ns 227445214 abs_delta=0 bytes=49.898G/s pairs=16.2428M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1 61.5 ns 61.5 ns 227609621 abs_delta=0 bytes=49.9167G/s pairs=16.2489M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1 299 ns 299 ns 46788061 abs_delta=0 bytes=10.2666G/s pairs=3.34198M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1 299 ns 299 ns 46787275 abs_delta=0 bytes=10.2663G/s pairs=3.34191M/s relative_error=0

New timings:

sh
---------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1 92.4 ns 92.4 ns 151463294 abs_delta=105.739u bytes=33.2359G/s pairs=10.819M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1 92.4 ns 92.4 ns 151470519 abs_delta=0 bytes=33.2392G/s pairs=10.82M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1 48.1 ns 48.1 ns 292087642 abs_delta=0 bytes=63.8408G/s pairs=20.7815M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1 48.2 ns 48.2 ns 291716009 abs_delta=0 bytes=63.7662G/s pairs=20.7572M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1 299 ns 299 ns 46784120 abs_delta=0 bytes=10.2647G/s pairs=3.34139M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1 299 ns 299 ns 46781350 abs_delta=0 bytes=10.2654G/s pairs=3.3416M/s relative_error=0

4.3.1

[4.3.1](https://github.com/ashvardanian/SimSIMD/compare/v4.3.0...v4.3.1) (2024-04-10)

Make

* Lighter Rust package ([c3bc06c](https://github.com/ashvardanian/SimSIMD/commit/c3bc06c4181e856f72f57c3f32f14bd7d2938d3d))

Simsimd

Page 9 of 21