* Add bfloat16 support to gemlite kernels by mobicham in https://github.com/mobiusml/gemlite/pull/24
0.4.3
- Add faster packing / unpacking utils - Set MIN_SIZE = 64 for Gemma 3 - Update caches
0.4.2.post1
- Avoid recompilation when the batch-size `M` changes: https://github.com/mobiusml/gemlite/commit/dcc2455d6187ec58338d1746e4c780f0718f70c3 - Expose autotune `M` logic via `set_autotune_setting()`: https://github.com/mobiusml/gemlite/commit/37dab275d95bbfa67aa7dac718b123dbfad054a4 - Fix bug related to config caching that was ignoring the pre-loaded cache: https://github.com/mobiusml/gemlite/commit/3c4ab53c54b55e00f94223a5eadedfcee1815f1f
0.4.2
* Auto-load pre-warmed caches for A100, H100, 4090, A6000 Ada. * Auto-set FP16 acc dtype for consumer gpus. * Enable/disable cache overwriting while loading. * Fix splitK_gemv bug with large block-sizes (Flux) * Force M powers of 2 to avoid re-compilation during the prefill phase
0.4.1
Fix bugs related to config caching.
0.4.0
* Improved performance on the A100 and H100. * Flexible bitpacking support (32-bit / 8-bit, over cols or rows). * Best config caching over all kernels. * Helper functions for easier usage. * `GEMV_SPLITK` kernel for better performance at batch-size=1 with non-packed data. * Improved accuracy via dumping for 8-bit weights with GEMV kernels. * Max-autotuning. * Avoid out-of-shared-memory by limiting `num_stages` based on the GPU device. * Various bug fixes.