Added
- Add option to pre-build LUTs for most operations, improving performance for repeated operations with same sparsity
layouts
Changed
- Rework all operations to use new `torch.library.triton_op()` approach, allowing for JIT compilation and better
compatability
- Rework kernels to work with triton block sizes larger than sparsity block sizes via masking
- Rework kernels to use automatic tuning of triton block sizes rather than fixed block sizes
- Rework operations to support dtype autocasting
Removed
- Remove manual specification of triton block sizes