Supports
- Simple RNNT loss with Atomic Locks implementation
Improvements
- Improve runtime speed of numba loss
- Fix issue with data movement of costs tensor from llForward to pytorch data view in numba
- This alone costs a linear loop (scaling with batch size) that is roughly 10x the kernel costs themselves.
- Fix by writing a small kernel to copy the data and update the costs.