Added
- Initial release of custom tokeniser library
- Tokeniser class with support for:
- `tokenise()` using DP segmentation
- Custom token map and count map loading
- One-hot encoding support (NumPy & PyTorch)
- Token and token ID visualisation functions
- `token_map()`, `token_count_map()`, `max_token_length()` accessors
- Full support for:
- 0.5B val-only vocab
- 1B val + test vocab
- JSON-based token and count maps from SlimPajama corpus