Tokeniser-py

Latest version: v0.1.2

Safety actively analyzes 722460 Python packages for vulnerabilities to keep your Python projects secure.

0.1.2

Added
- Changed the url to the actual GitHub repo.

Notes
- Built on top of a **custom token creation algorithm** not based on any standard BPE/WordPiece method
- SlimPajama dataset used for vocab extraction
- Token count files are optimized to stay under 2GB for compatibility with Git LFS (and Hugging Face storage)

0.1.1

Added
- Changed default import code in README, showing class instance creation with default params.

0.1.0

Added
- Initial release of custom tokeniser library
- Tokeniser class with support for:
- `tokenise()` using DP segmentation
- Custom token map and count map loading
- One-hot encoding support (NumPy & PyTorch)
- Token and token ID visualisation functions
- `token_map()`, `token_count_map()`, `max_token_length()` accessors
- Full support for:
- 0.5B val-only vocab
- 1B val + test vocab
- JSON-based token and count maps from SlimPajama corpus

Releases

Has known vulnerabilities