Latest version: v0.1.2
The information on this page was curated by experts in our Cybersecurity Intelligence Team.
A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.
No known vulnerabilities found