- Included SentencePiece support for tokenizing large datasets. By default, if a dataset is larger than 100,000 lines, we will utilize a sample of 100k to build the tokenizer.
- Switched to a `loky` based backend for multi-processing. This fixes several run-time bugs on Windows platforms and increases general stability for consistent multi-processing text generation.