1. Update imports:
python
Old
from smolbpe.gpt4Tokenizer import GPT4Tokenizer
New
from smolbpe.tokenizer import Tokenizer
2. Update class initialization:
python
Old
tokenizer = GPT4Tokenizer(output='vocab.json')
New
tokenizer = Tokenizer(
output='vocab.json',
special_tokens=['<|start|>', '<|end|>'] Optional
)
3. Update CLI commands:
sh
Old
gpt4tokenizer --text input.txt --vocab_size 400
New
tokenizer --text input.txt --vocab_size 400 --special_tokens "<|start|>" "<|end|>"
Example Usage
python
from smolbpe.tokenizer import Tokenizer
Initialize with special tokens
tokenizer = Tokenizer(
output='vocab.json',
special_tokens=['<|start|>', '<|end|>']
)
Train on your data
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
tokenizer.train(text, vocab_size=400)
Encode text with special tokens
text = "<|start|>Hello world!<|end|>"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(decoded) "<|start|>Hello world!<|end|>"
Bug Fixes
- Fixed empty statistics handling during training
- Improved Unicode character handling
- Better error messages for invalid inputs
Documentation Updates
- Added examples for special tokens usage
- Updated CLI documentation
- Improved code comments
Contributors
- Vover - Core development and maintenance
Links
- [GitHub Repository](https://github.com/T4ras123/SmolBPE)
- [PyPI Package](https://pypi.org/project/smolbpe/)
- [Issue Tracker](https://github.com/T4ras123/SmolBPE/issues)
---
For any issues or questions, please [open an issue](https://github.com/T4ras123/SmolBPE/issues/new) on GitHub."""