Pytokencounter

Latest version: v1.7.0

Safety actively analyzes 723929 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

1.5.1

This patch release introduces `tc` as an additional CLI entry point, providing a more convenient alias for `tokencount`.

**Changes**
- Added `tc` as a recognized command-line entry point alongside `tokencount`.
- Updated documentation to reflect this addition.

This update improves usability while maintaining full compatibility with existing functionality.

1.5.0

This release introduces expanded file handling options, improved CLI functionality, enhanced error handling, and a license update from MIT to GPL v3.

**New Features & Enhancements**

**Expanded File Handling Capabilities**
- Added `excludeBinary` and `includeHidden` parameters to:
- `TokenizeDir`, `GetNumTokenDir`, `TokenizeFiles`, and `GetNumTokenFiles`.
- Allows users to explicitly include or exclude binary and hidden files when processing directories and files.

**CLI Enhancements**
- New command-line options in `cli.py`:
- `--no-exclude-binary` (`-b`): Allows processing of binary files.
- `--include-hidden` (`-H`): Enables tokenization of hidden files.
- Updated `README.md`:
- Documented the new CLI options.
- Improved CLI examples for clarity.
- Added a structured table for ignored binary files to improve readability.

**Improved Exception Handling & Logging**
- Refined `UnsupportedEncodingError` handling:
- Now utilizes `rich.panel` for structured and readable error messages.
- Prevents redundant error display by tracking `_printed` status.
- Improved error handling in `ReadTextFile`:
- Now explicitly catches `UnicodeDecodeError` when reading files.
- Ensures the file that caused the error is recorded in the error message for better debugging.

**License Update: MIT → GPL v3**
- The project has transitioned from the MIT License to GNU General Public License v3 (GPL v3).

1.4.0

This release introduces several enhancements to improve tokenization workflows for LLMs:

New Features:
1. **Map Tokens to String Values:**
- Added the `mapTokens` option to `tokenize` and `count` functions, allowing mapping of token IDs to their string values.
- Introduced the standalone `MapTokens` function for independent token mapping.
- Added `mapTokens` support to the CLI.

2. **Default Model Update:**
- The default model for all functions and CLI operations is now `"gpt-4o"`.

3. **CLI Enhancements:**
- Added an `--output` option to save results to a JSON file.
- Introduced colored logging for better visibility of CLI outputs.

4. **Expanded Model and Encoding Support:**
- Added support for more models and encodings, ensuring compatibility with a wider range of LLMs.

For questions or feedback, feel free to [open an issue](https://github.com/kgruiz/PyTokenCounter/issues/new).

1.3.0

Version 1.3.0 introduces several key updates to enhance functionality, usability, and reliability. This release focuses on refining encoding handling, adding useful CLI features, and improving user experience with visual progress feedback.

Key Updates:

- **Progress Bars**: Implemented using the `rich` library, progress bars provide clear, real-time feedback for longer operations, making processes easier to track.
- **Quiet Mode**: A new CLI option to suppress unnecessary output for a streamlined experience.

Bug Fix:
A critical issue with character handling has been resolved. Previously, files were read in their correct encoding, but `tiktoken` expects UTF-8 input. This mismatch caused problems with special characters like `é` and differences between straight and typographic apostrophes (`'` vs. `’`), leading to errors and inconsistent replacements. The input handling has been updated to ensure consistent UTF-8 processing, eliminating these issues.

Updated Methods:
Several methods have been introduced or updated to improve flexibility when working with encodings and models:
- `GetEncodingNameForModel`: Returns the encoding name as a string for a specified model.
- `GetEncodingForModel`: Outputs the `tiktoken.Encoding` object for a given model.
- `GetModelForEncodingName`: Maps encoding names to their corresponding model names.
- `GetModelForEncoding`: Maps a `tiktoken.Encoding` object to its model name.

New CLI Commands:
- `get-model`: Retrieve the model name from a given encoding.
- `get-encoding`: Retrieve the encoding name from a given model.

Testing Framework:
A comprehensive testing suite has been introduced, ensuring greater reliability and robustness for all features.

Version 1.3.0 represents a substantial improvement, addressing critical issues while introducing new tools and enhancements. For detailed information, refer to the [documentation](https://github.com/kgruiz/PyTokenCounter#readme).

1.2.1

What's Changed
- **Fixed error handling**: Fixed an issue where `TokenizeFile` and possibly other functions were catching an `UnsupportedEncodingError` that was never thrown. The `ReadTextFile` function was signaling non-text files by returning a tuple `(None, encoding)` instead of raising an error. This caused issues where `TokenizeStr` (and others) were being passed the tuple instead of a valid string.
- **Improved logic**: Updated `ReadTextFile` to raise an `UnsupportedEncodingError` instead of returning a tuple. This fixes the error handling issue and makes the logic cleaner.

Notes
This patch release addresses a specific bug in error handling. Updating is recommended for smoother and more consistent behavior.

1.2.0

This update improves the "TokenizeFiles" and "GetNumTokenFiles" functions, as well as their CLI commands ("tokenize-files" and "count-files"):

- **New Input Options**:
- Both "TokenizeFiles" and "GetNumTokenFiles" now support:
- A directory path.
- A single file.
- The original list of files.

- **Better Error Handling**:
- Added checks to ensure non-text files are not processed, making the functions more reliable.

These changes add new features while keeping everything compatible with previous versions.

Page 2 of 3

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.