Merge-tokenizers

Latest version: v0.0.6

Safety actively analyzes 682361 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.0.6

This new version adds:

- The possibility of skipping the `get_spans` step in `GreedyCoverage` aligners, when passing the spans already computed in `token_to_chars` by HuggingFace tokenizers. This result in faster execution of the algorithms.
- Reran the benchmark to depict the time changes.
- Extended the explanation of `GreedyCoverage` aligners in the README.

0.0.5

The `0.0.5` version of `merge-tokenizers` adds:

- A new fast and high-quality greedy algorithm based on text coverage and span overlapping. Two implementations: `GreedyCoverageAligner` (C) and `PythonGreedyCoverageAligner` (Python) are provided.
- Fix an import related to the `Alignment` class.
- Improve the README.

0.0.4

This is the first version of `merge-tokenizers`. It contains:

- Five aligners: DTW, Fast-DTW, greedy, word_ids, and Tamuhey, that can be used to align different tokenizations and merge token-level features.
- Examples and benchmark.
- Implementations of token distances.
- Heuristics to avoid useless computations.
- Base class to implement custom aligners.
- Brief documentation in the README.

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.