Chonkie

Latest version: v0.4.0

Safety actively analyzes 701427 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

0.4.0

Highlights

Added the new `RecursiveChunker` that uses complex recursive rules to create structurally meaningful chunks, maintaining natural separations as much as possible. Try it out~
python
from chonkie import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker("Woah! Chonkie has it's own recursive chunker now~ so cooool!")


What's Changed
* Add initial support for Recursive Chunking (`RecursiveChunker`) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/107
* [FEAT] Add support for RecursiveChunking + minor fixes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/108
* [fix] Correct the start and end indices for TokenChunker in Batch mode (84) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/109
* [fix] Correct the start and end indices for TokenChunker in Batch mode (84) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/110
* [fix] 106: Missing last sentence in the SemanticChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/112
* [fix] Add fix for 106: Reconstruction tests for SemanticChunker failing, missing last sentence by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/113
* [chore] Bump version to "v0.4.0" + minor change by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/114


**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.3.0...v0.4.0

0.3.0

Highlights

* Added `LateChunker` support! You can use `LateChunker` in the following manner:

python
from chonkie import LateChunker

chunker = LateChunker(
embedding_model="jinaai/jina-embeddings-v3",
mode="sentence",
trust_remote_code=True
)


* Added [Chonkie Discord](https://discord.gg/nMYNVyuB5Y) to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on [Twitter](https://x.com/ChonkieAI) and [Bluesky](https://bsky.app/profile/chonkieai.bsky.social) too!
* Bunch of bug fixes to improve chunkers' stability...

What's Changed
* [Fix] 37: Incorrect indexing when repetition is present in the text by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/87
* [Fix] 88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by arpesenti in https://github.com/chonkie-ai/chonkie/pull/89
* [Fix] WordChunker chunk_batch fail by sky-2002 in https://github.com/chonkie-ai/chonkie/pull/90
* [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/96
* Add initial support for Late Chunking by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/97
* [FEAT] Add LateChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/98
* [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/99
* Update version to 0.3.0 in pyproject.toml and __init__.py by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/100
* [fix] Add LateChunker support to chunker and module exports by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/101
* [fix] Docstrings in SemanticChunker should include **kwargs by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/102
* [Minor] Add Discord badge to README for community engagement by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/103

New Contributors
* arpesenti made their first contribution in https://github.com/chonkie-ai/chonkie/pull/89

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.2.2...v0.3.0

0.2.2

Highlights

* Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
* Added `auto` thresholding mode for SemanticChunkers to remove `similarity_threshold` hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum
* Added `OverlapRefinery` for adding overlap context to the chunks. `chunk_overlap` parameter will be deprecated in the future for `OverlapRefinery` instead.

What's Changed
* [Fix] AutoEmbeddings not loading `all-minilm-l6-v2` but loads `All-MiniLM-L6-V2` by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/57
* [Fix] Add fix for 55 by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/58
* [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/60
* [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/62
* [Update] Change default embedding model in SemanticChunkers by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/63
* Add `min_chunk_size` to SDPMChunker + Lint codebase with ruff + minor changes by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/68
* Added automated testing using Github Actions by pratyushmittal in https://github.com/bhavnicksm/chonkie/pull/66
* Add support for automated testing with Github Actions by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/69
* [Fix] Allow for functions as token_counters in BaseChunkers by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/70
* Add TEVL to speed up sentence chunker by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/71
* Add TEVL to speed-up sentence chunking by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/72
* Update the docs path to docs.chonkie.ai by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/75
* [FEAT] Add BaseRefinery and OverlapRefinery support by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/77
* Add support for BaseRefinery and OverlapRefinery + minor changes by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/78
* [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/79
* [Fix] Unify dataclasses under a types.py for ease by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/80
* Expose the seperation delim for simple multilingual chunking by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/81
* Bump version to v0.2.2 for release by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/82

New Contributors
* pratyushmittal made their first contribution in https://github.com/bhavnicksm/chonkie/pull/66

**Full Changelog**: https://github.com/bhavnicksm/chonkie/compare/v0.2.1...v0.2.2

0.2.1.post1

Highlights

This patch fix allows for AutoEmbeddings to properly default to `SentenceTransformerEmbeddings` which was being by-passed in the previous release.

Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a `min_chunk_size` which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.

What's Changed
* [Fix] AutoEmbeddings not loading `all-minilm-l6-v2` but loads `All-MiniLM-L6-V2` by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/57
* [Fix] Add fix for 55 by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/58
* [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/60
* [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/62


**Full Changelog**: https://github.com/bhavnicksm/chonkie/compare/v0.2.1...v0.2.1.post1

0.2.1

Breaking Changes

* SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the `SentenceTransformerEmbeddings` class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside the `AutoEmbeddings` class.
* By default, `semantic` optional installation now depends on `Model2VecEmbeddings` and hence `model2vec` python package from this release onwards, due to size and speed benefits. `Model2Vec` uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency.
* `SemanticChunker` and `SDPMChunker` now use the argument `chunk_size` instead of `max_chunk_size` for uniformity across the chunkers, but the internal representation remains the same.

What's Changed
* [BUG] Fix the start_index and end_index to point to character indices, not token indices by mrmps in https://github.com/bhavnicksm/chonkie/pull/29
* [DOCS] Fix typo for import tokenizer in quick start example by jasonacox in https://github.com/bhavnicksm/chonkie/pull/30
* Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/32
* Use `__slots__` instead of `slots=True` for python3.9 support by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/34
* Bump version to 0.2.0.post1 in pyproject.toml and __init__.py by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/35
* [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/44
* Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/45
* Add initial OpenAIEmbeddings support to Chonkie ✨ by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/46
* [DOCS] Add info about initial embeddings support and how to add custom embeddings by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/47
* [FEAT] - Add model2vec embedding models by sky-2002 in https://github.com/bhavnicksm/chonkie/pull/41
* [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/49
* [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/51
* [Fix] Token counts from Tokenizers and Transformers adding special tokens by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/52
* [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/53
* [Refactor] Optimize similarity calculation by using np.divide for imp… by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/54

New Contributors
* mrmps made their first contribution in https://github.com/bhavnicksm/chonkie/pull/29
* jasonacox made their first contribution in https://github.com/bhavnicksm/chonkie/pull/30
* sky-2002 made their first contribution in https://github.com/bhavnicksm/chonkie/pull/41

**Full Changelog**: https://github.com/bhavnicksm/chonkie/compare/v0.2.0...v0.2.1

0.2.0.post2

Highlights
This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using `slots=True` which would only work for python 3.10 onwards. This also works in python3.10+ versions.

What's Changed
* Use `__slots__` instead of `slots=True` for python3.9 support by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/34
* Bump version to 0.2.0.post1 in pyproject.toml and __init__.py by bhavnicksm in https://github.com/bhavnicksm/chonkie/pull/35


**Full Changelog**: https://github.com/bhavnicksm/chonkie/compare/v0.2.0.post1...v0.2.0.post2

Page 1 of 3

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.