Chonkie

Latest version: v0.5.1

Safety actively analyzes 723685 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

0.5.2rc1

What's Changed
* [Feat] Default to Lazy Imports in classes to improve import time by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/186
* [Feat] Default to Lazy Imports for faster import time by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/187
* [Fix] Solve a few static type checking errors from mypy by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/189
* [FEAT] Addition of decimal point, emails and urls handling. Better abbreviation handling. by Udayk02 in https://github.com/chonkie-ai/chonkie/pull/192
* Update version to `v0.5.2rc1` for first release candidate + Add decimals, email and url support in the Chef by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/194

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.5.1...v0.5.2rc1

0.5.1

What's Changed
* [Tutorial] Add `include_delim="next"` for headings in tutorial by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/181
* [Fix] Missing dependency for tiktoken in `all` + Better error messages for `OpenAIEmbedding` imports by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/183
* [Chore] Bump version to v0.5.1 by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/184

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.5.0...v0.5.1

0.5.0

🚨 Breaking changes

* All chunkers except `TokenChunker` have their argument `tokenizer` renamed to `tokenizer_or_token_counter` to denote that the chunkers support callable token counters as well.
* `DeprecatedWarning` has been set for `chunk_overlap>0` and users are suggested to use `OverlapRefinery` for its speed and flexibility.

✨ Highlights

* All chunkers now support a `return_type="texts"` parameter, causing the chunker to output only `List[str]`; skip receiving the metadata available in the `Chunk` dataclass and get only texts. This saves a little bit of memory as well.
* All chunkers support `Callable` in their `tokenizer_or_token_counter` arg. This allows you to pass in functions defined like `def token_counter (text:str) -> int: ...` into the chunkers.
* All chunkers which use delimiters (i.e. `SentenceChunker`, `RecursiveChunker`, `LateChunker` etc) have `include_delim="next"` which puts the delimiter in the next chunk. This feature is useful in processing Markdown files properly.
* Added initial support for Chonkie's pre-processing classes, `Chef` with `TextChef` that can handle loading and pre-processing Text and Markdown files.
* All `Chunk` dataclasses have `to_dict` and `from_dict` method, which allows to convert `Chunk <--> Dict`. This is especially useful if you want to store chunks as JSON or JSONLines files.

What's Changed
* [FEAT] Support `return_type` as `texts` for direct text handling by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/146
* [FEAT] Support `return_type` with `texts` output type by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/147
* [FEAT] Add support for callable `token_counter` as input for rule-based Chunkers by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/155
* [DOCS] Benchmarking update by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/145
* [DOCS] Update Benchmarks - Include Wikipedia-100k and Wikipedia-500k run timings by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/156
* [Feat] Add `include_delim='next'` as an optional argument in SentenceChunker and RecursiveChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/157
* Update Benchmarks + Remove numpy base dependency + Add `include_delim` in Chunkers by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/158
* [Fix] 151: Provide warning to user when `min_sentences_per_chunk` is not satisfied by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/159
* [Minor] Update Benchmarks + Add `include_delim="next"` + Fix 151 by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/160
* [Fix] Default to Tokenizers while AutoTikTokenizer issues are resolved by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/161
* Shift to HF Tokenizers as the default, until AutoTikTokenizer issues are resolved by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/162
* [Fix] Correct the `_split_text`/`_split_sentence` logic to give proper splits by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/164
* Add chonkbook by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/165
* Add ChonkBook by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/166
* [FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker 118 by Udayk02 in https://github.com/chonkie-ai/chonkie/pull/130
* Update Example URL by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/167
* [Feat] Add `mode="recursive"` in `OverlapRefinery` for all methods by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/168
* Add CohereEmbeddings + `recursive` mode in OverlapRefinery + Initial support for master Tokenizer class by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/169
* Bump up version to `v0.5.0` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/170
* Add `to_dict` and `from_dict` to all Chonkie data-classes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/171
* Add tests for types.py + Make the tests pass :) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/172
* Add `to_dict` + `from_dict` to Chonkie dataclasses + Add `__repr__` to classes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/173
* [Minor] Remove `token_processor.py` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/174
* [Feat] Add initial support for Chef via `BaseChef` and `TextChef` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/175
* [Feat] Add initial support for Chefs via `BaseChef` and `TextChef` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/176
* [Feat] Switch to using `Chonkie.Tokenizer` for Chunkers, Refineries by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/178
* [Fix] Use default model in `AutoEmbeddings` if `Error: model not found` + bad `__repr__` for `SemanticSentence` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/179

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.4.1...v0.5.0

0.4.1

Highlights

* Now you can see a progress bar when chunking a lot of texts with batch chunking
python
from chonkie import RecursiveChunker

chunker = RecursiveChunker()

chunks = chunker([...], show_progress_bar=True) progress bar is enabled by default

🦛 choooooooooooooooooooonk 100% • 200/200 docs chunked [00:00<00:00, 229.65doc/s] 🌱

What's Changed
* Add CONTRIBUTING.md, update issue templates, CI, Codecov and more... by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/119
* [FEAT] Add TQDM to default installs + CONTRIBUTING.md + other minor updates by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/120
* [fix] CI: reports were not being uploaded to Codecov by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/121
* Update CONTRIBUTING.md with first issue hyperlink by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/122
* [FIX] Support class methods as `token_counter` objects for `CustomEmbeddings` (92) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/127
* [Fix] Add fix for 92: Support `class.method` as a Tokenizer for `CustomEmbedding` +. minor changes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/128
* [FIX] 116: Incorrect`start_index` when `chunk_overlap` is not 0 by Udayk02 in https://github.com/chonkie-ai/chonkie/pull/126
* [FIX] `start_index` incorrect when `chunk_overlap` is not 0 (116) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/132
* [FIX] Remove tests for Py3.8 — Incompatible for support by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/134
* [fix] High `chunk_overlap` causes last chunk to be entirely redundant by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/136
* [FIX] Handle edge case for RecursiveChunker (131) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/137
* [DOCS] Update readme intro to match docs. by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/135
* [FEAT] Add TQDM progress bars for `chunk_batch` + Update README.md by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/138
* Replace dead discord link with infinite lifetime by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/140
* [FIX] Minor fixes + Stylistic enhancements for TQDM and Multiprocessing by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/141
* [chore] Bump up the package version to v0.4.1 by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/143

New Contributors
* shreyashnigam made their first contribution in https://github.com/chonkie-ai/chonkie/pull/122
* Udayk02 made their first contribution in https://github.com/chonkie-ai/chonkie/pull/126

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.4.0...v0.4.1

0.4.0

Highlights

Added the new `RecursiveChunker` that uses complex recursive rules to create structurally meaningful chunks, maintaining natural separations as much as possible. Try it out~
python
from chonkie import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker("Woah! Chonkie has it's own recursive chunker now~ so cooool!")

What's Changed
* Add initial support for Recursive Chunking (`RecursiveChunker`) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/107
* [FEAT] Add support for RecursiveChunking + minor fixes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/108
* [fix] Correct the start and end indices for TokenChunker in Batch mode (84) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/109
* [fix] Correct the start and end indices for TokenChunker in Batch mode (84) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/110
* [fix] 106: Missing last sentence in the SemanticChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/112
* [fix] Add fix for 106: Reconstruction tests for SemanticChunker failing, missing last sentence by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/113
* [chore] Bump version to "v0.4.0" + minor change by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/114

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.3.0...v0.4.0

0.3.0

Highlights

* Added `LateChunker` support! You can use `LateChunker` in the following manner:

python
from chonkie import LateChunker

chunker = LateChunker(
embedding_model="jinaai/jina-embeddings-v3",
mode="sentence",
trust_remote_code=True
)

* Added [Chonkie Discord](https://discord.gg/nMYNVyuB5Y) to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on [Twitter](https://x.com/ChonkieAI) and [Bluesky](https://bsky.app/profile/chonkieai.bsky.social) too!
* Bunch of bug fixes to improve chunkers' stability...

What's Changed
* [Fix] 37: Incorrect indexing when repetition is present in the text by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/87
* [Fix] 88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by arpesenti in https://github.com/chonkie-ai/chonkie/pull/89
* [Fix] WordChunker chunk_batch fail by sky-2002 in https://github.com/chonkie-ai/chonkie/pull/90
* [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/96
* Add initial support for Late Chunking by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/97
* [FEAT] Add LateChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/98
* [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/99
* Update version to 0.3.0 in pyproject.toml and __init__.py by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/100
* [fix] Add LateChunker support to chunker and module exports by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/101
* [fix] Docstrings in SemanticChunker should include **kwargs by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/102
* [Minor] Add Discord badge to README for community engagement by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/103

New Contributors
* arpesenti made their first contribution in https://github.com/chonkie-ai/chonkie/pull/89

**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.2.2...v0.3.0

Page 1 of 4

Releases

Has known vulnerabilities

Chonkie

Page 1 of 4

0.5.2rc1

0.5.1

0.5.0

0.4.1

0.4.0

0.3.0

Page 1 of 4

Links

Releases