🚨 Breaking changes
* All chunkers except `TokenChunker` have their argument `tokenizer` renamed to `tokenizer_or_token_counter` to denote that the chunkers support callable token counters as well.
* `DeprecatedWarning` has been set for `chunk_overlap>0` and users are suggested to use `OverlapRefinery` for its speed and flexibility.
✨ Highlights
* All chunkers now support a `return_type="texts"` parameter, causing the chunker to output only `List[str]`; skip receiving the metadata available in the `Chunk` dataclass and get only texts. This saves a little bit of memory as well.
* All chunkers support `Callable` in their `tokenizer_or_token_counter` arg. This allows you to pass in functions defined like `def token_counter (text:str) -> int: ...` into the chunkers.
* All chunkers which use delimiters (i.e. `SentenceChunker`, `RecursiveChunker`, `LateChunker` etc) have `include_delim="next"` which puts the delimiter in the next chunk. This feature is useful in processing Markdown files properly.
* Added initial support for Chonkie's pre-processing classes, `Chef` with `TextChef` that can handle loading and pre-processing Text and Markdown files.
* All `Chunk` dataclasses have `to_dict` and `from_dict` method, which allows to convert `Chunk <--> Dict`. This is especially useful if you want to store chunks as JSON or JSONLines files.
What's Changed
* [FEAT] Support `return_type` as `texts` for direct text handling by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/146
* [FEAT] Support `return_type` with `texts` output type by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/147
* [FEAT] Add support for callable `token_counter` as input for rule-based Chunkers by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/155
* [DOCS] Benchmarking update by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/145
* [DOCS] Update Benchmarks - Include Wikipedia-100k and Wikipedia-500k run timings by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/156
* [Feat] Add `include_delim='next'` as an optional argument in SentenceChunker and RecursiveChunker by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/157
* Update Benchmarks + Remove numpy base dependency + Add `include_delim` in Chunkers by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/158
* [Fix] 151: Provide warning to user when `min_sentences_per_chunk` is not satisfied by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/159
* [Minor] Update Benchmarks + Add `include_delim="next"` + Fix 151 by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/160
* [Fix] Default to Tokenizers while AutoTikTokenizer issues are resolved by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/161
* Shift to HF Tokenizers as the default, until AutoTikTokenizer issues are resolved by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/162
* [Fix] Correct the `_split_text`/`_split_sentence` logic to give proper splits by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/164
* Add chonkbook by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/165
* Add ChonkBook by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/166
* [FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker 118 by Udayk02 in https://github.com/chonkie-ai/chonkie/pull/130
* Update Example URL by shreyashnigam in https://github.com/chonkie-ai/chonkie/pull/167
* [Feat] Add `mode="recursive"` in `OverlapRefinery` for all methods by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/168
* Add CohereEmbeddings + `recursive` mode in OverlapRefinery + Initial support for master Tokenizer class by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/169
* Bump up version to `v0.5.0` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/170
* Add `to_dict` and `from_dict` to all Chonkie data-classes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/171
* Add tests for types.py + Make the tests pass :) by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/172
* Add `to_dict` + `from_dict` to Chonkie dataclasses + Add `__repr__` to classes by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/173
* [Minor] Remove `token_processor.py` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/174
* [Feat] Add initial support for Chef via `BaseChef` and `TextChef` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/175
* [Feat] Add initial support for Chefs via `BaseChef` and `TextChef` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/176
* [Feat] Switch to using `Chonkie.Tokenizer` for Chunkers, Refineries by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/178
* [Fix] Use default model in `AutoEmbeddings` if `Error: model not found` + bad `__repr__` for `SemanticSentence` by bhavnicksm in https://github.com/chonkie-ai/chonkie/pull/179
**Full Changelog**: https://github.com/chonkie-ai/chonkie/compare/v0.4.1...v0.5.0