Bm25s

Latest version: v0.2.7

Safety actively analyzes 696794 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

0.2.7pre2

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.7pre1...0.2.7pre2

0.2.7pre1

What's Changed
* Fix query filtering and vocabulary dict by xhluca in https://github.com/xhluca/bm25s/pull/96 and mossbee in https://github.com/xhluca/bm25s/pull/92

Notes

* The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.6...0.2.7

0.2.6

What's Changed
* Extending to Non-ASCII characters with corpora loading and saving by IssacXid in https://github.com/xhluca/bm25s/pull/93


**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.5...0.2.6

0.2.5

What's Changed
* Update README.md by xhluca in https://github.com/xhluca/bm25s/pull/83
* Added support for saving and loading non ASCII chars in corpus and vocab by IssacXid in https://github.com/xhluca/bm25s/pull/86
* Update README.md by mrisher in https://github.com/xhluca/bm25s/pull/87

New Contributors
* IssacXid made their first contribution in https://github.com/xhluca/bm25s/pull/86
* mrisher made their first contribution in https://github.com/xhluca/bm25s/pull/87

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.4...0.2.5

0.2.4

What's Changed

Fix crash tokenizing with empty word_to_id by mgraczyk in https://github.com/xhluca/bm25s/pull/72

Create nltk_stemmer.py by aflip in https://github.com/xhluca/bm25s/pull/77

https://github.com/xhluca/bm25s/commit/aa31a2321250180feb8b155fec1daafd40f56182: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.
- `bm25s/init.py:` Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
- `bm25s/tokenization.py:` Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.


New Contributors
* mgraczyk made their first contribution in https://github.com/xhluca/bm25s/pull/72
* aflip made their first contribution in https://github.com/xhluca/bm25s/pull/77

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.3...0.2.4

0.2.3

What's Changed
* PR https://github.com/xhluca/bm25s/pull/67 fixes issue #60
* More test cases for edge cases of `Tokenizer` class, such as when `update_vocab=True` in `return_as="ids"` mode, which leads to unseen new token IDs being passed to `retriever.retrieve`


**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.2...0.2.3

Page 1 of 5

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.