Bm25s

Latest version: v0.2.5

Safety actively analyzes 688126 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

0.2.5

What's Changed
* Update README.md by xhluca in https://github.com/xhluca/bm25s/pull/83
* Added support for saving and loading non ASCII chars in corpus and vocab by IssacXid in https://github.com/xhluca/bm25s/pull/86
* Update README.md by mrisher in https://github.com/xhluca/bm25s/pull/87

New Contributors
* IssacXid made their first contribution in https://github.com/xhluca/bm25s/pull/86
* mrisher made their first contribution in https://github.com/xhluca/bm25s/pull/87

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.4...0.2.5

0.2.4

What's Changed

Fix crash tokenizing with empty word_to_id by mgraczyk in https://github.com/xhluca/bm25s/pull/72

Create nltk_stemmer.py by aflip in https://github.com/xhluca/bm25s/pull/77

https://github.com/xhluca/bm25s/commit/aa31a2321250180feb8b155fec1daafd40f56182: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.
- `bm25s/init.py:` Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
- `bm25s/tokenization.py:` Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.


New Contributors
* mgraczyk made their first contribution in https://github.com/xhluca/bm25s/pull/72
* aflip made their first contribution in https://github.com/xhluca/bm25s/pull/77

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.3...0.2.4

0.2.3

What's Changed
* PR https://github.com/xhluca/bm25s/pull/67 fixes issue #60
* More test cases for edge cases of `Tokenizer` class, such as when `update_vocab=True` in `return_as="ids"` mode, which leads to unseen new token IDs being passed to `retriever.retrieve`


**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.2...0.2.3

0.2.2

- Improve README with example of memory usage optimization
- Add a `Results.merge` method allowing merging list of results
- Make `get_max_memory_usage` compatible with mac os
- Add `BM25.load_scores` that allows loading only the scores of the object
- Add a `load_vocab` parameter set to `True` by default in `BM25.load`, allowing the vocabulary to not be always loaded.

PR: https://github.com/xhluca/bm25s/pull/63

**Full Changelog**: https://github.com/xhluca/bm25s/compare/0.2.1...0.2.2

0.2.1

- Add `Tokenizer.save_vocab` and `Tokenizer.load_vocab` methods to save/load vocabulary to a json file called `vocab.tokenizer.json` by default
- Add `Tokenizer.save_stopwords` and `Tokenizer.load_stopwords` methods to save/load stopwords to a json file called `stopwords.tokenizer.json` by default
- Add `TokenizerHF` class to allow saving/loading from huggingface hub
- New function: `load_vocab_from_hub`, `save_vocab_to_hub`, `load_stopwords_from_hub`, `save_stopwords_to_hub`

> New tests and examples were added (see `examples/index_to_hf.py` and `examples/tokenizer_class.py`)

0.2.0

Numba JIT support

> See discussion here: https://github.com/xhluca/bm25s/discussions/46

The most important new feature of v0.2.0 is the addition of numba support, which only require you to install the core requirements (with `pip install "bm25s[core]"`) or with `pip install numba`.

Using numba will result in a substantial speedup, so it is highly recommended if you have access to numba on your system (which should be in most cases). You can find a [benchmark here](https://github.com/xhluca/bm25-benchmarks?tab=readme-ov-file#queries-per-second).

Notably, by combining numba JIT-based scoring, numba-based top-k selection (no longer relies on jax, see discussion thread) and the new and faster `bm25s.tokenization.Tokenizer` (see below), we observe the following speedup on a few benchmarks, in a single-threaded setting with Kaggle CPUs:

- MSMarco: 12.2 --> 39.18
- HotpotQA: 20.88 --> 47.16
- Fever: 20.19 --> 53.84
- NQ: 41.85 --> 109.47
- Quora: 272.04 --> 479.71
- NFCorpus: 1196.16 --> 5696.21

To enable it, simply do:

python
import bm25s

load corpus
...

retriever = bm25s.BM25(backend="numba")

index and run retrieval


This is all you need to use numba JIT when calling the `retriever.retrieve` method. Note, however, that the first run might be slower, so you can warmup by passing a small query. Here are more examples:
- [index_and_retrieve_with_numba.py](https://github.com/xhluca/bm25s/blob/main/examples/index_and_retrieve_with_numba.py)
- [retrieve_with_numba_hf.py](https://github.com/xhluca/bm25s/blob/main/examples/retrieve_with_numba_hf.py)


New `bm25s.tokenization.Tokenizer` class

With v0.2.0, we are adding the `Tokenizer` class, which enhances the existing features of `bm25s.tokenize` and makes it more flexible. Notably, it enables generator mode (stream with `yield`), and is much faster when tokenizing queries, if you have an existing vocabulary. Also, you can specify your own splitter function, which is no longer locked to a regex pattern.

You can find more information here:
* [Readme section](https://github.com/xhluca/bm25s?tab=readme-ov-file#tokenization)
* [`examples/tokenizer_class.py`](https://github.com/xhluca/bm25s/blob/main/examples/tokenizer_class.py)
* Read the docstring with `help(bm25s.tokenization.Tokenizer)`


New stopwords

Stopwords for 10 languages (from NLTK) were added by bm777 in https://github.com/xhluca/bm25s/pull/33

- English
- German
- Dutch
- French
- Spanish
- Portuguese
- Italian
- Russian
- Swedish
- Norwegian
- Chinese

New JSON backend

`orjson` is now supported as a JSON backend, as it is faster than ujson and is currently supported.

Weight mask

`BM25.retrieve` now supports a weight_mask array, which applies a weight (binary or float) on each of the document retrieved. This is useful, for example, if you want to use a binary mask to hide certain documents deemed irrelevant.

Dependency Notes

- `orjson` replaces `ujson` as a core dependency
- `jax[cpu]` is no longer a `core` dependency, but a `selection` dependency now. Be careful to not use `backend_selection='jax'` if you don't have it installed!
- `numba` is a new `core` dependency, allowing you to directly use the `backend='numba'` when initializing a retriever.
- `pytrec_eval` is a new `evaluation` dependency, which is useful if you want to use the evaluation function in `bm25s.utils.beir` which is copied from the BEIR dataset.

Advanced Numba

Alternative Usage (advanced)

Here's an example of how to leverage numba speedups using the alternative method of activing numba scorer and choosing the `backend_selection` manually. It is not recommended to use this method unless you speicfically want to have more control over how the backend is activated.

python
import os
import Stemmer

import bm25s.hf

def main(repo_name="xhluca/bm25s-fiqa-index"):
queries = [
"Is chemotherapy effective for treating cancer?",
"Is Cardiac injury is common in critical cases of COVID-19?",
]

retriever = bm25s.hf.BM25HF.load_from_hub(
repo_name, load_corpus=False, mmap=False
)

Tokenize the queries
stemmer = Stemmer.Stemmer("english")
queries_tokenized = bm25s.tokenize(queries, stemmer=stemmer)

Retrieve the top-k results
retriever.activate_numba_scorer()
results = retriever.retrieve(queries_tokenized, k=3, backend_selection="numba")
show first results
result = results.documents[0]
print(f"First score ( 1 result):{results.scores[0, 0]}")
print(f"First result ( 1 result):\n{result[0]}")

if __name__ == "__main__":
main()


Again, this method is only recommended if you want to have more control.

**WARNING:** it will not do well with multithreading. For the full example, see [retrieve_with_numba_advanced.py](https://github.com/xhluca/bm25s/blob/main/examples/retrieve_with_numba_advanced.py)

Page 1 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.