Beir

Latest version: v2.1.0

Safety actively analyzes 723217 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

12.08.2021

I had fun speaking about BEIR and Neural Search in a recent OpenNLP event on benchmarking search using BEIR.
If you are interested, the talk was recorded and is available below:

YouTube: https://www.youtube.com/watch?v=e9nNr4ugNAo
Slides: https://drive.google.com/file/d/1gghRVv6nWWmMZRqkYvuCO0HWOTEVnPNz/view?usp=sharing

3. Added Splits for each dataset in the datasets table present in README
I plan to add the new big msmarco-v2 version of the passage collection soon, this contains 138,364,198 passages (13.5 GB). The dataset contains two dev splits (``dev1``,``dev2``). Adding splits would be useful to incorporate different splits that don't follow the traditional convention of a single train, dev and test splits.

2.1.0

After a busy & hectic 2024, I'm back contributing to the BEIR repository! 🎉

I upgraded the outdated repository in Python 3.6 to Python 3.9+. Also, `sentence-transformers` since 2023 has improved and changed. Therefore, it was necessary to update BEIR to include the latest decoder-based embedding models evaluation on the BEIR datasets.

BEIR provides you with easy to use code snippets and examples so that you can evaluate retrieval models without any issue in the `examples/` folder and you are able to configure each parameter for retrieval which helps improve reproducibility!

Evaluate latest SoTA models such as E5, NV-Embed, ModernBERT etc.
1. Added `models.HuggingFace` which can be easily used to evaluate E5 models & fine-tuned PEFT models with Tevatron, e.g., RepLLAMA, or any custom embedding model present in HuggingFace. It supports three pooling techniques: `mean`, `cls` and `eos` pooling. To evaluate PEFT models, install peft using `pip install beir[peft]`.
python
Example for E5-Mistral-7B
Check prompts: https://github.com/microsoft/unilm/blob/9c0f1ff7ca53431fe47d2637dfe253643d94185b/e5/utils.py
query_prompt = "Given a query on COVID-19, retrieve documents that answer the query"
passage_prompt = ""
dense_model = models.HuggingFace(
model="intfloat/e5-mistral-7b-instruct",
max_length=512,
append_eos_token=True, add [EOS] token to the end of the input
pooling="eos", end of sequence pooling
normalize=True,
prompts={"query": query_prompt, "passage": passage_prompt},
attn_implementation="flash_attention_2",
torch_dtype="bfloat16"
)

Example with RepLLAMA (PEFT) model
query_prompt = "query: "
passage_prompt = "passage: "
dense_model = models.HuggingFace(
model="meta-llama/Llama-2-7b-hf",
peft_model_path="castorini/repllama-v1-7b-lora-passage",
max_length=512,
append_eos_token=True, add [EOS] token to the end of the input
pooling="eos",
normalize=True,
prompts={"query": query_prompt, "passage": passage_prompt},
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
)

2. Updated `models.SentenceTransformer` to include prompts, prompt_names and other latest features with LLM-based decoder models, for e.g., evaluate Stella, modernBERT-gte-base etc. **Bonus**: Alll sentence-transformer models can be used in multiple GPUs for evaluation. Checkout [evaluate_sbert_multi_gpu.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/dense/evaluate_sbert_multi_gpu.py)
python
Example for Stella 1.5B v5
dense_model = models.SentenceBERT(
"NovaSearch/stella_en_1.5B_v5",
max_length=512,
prompt_names={"query": "s2p_query", "passage": None},
trust_remote_code=True,
)

Example for modernBERT GTE base
dense_model = models.SentenceBERT("Alibaba-NLP/gte-modernbert-base")

3. Added `models.NVEmbed` to evaluate the custom `nvidia/NV-Embed-v2` model using BEIR, beware but currently you would need to downgrade your transformers version to `4.47.1`, for the model to work. See discussion [here](https://huggingface.co/nvidia/NV-Embed-v2/discussions/36).
python
Checkout prompts for NV-Embed-v2 model inside `instructions.json`.
https://huggingface.co/nvidia/NV-Embed-v2/blob/main/instructions.json
trec_covid_prompt = "Given a query on COVID-19, retrieve documents that answer the query"

Load the Dense Retriever model (NVEmbed)
dense_model = models.NVEmbed(
"nvidia/NV-Embed-v2",
max_length=512,
normalize=True,
prompts={"query": trec_covid_prompt, "passage": ""},
)

4. Added `models.LLM2Vec` to evlauate the custom cross-attention embedding models provided in LLM2Vec repository here: https://github.com/McGill-NLP/llm2vec. Make sure you install LLM2Vec separately or using `pip install beir[llm2vec]`.
python
query_prompt = "Given a web search query, retrieve relevant passages that answer the query:"
dense_model = models.LLM2Vec(
model_name_or_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
peft_model_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
max_length=512,
pooling="mean",
normalize=True,
prompts={"query": query_prompt, "passage": None},
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
)

5. Removed a few old retrieval model examples such as USE-QA as they are sadly out of favour.

Util functions for easy saving of evaluation metrics and runfiles
Here now you can use two functions: `util.save_runfile()` which saves the results as a TREC runfile, which is useful to evaluate the top-k retrieved documents for a given query, and `util.save_results()` which saves your metrics: ndcg, map, recall, precision (optional: mrr, recall_cap and hole) into a JSON results file.

python
from beir import util

dataset="nfcorpus"
....
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
mrr = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="mrr")

If you want to save your results and runfile (useful for reranking)
results_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "results")
os.makedirs(results_dir, exist_ok=True)

Save the evaluation runfile & results
util.save_runfile(os.path.join(results_dir, f"{dataset}.run.trec"), results)
util.save_results(os.path.join(results_dir, f"{dataset}.json"), ndcg, _map, recall, precision, mrr)


Python upgraded to 3.9, Installation, ruff & pyproject.toml
We upgraded python to 3.9+ and accordingly changed the python formatting overall in the codebase. We also include ruff as the Python linter and code formatter to help me clean the codebase.

Merged old PRs
I changed beir installation to include only three main dependencies now: `sentence-transformers`, `datasets` and `pytrec-eval-terrier`, as many complained that `pytrec_eval` was not actively maintained and Windows users faced issues.

**PS: Next, I have plans to add ColBERT evaluation now which is easily supported in PyLate and BM25s etc.**

What's Changed
* Pull latest main branch into development by thakur-nandan in https://github.com/beir-cellar/beir/pull/153
* merge latest main into development by thakur-nandan in https://github.com/beir-cellar/beir/pull/190
* Support Multi-node evaluation by NouamaneTazi in https://github.com/beir-cellar/beir/pull/155
* replacing `pytrec_eval` with `pytrec-eval-terrier` by archersama in https://github.com/beir-cellar/beir/pull/175
* merge latest main into development by thakur-nandan in https://github.com/beir-cellar/beir/pull/191
* Merge development into main by thakur-nandan in https://github.com/beir-cellar/beir/pull/192

New Contributors
* archersama made their first contribution in https://github.com/beir-cellar/beir/pull/175

**Full Changelog**: https://github.com/beir-cellar/beir/compare/v2.0.0...v2.1.0

2.0.0

After a long stale year full of no changes. I've merged many pull requests and made changes to the BEIR code. You can find the latest changes mentioned here below:

1. Heap Queue for keeping track of top-k documents when evaluating with dense retrieval.
Thanks to kwang2049, starting from v2.0.0, we include a heap queue for keeping track of top-k documents when using the `DenseRetrievalExactSearch` class module. This considerably reduces the RAM consumed, especially during the evaluation of large corpora such as MS MARCO or BIOASQ.

The logic remains the same for keeping track of elements during the chunking of the collection.

- If your `heapq` is less than `k` size, push the item, i.e. document into the heap.
- If your `heapq` is at max `k` size, if the item is larger than the smallest item in the heap, push the item on the heap and then pop the smallest element.

2. Removed all major typing errors from the BEIR code.
We removed all typing errors from the BEIR code as we implemented an abstract base class for search. The base class function will take in the corpus, queries, and a top-k value. We return the results, where you would have `query_id` and corresponding `doc_id` and `score`.

python
class BaseSearch(ABC):

abstractmethod
def search(self,
corpus: Dict[str, Dict[str, str]],
queries: Dict[str, str],
top_k: int,
**kwargs) -> Dict[str, Dict[str, float]]:
pass

Example: [evaluate_sbert_multi_gpu.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/dense/evaluate_sbert_multi_gpu.py)

3. Updated Faiss Code to include GPU options.
I added the GPU option with `FaissSearch` base class. Using the GPU can reduce latency immensely. However, sometimes it takes time to transfer the faiss index from CPU to GPU. Pass the `use_gpu=True` parameter in the `DenseRetrievalFaissSearch` class to use GPU for faiss inference with PQ, PCA, or with FlatIP Search.

4. New publication -- Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard.
We have a new publication, where we describe our official leaderboard hosted on eval.ai and provide reproducible reference models on BEIR using the Pyserini Repository (https://github.com/castorini/pyserini).

Link to the arxiv version: https://arxiv.org/abs/2306.07471

If you use numbers from our leaderboard, please cite the following paper:

misc{kamalloo2023resources,
title={Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard},
author={Ehsan Kamalloo and Nandan Thakur and Carlos Lassance and Xueguang Ma and Jheng-Hong Yang and Jimmy Lin},
year={2023},
eprint={2306.07471},
archivePrefix={arXiv},
primaryClass={cs.IR}
}

1.0.1

There have been multiple changes done to the repository ever since the last version. You can find the latest changes mentioned here below:

1. Brand New Wiki page for BEIR
Starting from v1.0.1, we have created a new Wiki page for the BEIR benchmark. We would keep it updated with the latest datasets available out there, examples of how you can evaluate your models on BEIR, Leaderboard, etc. Correspondingly we have shortened our README.md and displayed only necessary information out there. For a full overview. one can view the **BEIR Wiki**.

You can view the BEIR Wiki here: [https://github.com/beir-cellar/beir/wiki](https://github.com/beir-cellar/beir/wiki).

2. Multi GPU evaluation with SBERT dense retrievers using Distributed Evaluation
Thanks to NouamaneTazi, we currently now support multiple GPU evaluation for SBERT models across all datasets in BEIR. These benefit evaluation on large datasets such as BioASQ, where encoding takes 1 day at least to complete on a single GPU. Now with access to multi GPUs, one can evaluate large datasets quickly in contrast to old single GPU evaluation. Only Caveat, running on multiple GPUs requires ``evaluate`` library to be installed which has a python version requirement of >= 3.7.

Example: [evaluate_sbert_multi_gpu.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/dense/evaluate_sbert_multi_gpu.py)

3. Hugging Face Data loader for BEIR dataset. Uploaded all datasets on HF.
We added Hugging Face Dataloaders for all the public BEIR datasets. One can use it to easily work with BEIR datasets available on Hugging Face. We also made available all corpus and queries for eg. ``BeIR/fiqa`` and qrels ``BeIR/fiqa-qrels`` for all public BEIR datasets on HuggingFace. This step would mean one does not need to download the datasets and keep the locally in RAM. Again thanks to NouamaneTazi.

You can find all datasets here: [https://huggingface.co/BeIR](https://huggingface.co/BeIR)
Example: [evaluate_sbert_hf_loader.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/dense/evaluate_sbert_hf_loader.py)

4. Added support for the T5 reranking model: monoT5 reranker
We added the support of the monoT5 reranking model within BEIR. These are stronger (but complex) rerankers that can be used to attain the best reranking performances currently on the BEIR benchmark.

Example: [evaluate_bm25_monot5_reranking.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/reranking/evaluate_bm25_monot5_reranking.py)

5. Fix: Add ``ignore_identical_ids`` with BEIR evaluation
Thanks to kwang2049, we added a check to ignore identical ids within the evaluation script. This causes issues with ArguAna and Quora datasets, particularly as there a document and query can be alike (with the same id). By default, we remove these ids and evaluate the dataset accordingly. With this fix, one can evaluate Quora and ArguAna and provide the accurate and reproducible nDCG10 scores.

5. Added HNSWSQ method in faiss retrieval methods
We added support to HNSWSQ faiss index method as a memory compression-based technique to evaluate across the BEIR datasets.

6. Added dependency of datasets library within setup.py
In order to support HF data loaders, we added the dependency of the ``datasets`` library within our ``setup.py``.

1.0.0

This is a major release since the last version v0.2.3.

1. New BEIR Organization and moving forward will be part of a collaboration
The BEIR benchmark has been shifted from [UKPLab](https://github.com/UKPLab/beir) to [beir-cellar](https://github.com/beir-cellar/beir). Moving forward, the BEIR benchmark will be actively maintained and developed with the help of UKPLab, castorini, and huggingface.

2. ColBERT model evaluation on the BEIR benchmark code released
The ColBERT model evaluation on the BEIR benchmark has been released. This code repository uses the original ColBERT repository for evaluation and training with a few tweaks.

Here is the repository for more details: https://github.com/NThakur20/beir-ColBERT

3. New Passage Expansion Model added: TILDE
Since DocT5query is compute-intensive and time-consuming to generate, we added a faster passage expansion model: TILDE (https://arxiv.org/abs/2108.08513) for expanding documents, by expanding on relevant keywords present within the BERT vocabulary. An easy example using TILDE can be shown here: [passage_expansion_tilde.py](https://github.com/beir-cellar/beir/blob/main/examples/generation/passage_expansion_tilde.py)

4. Upcoming New work for Easy evaluation of Neural Sparse Retrieval Models
We are currently developing a new repository for easy evaluation of neural sparse models including a inverted index implementation. This will help a unified evaluation of all diverse neural sparse retrieval models such as uniCOIL, SPLADE, SPARTA and DeepImpact.

An initial repository for this work and more details can be found here: https://github.com/NThakur20/sparse-retrieval.

5. Fixed breaking changes and reproducibility in Elasticsearch
58 showed issues in ES lexical search reproducibility and downloading Elasticsearch client.

1. Added a sleep_for parameter in the ES code with a default value of 2 seconds. This will forcefully sleep the ES index after index deletion, and indexing documents.
2. During bulk indexing (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html), there is a parameter refresh which I have set to wait_for instead of default kept at false. For more details, refer here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html.
3. Freeze ES version in beir: ``elasticsearch==7.9.1`` which will help us avoid the latest issues occurring with ES policies.

6. Temporary Packages: Tensorflow
Tensorflow installation was causing issues while ``pip installing beir``. Only USE models were evaluated using TF, however, they are currently not the most popular choice in models. Hence, we decided to move forward with ``['tensorflow>=2.2.0', 'tensorflow-text', 'tensorflow-hub']`` made available as optional packages which can be installed separately in case a user wishes to evaluate the USE model or use TF for their own use-case: ``pip install beir[tf]``.

7. Fixed breaking changes in sparse search in SparseRetrieval
As notified in 62, we have updated our bug found in the sparse retrieval code for evaluating SPARTA on the beir benchmark,

0.2.3

This is a small release update!

1. BEIR Benchmark paper accepted at NeurIPS 2021 (Datasets and Benchmark Track)
I'm quite thrilled to share that BEIR has been accepted at NeurIPS 2021 conference. All reviewers had positive reviews and realized the benchmark to be useful for the community. More information can be found here: https://openreview.net/forum?id=wCu6T5xFjeJ.

2. New Multilingual datasets added within BEIR
New multilingual datasets have been added to the BEIR Benchmark. Now BEIR supports over 10+ languages. We included the translated MSMARCO dataset in 8 languages: mMARCO (https://github.com/unicamp-dl/mMARCO) and Mr.TyDi which contains train, development, and test data across 10 languages (https://github.com/castorini/mr.tydi). We hope to provide good and robust dense multilingual retrievers in the future.

3. Breaking change in Top-k accuracy now fixed
The top-k accuracy metric was by mistake sorting retrieved keys instead of retriever model scores which would have led to incorrect scores. This mistake has been identified in 45 and successfully updated and merged now.

4. Yannic Kilcher recognized BEIR as a helpful ML library
Yannic Kilcher recently mentioned the BEIR repository as a helpful library for benchmarking and evaluating diverse IR models and architectures. You can find more details in his latest ML News video on YouTube: https://www.youtube.com/watch?v=K3cmxn5znyU&t=1290s&ab_channel=YannicKilcher

Page 1 of 2

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.