pip install sentence-transformers[onnx-gpu]==3.2.0
pip install sentence-transformers[onnx]==3.2.0
pip install sentence-transformers[openvino]==3.2.0
Faster ONNX and OpenVINO Backends for SentenceTransformer (2712)
Introducing a new `backend` keyword argument to the `SentenceTransformer` initialization, allowing values of `"torch"` (default), `"onnx"`, and `"openvino"`.
These come with new installations:
bash
pip install sentence-transformers[onnx-gpu]
or ONNX for CPU only:
pip install sentence-transformers[onnx]
or
pip install sentence-transformers[openvino]
It's as simple as:
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
If you specify a `backend` and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to `model.push_to_hub` or `model.save_pretrained` into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via `model_kwargs` will be passed on to [`ORTModel.from_pretrained`](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModel.from_pretrained) or [`OVBaseModel.from_pretrained`](https://huggingface.co/docs/optimum/intel/openvino/reference#optimum.intel.openvino.modeling_base.OVBaseModel.from_pretrained). The most useful arguments are:
* `provider`: (Only if `backend="onnx"`) ONNX Runtime provider to use for loading the model, e.g. `"CPUExecutionProvider"` . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. `"CUDAExecutionProvider"`) will be used.
* `file_name`: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
* `export`: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
For example:
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"all-MiniLM-L6-v2",
backend="onnx",
model_kwargs={
"file_name": "model_O3.onnx",
"provider": "CPUExecutionProvider",
}
)
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
Benchmarks
We ran [benchmarks](https://sbert.net/docs/sentence_transformer/usage/efficiency.html#benchmark) for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:
<p float="left">
<img src="https://github.com/user-attachments/assets/d3f423ff-ad4e-4c91-9beb-8217a062a61d" width="45%" />
<img src="https://github.com/user-attachments/assets/3b9ae402-1127-4152-a925-70c3d626b27d" width="45%" />
</p>
These findings resulted in these recommendations:
![image](https://github.com/user-attachments/assets/0ace85c5-622b-471a-8e20-9331a1ae12c7)
For GPU, you can expect **2x speedup with fp16 at no cost**, and for CPU you can expect **~2.5x speedup at a cost of 0.4% accuracy**.
<details><summary>ONNX Optimization and Quantization</summary>
In addition to exporting default ONNX and OpenVINO models, we also introduce 2 helper methods for optimizing and quantizing ONNX models:
Optimization
[`export_optimized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_optimized_onnx_model): This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options [here](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization#optimizing-a-model-during-the-onnx-export). This function accepts:
* `model` A SentenceTransformer model loaded with `backend="onnx"`.
* `optimization_config`: ["O1", "O2", "O3", or "O4" from ๐ค Optimum](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization) or a custom [`OptimizationConfig`](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig) instance.
* `model_name_or_path`: The directory or model repository where the optimized model will be saved.
* `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`.
* `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
* `file_suffix`: The suffix to add to the optimized model file name. Will use the `optimization_config` string or `"optimized"` if not set.
The usage is like this:
python
from sentence_transformers import SentenceTransformer, export_optimized_onnx_model
onnx_model = SentenceTransformer("BAAI/bge-large-en-v1.5", backend="onnx")
export_optimized_onnx_model(
model=onnx_model,
optimization_config="O4",
model_name_or_path="BAAI/bge-large-en-v1.5",
push_to_hub=True,
create_pr=True,
)
After which you can load the model with:
python
from sentence_transformers import SentenceTransformer
pull_request_nr = 2 TODO: Update this to the number of your pull request
model = SentenceTransformer(
"BAAI/bge-large-en-v1.5",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
revision=f"refs/pr/{pull_request_nr}"
)
or when it gets merged:
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"BAAI/bge-large-en-v1.5",
backend="onnx",
model_kwargs={"file_name": "onnx/model_O4.onnx"},
)
Quantization
[`export_dynamic_quantized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_dynamic_quantized_onnx_model): This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts
* `model` A SentenceTransformer model loaded with `backend="onnx"`.
* `quantization_config`: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from [AutoQuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.AutoQuantizationConfig), or an [QuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.QuantizationConfig) instance.
* `model_name_or_path`: The directory or model repository where the optimized model will be saved.
* `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`.
* `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
* `file_suffix`: The suffix to add to the optimized model file name. Will use the `quantization_config` string or e.g. `"int8_quantized"` if not set.
The usage is like this:
python
from sentence_transformers import SentenceTransformer, export_quantized_onnx_model
onnx_model = SentenceTransformer("BAAI/bge-large-en-v1.5", backend="onnx")
export_quantized_onnx_model(
model=onnx_model,
quantization_config="avx512",
model_name_or_path="BAAI/bge-large-en-v1.5",
push_to_hub=True,
create_pr=True,
)
After which you can load the model with:
python
from sentence_transformers import SentenceTransformer
pull_request_nr = 2 TODO: Update this to the number of your pull request
model = SentenceTransformer(
"BAAI/bge-large-en-v1.5",
backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
revision=f"refs/pr/{pull_request_nr}"
)
or when it gets merged:
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"BAAI/bge-large-en-v1.5",
backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
)
</details>
Lightning-Fast Static Embeddings via Model2Vec (2961)
If ONNX or OpenVINO isn't fast enough for you yet, then perhaps you'll enjoy Static Embeddings. These embeddings are a bit akin to [GLoVe](https://nlp.stanford.edu/projects/glove/) or [Word2vec](https://en.wikipedia.org/wiki/Word2vec), i.e. they're bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks.
However, these Static Embeddings are created in different ways. For example:
1. Distillation via the [Model2Vec](https://github.com/MinishLab/model2vec) technique. This projects allows you to distill any Sentence Transformer model into Static Embeddings. For example, distilling [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) resulted in a Static Embeddings Sentence Transformer model that reaches 87.5% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) on MTEB (+ PEARL & WordSim) and 97.4% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) on [various classification benchmarks](https://github.com/MinishLab/model2vec?tab=readme-ov-file#classification-and-speed-benchmarks).
You can initialize Static Embeddings via Model2Vec in two ways:
* [`from_model2vec`](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding.from_model2vec): You can load one of the pretrained [Model2Vec models](https://huggingface.co/models?library=model2vec):
python
note: `pip install model2vec` is needed, but not for inference
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
Initialize a Sentence Transformer model with a static embedding from a pretrained model2vec model
static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
Encode some texts
queries = ["What is the capital of France?", "How many people live in the Netherlands?"]
documents = ["Paris is the capital of France", "The Netherlands has 17 million inhabitants"]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
Compute similarities
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
"""
tensor([[0.8170, 0.3843],
[0.3929, 0.5818]])
"""
* [`from_distillation`](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding.from_distillation): You can use the name of any Sentence Transformer model alongside some parameters (See [this docs](https://github.com/MinishLab/model2vec#distilling-a-model2vec-model) for more information) to perform the distillation yourself, without needing any dataset. On my device, this takes ~4s on a GPU and ~2 minutes on a CPU:
python
note: `pip install model2vec` is needed, but not for inference
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
Initialize a Sentence Transformer model with a static embedding by distilling via model2vec
static_embedding = StaticEmbedding.from_distillation(
"mixedbread-ai/mxbai-embed-large-v1",
device="cuda",
pca_dims=256,
apply_zipf=True,
)
model = SentenceTransformer(modules=[static_embedding])
Encode some texts
queries = ["What is the capital of France?", "How many people live in the Netherlands?"]
documents = ["Paris is the capital of France", "The Netherlands has 17 million inhabitants"]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
Compute similarities
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
"""
tensor([[0.8430, 0.3271],
[0.3213, 0.5861]])
"""
2. Random initialization: Although this initialization needs finetuning, finetuning a Sentence Transformers model backed by StaticEmbedding is extremely fast. For example, I was able to finetune [tomaarsen/static-bert-uncased-gooaq](https://huggingface.co/tomaarsen/static-bert-uncased-gooaq) with MatryoshkaLoss & MultipleNegativesRankingLoss on the entire (3 million pairs) [gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq) dataset in just 7 minutes. This model reaches a NDCG10 of 79.33 on a hold-out set of 10k samples from gooaq, whereas e.g. [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) reaches 85.01 NDCG10. In short, only 6.6% less performance for a model that's about 500x faster.
That's not a typo: I can compute embeddings for about 14000 [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) sentences from per second on *CPU*, compared to about ~24 with [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5), a.k.a. 625x faster.
> [!NOTE]
> You can `save_pretrained` and load these models like any other Sentence Transformer models, the `StaticEmbedding` initialization is only necessary when you're creating a *new* model.
> * Creation:
> python
> from sentence_transformers import SentenceTransformer
> from sentence_transformers.models import StaticEmbedding
>
> Initialize a Sentence Transformer model with a static embedding from a pretrained model2vec model
> static_embedding = StaticEmbedding.from_distillation(
> "mixedbread-ai/mxbai-embed-large-v1",
> device="cuda",
> pca_dims=256,
> apply_zipf=True,
> )
> model = SentenceTransformer(modules=[static_embedding])
> model.save_pretrained("static-mxbai-embed-large-v1")
> or
> model.push_to_hub("tomaarsen/static-mxbai-embed-large-v1")
>
> * Inference:
> python
> from sentence_transformers import SentenceTransformer
>
> Initialize a Sentence Transformer model with a static embedding
> model = SentenceTransformer("static-mxbai-embed-large-v1")
>
> model.encode([...])
>
Small changes
* The [`InformationRetrievalEvaluator`](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) now accepts `query_prompt`, `query_prompt_name`, `corpus_prompt`, and `corpus_prompt_name` arguments, useful if your model requires specific prompts for queries and/or documents for the best performance. (2951)
* The [`mine_hard_negatives`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives) function now accepts `anchor_column_name` and `positive_column_name` for specifying which dataset columns will be used. If not specified, the first two columns are used, respectively. Additionally, the `min_score` parameter is added, ensuring that all mined negatives have a similarity score of at least `min_score` according to the chosen `SentenceTransformer` or `CrossEncoder` model. (2977)
* If you're using multiple evaluators during training via [SequentialEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sequentialevaluator), e.g. multiple evaluators for different Matryoshka dimensions, then the order is now preserved in the training logs in the model card. Previously, they were sorted by name, resulting in weird orderings (e.g. "gooaq-1024", "gooaq-128", "gooaq-256", "gooaq-32", "gooaq-512", "gooaq-64") (2963)
* [`CachedGISTEmbedLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss) has been improved to support multiple negatives per sample, i.e. the loss now accepts data in the `(anchor, positive, negative_1, โฆ, negative_n)` format. It is the third loss to support this format (see [docs](https://sbert.net/docs/sentence_transformer/loss_overview.html)):
![image](https://github.com/user-attachments/assets/758a2143-87f8-4d5e-9cf4-e887e50e8c73)
All changes
* [`fix`] Only save first module in root if "save_in_root" is specified. by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2957
* [`feat`] Add query prompts to Information Retrieval Evaluator by ArthurCamara in https://github.com/UKPLab/sentence-transformers/pull/2951
* [`model cards`] Keep evaluation order in training logs if there's multiple evaluators by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2963
* Add negatives in CachedGISTEmbedLoss by daegonYu in https://github.com/UKPLab/sentence-transformers/pull/2946
* [ENH] -- `CrossEncoder.rank` by it176131 in https://github.com/UKPLab/sentence-transformers/pull/2947
* [`feat`] Add lightning-fast StaticEmbedding module based on model2vec by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2961
* [`feat`] Add ONNX and OpenVINO backends by helena-intel and tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2712
* Refine mine_hard_negatives arguments by bakrianoo in https://github.com/UKPLab/sentence-transformers/pull/2977
New Contributors
* daegonYu made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2946
* it176131 made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2947
* helena-intel made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2712
* bakrianoo made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2977
Special thanks to echarlaix for making the new backends possible due to some last-minute changes in `optimum` and `optimum-intel`.
**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v3.1.1...v3.2.0