Sentence-transformers

Latest version: v4.0.2

Safety actively analyzes 723954 Python packages for vulnerabilities to keep your Python projects secure.

Page 20 of 25

0.4933

As you can see, the similarity between the search query and the correct document is much higher than that of an unrelated document, despite the very small matryoshka dimension applied. Feel free to copy this script locally, modify the `matryoshka_dim`, and observe the difference in similarities.

**Note**: Despite the embeddings being smaller, training and inference of a Matryoshka model is not faster, not more memory-efficient, and not smaller. Only the processing and storage of the resulting embeddings will be faster and cheaper.

</details>

Extra information:
* [Matryoshka Embeddings Documentation](https://sbert.net/examples/training/matryoshka/README.html)
* [Matryoshka Embeddings Blogpost](https://huggingface.co/blog/matryoshka)

Example training scripts:
* [matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py)
* [matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli_reduced_dim.py)
* [matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)

CoSENTLoss (2454)
[CoSENTLoss](https://sbert.net/docs/package_reference/losses.html#cosentloss) was introduced by [Jianlin Su, 2022](https://kexue.fm/archives/8847) as a drop-in replacement of [CosineSimilarityLoss](https://sbert.net/docs/package_reference/losses.html#cosinesimilarityloss). Experiments have shown that it produces a stronger learning signal than `CosineSimilarityLoss`.

python
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('bert-base-uncased')
train_examples = [
InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CoSENTLoss(model=model)

You can update [training_stsbenchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py) by replacing `CosineSimilarityLoss` with `CoSENTLoss` & you can observe the improved performance.

AnglELoss (2471)
[AnglELoss](https://sbert.net/docs/package_reference/losses.html#angleloss) was introduced in [Li and Li, 2023](https://arxiv.org/pdf/2309.12871.pdf). It is an adaptation of the [CoSENTLoss](https://sbert.net/docs/package_reference/losses.html#cosentloss), and also acts as a strong drop-in replacement of `CosineSimilarityLoss`. Compared to `CoSENTLoss`, `AnglELoss` uses a different similarity function which aims to avoid vanishing gradients.

Like with `CoSENTLoss`, you can use it just like `CosineSimilarityLoss`.

python
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('bert-base-uncased')
train_examples = [
InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.AnglELoss(model=model)

You can update [training_stsbenchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py) by replacing `CosineSimilarityLoss` with `AnglELoss` & you can observe the improved performance.

Prompt Templates (2477)
Some models require using specific text prompts to achieve optimal performance. For example, with [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) you should prefix all queries with `query: ` and all passages with `passage: `. Another example is [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5), which performs best for retrieval when the input texts are prefixed with `Represent this sentence for searching relevant passages: `.

Sentence Transformer models can now be initialized with `prompts` and `default_prompt_name` parameters:
* `prompts` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example,
python
model = SentenceTransformer(
"intfloat/multilingual-e5-large",
prompts={
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
},
)
or
model.prompts = {
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
}

* `default_prompt_name` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from `prompts`. If `None`, then no prompt is used by default. For example,
python
model = SentenceTransformer(
"intfloat/multilingual-e5-large",
prompts={
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
},
default_prompt_name="retrieval",
)
or
model.default_prompt_name="retrieval"

Both of these parameters can also be specified in the `config_sentence_transformers.json` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.

During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
1. Explicitly using the `prompt` option in `SentenceTransformer.encode`:
python
embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ")

2. Explicitly using the `prompt_name` option in `SentenceTransformer.encode` by relying on the prompts loaded from a) initialization or b) the model config.
python
embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval")

3. If `prompt` nor `prompt_name` are specified in `SentenceTransformer.encode`, then the prompt specified by `default_prompt_name` will be applied. If it is `None`, then no prompt will be applied.
python
embeddings = model.encode("How to bake a strawberry cake")

* [Prompt Templates Documentation](https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates)

Instructor support (2477)

Some INSTRUCTOR models, such as [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large), are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step.

The following models work out of the box:
* [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base)
* [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large)
* [hkunlp/instructor-xl](https://huggingface.co/hkunlp/instructor-xl)

You can use these models like so:

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("hkunlp/instructor-large")
embeddings = model.encode(
[
"Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity",
"Comparison of Atmospheric Neutrino Flux Calculations at Low Energies",
"Fermion Bags in the Massive Gross-Neveu Model",
"QCD corrections to Associated t-tbar-H production at the Tevatron",
],
prompt="Represent the Medicine sentence for clustering: ",
)
print(embeddings.shape)
=> (4, 768)

<details><summary>Information Retrieval usage</summary>

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("hkunlp/instructor-large")
query = "where is the food stored in a yam plant"
query_instruction = (
"Represent the Wikipedia question for retrieving supporting documents: "
)
corpus = [
'Yams are perennial herbaceous vines native to Africa, Asia, and the Americas and cultivated for the consumption of their starchy tubers in many temperate and tropical regions. The tubers themselves, also called "yams", come in a variety of forms owing to numerous cultivars and related species.',
"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession",
"Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.",
]
corpus_instruction = "Represent the Wikipedia document for retrieval: "

query_embedding = model.encode(query, prompt=query_instruction)
corpus_embeddings = model.encode(corpus, prompt=corpus_instruction)
similarities = cos_sim(query_embedding, corpus_embeddings)
print(similarities)

0.4884

</details>

Model2Vec was the inspiration of the recent [Static Embedding](https://huggingface.co/blog/static-embeddings) work; all of these models can be used to approach the performance of normal transformer-based embedding models at a fraction of the latency. For example, both Model2Vec and Static Embedding models are ~25x faster than tiny embedding models on a GPU and ~400x faster than those models on a CPU.

Bug Fix
* Using `local_files_only=True` still triggered a request to Hugging Face for the model card metadata; this has been resolved in (3202).

All Changes
* fix loss name in documentation of CachedMultipleNegativesRankingLoss by JINO-ROHIT in https://github.com/UKPLab/sentence-transformers/pull/3191
* Bump jinja2 from 3.1.4 to 3.1.5 in /docs by dependabot in https://github.com/UKPLab/sentence-transformers/pull/3192
* minor typo in MegaBatchMarginLoss by JINO-ROHIT in https://github.com/UKPLab/sentence-transformers/pull/3193
* Fix type hint of `StaticEmbedding.__init__` by altescy in https://github.com/UKPLab/sentence-transformers/pull/3196
* [`integration`] Work towards full model2vec integration by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3182
* Don't call `set_base_model` when `local_files_only=True` by Davidyz in https://github.com/UKPLab/sentence-transformers/pull/3202

New Contributors
* dependabot made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/3192
* altescy made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/3196
* Davidyz made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/3202

**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v3.4.0...v3.4.1

0.4674

Various papers ([INSTRUCTOR](https://arxiv.org/abs/2212.09741), [BGE](https://arxiv.org/pdf/2309.07597)) show that including prompts or instructions both during training and inference results in stronger performance. As of this release, it's now possible to easily train with prompts in Sentence Transformers with just one extra training argument: `prompts`. There are 4 accepted formats for it:

1. `str`: A single prompt to use for all columns in all datasets. For example:
python
args = SentenceTransformerTrainingArguments(
...,
prompts="text: ",
...,
)

2. `Dict[str, str]`: A dictionary mapping column names to prompts, applied to all datasets. For example:
python
args = SentenceTransformerTrainingArguments(
...,
prompts={
"query": "query: ",
"answer": "document: ",
},
...,
)

3. `Dict[str, str]`: A dictionary mapping dataset names to prompts. This should only be used if your training/evaluation/test datasets are a `DatasetDict` or a dictionary of `Dataset`. For example:
python
args = SentenceTransformerTrainingArguments(
...,
prompts={
"stsb": "Represent this text for semantic similarity search: ",
"nq": "Represent this text for retrieval: ",
},
...,
)

4. `Dict[str, Dict[str, str]]`: A dictionary mapping dataset names to dictionaries mapping column names to prompts. This should only be used if your training/evaluation/test datasets are a `DatasetDict` or a dictionary of `Dataset`. For example:
python
args = SentenceTransformerTrainingArguments(
...,
prompts={
"stsb": {
"sentence1": "sts: ",
"sentence2": "sts: ",
},
"nq": {
"query": "query: ",
"document": "document: ",
},
},
...,
)

I've trained models with and without prompts for 2 base models: [mpnet-base](https://huggingface.co/microsoft/mpnet-base) and [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased):
* [tomaarsen/mpnet-base-nq](https://huggingface.co/tomaarsen/mpnet-base-nq)
* [tomaarsen/mpnet-base-nq-prompts](https://huggingface.co/tomaarsen/mpnet-base-nq-prompts)
* [tomaarsen/bert-base-nq](https://huggingface.co/tomaarsen/bert-base-nq)
* [tomaarsen/bert-base-nq-prompts](https://huggingface.co/tomaarsen/bert-base-nq-prompts)

For both base models, the model with prompts consistently outperformed the baseline model. After training, the models with prompts resulted in a 0.66% and 0.90% relative improvement on NDCG10 at no extra cost.

| `mpnet-base` tests | `bert-base-uncased` tests |
|:----------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------:|
| ![mpnet_base_nq_nanobeir](https://github.com/user-attachments/assets/69914ce5-0ddd-4111-947e-5b780cf504cb) | ![bert_base_nq_nanobeir](https://github.com/user-attachments/assets/ea1d021b-fca4-4741-b19d-39977b247f06) |

* Training with Prompts documentation: https://sbert.net/examples/training/prompts/README.html
* Training with Prompts training script: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/prompts/training_nq_prompts.py

NanoBEIR Evaluator integration (https://github.com/UKPLab/sentence-transformers/pull/2966)
This update introduced a new simple `NanoBEIREvaluator`, evaluating your model against [NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6): a collection of subsets of the 13 [BEIR](https://github.com/beir-cellar/beir) datasets. BEIR corresponds to the retrieval tab of [MTEB](https://huggingface.co/spaces/mteb/leaderboard), and is commonly seen as a valuable indicator of general-purpose information retrieval performance.

With the `NanoBEIREvaluator`, you can easily evaluate your models on a much faster benchmark that should give similar insights in performance as BEIR. You can use it like so:
python
from sentence_transformers.evaluation import NanoBEIREvaluator
from sentence_transformers import SentenceTransformer
import logging

Optional, but nice to get human-readable results in the terminal
logging.basicConfig(
format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

1. Load a model
model = SentenceTransformer("all-mpnet-base-v2", backend="onnx")

2. Initialize the evaluator
evaluator = NanoBEIREvaluator()

3. Call the evaluator to get a dictionary of metric names to values
results = evaluator(model)
"""
NanoBEIR Evaluation of the model on ['climatefever', 'dbpedia', 'fever', 'fiqa2018', 'hotpotqa', 'msmarco', 'nfcorpus', 'nq', 'quoraretrieval', 'scidocs', 'arguana', 'scifact', 'touche2020'] dataset:
Evaluating NanoClimateFEVER
Information Retrieval Evaluation of the model on the NanoClimateFEVER dataset:
Queries: 50
Corpus: 3408

Score-Function: cosine

0.3796

Evaluating NanoFEVER
Information Retrieval Evaluation of the model on the NanoFEVER dataset:
Queries: 50
Corpus: 4996

... (truncated for brevity)

Aggregated for Score Function: cosine

Sentence-transformers

Page 20 of 25

0.4933

0.4884

0.4674

0.3796

0.3311

0.2618

Page 20 of 25

Links

Releases