As you can see, the similarity between the search query and the correct document is much higher than that of an unrelated document, despite the very small matryoshka dimension applied. Feel free to copy this script locally, modify the `matryoshka_dim`, and observe the difference in similarities.
**Note**: Despite the embeddings being smaller, training and inference of a Matryoshka model is not faster, not more memory-efficient, and not smaller. Only the processing and storage of the resulting embeddings will be faster and cheaper.
</details>
Extra information:
* [Matryoshka Embeddings Documentation](https://sbert.net/examples/training/matryoshka/README.html)
* [Matryoshka Embeddings Blogpost](https://huggingface.co/blog/matryoshka)
Example training scripts:
* [matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py)
* [matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli_reduced_dim.py)
* [matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)
CoSENTLoss (2454)
[CoSENTLoss](https://sbert.net/docs/package_reference/losses.html#cosentloss) was introduced by [Jianlin Su, 2022](https://kexue.fm/archives/8847) as a drop-in replacement of [CosineSimilarityLoss](https://sbert.net/docs/package_reference/losses.html#cosinesimilarityloss). Experiments have shown that it produces a stronger learning signal than `CosineSimilarityLoss`.
python
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample
model = SentenceTransformer('bert-base-uncased')
train_examples = [
InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CoSENTLoss(model=model)
You can update [training_stsbenchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py) by replacing `CosineSimilarityLoss` with `CoSENTLoss` & you can observe the improved performance.
AnglELoss (2471)
[AnglELoss](https://sbert.net/docs/package_reference/losses.html#angleloss) was introduced in [Li and Li, 2023](https://arxiv.org/pdf/2309.12871.pdf). It is an adaptation of the [CoSENTLoss](https://sbert.net/docs/package_reference/losses.html#cosentloss), and also acts as a strong drop-in replacement of `CosineSimilarityLoss`. Compared to `CoSENTLoss`, `AnglELoss` uses a different similarity function which aims to avoid vanishing gradients.
Like with `CoSENTLoss`, you can use it just like `CosineSimilarityLoss`.
python
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample
model = SentenceTransformer('bert-base-uncased')
train_examples = [
InputExample(texts=['My first sentence', 'My second sentence'], label=1.0),
InputExample(texts=['My third sentence', 'Unrelated sentence'], label=0.3)
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.AnglELoss(model=model)
You can update [training_stsbenchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py) by replacing `CosineSimilarityLoss` with `AnglELoss` & you can observe the improved performance.
Prompt Templates (2477)
Some models require using specific text prompts to achieve optimal performance. For example, with [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) you should prefix all queries with `query: ` and all passages with `passage: `. Another example is [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5), which performs best for retrieval when the input texts are prefixed with `Represent this sentence for searching relevant passages: `.
Sentence Transformer models can now be initialized with `prompts` and `default_prompt_name` parameters:
* `prompts` is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example,
python
model = SentenceTransformer(
"intfloat/multilingual-e5-large",
prompts={
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
},
)
or
model.prompts = {
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
}
* `default_prompt_name` is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name from `prompts`. If `None`, then no prompt is used by default. For example,
python
model = SentenceTransformer(
"intfloat/multilingual-e5-large",
prompts={
"classification": "Classify the following text: ",
"retrieval": "Retrieve semantically similar text: ",
"clustering": "Identify the topic or theme based on the text: ",
},
default_prompt_name="retrieval",
)
or
model.default_prompt_name="retrieval"
Both of these parameters can also be specified in the `config_sentence_transformers.json` file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.
During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
1. Explicitly using the `prompt` option in `SentenceTransformer.encode`:
python
embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ")
2. Explicitly using the `prompt_name` option in `SentenceTransformer.encode` by relying on the prompts loaded from a) initialization or b) the model config.
python
embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval")
3. If `prompt` nor `prompt_name` are specified in `SentenceTransformer.encode`, then the prompt specified by `default_prompt_name` will be applied. If it is `None`, then no prompt will be applied.
python
embeddings = model.encode("How to bake a strawberry cake")
* [Prompt Templates Documentation](https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates)
Instructor support (2477)
Some INSTRUCTOR models, such as [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large), are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step.
The following models work out of the box:
* [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base)
* [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large)
* [hkunlp/instructor-xl](https://huggingface.co/hkunlp/instructor-xl)
You can use these models like so:
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("hkunlp/instructor-large")
embeddings = model.encode(
[
"Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity",
"Comparison of Atmospheric Neutrino Flux Calculations at Low Energies",
"Fermion Bags in the Massive Gross-Neveu Model",
"QCD corrections to Associated t-tbar-H production at the Tevatron",
],
prompt="Represent the Medicine sentence for clustering: ",
)
print(embeddings.shape)
=> (4, 768)
<details><summary>Information Retrieval usage</summary>
python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("hkunlp/instructor-large")
query = "where is the food stored in a yam plant"
query_instruction = (
"Represent the Wikipedia question for retrieving supporting documents: "
)
corpus = [
'Yams are perennial herbaceous vines native to Africa, Asia, and the Americas and cultivated for the consumption of their starchy tubers in many temperate and tropical regions. The tubers themselves, also called "yams", come in a variety of forms owing to numerous cultivars and related species.',
"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loans—and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession",
"Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.",
]
corpus_instruction = "Represent the Wikipedia document for retrieval: "
query_embedding = model.encode(query, prompt=query_instruction)
corpus_embeddings = model.encode(corpus, prompt=corpus_instruction)
similarities = cos_sim(query_embedding, corpus_embeddings)
print(similarities)