Sentence-transformers

Latest version: v3.0.0

Safety actively analyzes 635661 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 9

3.0.0

Sentence Transformer training refactor (2449)
The v3.0 release centers around this huge modernization of the training approach for `SentenceTransformer` models. Whereas training before v3.0 used to be all about `InputExample`, `DataLoader` and `model.fit`, the new training approach relies on 5 new components. You can learn more about these components in our [Training and Finetuning Embedding Models with Sentence Transformers v3](https://huggingface.co/blog/train-sentence-transformers) blogpost. Additionally, you can read the new [Training Overview](https://sbert.net/docs/sentence_transformer/training_overview.html), check out the [Training Examples](https://sbert.net/docs/sentence_transformer/training/examples.html), or read this summary:

1. [Dataset](https://sbert.net/docs/sentence_transformer/training_overview.html#dataset)
A training [`Dataset`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset) or [`DatasetDict`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict). This class is much more suited for sharing & efficient modifications than lists/DataLoaders of `InputExample` instances. A `Dataset` can contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant. If there is a "label" or "score" column, it is treated separately, and used as the labels during training.
A `DatasetDict` can be used to train with multiple datasets at once, e.g.:
python
DatasetDict({
multi_nli: Dataset({
features: ['premise', 'hypothesis', 'label'],
num_rows: 392702
})
snli: Dataset({
features: ['snli_premise', 'hypothesis', 'label'],
num_rows: 549367
})
stsb: Dataset({
features: ['sentence1', 'sentence2', 'label'],
num_rows: 5749
})
})

When a `DatasetDict` is used, the `loss` parameter to the `SentenceTransformerTrainer` must also be a dictionary with these dataset keys, e.g.:
python
{
'multi_nli': SoftmaxLoss(...),
'snli': SoftmaxLoss(...),
'stsb': CosineSimilarityLoss(...),
}

2. [Loss Function](https://sbert.net/docs/sentence_transformer/training_overview.html#loss-function)
A loss function, or a dictionary of loss functions like described above. These loss functions do not require changes compared to before this PR.
3. [Training Arguments](https://sbert.net/docs/sentence_transformer/training_overview.html#training-arguments)
A SentenceTransformerTrainingArguments instance, subclass of a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) instance. This powerful class controls the specific details of the training.
4. [Evaluator](https://sbert.net/docs/sentence_transformer/training_overview.html#evaluator)
An optional [`SentenceEvaluator`](https://sbert.net/docs/package_reference/evaluation.html) instance. Unlike before, models can now be evaluated both on an evaluation dataset with some loss function and/or a `SentenceEvaluator` instance.
5. [Trainer](https://sbert.net/docs/sentence_transformer/training_overview.html#trainer)
The new `SentenceTransformersTrainer` instance based on the `transformers` `Trainer`. This instance is provided with a SentenceTransformer model, a SentenceTransformerTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is call `trainer.train()`.

Some of the major features that are now implemented include:
* MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
* bf16 training support
* Loss logging
* Evaluation datasets + evaluation loss
* Improved callback support (built-in via Weights and Biases, TensorBoard, CodeCarbon, etc., as well as custom callbacks)
* Gradient checkpointing
* Gradient accumulation
* Improved model card generation
* Warmup ratio
* Pushing to the Hugging Face Hub on every model checkpoint
* Resuming from a training checkpoint
* Hyperparameter Optimization

This script is a minimal example (no evaluator, no training arguments) of training [`mpnet-base`](https://huggingface.co/microsoft/mpnet-base) on a part of the [`all-nli` dataset](https://huggingface.co/datasets/sentence-transformers/all-nli) using [`MultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss):

python
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss

1. Load a model to finetune
model = SentenceTransformer("microsoft/mpnet-base")

2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(10_000))
eval_dataset = dataset["dev"].select(range(1_000))

3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

4. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()

5. Save the trained model
model.save_pretrained("models/mpnet-base-all-nli")


Additionally, trained models now automatically produce extensive model cards. Each of the following models were trained using some script from the [Training Examples](https://sbert.net/docs/sentence_transformer/training/examples.html), and the model cards were not edited manually whatsoever:
* [tomaarsen/mpnet-base-all-nli-triplet](https://huggingface.co/tomaarsen/mpnet-base-all-nli-triplet)
* [tomaarsen/stsb-distilbert-base-mnrl-cl-multi](https://huggingface.co/tomaarsen/stsb-distilbert-base-mnrl-cl-multi)
* [tomaarsen/distilroberta-base-paraphrases-multi](https://huggingface.co/tomaarsen/distilroberta-base-paraphrases-multi)

Prior to the Sentence Transformer v3 release, all models would be trained using the [`SentenceTransformer.fit`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.fit) method. Rather than deprecating this method, starting from v3.0, this method will use the [`SentenceTransformerTrainer`](https://sbert.net/docs/package_reference/sentence_transformer/trainer.html#sentence_transformers.trainer.SentenceTransformerTrainer) behind the scenes. This means that your old training code should still work, and should even be upgraded with the new features such as multi-gpu training, loss logging, etc. That said, the new training approach is much more powerful, so it is **recommended** to write new training scripts using the new approach.

Many of the old training scripts were updated to use the new Trainer-based approach, but not all have been updated yet. We accept help via Pull Requests to assist in updating the scripts.

Similarity Score (2615, 2490)

Sentence Transformers v3.0 introduces two new useful methods:
* [similarity](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity)
* [similarity_pairwise](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity_pairwise)

and one property:
* [similarity_fn_name](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity_fn_name)

These can be used to calculate the similarity between embeddings, and to specify which similarity function should be used, for example:

python
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("all-mpnet-base-v2")
>>> sentences = [
... "The weather is so nice!",
... "It's so sunny outside.",
... "He's driving to the movie theater.",
... "She's going to the cinema.",
... ]
>>> embeddings = model.encode(sentences, normalize_embeddings=True)
>>> model.similarity(embeddings, embeddings)

2.7.0

New loss function: CachedGISTEmbedLoss (2592)
For a number of years, [`MultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) (also known as SimCSE, InfoNCE, in-batch negatives loss) has been the state of the art in embedding model training. Notably, this loss function performs better with a larger batch size.

Recently, various improvements have been introduced:
1. [`CachedMultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/losses.html#cachedmultiplenegativesrankingloss) was introduced, which allows you to pick much higher batch sizes (e.g. 65536) with constant memory.
2. [`GISTEmbedLoss`](https://sbert.net/docs/package_reference/losses.html#gistembedloss) takes a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

Now, JacksonCakes has combined these two approaches to produce the best of both worlds: [`CachedGISTEmbedLoss`](https://sbert.net/docs/package_reference/losses.html#cachedgistembedloss). This loss function allows for high batch sizes with constant memory usage, while also using a guide model to assist with the in-batch negative sample selection.

As can be seen in our [Loss Overview](https://sbert.net/docs/training/loss_overview.html), this model should be used with `(anchor, positive)` pairs or `(anchor, positive, negative)` triplets, much like `MultipleNegativesRankingLoss`, `CachedMultipleNegativesRankingLoss`, and `GISTEmbedLoss`. In short, any example using those loss functions can be updated to use `CachedGISTEmbedLoss`! Feel free to experiment, e.g. with [this training script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v3.py).

Automatic Matryoshka model truncation (2573)
Sentence Transformers v2.4.0 introduced Matryoshka models: models whose embeddings are still useful after truncation. Since then, [many](https://huggingface.co/BEE-spoke-data/bert-plus-L8-v1.0-syntheticSTS-4k) [useful](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) [Matryoshka](https://huggingface.co/NeuML/pubmedbert-base-embeddings-matryoshka) [models](https://huggingface.co/mixedbread-ai/mxbai-embed-2d-large-v1) have been trained.

As of this release, the truncation for these Matryoshka embedding models can be done automatically via a new `truncate_dim` constructor argument:
python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

matryoshka_dim = 64
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True, truncate_dim=matryoshka_dim)

embeddings = model.encode(
[
"search_query: What is TSNE?",
"search_document: t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.",
"search_document: Amelia Mary Earhart was an American aviation pioneer and writer.",
]
)
print(embeddings.shape)
=> [3, 64]

similarities = cos_sim(embeddings[0], embeddings[1:])

2.6.1

This is a patch release to fix a bug in [`semantic_search_faiss`](https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss) and [`semantic_search_usearch`](https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch) that caused the scores to not correspond to the returned corpus indices. Additionally, you can now evaluate embedding models after quantizing their embeddings.

Precision support in EmbeddingSimilarityEvaluator
You can now pass `precision` to the `EmbeddingSimilarityEvaluator` to evaluate the performance after quantization:

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
import datasets

model = SentenceTransformer("all-mpnet-base-v2")

stsb = datasets.load_dataset("mteb/stsbenchmark-sts", split="test")

print("Spearman correlation based on Cosine Similarity on the STS Benchmark test set:")
for precision in ["float32", "uint8", "int8", "ubinary", "binary"]:
evaluator = EmbeddingSimilarityEvaluator(
stsb["sentence1"],
stsb["sentence2"],
[score / 5 for score in stsb["score"]],
main_similarity=SimilarityFunction.COSINE,
name="sts-test",
precision=precision,
)
print(precision, evaluator(model))


Spearman correlation based on Cosine Similarity on the STS Benchmark test set:

2.6.0

Embedding Quantization
Embeddings may be challenging to scale up, which leads to expensive solutions and high latencies. However, there is a new approach to counter this problem; it entails reducing the size of each of the individual values in the embedding: **Quantization**. Experiments on quantization have shown that we can maintain a large amount of performance while significantly speeding up computation and saving on memory, storage, and costs.

To be specific, using binary quantization may result in retaining 96% of the retrieval performance, while speeding up retrieval by **25x** and saving on memory & disk space with **32x**. Do not underestimate this approach! Read more about Embedding Quantization in our extensive [blogpost](https://huggingface.co/blog/embedding-quantization).

Binary and Scalar Quantization
Two forms of quantization exist at this time: binary and scalar (int8). These quantize embedding values from `float32` into `binary` and `int8`, respectively. For Binary quantization, you can use the following snippet:

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

2.5.1

This is a patch release to fix a bug in `CrossEncoder.rank` that caused the last value to be discarded when using the default `top_k=-1`.

`CrossEncoder.rank` patch:

python
from sentence_transformers.cross_encoder import CrossEncoder

Pre-trained cross encoder
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")

We want to compute the similarity between the query sentence
query = "A man is eating pasta."

With all sentences in the corpus
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]

We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)

Print the scores
print("Query:", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")


Query: A man is eating pasta.
0.67 A man is eating food.
0.34 A man is eating a piece of bread.
0.08 A man is riding a horse.
0.07 A man is riding a white horse on an enclosed ground.
0.01 The girl is carrying a baby.
0.01 Two men pushed carts through the woods.
0.01 A monkey is playing drums.
0.01 A woman is playing violin.
0.01 A cheetah is running behind its prey.

Previously, the lowest score document would be removed from the output.

All changes
* [`examples`] Update model repo_id in 2dMatryoshka example by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2515
* [`feat`] Add get_config_dict to new Matryoshka2dLoss & AdaptiveLayerLoss by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2516
* [`chore`] Update to ruff 0.3.0; update ruff.toml by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2517
* [`example`] Don't always normalize the embeddings in clustering example by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2520
* Fix CrossEncoder.rank default value for `top_k` by xenova in https://github.com/UKPLab/sentence-transformers/pull/2518

New Contributors
* xenova made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2518

**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v2.5.0...v2.5.1

2.5.0

2D Matryoshka & Adaptive Layer models (2506)
Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776) (2DMSE) revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.

For example, using Sentence Transformers, you can train an Adaptive Layer model that can be sped up by 2x at a 15% reduction in performance, or 5x on GPU & 10x on CPU for a 20% reduction in performance. The 2DMSE paper highlights scenarios where this is superior to using a smaller model.

Training

Training with Adaptive Layer support is quite elementary: rather than applying some loss function on only the last layer, we also apply that same loss function on the pooled embeddings from previous layers. Additionally, we employ a KL-divergence loss that aims to make the embeddings of the non-last layers match that of the last layer. This can be seen as a fascinating approach of [knowledge distillation](https://sbert.net/examples/training/distillation/README.html#knowledge-distillation), but with the last layer as the teacher model and the prior layers as the student models.

For example, with the 12-layer [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base), it will now be trained such that the model produces meaningful embeddings after each of the 12 layers.

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import CoSENTLoss, AdaptiveLayerLoss

model = SentenceTransformer("microsoft/mpnet-base")

base_loss = CoSENTLoss(model=model)
loss = AdaptiveLayerLoss(model=model, loss=base_loss)

* **Reference**: <a href="https://sbert.net/docs/package_reference/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a>

Additionally, this can be combined with the `MatryoshkaLoss` such that the resulting model can be reduced both in the number of layers, but also in the size of the output dimensions. See also the [Matryoshka Embeddings](https://sbert.net/examples/training/matryoshka/README.html) for more information on reducing output dimensions. In Sentence Transformers, the combination of these two losses is called `Matryoshka2dLoss`, and a shorthand is provided for simpler training.

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import CoSENTLoss, Matryoshka2dLoss

model = SentenceTransformer("microsoft/mpnet-base")

base_loss = CoSENTLoss(model=model)
loss = Matryoshka2dLoss(model=model, loss=base_loss, matryoshka_dims=[768, 512, 256, 128, 64])


* **Reference**: <a href="https://sbert.net/docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>

<details><summary>Performance Results</summary>

Results

Let's look at the performance that we may be able to expect from an Adaptive Layer embedding model versus a regular embedding model. For this experiment, I have trained two models:

* [tomaarsen/mpnet-base-nli-adaptive-layer](https://huggingface.co/tomaarsen/mpnet-base-nli-adaptive-layer): Trained by running [adaptive_layer_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/adaptive_layer/adaptive_layer_nli.py) with [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base).
* [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): A near identical model as the former, but using only `MultipleNegativesRankingLoss` rather than `AdaptiveLayerLoss` on top of `MultipleNegativesRankingLoss`. I also use [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) as the base model.

Both of these models were trained on the AllNLI dataset, which is a concatenation of the [SNLI](https://huggingface.co/datasets/snli) and [MultiNLI](https://huggingface.co/datasets/multi_nli) datasets. I have evaluated these models on the [STSBenchmark](https://huggingface.co/datasets/mteb/stsbenchmark-sts) test set using multiple different embedding dimensions. The results are plotted in the following figure:

![adaptive_layer_results](https://huggingface.co/tomaarsen/mpnet-base-nli-adaptive-layer/resolve/main/adaptive_layer_results.png)

The first figure shows that the Adaptive Layer model stays much more performant when reducing the number of layers in the model. This is also clearly shown in the second figure, which displays that 80% of the performance is preserved when the number of layers is reduced all the way to 1.

Lastly, the third figure shows the expected speedup ratio for GPU & CPU devices in my tests. As you can see, removing half of the layers results in roughly a 2x speedup, at a cost of ~15% performance on STSB (~86 -> ~75 Spearman correlation). When removing even more layers, the performance benefit gets larger for CPUs, and between 5x and 10x speedups are very feasible with a 20% loss in performance.

</details>

<details><summary>Inference</summary>

Inference

After a model has been trained using the Adaptive Layer loss, you can then truncate the model layers to your desired layer count. Note that this requires doing a bit of surgery on the model itself, and each model is structured a bit differently, so the steps are slightly different depending on the model.

First of all, we will load the model & access the underlying `transformers` model like so:

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tomaarsen/mpnet-base-nli-adaptive-layer")

We can access the underlying model with `model[0].auto_model`
print(model[0].auto_model)


MPNetModel(
(embeddings): MPNetEmbeddings(
(word_embeddings): Embedding(30527, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): MPNetEncoder(
(layer): ModuleList(
(0-11): 12 x MPNetLayer(
(attention): MPNetAttention(
(attn): MPNetSelfAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(k): Linear(in_features=768, out_features=768, bias=True)
(v): Linear(in_features=768, out_features=768, bias=True)
(o): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(intermediate): MPNetIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): MPNetOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(relative_attention_bias): Embedding(32, 12)
)
(pooler): MPNetPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)

This output will differ depending on the model. We will look for the repeated layers in the encoder. For this MPNet model, this is stored under `model[0].auto_model.encoder.layer`. Then we can slice the model to only keep the first few layers to speed up the model:

python

new_num_layers = 3
model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers]


Then we can run inference with it using <a href="https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>.

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("tomaarsen/mpnet-base-nli-adaptive-layer")
new_num_layers = 3
model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers]

embeddings = model.encode(
[
"The weather is so nice!",
"It's so sunny outside!",
"He drove to the stadium.",
]
)
Similarity of the first sentence with the other two
similarities = cos_sim(embeddings[0], embeddings[1:])

Page 1 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.