Sentence-transformers

Latest version: v4.0.1

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 14 of 24

2.0.0

Not secure

Models hosted on the hub
All pre-trained models are now hosted on the [Huggingface Models hub](https://huggingface.co/models).

Our pre-trained models can be found here: [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)

But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

model = SentenceTransformer('[your_username]/[model_name]')

For more information, see: [Sentence Transformers in the Hugging Face Hub](https://huggingface.co/blog/sentence-transformers-in-the-hub)

Breaking changes

There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

Find sentence-transformer models on the Hub

You can filter the hub for sentence-transformers models: [https://huggingface.co/models?filter=sentence-transformers](https://huggingface.co/models?filter=sentence-transformers)

Add the `sentence-transformers` tag to you model card so that others can find your model.

Widget & Inference API
A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website:
https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

Further, models can now be used with the [Accelerated Inference API](https://api-inference.huggingface.co/docs/python/html/index.html): Send you sentences to the API and get back the embeddings from the respective model.

Save Model to Hub

A new method was added to the `SentenceTransformer` class: `save_to_hub`.

Provide the model name and the model is saved on the hub.

Here you find the explanation from transformers how the hub works: [Model sharing and uploading
](https://huggingface.co/transformers/model_sharing.html)

Automatic Model Card

When you save a model with `save` or `save_to_hub`, a `README.md` (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

New Models
- Several new sentence embedding models have been added, which are much better than the previous model: [Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)
- Some new models for semantic search based on MS MARCO have been added: [MSMARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
- The training script for these MS MARCO models have been released as well: [Train MS MARCO Bi-Encoder v3](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder-v3.py)

2b.

embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
binary_embeddings = quantize_embeddings(embeddings, precision="binary")

References:
* [SentenceTransformer.encode](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode)
* [quantize_embeddings](https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.quantize_embeddings)

GISTEmbedLoss

GISTEmbedLoss, as introduced in [Solatorio (2024)](https://arxiv.org/pdf/2402.16829.pdf), is a guided variant of the more standard in-batch negatives (`MultipleNegativesRankingLoss`) loss. Both loss functions are provided with a list of (anchor, positive) pairs, but while `MultipleNegativesRankingLoss` uses `anchor_i` and `positive_i` as positive pair and all `positive_j` with `i != j` as negative pairs, `GISTEmbedLoss` uses a second model to guide the in-batch negative sample selection.

This can be very useful, because it is plausible that `anchor_i` and `positive_j` are actually quite semantically similar. In this case, `GISTEmbedLoss` would not consider them a negative pair, while `MultipleNegativesRankingLoss` would. When finetuning MPNet-base on the AllNLI dataset, these are the Spearman correlation based on cosine similarity using the STS Benchmark dev set (higher is better):

![312039399-ef5d4042-a739-41f6-a6ca-eddc7f901411](https://github.com/UKPLab/sentence-transformers/assets/37621491/ae99e809-4cc9-4db3-8b00-94cc74d2fe3b)
The blue line is `MultipleNegativesRankingLoss`, whereas the grey line is `GISTEmbedLoss` with the small `all-MiniLM-L6-v2` as the guide model. Note that `all-MiniLM-L6-v2` by itself does not reach 88 Spearman correlation on this dataset, so this is really the effect of two models (`mpnet-base` and `all-MiniLM-L6-v2`) reaching a performance that they could not reach separately.

Soft `save_to_hub` Deprecation
Most codebases that allow for pushing models to the [Hugging Face Hub](https://huggingface.co/) adopt a `push_to_hub` method instead of a `save_to_hub` method, and now Sentence Transformers will follow that convention. The [`push_to_hub`](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.push_to_hub) method will now be the recommended approach, although `save_to_hub` will continue to exist for the time being: it will simply call `push_to_hub` internally.

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

...

Train the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=dev_evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
)

Push the model to Hugging Face
model.push_to_hub("tomaarsen/mpnet-base-nli-stsb")

All changes
* Add GISTEmbedLoss by avsolatorio in https://github.com/UKPLab/sentence-transformers/pull/2535
* [`feat`] Add 'get_config_dict' method to GISTEmbedLoss for better model cards by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2543
* Enable saving modules as pytorch_model.bin by CKeibel in https://github.com/UKPLab/sentence-transformers/pull/2542
* [`deprecation`] Deprecate `save_to_hub` in favor of `push_to_hub`; add safe_serialization support to `push_to_hub` by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2544
* Fix SentenceTransformer encode documentation return type default (numpy vectors) by CKeibel in https://github.com/UKPLab/sentence-transformers/pull/2546
* [`docs`] Update return docstring of encode_multi_process by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2548
* [`feat`] Add binary & scalar embedding quantization support to Sentence Transformers by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2549

New Contributors
* avsolatorio made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2535
* CKeibel made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2542

**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v2.5.1...v2.6.0

2a.

binary_embeddings = model.encode(
["I am driving to the lake.", "It is a beautiful day."],
precision="binary",
)

1.3184

[-0.7437, -0.0000, -1.3702, -1.3320],
[-1.3935, -1.3702, -0.0000, -0.9973],
[-1.3184, -1.3320, -0.9973, -0.0000]])

Additionally, you can compute the similarity between pairs of embeddings, resulting in a 1-dimensional vector of similarities rather than a 2-dimensional matrix:
python
>>> model = SentenceTransformer("all-mpnet-base-v2")
>>> sentences = [
... "The weather is so nice!",
... "It's so sunny outside.",
... "He's driving to the movie theater.",
... "She's going to the cinema.",
... ]
>>> embeddings = model.encode(sentences, normalize_embeddings=True)
>>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])

1.2.1

Not secure

Final release of version 1: Makes v1 of sentence-transformers forward compatible with models from version 2 of sentence-transformers.

1.2.0

Not secure

Unsupervised Sentence Embedding Learning

New methods integrated to train sentence embedding models without labeled data. See [Unsupervised Learning](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning) for an overview of all existent methods.

New methods:
- **[CT](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/CT)**: Integration of [Semantic Re-Tuning With Contrastive Tension (CT)](https://openreview.net/pdf?id=Ov_sMNau-PF) to tune models without labeled data
- **[CT_In-Batch_Negatives](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/CT_In-Batch_Negatives)**: A modification of CT using in-batch negatives
- **[SimCSE](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/SimCSE)**: An unsupervised sentence embedding learning method by [Gao et al.](https://arxiv.org/abs/2104.08821)

Pre-Training Methods
- **[MLM](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/MLM):** An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

Training Examples
- **[Paraphrase Data](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/paraphrases):** In our paper [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.
- **[NLI with MultipleNegativeRankingLoss](https://www.sbert.net/examples/training/nli/README.html#multiplenegativesrankingloss)**: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.

New models
- **[New NLI & STS models](https://www.sbert.net/docs/pretrained_models.html#semantic-textual-similarity):** Following the [Paraphrase Data training example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/paraphrases) we published new models trained on NLI and NLI+STS data. Training code is available: [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py).

| Model-Name | STSb-test performance |
| --- | :---: |
| *Previous best models* | |
| nli-bert-large | 79.19 |
| stsb-roberta-large | 86.39 |
| *New v2 models* | |
| nli-mpnet-base-v2 | 86.53 |
| stsb-mpnet-base-v2 | 88.57 |

- **[New MS MARCO model for Semantic Search](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)**: [Hofstätter et al.](https://arxiv.org/abs/2104.06967) optimized the training procedure on the [MS MARCO dataset](https://www.sbert.net/examples/training/ms_marco/README.html). The resulting model is integrated as **msmarco-distilbert-base-tas-b** and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR10

New Functions
- `SentenceTransformer.fit()` **Checkpoints**: The fit() method now allows to save checkpoints during the training at a fixed number of steps. [More info](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.fit)
- **Pooling-mode as string**: You can now pass the pooling-mode to `models.Pooling()` as string:
python
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')

Valid values are mean/max/cls.
- **[NoDuplicatesDataLoader](https://www.sbert.net/docs/package_reference/datasets.html#noduplicatesdataloader)**: When using the [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss), one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~

Page 14 of 24

Releases

Has known vulnerabilities

Previous Next

Sentence-transformers

Page 14 of 24

2.0.0

2b.

2a.

1.3184

1.2.1

1.2.0

Page 14 of 24

Links

Releases