Sentence-transformers

Latest version: v3.3.1

Safety actively analyzes 682471 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 13 of 23

2.2.1

Not secure
Version `0.8.1` of `huggingface_hub` introduces several changes that resulted in errors and warnings. This version of `sentence-transformers` fixes these issues.

Further, several improvements have been added / merged:
- `util.community_detection` was improved: 1) It works in a batched mode to save memory, 2) Overlapping clusters are no longer dropped but removed by overlapping items, 3) The parameter `init_max_size` was removed and replaced by a heuristic to estimate the max size of clusters
- 1581 the training dataset names can be saved in the model card
- 1426 fix the text summarization example
- 1487 Rekursive sentence-transformers models are now possible
- 1522 Private models can now be loaded
- 1551 DataLoaders can now have workers
- 1565 Models are just checked on the hub if they don't exist in the cache. Fixes issues with connectivity issues
- 1591 Example added how to stream encode larger datasets

2.2.0

Not secure
T5
You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:
python
from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer('t5-base', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


See [T5-Benchmark results](https://www.sbert.net/docs/training/overview.html#best-transformer-model) - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

New Models
The models from the papers [Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models](https://arxiv.org/abs/2108.08877) and [Large Dual Encoders Are Generalizable Retrievers](https://arxiv.org/abs/2112.07899) have been added:
- [gtr-t5-base](https://huggingface.co/sentence-transformers/gtr-t5-base)
- [gtr-t5-large](https://huggingface.co/sentence-transformers/gtr-t5-large)
- [gtr-t5-xl](https://huggingface.co/sentence-transformers/gtr-t5-xl)
- [gtr-t5-xxl](https://huggingface.co/sentence-transformers/gtr-t5-xxl)
- [sentence-t5-base](https://huggingface.co/sentence-transformers/sentence-t5-base)
- [sentence-t5-large](https://huggingface.co/sentence-transformers/sentence-t5-large)
- [sentence-t5-xl](https://huggingface.co/sentence-transformers/sentence-t5-xl)
- [sentence-t5-xxl](https://huggingface.co/sentence-transformers/sentence-t5-xxl)

For benchmark results, see [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/gtr-t5-base)

Private Models

Thanks to 1406 you can now load private models from the hub:
python
model = SentenceTransformer("your-username/your-model", use_auth_token=True)

2.1.0

Not secure
This is a smaller release with some new features

MarginMSELoss
[MarginMSELoss](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MarginMSELoss.py) is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: [MSMARCO - MarginMSE Training](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse)

You pass your training data in the format:
python
InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])

MultipleNegativesSymmetricRankingLoss
MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

[MultipleNegativesSymmetricRankingLoss](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesSymmetricRankingLoss.py) also computes the loss in the other direction: Find the correct question for a given answer.

Breaking Change: CLIPModel
The CLIPModel is now based on the `transformers` model.

You can still load it like this:
python
model = SentenceTransformer('clip-ViT-B-32')


Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.


Added files on the hub are automatically downloaded
PR 1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

`SentenceTransformers.encode()` can return all values

When you set `output_value=None` for the `encode` method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.

2.0.0

Not secure
Models hosted on the hub
All pre-trained models are now hosted on the [Huggingface Models hub](https://huggingface.co/models).

Our pre-trained models can be found here: [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)

But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

model = SentenceTransformer('[your_username]/[model_name]')


For more information, see: [Sentence Transformers in the Hugging Face Hub](https://huggingface.co/blog/sentence-transformers-in-the-hub)

Breaking changes

There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

Find sentence-transformer models on the Hub

You can filter the hub for sentence-transformers models: [https://huggingface.co/models?filter=sentence-transformers](https://huggingface.co/models?filter=sentence-transformers)

Add the `sentence-transformers` tag to you model card so that others can find your model.

Widget & Inference API
A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website:
https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

Further, models can now be used with the [Accelerated Inference API](https://api-inference.huggingface.co/docs/python/html/index.html): Send you sentences to the API and get back the embeddings from the respective model.

Save Model to Hub

A new method was added to the `SentenceTransformer` class: `save_to_hub`.

Provide the model name and the model is saved on the hub.

Here you find the explanation from transformers how the hub works: [Model sharing and uploading
](https://huggingface.co/transformers/model_sharing.html)

Automatic Model Card

When you save a model with `save` or `save_to_hub`, a `README.md` (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.


New Models
- Several new sentence embedding models have been added, which are much better than the previous model: [Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)
- Some new models for semantic search based on MS MARCO have been added: [MSMARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
- The training script for these MS MARCO models have been released as well: [Train MS MARCO Bi-Encoder v3](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder-v3.py)

2b.

embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
binary_embeddings = quantize_embeddings(embeddings, precision="binary")


References:
* [SentenceTransformer.encode](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode)
* [quantize_embeddings](https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.quantize_embeddings)

GISTEmbedLoss

GISTEmbedLoss, as introduced in [Solatorio (2024)](https://arxiv.org/pdf/2402.16829.pdf), is a guided variant of the more standard in-batch negatives (`MultipleNegativesRankingLoss`) loss. Both loss functions are provided with a list of (anchor, positive) pairs, but while `MultipleNegativesRankingLoss` uses `anchor_i` and `positive_i` as positive pair and all `positive_j` with `i != j` as negative pairs, `GISTEmbedLoss` uses a second model to guide the in-batch negative sample selection.

This can be very useful, because it is plausible that `anchor_i` and `positive_j` are actually quite semantically similar. In this case, `GISTEmbedLoss` would not consider them a negative pair, while `MultipleNegativesRankingLoss` would. When finetuning MPNet-base on the AllNLI dataset, these are the Spearman correlation based on cosine similarity using the STS Benchmark dev set (higher is better):

![312039399-ef5d4042-a739-41f6-a6ca-eddc7f901411](https://github.com/UKPLab/sentence-transformers/assets/37621491/ae99e809-4cc9-4db3-8b00-94cc74d2fe3b)
The blue line is `MultipleNegativesRankingLoss`, whereas the grey line is `GISTEmbedLoss` with the small `all-MiniLM-L6-v2` as the guide model. Note that `all-MiniLM-L6-v2` by itself does not reach 88 Spearman correlation on this dataset, so this is really the effect of two models (`mpnet-base` and `all-MiniLM-L6-v2`) reaching a performance that they could not reach separately.

Soft `save_to_hub` Deprecation
Most codebases that allow for pushing models to the [Hugging Face Hub](https://huggingface.co/) adopt a `push_to_hub` method instead of a `save_to_hub` method, and now Sentence Transformers will follow that convention. The [`push_to_hub`](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.push_to_hub) method will now be the recommended approach, although `save_to_hub` will continue to exist for the time being: it will simply call `push_to_hub` internally.

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

...

Train the model
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=dev_evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
)

Push the model to Hugging Face
model.push_to_hub("tomaarsen/mpnet-base-nli-stsb")


All changes
* Add GISTEmbedLoss by avsolatorio in https://github.com/UKPLab/sentence-transformers/pull/2535
* [`feat`] Add 'get_config_dict' method to GISTEmbedLoss for better model cards by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2543
* Enable saving modules as pytorch_model.bin by CKeibel in https://github.com/UKPLab/sentence-transformers/pull/2542
* [`deprecation`] Deprecate `save_to_hub` in favor of `push_to_hub`; add safe_serialization support to `push_to_hub` by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2544
* Fix SentenceTransformer encode documentation return type default (numpy vectors) by CKeibel in https://github.com/UKPLab/sentence-transformers/pull/2546
* [`docs`] Update return docstring of encode_multi_process by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2548
* [`feat`] Add binary & scalar embedding quantization support to Sentence Transformers by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2549

New Contributors
* avsolatorio made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2535
* CKeibel made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2542

**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v2.5.1...v2.6.0

2a.

binary_embeddings = model.encode(
["I am driving to the lake.", "It is a beautiful day."],
precision="binary",
)

Page 13 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.