Sentence-transformers

Latest version: v3.3.1

Safety actively analyzes 681881 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 20 of 23

0.1309

[0.7235, 1.0000, 0.0613, 0.1129],
[0.0290, 0.0613, 1.0000, 0.5027],
[0.1309, 0.1129, 0.5027, 1.0000]])
>>> model.similarity_fn_name
"cosine"
>>> model.similarity_fn_name = "euclidean"
>>> model.similarity(embeddings, embeddings)

0.1277

Additionally, model authors can take advantage of keyword argument passthrough. By updating the `modules.json` file to include a list of `kwargs`, e.g.:
json
[
{
"idx": 0,
"name": "0",
"path": "",
"type": "custom_transformer.CustomTransformer",
"kwargs": ["task_type"]
},
...
]

then if a user provides the `task_type` keyword argument in `model.encode`, this value will be propagated to the `forward` of the custom module(s). This way, users can specify some custom functionality on the fly during inference time (as well as during load time via the `model_kwargs` option when initializing a `SentenceTransformer` model).

Update dependency versions (2757)
* Restrict `numpy<2.0.0` due to issues with `torch` and `numpy` interoperability on Windows.
* Increment minimum `transformers` version to 4.38.0 & `huggingface-hub` to 0.19.3 to prevent a training crash related to the `prefetch_factor` option

Smaller Highlights
Features
* Add `show_progress_bar` to [`encode_multi_process`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode_multi_process) (2762)
* Add `revision` to [`push_to_hub`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.push_to_hub) (2902)
* Add `cache_dir` and `config_args` to CrossEncoder (2784)
* Warn users if they might be passing training/evaluation columns in the wrong order, leading to worse training performance (2928)

Bug fixes
* Prevent crash when encoding an empty list (2759)
* Support training with `GISTEmbedLoss` with DataParallel (DP) and DataDistributedParallel (DDP) (2772)
* Fix a bug in `GroupByLabelBatchSampler` resulting in some data not being used in training (2788)
* Prevent crash if a `datasets` directory exists locally (2859)
* Fix `Matryoshka2dLoss` not importing correctly (2907)
* Resolve niche training bug with training if using multi-dataset, no-duplicates, and `dataloader_drop_last=True` (2877)
* Fix `torch_compile=True` not working in the `SentenceTransformersTrainingArguments`: should now work for faster training (2884)
* Fix `SoftmaxLoss` performing worse since v3.0 as a Linear layer was ignored by the optimizer (2881)
* Fix `trainer.train(resume_from_checkpoint="...")` with custom models (i.e. `trust_remote_code`) (2918)
* Fix the evaluation using the training batch size (2847)
* Fix encoding when passing `model_kwargs={"torch_dtype": torch.float16}` with models that use Dense layers (2889)

Documentation
* New [documentation for batch samplers](https://sbert.net/docs/package_reference/sentence_transformer/sampler.html) (#2921, various PRs by fpgmaas)
* New [documentation for custom modules and model structure](https://sbert.net/docs/sentence_transformer/usage/custom_models.html) (#2773)

All changes
* [Typing] make device optional by michaelfeil in https://github.com/UKPLab/sentence-transformers/pull/2731
* [Spelling] Docs by michaelfeil in https://github.com/UKPLab/sentence-transformers/pull/2733
* [Spelling] Codespell readme by michaelfeil in https://github.com/UKPLab/sentence-transformers/pull/2736
* [Spelling] update examples by michaelfeil in https://github.com/UKPLab/sentence-transformers/pull/2734
* [`versions`] Increment transformers/hf-hub versions to prevent training crash by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2757
* Typo fixed in examples/training/sts/training_stsbenchmark.py by akkefa in https://github.com/UKPLab/sentence-transformers/pull/2743
* spelling: code comment updates by michaelfeil in https://github.com/UKPLab/sentence-transformers/pull/2735
* Update DenoisingAutoEncoderDataset.py by sophia8844 in https://github.com/UKPLab/sentence-transformers/pull/2747
* [`fix`] Prevent crash when encoding empty list by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2759
* Fix syntax warning (issue 2687) by wyattscarpenter in https://github.com/UKPLab/sentence-transformers/pull/2765
* [`feat`] Add show_progress_bar to encode_multi_process by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2762
* Typing overload by janrito in https://github.com/UKPLab/sentence-transformers/pull/2763
* [`fix`] Fix retokenization on DDP/DP with GIST losses by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2775
* Cast predict scores to float before converting to numpy by malteos in https://github.com/UKPLab/sentence-transformers/pull/2783
* Elasticsearch example: simplify setup by maxjakob in https://github.com/UKPLab/sentence-transformers/pull/2778
* [chore] Enable ruff rules `Warning (W)` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2789
* [fix] Add tests for 3.12 in cicd by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2785
* Allow inheriting the Transformer class by mokha in https://github.com/UKPLab/sentence-transformers/pull/2810
* [`feat`] Add hard negatives mining utility by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2768
* [chore] add test for NoDuplicatesBatchSampler by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2795
* [chore] Add test for RoundrobinBatchSampler by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2798
* [feat] Improve GroupByLabelBatchSampler by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2788
* [`chore`] Clean-up `.gitignore` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2799
* [chore] improve the use of ruff and pre-commit hooks by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2793
* [feat] Move from `setup.py` and `setup.cfg` to `pyproject.toml` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2786
* [chore] Add `pytest-cov` and add test coverage command to the Makefile by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2794
* Move `pytest` config to `pyproject.toml` and remove `pytest.ini` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2819
* [`fix`] Fix packages discovery in `pyproject.toml` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2825
* Fix `ruff` pre-commit hook. by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2826
* [`chore`] Enable `isort` with `ruff` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2828
* [`chore`] Enable ruff rules `UP006` and `UP007` to improve type hints. by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2830
* [`chore`] Enable ruff's pypgrade (`UP`) ruleset by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2834
* update SoftmaxLoss arguments by KiLJ4EdeN in https://github.com/UKPLab/sentence-transformers/pull/2894
* [feat] Added revision to push_to_hub argument. by pesuchin in https://github.com/UKPLab/sentence-transformers/pull/2902
* Perform additional check for owner string in `is_<library>_available` functions by leblancfg in https://github.com/UKPLab/sentence-transformers/pull/2859
* [`style`] Replace Huggingface with Hugging Face by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2905
* Fix typo: "comuptation" -> "computation" by jeffwidman in https://github.com/UKPLab/sentence-transformers/pull/2909
* [`ci`] Attempt to fix CI disk space issues by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2906
* [`docs`] Fix typo and broken links in documentation by ZiyiXia in https://github.com/UKPLab/sentence-transformers/pull/2861
* Add MNSRL with GradCache by madhavthaker1 in https://github.com/UKPLab/sentence-transformers/pull/2879
* Fix 'module object is not callable' error in Matryoshka2dLoss by pesuchin in https://github.com/UKPLab/sentence-transformers/pull/2907
* [`chore`] Add unittests for `InformationRetrievalEvaluator` by fpgmaas in https://github.com/UKPLab/sentence-transformers/pull/2838
* [`fix`] Safely continue if ProportionalBatchSampler sub-batch sampler throws StopIteration by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2877
* [`fix`] Fix `torch_compile=True` by always inserting a wrapped model into the loss by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2884
* [`fix`] Fix SoftmaxLoss by initializing the optimizer over the loss(es) rather than the model by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2881
* [`fix`] Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e. `trust_remote_code`) by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2918
* [`docs`] Heavily extend sampler documentation by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2921
* [`feat`] Add support for streaming datasets by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2792
* [`fix`] Change eval dataloader to use eval_batch_size by akashd-2 in https://github.com/UKPLab/sentence-transformers/pull/2847
* [`feat`] Add cache_dir support to CrossEncoder by RoyBA in https://github.com/UKPLab/sentence-transformers/pull/2784
* [`deprecation`] Push deprecation cycle for `use_auth_token` to v4 by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2926
* [`security`] Load weights only with torch.load & pytorch_model.bin by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2927
* [`feat`] Allow loading custom modules; encode kwargs passthrough to modules by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2773
* [`fix`] Add dtype cast for modules other than Transformer by ir2718 in https://github.com/UKPLab/sentence-transformers/pull/2889
* [`docs`] Move losses up in the package reference; they're more important by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2929
* [`feat`] Add column order warnings to the data collator by tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2928

New Contributors
* akkefa made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2743
* sophia8844 made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2747
* wyattscarpenter made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2765
* janrito made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2763
* malteos made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2783
* fpgmaas made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2789
* KiLJ4EdeN made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2894
* pesuchin made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2902
* leblancfg made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2859
* jeffwidman made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2909
* ZiyiXia made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2861
* madhavthaker1 made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2879
* akashd-2 made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2847
* RoyBA made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/2784

Big thanks to fpgmaas for the large number of valuable contributions surrounding tests, CI, config files, and overall project health.

**Full Changelog**: https://github.com/UKPLab/sentence-transformers/compare/v3.0.1...v3.1.0

0.0569

0.4.1

Not secure
**Refactored Tokenization**
- Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
- Usage of the `SentencesDataset` no longer needed for training. You can pass your train examples directly to the DataLoader:
python
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

- If you use a custom torch DataSet class: The dataset class must now return `InputExample` objects instead of tokenized texts
- Class `SentenceLabelDataset` has been updated to new tokenization flow: It returns always two or more `InputExamples` with the same label

**Asymmetric Models**
Add new `models.Asym` class that allows different encoding of sentences based on some tag (e.g. *query* vs *paragraph*). Minimal example:

python
word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])


Inputs that have the key 'QRY' will be passed through the `d1` dense layer, while inputs with they key 'DOC' through the `d2` dense layer.
More documentation on how to design asymmetric models will follow soon.


**New Namespace & Models for Cross-Encoder**
Cross-Encoder are now hosted at [https://huggingface.co/cross-encoder](https://huggingface.co/cross-encoder). Also, new [pre-trained models](https://www.sbert.net/docs/pretrained_cross-encoders.html) have been added for: NLI & QNLI.

**Logging**
Log messages now use a custom logger from `logging` thanks to PR 623. This allows you which log messages you want to see from which components.

**Unit tests**
A lot more unit tests have been added, which test the different components of the framework.

0.4.0

Not secure
- Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
- New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
- New application example for [information retrieval and question answering retrieval](https://www.sbert.net/examples/applications/information-retrieval/README.html). Together with respective pre-trained models

0.3.9

Not secure
This release only include some smaller updates:
- Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
- As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
- model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
- The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
- The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

Page 20 of 23

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.