Keybert

Latest version: v0.8.4

Safety actively analyzes 623827 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.8.0

**Highlights**

* Use `keybert.KeyLLM` to leverage LLMs for extracting keywords 🔥
* Use it either with or without candidate keywords generated through KeyBERT
* Efficient implementation by calculating embeddings and generating keywords for a subset of the documents
* Multiple LLMs are integrated: [OpenAI](https://maartengr.github.io/KeyBERT/guides/llms.html#openai), [Cohere](https://maartengr.github.io/KeyBERT/guides/llms.html#cohere), [LangChain](https://maartengr.github.io/KeyBERT/guides/llms.html#langchain), [🤗 Transformers](https://maartengr.github.io/KeyBERT/guides/llms.html#hugging-face-transformers), and [LiteLLM](https://maartengr.github.io/KeyBERT/guides/llms.html#litellm)

1. Create Keywords with [KeyLLM](https://maartengr.github.io/KeyBERT/guides/keyllm.html#1-create-keywords-with-keyllm)

A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match KeyBERT with KeyLLM. You could also choose to use KeyLLM without KeyBERT.

![keyllm](https://github.com/MaartenGr/KeyBERT/assets/25746895/22b443c4-cfba-4b34-a123-0392cb0f4479)


python
from keybert import KeyBERT

kw_model = KeyBERT()

Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)


2. Efficient [KeyLLM](https://maartengr.github.io/KeyBERT/guides/keyllm.html#4-efficient-keyllm)
If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.

![efficient](https://github.com/MaartenGr/KeyBERT/assets/25746895/ce834bd2-08c9-46cb-9b6e-faaabe2dcdb8)

python
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

Extract embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, convert_to_tensor=True)

Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

Load it in KeyLLM
kw_model = KeyLLM(llm)

Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.75)


3. Efficient [KeyLLM + KeyBERT](https://maartengr.github.io/KeyBERT/guides/keyllm.html#5-efficient-keyllm-keybert)
This is the best of both worlds. We use KeyBERT to generate a first pass of keywords and embeddings and give those to KeyLLM for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior with the threshold. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.

![keybert_keyllm](https://github.com/MaartenGr/KeyBERT/assets/25746895/0b605de2-fb72-4c8d-bc96-175d59881acd)

python
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM, KeyBERT

Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

Load it in KeyLLM
kw_model = KeyBERT(llm=llm)

Extract keywords
keywords = kw_model.extract_keywords(documents); keywords


See [here](https://maartengr.github.io/KeyBERT/guides/keyllm.html) for full documentation on use cases of `KeyLLM` and [here](https://maartengr.github.io/KeyBERT/guides/llms.html) for the implemented Large Language Models.

**Fixes**

* Enable Guided KeyBERT for seed keywords differing among docs by [shengbo-ma](https://github.com/shengbo-ma) in [#152](https://github.com/MaartenGr/KeyBERT/pull/152)

0.7.0

**Highlights**

* Cleaned up [documentation](https://maartengr.github.io/KeyBERT/guides/quickstart.html) and added several visual representations of the algorithm (excluding MMR / MaxSum)
* Added [functions](https://maartengr.github.io/KeyBERT/guides/quickstart.html#prepare-embeddings) to extract and pass word- and document embeddings which should make fine-tuning much faster

python
from keybert import KeyBERT

kw_model = KeyBERT()

Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)


Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.

**Fixes**

* Redundant documentation was removed by [mabhay3420](https://github.com/priyanshul-govil) in [#123](https://github.com/MaartenGr/KeyBERT/pull/123)
* Fixed Gensim backend not working after v4 migration ([71](https://github.com/MaartenGr/KeyBERT/issues/71))
* Fixed `candidates` not working ([122](https://github.com/MaartenGr/KeyBERT/issues/122))

0.6.0

**Highlights**

* Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents
* Same results whether passing a single document or multiple documents
* MMR and MaxSum now work when passing a single document or multiple documents
* Improved documentation
* Added 🤗 Hugging Face Transformers

python
from keybert import KeyBERT
from transformers.pipelines import pipeline

hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)


* Highlighting support for Chinese texts
* Now uses the `CountVectorizer` for creating the tokens
* This should also improve the highlighting for most applications and higher n-grams

![image](https://user-images.githubusercontent.com/25746895/179488649-3c66403c-9620-4e12-a7a8-c2fab26b18fc.png)

**NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!

**Fixes**

* Fix typo in ReadMe by [priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
* Add missing optional dependencies (gensim, use, and spacy) by [yusuke1997](https://github.com/yusuke1997)
in [114](https://github.com/MaartenGr/KeyBERT/pull/114)

0.5.1

* Added a [page](https://maartengr.github.io/KeyBERT/guides/countvectorizer.html) about leveraging `CountVectorizer` and `KeyphraseVectorizers`
* Shoutout to [TimSchopf](https://github.com/TimSchopf) for creating and optimizing the package!
* The `KeyphraseVectorizers` package can be found [here](https://github.com/TimSchopf/KeyphraseVectorizers)
* Fixed Max Sum Similarity returning incorrect similarities [92](https://github.com/MaartenGr/KeyBERT/issues/92)
* Thanks to [kunihik0](https://github.com/kunihik0) for the PR!
* Fixed out of bounds condition in MMR
* Thanks to [artmatsak](https://github.com/artmatsak) for the PR!
* Started styling with Flake8 and Black (which was long overdue)
* Added pre-commit to make following through a bit easier with styling

0.5.0

**Highlights**:

* Added Guided KeyBERT
* kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
* Thanks to [zolekode](https://github.com/zolekode) for the inspiration!
* Use the newest all-* models from SBERT

**Miscellaneous**:

* Added instructions in the FAQ to extract keywords from Chinese documents

0.4.0

Features
* Use `paraphrase-MiniLM-L6-v2` as the default (great results!)
* Highlight the document with keywords:
* `keywords = kw_model.extract_keywords(doc, highlight=True)`

Miscellaneous
* Update Flair dependencies
* Added FAQ

![highlight](https://user-images.githubusercontent.com/25746895/123934835-f7751f00-d993-11eb-8d5b-01e759e388ae.png)

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.