Bertopic

Latest version: v0.16.2

Safety actively analyzes 630360 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 6

0.16.2

*Release date: 12 May, 2024*

<h3><b>Fixes:</a></b></h3>

* Fix issue with zeroshot topic modeling missing outlier [1957](https://github.com/MaartenGr/BERTopic/issues/1957)
* Bump github actions versions by [afuetterer](https://github.com/afuetterer) in [#1941](https://github.com/MaartenGr/BERTopic/pull/1941)
* Drop support for python 3.7 by [afuetterer](https://github.com/afuetterer) in [#1949](https://github.com/MaartenGr/BERTopic/pull/1949)
* Add testing python 3.10+ in Github actions by [afuetterer](https://github.com/afuetterer) in [#1968](https://github.com/MaartenGr/BERTopic/pull/1968)
* Speed up fitting CountVectorizer by [dannywhuang](https://github.com/dannywhuang) in [#1938](https://github.com/MaartenGr/BERTopic/pull/1938)
* Fix `transform` when using cuML HDBSCAN by [beckernick](https://github.com/beckernick) in [#1960](https://github.com/MaartenGr/BERTopic/pull/1960)
* Fix wrong link in algorithm documentation by [naeyn](https://github.com/naeyn) in [#1970](https://github.com/MaartenGr/BERTopic/pull/1970)

0.16.1

*Release date: 21 April, 2024*

<h3><b>Highlights:</a></b></h3>

* Add Quantized [LLM Tutorial](https://colab.research.google.com/drive/1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing)
* Add optional [datamapplot](https://github.com/TutteInstitute/datamapplot) visualization using `topic_model.visualize_document_datamap` by [lmcinnes](https://github.com/lmcinnes) in [#1750](https://github.com/MaartenGr/BERTopic/pull/1750)
* Migrated OpenAIBackend to openai>=1 by [peguerosdc](https://github.com/peguerosdc) in [#1724](https://github.com/MaartenGr/BERTopic/pull/1724)
* Add automatic height scaling and font resize by [ir2718](https://github.com/ir2718) in [#1863](https://github.com/MaartenGr/BERTopic/pull/1863)
* Use `[KEYWORDS]` tags with the LangChain representation model by [mcantimmy](https://github.com/mcantimmy) in [#1871](https://github.com/MaartenGr/BERTopic/pull/1871)


<h3><b>Fixes:</a></b></h3>

* Fixed issue with `.merge_models` seemingly skipping topic [1898](https://github.com/MaartenGr/BERTopic/issues/1898)
* Fixed Cohere client.embed TypeError [1904](https://github.com/MaartenGr/BERTopic/issues/1904)
* Fixed `AttributeError: 'TextGeneration' object has no attribute 'random_state'` [1870](https://github.com/MaartenGr/BERTopic/issues/1870)
* Fixed topic embeddings not properly updated if all outliers were removed [1838](https://github.com/MaartenGr/BERTopic/issues/1838)
* Fixed issue with representation models not properly merging [1762](https://github.com/MaartenGr/BERTopic/issues/1762)
* Fixed Embeddings not ordered correctly when using `.merge_models` [1804](https://github.com/MaartenGr/BERTopic/issues/1804)
* Fixed Outlier topic not in the 0th position when using zero-shot topic modeling causing prediction issues (amongst others) [1804](https://github.com/MaartenGr/BERTopic/issues/1804)
* Fixed Incorrect label in ZeroShot doc SVG [1732](https://github.com/MaartenGr/BERTopic/issues/1732)
* Fixed MultiModalBackend throws error with clip-ViT-B-32-multilingual-v1 [1670](https://github.com/MaartenGr/BERTopic/issues/1670)
* Fixed AuthenticationError while using OpenAI() [1678](https://github.com/MaartenGr/BERTopic/issues/1678)

* Update FAQ on Apple Silicon by [benz0li](https://github.com/benz0li) in [#1901](https://github.com/MaartenGr/BERTopic/pull/1901)
* Add documentation DataMapPlot + FAQ for running on Apple Silicon by [dkapitan](https://github.com/dkapitan) in [#1854](https://github.com/MaartenGr/BERTopic/pull/1854)
* Remove commas from pip install reference in readme by [luisoala](https://github.com/luisoala) in [#1850](https://github.com/MaartenGr/BERTopic/pull/1850)
* Spelling corrections by [joouha](https://github.com/joouha) in [#1801](https://github.com/MaartenGr/BERTopic/pull/1801)
* Replacing the deprecated `text-ada-001` model with the latest `text-embedding-3-small` from OpenAI by [atmb4u](https://github.com/atmb4u) in [#1800](https://github.com/MaartenGr/BERTopic/pull/1800)
* Prevent invalid empty input error when retrieving embeddings with openai backend by [liaoelton](https://github.com/liaoelton) in [#1827](https://github.com/MaartenGr/BERTopic/pull/1827)
* Remove spurious warning about missing embedding model by [sliedes](https://github.com/sliedes) in [#1774](https://github.com/MaartenGr/BERTopic/pull/1774)
* Fix type hint in ClassTfidfTransformer constructor [snape](https://github.com/snape) in [#1803](https://github.com/MaartenGr/BERTopic/pull/1803)
* Fix typo and simplify wording in OnlineCountVectorizer docstring by [chrisji](https://github.com/chrisji) in [#1802](https://github.com/MaartenGr/BERTopic/pull/1802)
* Fixed warning when saving a topic model without an embedding model by [zilch42](https://github.com/zilch42) in [#1740](https://github.com/MaartenGr/BERTopic/pull/1740)
* Fix bug in `TextGeneration` by [manveersadhal](https://github.com/manveersadhal) in [#1726](https://github.com/MaartenGr/BERTopic/pull/1726)
* Fix an incorrect link to usecases.md by [nicholsonjf](https://github.com/nicholsonjf) in [#1731](https://github.com/MaartenGr/BERTopic/pull/1731)
* Prevent `model` argument being passed twice when using `generator_kwargs` in OpenAI by [ninavandiermen](https://github.com/ninavandiermen) in [#1733](https://github.com/MaartenGr/BERTopic/pull/1733)
* Several fixes to the docstrings by [arpadikuma](https://github.com/arpadikuma) in [#1719](https://github.com/MaartenGr/BERTopic/pull/1719)
* Remove unused `cluster_df` variable in `hierarchical_topics` by [shadiakiki1986](https://github.com/shadiakiki1986) in [#1701](https://github.com/MaartenGr/BERTopic/pull/1701)
* Removed redundant quotation mark by [LawrenceFulton](https://github.com/LawrenceFulton) in [#1695](https://github.com/MaartenGr/BERTopic/pull/1695)
* Fix typo in merge models docs by [zilch42](https://github.com/zilch42) in [#1660](https://github.com/MaartenGr/BERTopic/pull/1660)

0.16.0

*Release date: 26 November, 2023*

<h3><b>Highlights:</a></b></h3>

* Merge pre-trained BERTopic models with [**`.merge_models`**](https://maartengr.github.io/BERTopic/getting_started/merge/merge.html)
* Combine models with different representations together!
* Use this for *incremental/online topic modeling* to detect new incoming topics
* First step towards *federated learning* with BERTopic
* [**Zero-shot**](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) Topic Modeling
* Use a predefined list of topics to assign documents
* If needed, allows for further exploration of undefined topics
* [**Seed (domain-specific) words**](https://maartengr.github.io/BERTopic/getting_started/seed_words/seed_words.html) with `ClassTfidfTransformer`
* Make sure selected words are more likely to end up in the representation without influencing the clustering process
* Added params to [**truncate documents**](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#truncating-documents) to length when using LLMs
* Added [**LlamaCPP**](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#llamacpp) as a representation model
* LangChain: Support for **LCEL Runnables** by [joshuasundance-swca](https://github.com/joshuasundance-swca) in [#1586](https://github.com/MaartenGr/BERTopic/pull/1586)
* Added `topics` parameter to `.topics_over_time` to select a subset of documents and topics
* Documentation:
* [Best practices Guide](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html)
* [Llama 2 Tutorial](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#llama-2)
* [Zephyr Tutorial](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#zephyr-mistral-7b)
* Improved [embeddings guidance](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#sentence-transformers) (MTEB)
* Improved logging throughout the package
* Added support for **Cohere's Embed v3**:
python
cohere_model = CohereBackend(
client,
embedding_model="embed-english-v3.0",
embed_kwargs={"input_type": "clustering"}
)


<h3><b>Fixes:</a></b></h3>

* Fixed n-gram Keywords need delimiting in OpenAI() [1546](https://github.com/MaartenGr/BERTopic/issues/1546)
* Fixed OpenAI v1.0 issues [1629](https://github.com/MaartenGr/BERTopic/issues/1629)
* Improved documentation/logging to address [1589](https://github.com/MaartenGr/BERTopic/issues/1589), [#1591](https://github.com/MaartenGr/BERTopic/issues/1591)
* Fixed engine support for Azure OpenAI embeddings [1577](https://github.com/MaartenGr/BERTopic/issues/1487)
* Fixed OpenAI Representation: KeyError: 'content' [1570](https://github.com/MaartenGr/BERTopic/issues/1570)
* Fixed Loading topic model with multiple topic aspects changes their format [1487](https://github.com/MaartenGr/BERTopic/issues/1487)
* Fix expired link in algorithm.md by [burugaria7](https://github.com/burugaria7) in [#1396](https://github.com/MaartenGr/BERTopic/pull/1396)
* Fix guided topic modeling in cuML's UMAP by [stevetracvc](https://github.com/stevetracvc) in [#1326](https://github.com/MaartenGr/BERTopic/pull/1326)
* OpenAI: Allow retrying on Service Unavailable errors by [agamble](https://github.com/agamble) in [#1407](https://github.com/MaartenGr/BERTopic/pull/1407)
* Fixed parameter naming for HDBSCAN in best practices by [rnckp](https://github.com/rnckp) in [#1408](https://github.com/MaartenGr/BERTopic/pull/1408)
* Fixed typo in tips_and_tricks.md by [aronnoordhoek](https://github.com/aronnoordhoek) in [#1446](https://github.com/MaartenGr/BERTopic/pull/1446)
* Fix typos in documentation by [bobchien](https://github.com/bobchien) in [#1481](https://github.com/MaartenGr/BERTopic/pull/1481)
* Fix IndexError when all outliers are removed by reduce_outliers by [Aratako](https://github.com/Aratako) in [#1466](https://github.com/MaartenGr/BERTopic/pull/1466)
* Fix TypeError on reduce_outliers "probabilities" by [ananaphasia](https://github.com/ananaphasia) in [#1501](https://github.com/MaartenGr/BERTopic/pull/1501)
* Add new line to fix markdown bullet point formatting by [saeedesmaili](https://github.com/saeedesmaili) in [#1519](https://github.com/MaartenGr/BERTopic/pull/1519)
* Update typo in topicrepresentation.md by [oliviercaron](https://github.com/oliviercaron) in [#1537](https://github.com/MaartenGr/BERTopic/pull/1537)
* Fix typo in FAQ by [sandijou](https://github.com/sandijou) in [#1542](https://github.com/MaartenGr/BERTopic/pull/1542)
* Fixed typos in best practices documentation by [poomkusa](https://github.com/poomkusa) in [#1557](https://github.com/MaartenGr/BERTopic/pull/1557)
* Correct TopicMapper doc example by [chrisji](https://github.com/chrisji) in [#1637](https://github.com/MaartenGr/BERTopic/pull/1637)
* Fix typing in hierarchical_topics by [dschwalm](https://github.com/dschwalm) in [#1364](https://github.com/MaartenGr/BERTopic/pull/1364)
* Fixed typing issue with treshold parameter in reduce_outliers by [dschwalm](https://github.com/dschwalm) in [#1380](https://github.com/MaartenGr/BERTopic/pull/1380)
* Fix several typos by [mertyyanik](https://github.com/mertyyanik) in [#1307](https://github.com/MaartenGr/BERTopic/pull/1307)
(1307)
* Fix inconsistent naming by [rolanderdei](https://github.com/rolanderdei) in [#1073](https://github.com/MaartenGr/BERTopic/pull/1073)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/merge/merge.html">Merge Pre-trained BERTopic Models</a></b></h3>

The new `.merge_models` feature allows for any number of fitted BERTopic models to be merged. Doing so allows for a number of use cases:

* **Incremental topic modeling** -- Continuously merge models together to detect whether new topics have appeared
* **Federated Learning** - Train BERTopic models on different clients and combine them on a central server
* **Minimal compute** - We can essentially batch the training process into multiple instances to reduce compute
* **Different datasets** - When you have different datasets that you want to train separately on, for example with different languages, you can train each model separately and join them after training

To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.

First, we train three separate models on different parts of the data:

python
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]

Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)


Then, we can combine all three models into one with `.merge_models`:

python
Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html">Zero-shot Topic Modeling</a></b></h3>
Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.

* No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
* Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
* Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

![zeroshot](https://github.com/MaartenGr/BERTopic/assets/25746895/9cce6ee3-445f-440a-b93b-f8008578c839)

In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers.
We know the data and believe there to be at least the following topics: *clustering*, *topic modeling*, and *large language models*.
However, we are not sure whether other topics exist and want to explore those.

Using this feature is straightforward:

python
from datasets import load_dataset

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

We fit our model using the zero-shot topics
and we define a minimum similarity. For each document,
if the similarity does not exceed that value, it will be used
for clustering instead.
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)


When we run `topic_model.get_topic_info()` you will see something like this:

![zeroshot_output](https://github.com/MaartenGr/BERTopic/assets/25746895/1801e0a9-cda7-4d74-929f-e975fa67404b)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/seed_words/seed_words.html">Seed (Domain-specific) Words</a></b></h3>


When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the *"TNM"* classification is a method for identifying the stage of most cancers. The word *"TNM"* is an abbreviation and might not be correctly captured in generic embedding models.

To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of `seed_words` in the `bertopic.vectorizer.ClassTfidfTransformer`. To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the `ClassTfidfTransformer`, we can define those `seed_words` and also choose by how much their values are multiplied.

The full example is then as follows:

python
from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

Let's take a subset of ArXiv abstracts as the training data
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

For illustration purposes, we make sure the output is fixed when running this code multiple times
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

We can choose any number of seed words for which we want their representation
to be strengthen. We increase the importance of these words as we want them to be more
likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
seed_words=["agent", "robot", "behavior", "policies", "environment"],
seed_multiplier=2
)

We run the topic model with the seeded words
topic_model = BERTopic(
umap_model=umap_model,
min_topic_size=15,
ctfidf_model=ctfidf_model,
).fit(abstracts)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#truncating-documents">Truncate Documents in LLMs</a></b></h3>

When using LLMs with BERTopic, we can truncate the input documents in `[DOCUMENTS]` in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:

* `doc_length` - The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.
* `tokenizer` - The tokenizer used to calculate to split the document into segments used to count the length of a document.
* Options include `'char'`, `'whitespace'`, `'vectorizer'`, and a callable

This means that the definition of `doc_length` changes depending on what constitutes a token in the `tokenizer` parameter. If a token is a character, then `doc_length` refers to max length in characters. If a token is a word, then `doc_length` refers to the max length in words.

Let's illustrate this with an example. In the code below, we will use [`tiktoken`](https://github.com/openai/tiktoken) to count the number of tokens in each document and limit them to 100 tokens. All documents that have more than 100 tokens will be truncated.

We use `bertopic.representation.OpenAI` to represent our topics with nicely written labels. We specify that documents that we put in the prompt cannot exceed 100 tokens each. Since we will put 4 documents in the prompt, they will total roughly 400 tokens:

python
import openai
import tiktoken
from bertopic.representation import OpenAI
from bertopic import BERTopic

Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")

Create your representation model
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(
client,
model="gpt-3.5-turbo",
delay_in_seconds=2,
chat=True,
nr_docs=4,
doc_length=100,
tokenizer=tokenizer
)

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

0.15.0

*Release date: 29 May, 2023*

<h3><b>Highlights:</a></b></h3>

* [**Multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) Topic Modeling
* Train your topic modeling on text, images, or images and text!
* Use the `bertopic.backend.MultiModalBackend` to embed images, text, both or even caption images!
* [**Multi-Aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) Topic Modeling
* Create multiple topic representations simultaneously
* Improved [**Serialization**](https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html) options
* Push your model to the HuggingFace Hub with `.push_to_hf_hub`
* Safer, smaller and more flexible serialization options with `safetensors`
* Thanks to a great collaboration with HuggingFace and the authors of [BERTransfer](https://github.com/opinionscience/BERTransfer)!
* Added new embedding models
* OpenAI: `bertopic.backend.OpenAIBackend`
* Cohere: `bertopic.backend.CohereBackend`
* Added example of [summarizing topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#summarization) with OpenAI's GPT-models
* Added `nr_docs` and `diversity` parameters to OpenAI and Cohere representation models
* Use `custom_labels="Aspect1"` to use the aspect labels for visualizations instead
* Added cuML support for probability calculation in `.transform`
* Updated **topic embeddings**
* Centroids by default and c-TF-IDF weighted embeddings for `partial_fit` and `.update_topics`
* Added `exponential_backoff` parameter to `OpenAI` model

<h3><b>Fixes:</a></b></h3>

* Fixed custom prompt not working in `TextGeneration`
* Fixed [1142](https://github.com/MaartenGr/BERTopic/pull/1142)
* Add additional logic to handle cupy arrays by [metasyn](https://github.com/metasyn) in [#1179](https://github.com/MaartenGr/BERTopic/pull/1179)
* Fix hierarchy viz and handle any form of distance matrix by [elashrry](https://github.com/elashrry) in [#1173](https://github.com/MaartenGr/BERTopic/pull/1173)
* Updated languages list by [sam9111](https://github.com/sam9111) in [#1099](https://github.com/MaartenGr/BERTopic/pull/1099)
* Added level_scale argument to visualize_hierarchical_documents by [zilch42](https://github.com/zilch42) in [#1106](https://github.com/MaartenGr/BERTopic/pull/1106)
* Fix inconsistent naming by [rolanderdei](https://github.com/rolanderdei) in [#1073](https://github.com/MaartenGr/BERTopic/pull/1073)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal Topic Modeling</a></b></h3>

With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.

<figure markdown>
![Image title](getting_started/multimodal/images_and_text.svg)
<figcaption></figcaption>
</figure>

In this example, we are going to use images from `flickr` that each have a caption accociated to it:

python
NOTE: This requires the `datasets` package which you can
install with `pip install datasets`
from datasets import load_dataset

ds = load_dataset("maderix/flickr_bw_rgb")
images = ds["train"]["image"]
docs = ds["train"]["caption"]


The `docs` variable contains the captions for each image in `images`. We can now use these variables to run our multimodal example:

python
from bertopic import BERTopic
from bertopic.representation import VisualRepresentation

Additional ways of representing a topic
visual_model = VisualRepresentation()

Make sure to add the `visual_model` to a dictionary
representation_model = {
"Visual_Aspect": visual_model,
}
topic_model = BERTopic(representation_model=representation_model, verbose=True)


We can now access our image representations for each topic with `topic_model.topic_aspects_["Visual_Aspect"]`.
If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:

python
import base64
from io import BytesIO
from IPython.display import HTML

def image_base64(im):
if isinstance(im, str):
im = get_thumbnail(im)
with BytesIO() as buffer:
im.save(buffer, 'jpeg')
return base64.b64encode(buffer.getvalue()).decode()


def image_formatter(im):
return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'

Extract dataframe
df = topic_model.get_topic_info().drop("Representative_Docs", 1).drop("Name", 1)

Visualize the images
HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))


![images_and_text](https://github.com/MaartenGr/BERTopic/assets/25746895/3a741e2b-5810-4865-9664-0c6bb24ca3f9)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect Topic Modeling</a></b></h3>

In this new release, we introduce `multi-aspect topic modeling`! During the `.fit` or `.fit_transform` stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).

<figure markdown>
![Image title](getting_started/multiaspect/multiaspect.svg)
<figcaption></figcaption>
</figure>

The approach is rather straightforward. We might want to represent our topics using a `PartOfSpeech` representation model but we might also want to try out `KeyBERTInspired` and compare those representation models. We can do this as follows:

python
from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.representation import MaximalMarginalRelevance
from sklearn.datasets import fetch_20newsgroups

Documents to train on
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

The main representation of a topic
main_representation = KeyBERTInspired()

Additional ways of representing a topic
aspect_model1 = PartOfSpeech("en_core_web_sm")
aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

Add all models together to be run in a single `fit`
representation_model = {
"Main": main_representation,
"Aspect1": aspect_model1,
"Aspect2": aspect_model2
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)


As show above, to perform multi-aspect topic modeling, we make sure that `representation_model` is a dictionary where each representation model pipeline is defined.
The main pipeline, that is used in most visualization options, is defined with the `"Main"` key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as `"Aspect1"` and `"Aspect2"`.

After we have fitted our model, we can access all representations with `topic_model.get_topic_info()`:

<img src="getting_started/multiaspect/table.PNG">
<br>

As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in `topic_model.topic_aspects_`.


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html">Serialization</a></b></h3>

Saving, loading, and sharing a BERTopic model can be done in several ways. With this new release, it is now advised to go with `.safetensors` as that allows for a small, safe, and fast method for saving your BERTopic model. However, other formats, such as `.pickle` and pytorch `.bin` are also possible.

The methods are used as follows:

python
topic_model = BERTopic().fit(my_docs)

Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

Method 2 - pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)

Method 3 - pickle
topic_model.save("my_model", serialization="pickle")


Saving the topic modeling with `.safetensors` or `pytorch` has a number of advantages:

* `.safetensors` is a relatively **safe format**
* The resulting model can be **very small** (often < 20MB>) since no sub-models need to be saved
* Although version control is important, there is a bit more **flexibility** with respect to specific versions of packages
* More easily used in **production**
* **Share** models with the HuggingFace Hub

<br><br>
<img src="getting_started/serialization/serialization.png">
<br><br>

The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.




<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html#huggingFace-hub">HuggingFace Hub</a></b></h3>

When you have created a BERTopic model, you can easily share it with other through the HuggingFace Hub. First, you need to log in to your HuggingFace account:

python
from huggingface_hub import login
login()


When you have logged in to your HuggingFace account, you can save and upload the model as follows:

python
from bertopic import BERTopic

Train model
topic_model = BERTopic().fit(my_docs)

Push to HuggingFace Hub
topic_model.push_to_hf_hub(
repo_id="MaartenGr/BERTopic_ArXiv",
save_ctfidf=True
)

Load from HuggingFace
loaded_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

0.14.1

*Release date: 2 March, 2023*

<h3><b>Highlights:</a></b></h3>

* Use [**ChatGPT**](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#chatgpt) to create topic representations!:
* Added `delay_in_seconds` parameter to OpenAI and Cohere representation models for throttling the API
* Setting this between 5 and 10 allows for trial users now to use more easily without hitting RateLimitErrors
* Fixed missing `title` param to visualization methods
* Fixed probabilities not correctly aligning ([1024](https://github.com/MaartenGr/BERTopic/issues/1024))
* Fix typo in textgenerator [dkopljar27](https://github.com/dkopljar27) in [#1002](https://github.com/MaartenGr/BERTopic/pull/1002)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#chatgpt">ChatGPT</a></b></h3>

Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.
In order to use ChatGPT with BERTopic, we need to define the model and make sure to set `chat=True`:

python
import openai
from bertopic import BERTopic
from bertopic.representation import OpenAI

Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI(model="gpt-3.5-turbo", delay_in_seconds=10, chat=True)

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.
There are currently two tags, namely `"[KEYWORDS]"` and `"[DOCUMENTS]"`.
These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
For example, if we have the following prompt:

python
prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""


then that will be rendered as follows and passed to OpenAI's API:

python
"""
I have a topic that contains the following documents:
- Our videos are also made possible by your support on patreon.co.
- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.
- If you want to help us make more videos, you can do so there.
- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.

The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""


!!! note
Whenever you create a custom prompt, it is important to add

Based on the information above, extract a short topic label in the following format:
topic: <topic label>

at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having
said that, if `topic: ` is not in the output, then it will simply extract the entire response, so
feel free to experiment with the prompts.

0.14.0

*Release date: 14 February, 2023*

<h3><b>Highlights:</a></b></h3>

* Fine-tune [topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with `bertopic.representation`
* Diverse range of models, including KeyBERT, MMR, POS, Transformers, OpenAI, and more!'
* Create your own prompts for text generation models, like GPT3:
* Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt to decide where the keywords and set of representative documents need to be inserted.
* Chain models to perform fine-grained fine-tuning
* Create and customize your represention model
* Improved the topic reduction technique when using `nr_topics=int`
* Added `title` parameters for all graphs ([800](https://github.com/MaartenGr/BERTopic/issues/800))


<h3><b>Fixes:</a></b></h3>

* Improve documentation ([837](https://github.com/MaartenGr/BERTopic/issues/837), [#769](https://github.com/MaartenGr/BERTopic/issues/769), [#954](https://github.com/MaartenGr/BERTopic/issues/954), [#912](https://github.com/MaartenGr/BERTopic/issues/912), [#911](https://github.com/MaartenGr/BERTopic/issues/911))
* Bump pyyaml ([903](https://github.com/MaartenGr/BERTopic/issues/903))
* Fix large number of representative docs ([965](https://github.com/MaartenGr/BERTopic/issues/965))
* Prevent stochastisch behavior in `.visualize_topics` ([952](https://github.com/MaartenGr/BERTopic/issues/952))
* Add custom labels parameter to `.visualize_topics` ([976](https://github.com/MaartenGr/BERTopic/issues/976))
* Fix cuML HDBSCAN type checks by [FelSiq](https://github.com/FelSiq) in [#981](https://github.com/MaartenGr/BERTopic/pull/981)

<h3><b>API Changes:</a></b></h3>
* The `diversity` parameter was removed in favor of `bertopic.representation.MaximalMarginalRelevance`
* The `representation_model` parameter was added to `bertopic.BERTopic`

<br>

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired">Representation Models</a></b></h3>

Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you!

<iframe width="1200" height="500" src="https://user-images.githubusercontent.com/25746895/218417067-a81cc179-9055-49ba-a2b0-f2c1db535159.mp4
" title="BERTopic Overview" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

<br>


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired">KeyBERTInspired</a></b></h3>

The algorithm follows some principles of [KeyBERT](https://github.com/MaartenGr/KeyBERT) but does some optimization in order to speed up inference. Usage is straightforward:

![keybertinspired](https://user-images.githubusercontent.com/25746895/216336376-d2c4e5d6-6cf7-435c-904c-fc195aae7dcd.svg)

python
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic

Create your representation model
representation_model = KeyBERTInspired()

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![keybert](https://user-images.githubusercontent.com/25746895/218417161-bfd5980e-43c7-498a-904a-b6018ba58d45.svg)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#partofspeech">PartOfSpeech</a></b></h3>

Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.

![partofspeech](https://user-images.githubusercontent.com/25746895/216336534-48ff400e-72e1-4c50-9030-414576bac01e.svg)


python
from bertopic.representation import PartOfSpeech
from bertopic import BERTopic

Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![pos](https://user-images.githubusercontent.com/25746895/218417198-41c19b5c-251f-43c1-bfe2-0a480731565a.svg)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#maximalmarginalrelevance">MaximalMarginalRelevance</a></b></h3>

When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant. We can use `MaximalMarginalRelevance` to improve diversity of our candidate topics:

![mmr](https://user-images.githubusercontent.com/25746895/216336697-558f1409-8da3-4076-a21b-d87eec583ac7.svg)


python
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic

Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![mmr (1)](https://user-images.githubusercontent.com/25746895/218417234-88b145e2-7293-43c0-888c-36abe469a48a.svg)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#zero-shot-classification">Zero-Shot Classification</a></b></h3>

To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.

We use it in BERTopic as follows:

python
from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic

Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![zero](https://user-images.githubusercontent.com/25746895/218417276-dcef3519-acba-4792-8601-45dc7ed39488.svg)

<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#transformers">Text Generation: 🤗 Transformers</a></b></h3>

Nearly every week, there are new and improved models released on the 🤗 [Model Hub](https://huggingface.co/models) that, with some creativity, allow for
further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
methods are created as a way to support whatever might be released in the future.

Using a GPT-like model from the huggingface hub is rather straightforward:

python
from bertopic.representation import TextGeneration
from bertopic import BERTopic

Create your representation model
representation_model = TextGeneration('gpt2')

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![hf](https://user-images.githubusercontent.com/25746895/218417310-2b0eabc7-296d-499d-888b-0ab48a65a2fb.svg)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#cohere">Text Generation: Cohere</a></b></h3>

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [Cohere](https://docs.cohere.ai/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:

bash
pip install cohere


Then, get yourself an API key and use Cohere's API as follows:

python
import cohere
from bertopic.representation import Cohere
from bertopic import BERTopic

Create your representation model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![cohere](https://user-images.githubusercontent.com/25746895/218417337-294cb52a-93c9-4fd5-b981-29b40e4f0c1e.svg)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#openai">Text Generation: OpenAI</a></b></h3>

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [OpenAI](https://openai.com/api/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:


pip install openai


Then, get yourself an API key and use OpenAI's API as follows:

python
import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic

Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI()

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)


![openai](https://user-images.githubusercontent.com/25746895/218417357-cf8c0fab-4450-43d3-b4fd-219ed276d870.svg)


<h3><b><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#langchain">Text Generation: LangChain</a></b></h3>

[Langchain](https://github.com/hwchase17/langchain) is a package that helps users with chaining large language models.
In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
external knowledge are the most representative documents in each topic.

To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain,
like openai:

bash
pip install langchain, openai


Then, you can create your chain as follows:

python
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=MY_API_KEY), chain_type="stuff")


Finally, you can pass the chain to BERTopic as follows:

python
from bertopic.representation import LangChain

Create your representation model
representation_model = LangChain(chain)

Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Page 1 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.