Bertopic

Latest version: v0.17.0

Safety actively analyzes 723963 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 6

0.10.0

*Release date: 30 April, 2022*

**Highlights**:

* Use any dimensionality reduction technique instead of UMAP:

python
from bertopic import BERTopic
from sklearn.decomposition import PCA

dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)

* Use any clustering technique instead of HDBSCAN:

python
from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)

**Documentation**:

* Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
* Added pages on how to use other dimensionality reduction and clustering algorithms
* Additional instructions on how to reduce outliers in the FAQ:

python
import numpy as np

0.9.4

*Release date: 14 December, 2021*

A number of fixes, documentation updates, and small features:

* Expose diversity parameter
* Use `BERTopic(diversity=0.1)` to change how diverse the words in a topic representation are (ranges from 0 to 1)
* Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings
* Added property to c-TF-IDF that all IDF values should be positive ([351](https://github.com/MaartenGr/BERTopic/issues/351))
* Improve stability of `.visualize_barchart()` and `.visualize_hierarchy()`
* Major [documentation](https://maartengr.github.io/BERTopic/) overhaul (mkdocs, tutorials, FAQ, images, etc. ) ([#330](https://github.com/MaartenGr/BERTopic/issues/330))
* Drop python 3.6 ([333](https://github.com/MaartenGr/BERTopic/issues/333))
* Relax plotly dependency ([88](https://github.com/MaartenGr/BERTopic/issues/88))
* Additional logging for `.transform` ([356](https://github.com/MaartenGr/BERTopic/issues/356))

0.9.3

*Release date: 17 October, 2021*

* Fix [282](https://github.com/MaartenGr/BERTopic/issues/282)
* As it turns out the old implementation of topic mapping was still found in the `transform` function
* Fix [285](https://github.com/MaartenGr/BERTopic/issues/285)
* Fix getting all representative docs
* Fix [288](https://github.com/MaartenGr/BERTopic/issues/288)
* A recent issue with the package `pyyaml` that can be found in Google Colab

0.9.2

*Release date: 12 October, 2021*

A release focused on algorithmic optimization and fixing several issues:

**Highlights**:

* Update the non-multilingual paraphrase-* models to the all-* models due to improved [performance](https://www.sbert.net/docs/pretrained_models.html)
* Reduce necessary RAM in c-TF-IDF top 30 word [extraction](https://stackoverflow.com/questions/49207275/finding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix)

**Fixes**:

* Fix topic mapping
* When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version
* A new class was created as a way to track these mappings regardless of how many times they were executed
* In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model
* Fix typo in embeddings page ([200](https://github.com/MaartenGr/BERTopic/issues/200))
* Fix link in README ([233](https://github.com/MaartenGr/BERTopic/issues/233))
* Fix documentation `.visualize_term_rank()` ([253](https://github.com/MaartenGr/BERTopic/issues/253))
* Fix getting correct representative docs ([258](https://github.com/MaartenGr/BERTopic/issues/258))
* Update [memory FAQ](https://maartengr.github.io/BERTopic/faq.html#i-am-facing-memory-issues-help) with [HDBSCAN pr](https://github.com/MaartenGr/BERTopic/issues/151)

0.9.1

*Release date: 1 September, 2021*

A release focused on fixing several issues:

**Fixes**:

* Fix TypeError when auto-reducing topics ([210](https://github.com/MaartenGr/BERTopic/issues/210))
* Fix mapping representative docs when reducing topics ([208](https://github.com/MaartenGr/BERTopic/issues/208))
* Fix visualization issues with probabilities ([205](https://github.com/MaartenGr/BERTopic/issues/205))
* Fix missing `normalize_frequency` param in plots ([213](https://github.com/MaartenGr/BERTopic/issues/208))

0.9.0

*Release date: 9 August, 2021*

**Highlights**:

* Implemented a [**Guided BERTopic**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) -> Use seeds to steer the Topic Modeling
* Get the most representative documents per topic: `topic_model.get_representative_docs(topic=1)`
* This allows users to see which documents are good representations of a topic and better understand the topics that were created
* Added `normalize_frequency` parameter to `visualize_topics_per_class` and `visualize_topics_over_time` in order to better compare the relative topic frequencies between topics
* Return flat probabilities as default, only calculate the probabilities of all topics per document if `calculate_probabilities` is True
* Added several FAQs

**Fixes**:

* Fix loading pre-trained BERTopic model
* Fix mapping of probabilities
* Fix [190](https://github.com/MaartenGr/BERTopic/issues/190)

**Guided BERTopic**:

Guided BERTopic works in two ways:

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder.
These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label.
If the document is most similar to a seeded topic, then it will get that topic's label.
If it is most similar to the average document embedding, it will get the -1 label.
These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in `seed_topic_list` and assign them a multiplier larger than 1.
Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an
irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to
remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,
like taking the distribution of IDF values and its position into account when defining the multiplier.

python
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
["acquisition", "procurement", "merge"],
["exchange", "currency", "trading", "rate", "euro"],
["grain", "wheat", "corn"],
["coffee", "cocoa"],
["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Page 3 of 6

Releases

Has known vulnerabilities

Previous Next

Bertopic

Page 3 of 6

0.10.0

0.9.4

0.9.3

0.9.2

0.9.1

0.9.0

Page 3 of 6

Links

Releases