Bertopic

Latest version: v0.17.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 6

0.8.1

*Release date: 8 June, 2021*

**Highlights**:

* Improved models:
* For English documents the default is now: `"paraphrase-MiniLM-L6-v2"`
* For Non-English or multi-lingual documents the default is now: `"paraphrase-multilingual-MiniLM-L12-v2"`
* Both models show not only great performance but are much faster!
* Add interactive visualizations to the `plotting` API documentation

For better performance, please use the following models:

* English: `"paraphrase-mpnet-base-v2"`
* Non-English or multi-lingual: `"paraphrase-multilingual-mpnet-base-v2"`

**Fixes**:

* Improved unit testing for more stability
* Set transformers version for Flair

0.8.0

*Release date: 31 May, 2021*

**Highlights**:

* Additional visualizations:
* Topic Hierarchy: `topic_model.visualize_hierarchy()`
* Topic Similarity Heatmap: `topic_model.visualize_heatmap()`
* Topic Representation Barchart: `topic_model.visualize_barchart()`
* Term Score Decline: `topic_model.visualize_term_rank()`
* Created `bertopic.plotting` library to easily extend visualizations
* Improved automatic topic reduction by using HDBSCAN to detect similar topics
* Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.

**Fixes**:

* Fix typo [113](https://github.com/MaartenGr/BERTopic/pull/113), [#117](https://github.com/MaartenGr/BERTopic/pull/117)
* Fix [121](https://github.com/MaartenGr/BERTopic/issues/121) by removing [these](https://github.com/MaartenGr/BERTopic/blob/5c6cf22776fafaaff728370781a5d33727d3dc8f/bertopic/_bertopic.py#L359-L360) two lines
* Fix mapping of topics after reduction (it now excludes 0) ([103](https://github.com/MaartenGr/BERTopic/issues/103))

0.7.0

*Release date: 26 April, 2021*

The two main features are **(semi-)supervised topic modeling**
and several **backends** to use instead of Flair and SentenceTransformers!

**Highlights**:

* (semi-)supervised topic modeling by leveraging supervised options in UMAP
* `model.fit(docs, y=target_classes)`
* Backends:
* Added Spacy, Gensim, USE (TFHub)
* Use a different backend for document embeddings and word embeddings
* Create your own backends with `bertopic.backend.BaseEmbedder`
* Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) for an overview of all new backends
* Calculate and visualize topics per class
* Calculate: `topics_per_class = topic_model.topics_per_class(docs, topics, classes)`
* Visualize: `topic_model.visualize_topics_per_class(topics_per_class)`
* Several tutorials were updated and added:

| Name | Link |
|---|---|
| Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing) |
| (Custom) Embedding Models in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |
| Advanced Customization in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
| (semi-)Supervised Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing) |
| Dynamic Topic Modeling with Trump's Tweets | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing) |

**Fixes**:

* Fixed issues with Torch req
* Prevent saving term frequency matrix in CTFIDF class
* Fixed DTM not working when reducing topics ([96](https://github.com/MaartenGr/BERTopic/issues/96))
* Moved visualization dependencies to base BERTopic
* `pip install bertopic[visualization]` becomes `pip install bertopic`
* Allow precomputed embeddings in bertopic.find_topics() ([79](https://github.com/MaartenGr/BERTopic/issues/79)):

python
model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
model.find_topics(search_term)

0.6.0

*Release date: 1 March, 2021*

**Highlights**:

* DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
* `model.topics_over_time(docs, timestamps, global_tuning=True)`
* DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
* Only uses topics at t-1 and skips evolution if there is a gap
* `model.topics_over_time(docs, timestamps, evolution_tuning=True)`
* DTM: Function to visualize topics over time
* `model.visualize_topics_over_time(topics_over_time)`
* DTM: Add binning of timestamps
* `model.topics_over_time(docs, timestamps, nr_bins=10)`
* Add function get general information about topics (id, frequency, name, etc.)
* `get_topic_info()`
* Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

**Fixes**:

* `_map_probabilities()` does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (63, 64)

0.5.0

*Release date: 8 Februari, 2021*

**Highlights**:

* Add `Flair` to allow for more (custom) token/document embeddings, including 🤗 transformers
* Option to use custom UMAP, HDBSCAN, and CountVectorizer
* Added `low_memory` parameter to reduce memory during computation
* Improved verbosity (shows progress bar)
* Return the figure of `visualize_topics()`
* Expose all parameters with a single function: `get_params()`

**Fixes**:

* To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
* Set `calculate_probabilities` to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.
* Use the newest version of `sentence-transformers` as it speeds ups encoding significantly

0.4.2

*Release date: 10 Januari, 2021*

**Fixes**:

* Selecting `embedding_model` did not work when `language` was also used. This led to the user needing
to set `language` to None before being able to use `embedding_model`. Fixed by using `embedding_model` when
`language` is used (as a default parameter).

Page 4 of 6

Releases

Has known vulnerabilities

Previous Next

Bertopic

Page 4 of 6

0.8.1

0.8.0

0.7.0

0.6.0

0.5.0

0.4.2

Page 4 of 6

Links

Releases