My ambition for BERTopic is to make it the one-stop shop for topic modeling by allowing for significant flexibility and modularity.
That has been the goal for the last few years and with the release of v0.16, I believe we are a BIG step closer to achieving that.
First, let’s take a small step back. What is BERTopic?
Well, BERTopic is a topic modeling framework that allows users to essentially create their version of a topic model. With many variations of topic modeling implemented, the idea is that it should support almost any use case.
With v0.16, several features were implemented that I believe will take BERTopic to the next level, namely:
- Zero-Shot Topic Modeling
- Model Merging
- More Large Language Model (LLM) Support
In this tutorial, we will go through what these features are and for which use cases they could be helpful.
To start with, you can install BERTopic (with HF datasets) as follows:
pip install bertopic datasets
You can also follow along with the Google Colab Notebook to make sure everything works as intended.
Zero-shot techniques generally refer to having no examples to train your data on. Although you know the target, it is not assigned to your data.
In BERTopic, we use Zero-shot Topic Modeling to find pre-defined topics in large amounts of documents.
Imagine you have ArXiv abstracts about Machine Learning and you know that the topic “Large Language Models” is in there. With Zero-shot Topic Modeling, you can ask BERTopic to find all documents related to…