Topic modeling is a powerful technique used in natural language processing to discover latent topics within text data. In this tutorial, we will be focusing on Topic Modeling using scikit-learn and Non Negative Matrix Factorization (NMF).
Non Negative Matrix Factorization (NMF) is a method used for feature extraction and dimensionality reduction by decomposing a non-negative matrix into two lower-dimensional matrices that represent the latent topics and their associated word distributions. NMF is particularly well-suited for topic modeling as it enforces non-negativity constraints on the decomposed matrices, leading to easily interpretable results.
To get started with Topic Modeling using NMF in scikit-learn, we first need to install the necessary libraries. You can install scikit-learn using pip:
pip install scikit-learn
Next, we will import the required libraries and load the text dataset that we will be working with. For this tutorial, we will use the 20 Newsgroups dataset, which contains text data from 20 different newsgroups:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
# Load the 20 Newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
After loading the dataset, we need to preprocess the text data and convert it into a numerical format that can be used for topic modeling. We will use the TfidfVectorizer from scikit-learn to convert the text data into a TF-IDF matrix:
# Preprocess and vectorize the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups_data.data)
Now that we have preprocessed the data, we can perform topic modeling using NMF. We will specify the number of topics that we want to extract and fit the NMF model to the TF-IDF matrix:
# Specify the number of topics
n_topics = 10
# Fit the NMF model to the TF-IDF matrix
nmf_model = NMF(n_components=n_topics, init='nndsvd', random_state=42)
nmf_matrix = nmf_model.fit_transform(tfidf_matrix)
Once the NMF model has been fitted, we can extract the topics and their associated word distributions. We can also print the top words for each topic to interpret the results:
# Get the feature names from the TfidfVectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()
# Print the top words for each topic
for topic_idx, topic in enumerate(nmf_model.components_):
print(f'Topic {topic_idx + 1}:')
print([feature_names[i] for i in topic.argsort()[:-11:-1]])
Finally, we can assign each document in the dataset to its most representative topic based on the NMF matrix:
# Get the dominant topic for each document
dominant_topic = nmf_matrix.argmax(axis=1)
# Assign the dominant topic to each document
newsgroups_data['dominant_topic'] = dominant_topic
In this tutorial, we have demonstrated how to perform Topic Modeling using scikit-learn and Non Negative Matrix Factorization (NMF). By following the steps outlined above, you can discover latent topics within text data and gain valuable insights from your text corpus.
sir, Thank you so much its so helpful
Great explanation, thank your sir
Great work. Loved the video. Thank you so much 🤗
Incredible video, thank you for sharing!!
Sir,shall we get the code for this ?
need to check each and every line of the code to see what is happening. I was blank at some points though i tried hard
Sir,how to choose the number of topics(n_components) ? For my dataset, the NMF results are quite good, but it just takes too long to run, that it becomes difficult to iterate over multiple value of n_components. Is there a good way to determine the number of topics present in our dataset