Implementing LSA with Scikit Learn in a Simple Manner

Posted by


Step 3: Building the Document-Term Matrix

Now that we have preprocessed our text data, we can proceed to build the document-term matrix. This matrix represents the frequency of each word in each document, allowing us to perform mathematical operations on the text data.

We will use the CountVectorizer class from Scikit Learn to build the document-term matrix. This class converts a collection of text documents into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the text data to create the document-term matrix
dtm = vectorizer.fit_transform(preprocessed_text)

# Convert the document-term matrix to a DataFrame for easier visualization
dtm_df = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names_out())

In the code snippet above, we first import the CountVectorizer class from Scikit Learn. We then initialize an instance of the CountVectorizer class and use the fit_transform method to create the document-term matrix from the preprocessed text data. Finally, we convert the document-term matrix to a DataFrame for easier visualization.

Step 4: Applying Latent Semantic Analysis

With the document-term matrix ready, we can now apply Latent Semantic Analysis (LSA) to identify hidden patterns in the text data. LSA is a dimensionality reduction technique that helps find the underlying structure in the data by decomposing the document-term matrix using singular value decomposition (SVD).

from sklearn.decomposition import TruncatedSVD

# Initialize the TruncatedSVD object
lsa_model = TruncatedSVD(n_components=2)

# Fit the LSA model to the document-term matrix
lsa_matrix = lsa_model.fit_transform(dtm)

In the code snippet above, we import the TruncatedSVD class from Scikit Learn. We then initialize an instance of the TruncatedSVD class with the number of components set to 2, indicating that we want to reduce the dimensionality of the document-term matrix to 2 dimensions. We fit the LSA model to the document-term matrix using the fit_transform method.

Step 5: Visualizing the Results

To visualize the results of LSA, we can plot the reduced dimensionality matrix and observe the clusters formed by the text data.

import matplotlib.pyplot as plt

# Create a scatter plot of the LSA matrix
plt.scatter(lsa_matrix[:, 0], lsa_matrix[:, 1])
plt.xlabel('LSA Component 1')
plt.ylabel('LSA Component 2')
plt.title('LSA Visualization')
plt.show()

In the code snippet above, we import the matplotlib library to create a scatter plot. We then use the scatter method to plot the LSA matrix with the x-axis representing the first LSA component and the y-axis representing the second LSA component. Finally, we add labels and a title to the plot before displaying it.

Conclusion

In this tutorial, we have implemented a trivial version of Latent Semantic Analysis (LSA) using Scikit Learn. We started by preprocessing the text data, building the document-term matrix, applying LSA, and visualizing the results. LSA can be a powerful tool for uncovering hidden patterns in text data and can be further customized and extended for more advanced natural language processing tasks.

0 0 votes
Article Rating
8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@sharme_
1 month ago

@4:01 – get_features_names() got replaced with vectorizer.get_feature_names_out()

@anarabiyev9766
1 month ago

kevin de brune

@WorldView660
1 month ago

@all is it possible to make model to predict essay is on topic or off topice on the basis essay and prompts we provided?

@akshitasood6455
1 month ago

How can I access the notebook given in the link above? It does not have ".ipynb" extenion.

@concert_music
1 month ago

Where is the variable body defined?

@sanjeevkumar-oc8wn
1 month ago

Informative.

@treyheaney9483
1 month ago

This is all very concise and clear, thank you!

@mitchellharding2423
1 month ago

For this exercise, you picked two topics (svd=TruncatedSVD(n_components=2)). Got it. For a more complex exercise, how would you come up with that number? What if I'm reviewing 500 documents? Should I pick 10 topics? 25? 50? Thanks for providing this content!