Step 3: Building the Document-Term Matrix
Now that we have preprocessed our text data, we can proceed to build the document-term matrix. This matrix represents the frequency of each word in each document, allowing us to perform mathematical operations on the text data.
We will use the CountVectorizer
class from Scikit Learn to build the document-term matrix. This class converts a collection of text documents into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer object
vectorizer = CountVectorizer()
# Fit and transform the text data to create the document-term matrix
dtm = vectorizer.fit_transform(preprocessed_text)
# Convert the document-term matrix to a DataFrame for easier visualization
dtm_df = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names_out())
In the code snippet above, we first import the CountVectorizer
class from Scikit Learn. We then initialize an instance of the CountVectorizer
class and use the fit_transform
method to create the document-term matrix from the preprocessed text data. Finally, we convert the document-term matrix to a DataFrame for easier visualization.
Step 4: Applying Latent Semantic Analysis
With the document-term matrix ready, we can now apply Latent Semantic Analysis (LSA) to identify hidden patterns in the text data. LSA is a dimensionality reduction technique that helps find the underlying structure in the data by decomposing the document-term matrix using singular value decomposition (SVD).
from sklearn.decomposition import TruncatedSVD
# Initialize the TruncatedSVD object
lsa_model = TruncatedSVD(n_components=2)
# Fit the LSA model to the document-term matrix
lsa_matrix = lsa_model.fit_transform(dtm)
In the code snippet above, we import the TruncatedSVD
class from Scikit Learn. We then initialize an instance of the TruncatedSVD
class with the number of components set to 2, indicating that we want to reduce the dimensionality of the document-term matrix to 2 dimensions. We fit the LSA model to the document-term matrix using the fit_transform
method.
Step 5: Visualizing the Results
To visualize the results of LSA, we can plot the reduced dimensionality matrix and observe the clusters formed by the text data.
import matplotlib.pyplot as plt
# Create a scatter plot of the LSA matrix
plt.scatter(lsa_matrix[:, 0], lsa_matrix[:, 1])
plt.xlabel('LSA Component 1')
plt.ylabel('LSA Component 2')
plt.title('LSA Visualization')
plt.show()
In the code snippet above, we import the matplotlib
library to create a scatter plot. We then use the scatter
method to plot the LSA matrix with the x-axis representing the first LSA component and the y-axis representing the second LSA component. Finally, we add labels and a title to the plot before displaying it.
Conclusion
In this tutorial, we have implemented a trivial version of Latent Semantic Analysis (LSA) using Scikit Learn. We started by preprocessing the text data, building the document-term matrix, applying LSA, and visualizing the results. LSA can be a powerful tool for uncovering hidden patterns in text data and can be further customized and extended for more advanced natural language processing tasks.
@4:01 – get_features_names() got replaced with vectorizer.get_feature_names_out()
kevin de brune
@all is it possible to make model to predict essay is on topic or off topice on the basis essay and prompts we provided?
How can I access the notebook given in the link above? It does not have ".ipynb" extenion.
Where is the variable body defined?
Informative.
This is all very concise and clear, thank you!
For this exercise, you picked two topics (svd=TruncatedSVD(n_components=2)). Got it. For a more complex exercise, how would you come up with that number? What if I'm reviewing 500 documents? Should I pick 10 topics? 25? 50? Thanks for providing this content!