Text classification is the process of categorizing textual data into different predefined categories or classes. It is a common task in natural language processing and machine learning, and it has various applications such as sentiment analysis, spam detection, and topic categorization.
In this tutorial, we will show you how to perform text classification using scikit-learn, a popular machine learning library in Python.
Step 1: Preprocess the text data
Before we can train a text classification model, we need to preprocess our text data. This involves several steps, including tokenization, stopword removal, and vectorization.
Tokenization is the process of dividing a text into individual words or tokens. We can use the CountVectorizer
class from scikit-learn to tokenize our text data. This class also removes punctuation and converts all words to lowercase by default.
Stopword removal is the process of removing common words that do not carry much meaning, such as "the", "and", and "is". We can use the stop_words
parameter in CountVectorizer
to remove stopwords.
Vectorization is the process of converting text data into numerical features that can be used by machine learning algorithms. We can use the TfidfVectorizer
class from scikit-learn to convert our text data into TF-IDF (term frequency-inverse document frequency) vectors.
Step 2: Load and preprocess the text data
For this tutorial, we will use the 20 Newsgroups dataset, which is a collection of newsgroup documents that are divided into 20 different categories. You can download the dataset from scikit-learn’s datasets module.
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
Step 3: Vectorize the text data
Next, we will use the TfidfVectorizer
class to vectorize our text data.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
Step 4: Train a text classification model
Now that we have processed our text data, we can train a text classification model. In this tutorial, we will use the LogisticRegression
classifier from scikit-learn.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, newsgroups_train.target)
Step 5: Evaluate the model
Finally, we can evaluate the performance of our text classification model on the test data.
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
accuracy = accuracy_score(newsgroups_test.target, predictions)
print(f'Accuracy: {accuracy}')
Conclusion
In this tutorial, we have shown you how to perform text classification using scikit-learn. By following these steps, you can build a text classification model for your own text data and achieve good accuracy. You can also experiment with different classifiers and hyperparameters to improve the performance of your model.
download dataset from here 👇🏻
https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation
Good 👍