Practical Scikit-learn Tutorial: Building a Bag-of-Words Model and Conducting Sentiment Analysis | packtpub.com

Posted by


In this tutorial, we will be exploring how to perform sentiment analysis using the bag-of-words model in Scikit-learn. Sentiment analysis is the process of determining the attitude or emotion expressed in a piece of text. The bag-of-words model is a simple and effective technique for representing text data in machine learning models.

Before we dive into the implementation, let’s install the required libraries:

  1. Scikit-learn: A popular machine learning library for Python that provides tools for data transformation, modeling, and evaluation.
  2. nltk: A natural language processing library for Python that provides tools for text processing such as tokenization and stemming.

You can install these libraries using pip by running the following commands:

pip install scikit-learn
pip install nltk

Now that we have the required libraries installed, let’s start by importing the necessary modules in Python:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
nltk.download('punkt')

Next, let’s load the dataset that we will be using for sentiment analysis. For this tutorial, we will be using the IMDB movie review dataset, which contains reviews labeled as positive or negative.

data = pd.read_csv('imdb_reviews.csv')

Let’s take a look at the first few rows of the dataset to understand its structure:

print(data.head())

Now, let’s preprocess the text data by tokenizing the reviews and converting them into a bag-of-words representation using the CountVectorizer class from Scikit-learn:

# Tokenize the reviews
data['tokens'] = data['review'].apply(lambda x: nltk.word_tokenize(x))

# Convert the reviews into a bag-of-words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['tokens'].apply(lambda x: ' '.join(x)))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, data['label'], test_size=0.2, random_state=42)

Now that we have preprocessed the data, let’s train a Naive Bayes classifier on the bag-of-words representation of the reviews and evaluate its performance:

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions on the test set
predictions = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')

Finally, let’s test the model with a sample review and see how it predicts the sentiment:

review = 'This movie was amazing! I loved every minute of it.'
review_tokens = nltk.word_tokenize(review)
review_vector = vectorizer.transform([' '.join(review_tokens)])
prediction = classifier.predict(review_vector)

print(f'Review: {review}')
print(f'Prediction: {"Positive" if prediction == 1 else "Negative"}')

That’s it! In this tutorial, we have learned how to perform sentiment analysis using the bag-of-words model in Scikit-learn. You can further experiment with different classifiers, hyperparameters, and preprocessing techniques to improve the performance of the sentiment analysis model.