Chapter 5.2: Introduction to Data Handling in Machine Learning using Scikit-Learn

Posted by


In machine learning, data handling plays a crucial role in building accurate and reliable models. In this tutorial, we will explore the basics of data handling using scikit-learn, a popular machine learning library in Python.

First, let’s start by importing the necessary libraries:

import numpy as np
import pandas as pd
from sklearn import datasets

Next, we will load a dataset to work with. For this tutorial, we will use the famous Iris dataset that comes included with scikit-learn:

iris = datasets.load_iris()
X = iris.data
y = iris.target

The X variable contains the features of the dataset, while the y variable contains the target labels. We can further explore the dataset by checking its shape and the unique values in the target labels:

print("Dataset shape: ", X.shape)
print("Unique labels: ", np.unique(y))

Now that we have loaded the dataset, let’s move on to data preprocessing. One common preprocessing step is to split the dataset into training and testing sets. We can do this using scikit-learn’s train_test_split function:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The test_size parameter specifies the fraction of the dataset that should be used for testing. The random_state parameter ensures reproducibility by setting a seed for the random number generator.

After splitting the dataset, we can move on to standardizing the features. Standardization is a common preprocessing step that scales the features to have a mean of 0 and a standard deviation of 1. We can achieve this using scikit-learn’s StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now that we have preprocessed the data, we can move on to fitting a machine learning model. For this tutorial, let’s use a simple logistic regression model:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

Once the model is trained, we can make predictions on the test set and evaluate its performance:

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Finally, we can save and load the trained model using the joblib library:

import joblib

joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')

In this tutorial, we covered the basics of data handling in machine learning, including data loading, preprocessing, model training, and evaluation. By following these steps, you can build and deploy machine learning models with scikit-learn.

0 0 votes
Article Rating

Leave a Reply

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@azjargalnaranbaatar2712
2 hours ago

Hi, I am trying to create a scatter plot of the two features that give the best result only and disregard the rest. (Struggling with understanding which is what and how I should include them in my plots), could you please give me a hint on that?

@azasoft11
2 hours ago

Hi Sebastian. Let me thank you first for your tremendous efforts sharing your knowledge with the whole world. It is really appreciated. I think you forgot to declare y in y[train_ind] to run the code. Although it is clearly the output label data but not explicitly coded. Regards,

@nguyenhuuuc2311
2 hours ago

Hi Sebastian, for the part of coding, I think you should do some live coding. People, who are brand new or not still familiar with sckit-learn, could get the hang of how each line works or outputs. Your teaching content is thorough though ☺.

@wesamelbaz7811
2 hours ago

Subtitles plz

4
0
Would love your thoughts, please comment.x
()
x