In machine learning, data handling plays a crucial role in building accurate and reliable models. In this tutorial, we will explore the basics of data handling using scikit-learn, a popular machine learning library in Python.
First, let’s start by importing the necessary libraries:
import numpy as np
import pandas as pd
from sklearn import datasets
Next, we will load a dataset to work with. For this tutorial, we will use the famous Iris dataset that comes included with scikit-learn:
iris = datasets.load_iris()
X = iris.data
y = iris.target
The X
variable contains the features of the dataset, while the y
variable contains the target labels. We can further explore the dataset by checking its shape and the unique values in the target labels:
print("Dataset shape: ", X.shape)
print("Unique labels: ", np.unique(y))
Now that we have loaded the dataset, let’s move on to data preprocessing. One common preprocessing step is to split the dataset into training and testing sets. We can do this using scikit-learn’s train_test_split
function:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The test_size
parameter specifies the fraction of the dataset that should be used for testing. The random_state
parameter ensures reproducibility by setting a seed for the random number generator.
After splitting the dataset, we can move on to standardizing the features. Standardization is a common preprocessing step that scales the features to have a mean of 0 and a standard deviation of 1. We can achieve this using scikit-learn’s StandardScaler
:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now that we have preprocessed the data, we can move on to fitting a machine learning model. For this tutorial, let’s use a simple logistic regression model:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
Once the model is trained, we can make predictions on the test set and evaluate its performance:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)
Finally, we can save and load the trained model using the joblib
library:
import joblib
joblib.dump(model, 'model.pkl')
loaded_model = joblib.load('model.pkl')
In this tutorial, we covered the basics of data handling in machine learning, including data loading, preprocessing, model training, and evaluation. By following these steps, you can build and deploy machine learning models with scikit-learn.
Hi, I am trying to create a scatter plot of the two features that give the best result only and disregard the rest. (Struggling with understanding which is what and how I should include them in my plots), could you please give me a hint on that?
Hi Sebastian. Let me thank you first for your tremendous efforts sharing your knowledge with the whole world. It is really appreciated. I think you forgot to declare y in y[train_ind] to run the code. Although it is clearly the output label data but not explicitly coded. Regards,
Hi Sebastian, for the part of coding, I think you should do some live coding. People, who are brand new or not still familiar with sckit-learn, could get the hang of how each line works or outputs. Your teaching content is thorough though ☺.
Subtitles plz