Dividing Data in Python Using scikit-learn and train_test_split()

Posted by

Alfalfa

–

August 14, 2024

Splitting datasets in Python is a common task when working on machine learning projects. It is crucial to split the data into training and testing sets to evaluate the performance of a machine learning model accurately. In this tutorial, we will learn how to split datasets using the train_test_split() function from the scikit-learn library in Python.

Install scikit-learn:
Before we start splitting datasets, we need to make sure that we have scikit-learn installed. If you don’t have it installed already, you can install it using pip:

pip install scikit-learn

Import the necessary libraries:
We need to import the required libraries before we can split the dataset. We will be using the train_test_split() function from the scikit-learn library to split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

Load the dataset:
To illustrate how to split datasets, let’s load a sample dataset using scikit-learn. In this tutorial, we will use the famous Iris dataset. The Iris dataset contains 150 samples of iris flowers, each with four features (sepal length, sepal width, petal length, petal width) and a target variable (species).

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Split the dataset:
Now that we have loaded the dataset, we can split it into training and testing sets using the train_test_split() function. The function takes the input data X, target data y, and the test size as parameters. The test size specifies the proportion of the dataset to include in the testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this example, we are splitting the dataset into 80% training data and 20% testing data. The random_state parameter is used to ensure reproducibility of the split.

Check the shape of the splits:
After splitting the dataset, it is a good practice to check the shape of the training and testing sets to verify the split was successful.

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Train a machine learning model:
Now that we have split the dataset into training and testing sets, we can train a machine learning model on the training data and evaluate its performance on the testing data. For example, we can train a Support Vector Machine (SVM) classifier on the Iris dataset.

from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train, y_train)

Evaluate the model:
After training the model, we can evaluate its performance on the testing data. We can use metrics such as accuracy, precision, recall, or F1 score to evaluate the model’s performance.

y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

That’s it! In this tutorial, we learned how to split datasets in Python using the train_test_split() function from the scikit-learn library. Splitting datasets is an essential step in machine learning projects to ensure the model’s performance is evaluated accurately. Now you can apply this knowledge to your own machine learning projects and split datasets for training and testing purposes.

and, Bottle, data, data-science, dividing, django, fastapi,, flask, Keras, Kivy, machine learning, PyQt, PySimpleGUI, python, PyTorch, realpython, scikit-learn, split dataset, TensorFlow, Tkinter, train_test_split, using

Alfalfa

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

@hanan8114

1 month ago

اسبوع2 يوم4

@IvanStar96

1 month ago

How can I split tiem series data? (e.g. first 8 years observed and last 2 ones)

@osielvivar2552

1 month ago

Good video, explanation + example , nice

@jsudp3

1 month ago

Thank you for such an excellent tutorial!

@AtlantaTerry

1 month ago

Thank you for taking the time to create this tutorial.

Dividing Data in Python Using scikit-learn and train_test_split()

Like this:

Recent Posts

Categories

Tags

Implementing Angular JS with Express Handlebars in NodeJS

Senzaa featuring Santrinos Raphael – Django (Official Music Video)

Live de Python #202 – Criando interfaces gráficas com PySimpleGUI

Implementing Angular JS with Express Handlebars in NodeJS

Senzaa featuring Santrinos Raphael – Django (Official Music Video)

Live de Python #202 – Criando interfaces gráficas com PySimpleGUI

Implementing Angular JS with Express Handlebars in NodeJS

Senzaa featuring Santrinos Raphael – Django (Official Music Video)

Live de Python #202 – Criando interfaces gráficas com PySimpleGUI

Implementing Angular JS with Express Handlebars in NodeJS

Senzaa featuring Santrinos Raphael – Django (Official Music Video)

Live de Python #202 – Criando interfaces gráficas com PySimpleGUI

Dividing Data in Python Using scikit-learn and train_test_split()

Share this:

Like this:

Recent Posts

Categories

Tags