Dividing Data in Python Using scikit-learn and train_test_split()

Posted by


Splitting datasets in Python is a common task when working on machine learning projects. It is crucial to split the data into training and testing sets to evaluate the performance of a machine learning model accurately. In this tutorial, we will learn how to split datasets using the train_test_split() function from the scikit-learn library in Python.

  1. Install scikit-learn:
    Before we start splitting datasets, we need to make sure that we have scikit-learn installed. If you don’t have it installed already, you can install it using pip:
pip install scikit-learn
  1. Import the necessary libraries:
    We need to import the required libraries before we can split the dataset. We will be using the train_test_split() function from the scikit-learn library to split the dataset into training and testing sets.
from sklearn.model_selection import train_test_split
  1. Load the dataset:
    To illustrate how to split datasets, let’s load a sample dataset using scikit-learn. In this tutorial, we will use the famous Iris dataset. The Iris dataset contains 150 samples of iris flowers, each with four features (sepal length, sepal width, petal length, petal width) and a target variable (species).
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
  1. Split the dataset:
    Now that we have loaded the dataset, we can split it into training and testing sets using the train_test_split() function. The function takes the input data X, target data y, and the test size as parameters. The test size specifies the proportion of the dataset to include in the testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this example, we are splitting the dataset into 80% training data and 20% testing data. The random_state parameter is used to ensure reproducibility of the split.

  1. Check the shape of the splits:
    After splitting the dataset, it is a good practice to check the shape of the training and testing sets to verify the split was successful.
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)
  1. Train a machine learning model:
    Now that we have split the dataset into training and testing sets, we can train a machine learning model on the training data and evaluate its performance on the testing data. For example, we can train a Support Vector Machine (SVM) classifier on the Iris dataset.
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train, y_train)
  1. Evaluate the model:
    After training the model, we can evaluate its performance on the testing data. We can use metrics such as accuracy, precision, recall, or F1 score to evaluate the model’s performance.
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

That’s it! In this tutorial, we learned how to split datasets in Python using the train_test_split() function from the scikit-learn library. Splitting datasets is an essential step in machine learning projects to ensure the model’s performance is evaluated accurately. Now you can apply this knowledge to your own machine learning projects and split datasets for training and testing purposes.

0 0 votes
Article Rating
5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@hanan8114
3 months ago

اسبوع2 يوم4

@IvanStar96
3 months ago

How can I split tiem series data? (e.g. first 8 years observed and last 2 ones)

@osielvivar2552
3 months ago

Good video, explanation + example , nice

@jsudp3
3 months ago

Thank you for such an excellent tutorial!

@AtlantaTerry
3 months ago

Thank you for taking the time to create this tutorial.