Scikit-Learn is a popular machine learning library in Python that provides a wide range of tools for data preprocessing, model building, and evaluation. One of the key features of Scikit-Learn is the Pipeline class, which enables users to streamline the process of data preprocessing and model training by chaining multiple preprocessing steps together in a single pipeline. In this tutorial, we will cover how to use Scikit-Learn pipelines for data preprocessing in Python.
Getting Started
To follow along with this tutorial, make sure you have Scikit-Learn installed on your machine. You can install it using pip by running the following command:
pip install scikit-learn
Understanding Pipelines
In Scikit-Learn, a pipeline is a sequence of steps that are applied to the data in a specific order. A typical pipeline consists of three main components:
- Data preprocessing steps, such as scaling, encoding categorical variables, and imputing missing values.
- A machine learning model.
- Training and evaluation steps.
By using a pipeline, you can automate the process of preparing your data for modeling and avoid data leakage by ensuring that the same preprocessing steps are applied to both the training and testing datasets.
Creating a Pipeline
To create a pipeline in Scikit-Learn, you can use the Pipeline
class from the sklearn.pipeline
module. Let’s walk through an example of how to create a simple pipeline that includes scaling the data and fitting a logistic regression model:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Define the steps in the pipeline
steps = [('scaler', StandardScaler()), ('logistic', LogisticRegression())]
# Create the pipeline
pipeline = Pipeline(steps)
In this example, we define two steps in the pipeline: scaling the data using StandardScaler
and fitting a logistic regression model. The steps are passed as a list of tuples to the Pipeline
constructor, where the first element of each tuple is a string identifier for the step and the second element is the actual preprocessing or modeling step.
Using a Pipeline
Once you have defined your pipeline, you can use it to preprocess your data and train your model in a single step. Here’s an example of how to fit the pipeline to some data and make predictions:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the Iris dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
# Make predictions on the test data
predictions = pipeline.predict(X_test)
In this example, we load the Iris dataset, split it into training and testing sets, fit the pipeline to the training data, and make predictions on the test data. The pipeline automatically applies the preprocessing steps to the input data before passing it to the model for training and prediction.
Adding More Steps to the Pipeline
You can easily add more preprocessing steps to the pipeline by extending the list of tuples passed to the Pipeline
constructor. For example, you can add steps for encoding categorical variables and imputing missing values:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Define the steps in the pipeline
steps = [
('imputer', SimpleImputer(strategy='mean')),
('encoder', OneHotEncoder()),
('scaler', StandardScaler()),
('logistic', LogisticRegression())
]
# Create the pipeline
pipeline = Pipeline(steps)
In this updated pipeline, we first impute missing values with the mean, then encode categorical variables using OneHotEncoder
, scale the data with StandardScaler
, and fit a logistic regression model.
Cross-Validation with Pipelines
You can also use pipelines in conjunction with cross-validation to evaluate the performance of your model on different subsets of the data. Here’s an example of how to perform 5-fold cross-validation with a pipeline:
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross-validation
scores = cross_val_score(pipeline, data.data, data.target, cv=5)
print('Cross-validation scores:', scores)
In this example, we use the cross_val_score
function to evaluate the pipeline on the entire Iris dataset using 5-fold cross-validation. The function returns an array of scores corresponding to each fold of the cross-validation, which you can use to assess the model’s performance.
Conclusion
In this tutorial, we have covered the basics of using Scikit-Learn pipelines for data preprocessing in Python. Pipelines are a powerful tool for streamlining the process of preparing your data for machine learning models and can help you avoid common pitfalls such as data leakage and code duplication. By chaining together multiple preprocessing steps in a single pipeline, you can create a robust and efficient workflow for building and evaluating machine learning models. I hope this tutorial has provided you with a solid foundation for using pipelines in your own machine learning projects.
hello nick
The first 18 minutes are totally useless. Should be cut.
helpful and easy to follow videos as always, thank you from Japan! 🙂
Amazing content thank you a lot.
you are great! thanks a million
Great Video, this gave me a much better understanding. maybe in the future include some feature engineering in a pipeline? Also, I would definitely find it helpful if you worked through a full project using a pipeline. Either way, I'd like to say my thanks for providing such great content!
Hey, nick!
Thank you! Thank you so much! for this. It serves my purpose. Thank you, man. I really wish to work under a mentor like you. I don't know when it is going to happen. But thank you, man
At 27:39 you mention a book about data pipelines, could you provide the name of the book?
Hey Nick, can you create an example of pipeline with Custom Transformer ? I mean transformer cleaning the data is a specific way.
Hey, just a quick question, I can't seem to find this answer. I saw Pycaret in a few of your videos and other YouTubers, what's the downside of using it to build out your models? I understand the fact that you lose out on customizability, but couldn't you run Pycaret get the first best model and tune it from there?
Hello nicolas i wanted to make a request that, can you teach me / viewers how to create a simple ai that can collect data from a given website?
This is incredible, I am currently working on a preprocessing pipeline for a dataframe which has 90 features and many columns with missing values. Really needed this pipeline example. Thanks Nicholas
You're a total legend!!
love your channel
I asked for this live stream in the last one! haha Thanks man!
Hey Nick thank you so much for the content you're putting out there, its helping people like me who are relatively new and enthusiastic about data science.
Hey Nicholas, I just want to say I really appreciate the extra effort you make to help us learn. If you’re reading this just know your hard work does not go unappreciated, and truly helps us out! 😁
Just a little over five lines of code today 🤦♂️ 😅