L05: Introduction to Scikit-learn Transformer API for Machine Learning

Posted by


In this tutorial, we will delve into the Scikit-learn Transformer API in order to better understand how to create custom transformers and use them in a machine learning pipeline. The Transformer API is an essential component of Scikit-learn that allows us to preprocess and transform data before feeding it into a machine learning model.

The Transformer API in Scikit-learn is based on the concept of transformers, which are classes that implement the fit and transform methods. The fit method is used to learn the parameters of the transformer from the training data, while the transform method is used to apply the learned transformation to new data.

  1. Creating a Custom Transformer:
    To create a custom transformer in Scikit-learn, we first need to create a class that inherits from the base classes BaseEstimator and TransformerMixin. These base classes provide the necessary methods for fitting and transforming data.

Here is an example of a custom transformer that scales the input data using the MinMaxScaler from Scikit-learn:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = MinMaxScaler()

    def fit(self, X, y=None):
        self.scaler.fit(X)
        return self

    def transform(self, X):
        return self.scaler.transform(X)

In this example, the CustomScaler class implements the fit and transform methods by wrapping the functionality of the MinMaxScaler class.

  1. Using a Custom Transformer in a Pipeline:
    Once we have created a custom transformer, we can use it in a machine learning pipeline in Scikit-learn. A pipeline is a sequence of transformers followed by an estimator, and it allows us to automate the preprocessing and modeling steps in a machine learning workflow.

Here is an example of using the custom scaler in a pipeline with a linear regression model:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ('scaler', CustomScaler()),
    ('regressor', LinearRegression())
])

In this example, the pipeline consists of two steps: first, the data is scaled using the custom scaler, and then it is passed to the linear regression model for training.

  1. Bringing it All Together:
    To bring everything together, we can now fit the pipeline to the training data and make predictions on new data. Here is an example of fitting the pipeline to the training data and evaluating its performance on a test set:
X_train = ... # training features
y_train = ... # training labels
X_test = ... # test features
y_test = ... # test labels

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

# Evaluate the performance of the pipeline
score = pipeline.score(X_test, y_test)
print(f'Pipeline score: {score}')

In this example, we first fit the pipeline to the training data using the fit method. Then, we make predictions on the test data using the predict method. Finally, we evaluate the performance of the pipeline using the score method.

In conclusion, the Scikit-learn Transformer API is a powerful tool for preprocessing and transforming data in machine learning pipelines. By creating custom transformers and using them in pipelines, we can automate the preprocessing steps and improve the performance of our machine learning models. I hope this tutorial has provided you with a comprehensive understanding of the Transformer API in Scikit-learn and how to leverage it in your machine learning workflows.

0 0 votes
Article Rating
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@jagritsinghal5939
2 months ago

If we dont sort in stratify split then can it be useful since all class in test data will have same proportion as whole data?

@MrNoufa10
2 months ago

Hello, I'm just exploring Machine learning without knowing anything about it.
Is it relevant for someone to learn it from scratch ? for someone who is doesn't have a coding background and coming from the legal field ? I'm just trying to understand..

@SandeepPawar1
2 months ago

Thank you for your lecturers, great as always. We scale the numerical data, mostly for NN and linear methods. should we include the ordinal encoded features as well in the scaled features? If there are many ordinal categories [ 1…20] in a feature, when we scale other numerical features (min-max) they will be between 0-1 while ordinal will still be 1-20. I guess scaling the ordinal feature wouldnt necessarily change the order among the categories so it should be included.. thoughts?