In this tutorial, we will delve into the Scikit-learn Transformer API in order to better understand how to create custom transformers and use them in a machine learning pipeline. The Transformer API is an essential component of Scikit-learn that allows us to preprocess and transform data before feeding it into a machine learning model.
The Transformer API in Scikit-learn is based on the concept of transformers, which are classes that implement the fit
and transform
methods. The fit
method is used to learn the parameters of the transformer from the training data, while the transform
method is used to apply the learned transformation to new data.
- Creating a Custom Transformer:
To create a custom transformer in Scikit-learn, we first need to create a class that inherits from the base classesBaseEstimator
andTransformerMixin
. These base classes provide the necessary methods for fitting and transforming data.
Here is an example of a custom transformer that scales the input data using the MinMaxScaler from Scikit-learn:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
class CustomScaler(BaseEstimator, TransformerMixin):
def __init__(self):
self.scaler = MinMaxScaler()
def fit(self, X, y=None):
self.scaler.fit(X)
return self
def transform(self, X):
return self.scaler.transform(X)
In this example, the CustomScaler
class implements the fit
and transform
methods by wrapping the functionality of the MinMaxScaler
class.
- Using a Custom Transformer in a Pipeline:
Once we have created a custom transformer, we can use it in a machine learning pipeline in Scikit-learn. A pipeline is a sequence of transformers followed by an estimator, and it allows us to automate the preprocessing and modeling steps in a machine learning workflow.
Here is an example of using the custom scaler in a pipeline with a linear regression model:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
('scaler', CustomScaler()),
('regressor', LinearRegression())
])
In this example, the pipeline consists of two steps: first, the data is scaled using the custom scaler, and then it is passed to the linear regression model for training.
- Bringing it All Together:
To bring everything together, we can now fit the pipeline to the training data and make predictions on new data. Here is an example of fitting the pipeline to the training data and evaluating its performance on a test set:
X_train = ... # training features
y_train = ... # training labels
X_test = ... # test features
y_test = ... # test labels
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
# Evaluate the performance of the pipeline
score = pipeline.score(X_test, y_test)
print(f'Pipeline score: {score}')
In this example, we first fit the pipeline to the training data using the fit
method. Then, we make predictions on the test data using the predict
method. Finally, we evaluate the performance of the pipeline using the score
method.
In conclusion, the Scikit-learn Transformer API is a powerful tool for preprocessing and transforming data in machine learning pipelines. By creating custom transformers and using them in pipelines, we can automate the preprocessing steps and improve the performance of our machine learning models. I hope this tutorial has provided you with a comprehensive understanding of the Transformer API in Scikit-learn and how to leverage it in your machine learning workflows.
If we dont sort in stratify split then can it be useful since all class in test data will have same proportion as whole data?
Hello, I'm just exploring Machine learning without knowing anything about it.
Is it relevant for someone to learn it from scratch ? for someone who is doesn't have a coding background and coming from the legal field ? I'm just trying to understand..
Thank you for your lecturers, great as always. We scale the numerical data, mostly for NN and linear methods. should we include the ordinal encoded features as well in the scaled features? If there are many ordinal categories [ 1…20] in a feature, when we scale other numerical features (min-max) they will be between 0-1 while ordinal will still be 1-20. I guess scaling the ordinal feature wouldnt necessarily change the order among the categories so it should be included.. thoughts?