Python Feature Scaling in SciKit-Learn (Normalization vs Standardization)
Feature scaling is an important step in the data preprocessing phase of machine learning. It helps in normalizing or standardizing the range of independent variables or features of the dataset.
In this article, we will discuss the feature scaling techniques of normalization and standardization using the popular Python library SciKit-Learn.
Normalization
Normalization transforms the features to scale between 0 and 1. It is useful when the features have different units or scales. In SciKit-Learn, you can use the MinMaxScaler to perform normalization on the dataset.
Let’s take a look at an example:
# Import necessary libraries
from sklearn.preprocessing import MinMaxScaler
# Create an instance of MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the dataset
X_normalized = scaler.fit_transform(X)
Standardization
Standardization transforms the features to have a mean of 0 and a standard deviation of 1. It is useful when the features have different means and standard deviations. In SciKit-Learn, you can use the StandardScaler to perform standardization on the dataset.
Here’s an example of standardization using SciKit-Learn:
# Import necessary libraries
from sklearn.preprocessing import StandardScaler
# Create an instance of StandardScaler
scaler = StandardScaler()
# Fit and transform the dataset
X_standardized = scaler.fit_transform(X)
Choosing between Normalization and Standardization
When to use normalization or standardization depends on the dataset and the machine learning algorithm being used. Generally, standardization is more robust to outliers and is often recommended for algorithms that assume zero-mean and unit variance of the features, such as support vector machines and logistic regression. On the other hand, normalization is recommended for algorithms that require input features to be on a similar scale, such as k-nearest neighbors and artificial neural networks.
In conclusion, Python feature scaling in SciKit-Learn can be achieved using the techniques of normalization and standardization. Understanding when and how to use these techniques is important for successful and effective machine learning models.
Could you also explain how the choice of feature_range affects the output processing please? Trying to understand in which case it should be (0,5) and when it should be (0,10), and how you then interpret the output, for example? Also, I am wondering: you are applying scalers to the whole dataset, but what if you have a regression type task (predicting an actual number)? If you apply scalers to all columns then your targets also change
Excellent brother !
Great video!