Python Machine Learning (Scikit-Learn) One Hot Encoder

Posted by

One Hot Encoder with Python Machine Learning (Scikit-Learn)

One Hot Encoder with Python Machine Learning (Scikit-Learn)

When working with categorical data in machine learning, it is often necessary to encode the categories into numerical values that can be used as input for a machine learning model. One popular method for doing this is the One Hot Encoder, which is available in the Scikit-Learn library for Python.

What is One Hot Encoding?

One Hot Encoding is a process of converting categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. One hot encoded data contains a lot of zeroes, with few ones. This would cause the problem to become very large. One-hot encoding is often used to reduce the dimensionality of the dataset, as in many cases it will produce a more efficient result.

Using One Hot Encoder in Python

In Python, we can use the OneHotEncoder from the Scikit-Learn library to encode categorical variables into a one-hot encoded representation. First, we import the necessary libraries:


import pandas as pd
from sklearn.preprocessing import OneHotEncoder

Next, we can create a Pandas DataFrame with our categorical data:


data = {'category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)

Then, we can create an instance of the OneHotEncoder and fit it to our data:


encoder = OneHotEncoder()
encoder.fit(df[['category']])

Finally, we can use the transform method to transform our categorical data into a one-hot encoded representation:


encoded_data = encoder.transform(df[['category']]).toarray()

Conclusion

One Hot Encoder is a useful tool for handling categorical data in machine learning, and it is easy to use in Python with the Scikit-Learn library. By using One Hot Encoder, we can effectively encode our categorical variables into a format that can be used as input for machine learning models, leading to better predictive performance.

0 0 votes
Article Rating
6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@RyanNolanData
10 months ago

Data Code:

d = {'sales': [100000,222000,1000000,522000,111111,222222,1111111,20000,75000,90000,1000000,10000],

'city': ['Tampa','Tampa','Orlando','Jacksonville','Miami','Jacksonville','Miami','Miami','Orlando','Orlando','Orlando','Orlando'],

'size': ['Small', 'Medium','Large','Large','Small','Medium','Large','Small','Medium','Medium','Medium','Small',]}

@alonzoslim
10 months ago

This is a great video. Explained in a manner that a newbie like myself can understand. Thank you.

A question: What if the dataset contains multiple categorical variables (as well as numerical), and they are all required as input to make a prediction. How can one go about it?

@AhmedIbrahim-xz8bt
10 months ago

exemplary and easy to follow explanation. Thanks sir

@juanDoAs
10 months ago

Trying your code I get this error: 'AttributeError: 'OneHotEncoder' object has no attribute 'set_output''. Any idea why this is?

@swativarsha68
10 months ago

lerant a lot! thanks!!

@onurbltc
10 months ago

Great video!