One Hot Encoder with Python Machine Learning (Scikit-Learn)
When working with categorical data in machine learning, it is often necessary to encode the categories into numerical values that can be used as input for a machine learning model. One popular method for doing this is the One Hot Encoder, which is available in the Scikit-Learn library for Python.
What is One Hot Encoding?
One Hot Encoding is a process of converting categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. One hot encoded data contains a lot of zeroes, with few ones. This would cause the problem to become very large. One-hot encoding is often used to reduce the dimensionality of the dataset, as in many cases it will produce a more efficient result.
Using One Hot Encoder in Python
In Python, we can use the OneHotEncoder from the Scikit-Learn library to encode categorical variables into a one-hot encoded representation. First, we import the necessary libraries:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
Next, we can create a Pandas DataFrame with our categorical data:
data = {'category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
Then, we can create an instance of the OneHotEncoder and fit it to our data:
encoder = OneHotEncoder()
encoder.fit(df[['category']])
Finally, we can use the transform method to transform our categorical data into a one-hot encoded representation:
encoded_data = encoder.transform(df[['category']]).toarray()
Conclusion
One Hot Encoder is a useful tool for handling categorical data in machine learning, and it is easy to use in Python with the Scikit-Learn library. By using One Hot Encoder, we can effectively encode our categorical variables into a format that can be used as input for machine learning models, leading to better predictive performance.
Data Code:
d = {'sales': [100000,222000,1000000,522000,111111,222222,1111111,20000,75000,90000,1000000,10000],
'city': ['Tampa','Tampa','Orlando','Jacksonville','Miami','Jacksonville','Miami','Miami','Orlando','Orlando','Orlando','Orlando'],
'size': ['Small', 'Medium','Large','Large','Small','Medium','Large','Small','Medium','Medium','Medium','Small',]}
This is a great video. Explained in a manner that a newbie like myself can understand. Thank you.
A question: What if the dataset contains multiple categorical variables (as well as numerical), and they are all required as input to make a prediction. How can one go about it?
exemplary and easy to follow explanation. Thanks sir
Trying your code I get this error: 'AttributeError: 'OneHotEncoder' object has no attribute 'set_output''. Any idea why this is?
lerant a lot! thanks!!
Great video!