In this tutorial, we will continue our exploration of the K-Nearest Neighbors algorithm using the Iris Dataset. We will build upon the knowledge gained in the previous parts of this series and delve deeper into the practical application of the algorithm.
Part Four: Feature Scaling and Model Evaluation
In the previous parts of this series, we have successfully implemented the K-Nearest Neighbors algorithm on the Iris Dataset and made predictions on new data points. However, there are still some important steps we need to take to ensure that our model is robust and accurate.
One crucial step in the machine learning process is feature scaling. Feature scaling is the process of normalizing the range of independent variables or features of our data. This is important because the K-Nearest Neighbors algorithm calculates the distance between data points based on their features. If the features are not scaled, features with larger ranges will dominate the distances, leading to biased results.
To perform feature scaling in Python, we can use the StandardScaler class from the scikit-learn library. Let’s start by importing the necessary libraries and initializing the StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Next, we need to fit the scaler to our training data and transform both the training and test data using the scaler:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now that we have scaled our features, we can proceed to evaluate the performance of our K-Nearest Neighbors model. One common metric for evaluating classification models is accuracy, which measures the fraction of correctly classified data points. We can calculate the accuracy of our model using the accuracy_score function from the scikit-learn library:
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
This will give us the accuracy of our model on the test set. However, accuracy alone may not provide a complete picture of the model’s performance. We can also calculate other metrics such as precision, recall, and F1 score to evaluate the model further. These metrics can be calculated using the classification_report function from scikit-learn:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
This will provide a detailed report of precision, recall, F1 score, and support for each class in our dataset.
In addition to evaluating our model’s performance, we can also optimize the hyperparameters of the K-Nearest Neighbors algorithm to improve its accuracy. Some of the hyperparameters that can be tuned include the number of neighbors (n_neighbors), the distance metric (metric), and the weight function (weights). We can use tools like GridSearchCV to perform hyperparameter tuning:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': range(1, 21), 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']}
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print(grid_search.best_params_)
Once we have identified the best hyperparameters for our model, we can retrain the model using the optimal hyperparameters and evaluate its performance on the test set.
In this tutorial, we have learned how to perform feature scaling, evaluate the performance of our K-Nearest Neighbors model, and optimize its hyperparameters. By following these steps, we can build a robust and accurate machine learning model using the K-Nearest Neighbors algorithm on the Iris Dataset.