K Nearest Neighbour (KNN) Machine Learning Algorithm Tutorial using Python Scikit Learn

Posted by

Alfalfa

–

August 15, 2024

In this tutorial, we will learn about the K Nearest Neighbors (KNN) algorithm, a popular machine learning algorithm used for classification and regression tasks. We will use the Python programming language and the Scikit-Learn library to implement the KNN algorithm.

K Nearest Neighbors (KNN) is a simple and intuitive algorithm that works by storing all available cases and classifying new cases based on a similarity measure. The algorithm makes predictions by finding the K most similar instances in the training dataset for a given data point and assigns a label or value based on these K nearest neighbors.

KNN is a type of lazy learning, as it does not build a model during training time but instead memorizes the training instances. This makes KNN computationally expensive, especially for large datasets, as it requires calculating the distance between the new data point and all training instances.

Let’s start by installing the required libraries and loading a dataset to demonstrate the KNN algorithm:

Step 1: Install the required libraries.

!pip install numpy
!pip install pandas
!pip install scikit-learn

Step 2: Import the required libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Step 3: Load and preprocess the dataset.
For this tutorial, we will use the famous Iris dataset, which contains 150 samples of Iris flowers, each with four features (sepal length, sepal width, petal length, petal width) and a target variable (species: setosa, versicolor, virginica).

# Load the Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Feature scaling.
Before applying the KNN algorithm, it is essential to scale the features to have a mean of 0 and a standard deviation of 1 using the StandardScaler.

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 5: Train and evaluate the KNN algorithm.
Now we can train the KNN classifier using the training data and evaluate its performance on the testing data.

# Initialize the KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the KNN classifier on the training data
knn.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = knn.predict(X_test)

# Calculate the accuracy of the KNN classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Step 6: Tune the hyperparameters.
The KNN algorithm has a hyperparameter K that represents the number of neighbors to consider when making predictions. We can tune this hyperparameter to improve the performance of the KNN classifier.

# Initialize the KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN classifier on the training data
knn.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = knn.predict(X_test)

# Calculate the accuracy of the KNN classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

By following these steps, you have successfully implemented the K Nearest Neighbors (KNN) algorithm in Python using the Scikit-Learn library. KNN is a versatile algorithm that can be used for both classification and regression tasks, making it a valuable tool in a data scientist’s toolkit. Happy coding!