Performing KMeans Clustering Analysis on the Iris Data Set using Scikit Learn

Posted by


In this tutorial, we will be covering how to perform KMeans clustering analysis using the Scikit Learn library on the famous Iris dataset. The Iris dataset is a well-known dataset in the field of machine learning and consists of 150 instances of iris flowers, with 50 instances for each of the three classes: setosa, versicolor, and virginica.

KMeans clustering is a popular unsupervised machine learning algorithm that partitions the data into k clusters based on the features of each data point. The algorithm works by randomly initializing k centroids, assigning each data point to the closest centroid, recalculating the centroids based on the mean of the data points in each cluster, and repeating this process until convergence is reached.

Let’s start by importing the necessary libraries and the Iris dataset from Scikit Learn.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

Now that we have imported the necessary libraries and loaded the Iris dataset, we can move on to performing the KMeans clustering analysis. We will first determine the optimal number of clusters using the "elbow method," which involves plotting the within-cluster sum of squares (WCSS) for different values of k and selecting the value of k where the rate of decrease in WCSS starts to slow down.

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow method graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

From the elbow method graph, we can see that the optimal number of clusters for the Iris dataset is around 3, as the rate of decrease in WCSS starts to slow down after that point. Let’s proceed with clustering the data into 3 clusters using the KMeans algorithm.

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)

Now that we have clustered the data, we can visualize the clusters by plotting the data points using the first two features (sepal length and sepal width) colored by the clusters assigned by KMeans.

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')

# Plot the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='yellow', label='Centroids')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()
plt.show()

In the plot, you can see the data points are divided into three distinct clusters, with centroids represented by yellow dots. This visualization helps us understand how the KMeans algorithm has partitioned the Iris dataset based on the sepal length and sepal width features.

In conclusion, we have successfully performed KMeans clustering analysis on the Iris dataset using Scikit Learn. We determined the optimal number of clusters using the elbow method, clustered the data into 3 clusters, and visualized the clusters. KMeans clustering is a powerful algorithm for unsupervised clustering analysis and can be used in a variety of real-world applications.

0 0 votes
Article Rating
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@meleseayichlie5645
1 month ago

pls help me , the source code

@dataanalyst1012
1 month ago

In k means clustering, is there an assumption in numbers of observations and variables? Would having variables greater than observation affect the results of clustering and make it less accurate?