Implementing K-Means Clustering in Python from Scratch (with Mathematical Explanation)

Posted by

K-Means Clustering From Scratch in Python

K-Means Clustering From Scratch in Python (Mathematical)

K-means clustering is a popular unsupervised learning algorithm used for clustering data points into a specific number of clusters. In this article, we will explore how to implement K-means clustering from scratch in Python, along with the mathematical concepts behind it.

Understanding K-Means Clustering Algorithm

The K-means clustering algorithm works by partitioning a dataset into K clusters, where each cluster is represented by its centroid. The algorithm iteratively assigns data points to the nearest centroid and recalculates the centroids based on the newly assigned data points. This process continues until the centroids no longer change, or until a predefined number of iterations is reached.

Steps of K-Means Clustering

  1. Initialize K centroids randomly.
  2. Assign each data point to the nearest centroid.
  3. Calculate the new centroids based on the data points assigned to each cluster.
  4. Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.

Implementing K-Means Clustering in Python

Now, let’s take a look at how we can implement the K-means clustering algorithm from scratch in Python.

“`python
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate random data points
np.random.seed(0)
X = np.random.rand(100, 2)

# Define the K-means clustering function
def k_means_clustering(X, K, max_iterations):
# Initialize centroids randomly
centroids = X[np.random.choice(range(len(X)), K, replace=False)]
for _ in range(max_iterations):
# Assign each data point to the nearest centroid
cluster_assignments = np.argmin(np.linalg.norm(X[:, np.newaxis] – centroids, axis=2), axis=1)
# Calculate the new centroids
new_centroids = np.array([X[cluster_assignments == k].mean(axis=0) for k in range(K)])
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, cluster_assignments

# Perform K-means clustering with K=3
K = 3
max_iterations = 100
centroids, cluster_assignments = k_means_clustering(X, K, max_iterations)

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=cluster_assignments)
plt.scatter(centroids[:, 0], centroids[:, 1], c=’red’, marker=’x’)
plt.show()
“`

Conclusion

In this article, we have learned about the K-means clustering algorithm and how to implement it from scratch in Python. This algorithm is widely used in various fields such as data mining, image segmentation, and customer segmentation. Understanding the mathematical concepts and implementing the algorithm from scratch can provide a deeper insight into how it works and how to customize it for specific applications.

0 0 votes
Article Rating
9 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@DarwinRCAPNWattersonIII
10 months ago

When did Viktor Krum start teaching Python?! /s

Very informative video btw, thank you so much!!

@bijayamanandhar3890
10 months ago

It's a great tutorial. Beside everything, I just didn't understand why and how it was assumed to have 3 centroids for the example dataset where as you assumed the dataset has no label (unsupervised). Appreciate if you can elaborate. Thanks,

@AlexandLupand
10 months ago

I'm glad I found this tutorial!

@Larzsolice
10 months ago

I take random points from my data as initial centroids, less computations since you only need the set a random integers for indices.

@pitaeata8493
10 months ago

this is great, thank you. it feels good to understand something and be a little closer to understanding machine learning or how to use it properly.

@aravindputtapaka5147
10 months ago

I want a python code to convert handwritten image into plain text with accurate i have tried buti didnt got you can try it and show it me sir and plz respond to this comment bcz i am searching for this very curiosly…

@tcgvsocg1458
10 months ago

really interesting

@user-gk5bz6vd9j
10 months ago

hi i am getting this error can you tell how to solve it

ValueError: 'c' argument has 200 elements, which is inconsistent with 'x' and 'y' with size 100.

@philtoa334
10 months ago

Thx_.