K-Means Clustering From Scratch in Python (Mathematical)
K-means clustering is a popular unsupervised learning algorithm used for clustering data points into a specific number of clusters. In this article, we will explore how to implement K-means clustering from scratch in Python, along with the mathematical concepts behind it.
Understanding K-Means Clustering Algorithm
The K-means clustering algorithm works by partitioning a dataset into K clusters, where each cluster is represented by its centroid. The algorithm iteratively assigns data points to the nearest centroid and recalculates the centroids based on the newly assigned data points. This process continues until the centroids no longer change, or until a predefined number of iterations is reached.
Steps of K-Means Clustering
- Initialize K centroids randomly.
- Assign each data point to the nearest centroid.
- Calculate the new centroids based on the data points assigned to each cluster.
- Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.
Implementing K-Means Clustering in Python
Now, let’s take a look at how we can implement the K-means clustering algorithm from scratch in Python.
“`python
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
# Generate random data points
np.random.seed(0)
X = np.random.rand(100, 2)
# Define the K-means clustering function
def k_means_clustering(X, K, max_iterations):
# Initialize centroids randomly
centroids = X[np.random.choice(range(len(X)), K, replace=False)]
for _ in range(max_iterations):
# Assign each data point to the nearest centroid
cluster_assignments = np.argmin(np.linalg.norm(X[:, np.newaxis] – centroids, axis=2), axis=1)
# Calculate the new centroids
new_centroids = np.array([X[cluster_assignments == k].mean(axis=0) for k in range(K)])
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, cluster_assignments
# Perform K-means clustering with K=3
K = 3
max_iterations = 100
centroids, cluster_assignments = k_means_clustering(X, K, max_iterations)
# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=cluster_assignments)
plt.scatter(centroids[:, 0], centroids[:, 1], c=’red’, marker=’x’)
plt.show()
“`
Conclusion
In this article, we have learned about the K-means clustering algorithm and how to implement it from scratch in Python. This algorithm is widely used in various fields such as data mining, image segmentation, and customer segmentation. Understanding the mathematical concepts and implementing the algorithm from scratch can provide a deeper insight into how it works and how to customize it for specific applications.
When did Viktor Krum start teaching Python?! /s
Very informative video btw, thank you so much!!
It's a great tutorial. Beside everything, I just didn't understand why and how it was assumed to have 3 centroids for the example dataset where as you assumed the dataset has no label (unsupervised). Appreciate if you can elaborate. Thanks,
I'm glad I found this tutorial!
I take random points from my data as initial centroids, less computations since you only need the set a random integers for indices.
this is great, thank you. it feels good to understand something and be a little closer to understanding machine learning or how to use it properly.
I want a python code to convert handwritten image into plain text with accurate i have tried buti didnt got you can try it and show it me sir and plz respond to this comment bcz i am searching for this very curiosly…
really interesting
hi i am getting this error can you tell how to solve it
ValueError: 'c' argument has 200 elements, which is inconsistent with 'x' and 'y' with size 100.
Thx_.