A Guide to Unsupervised Clustering using K-Means

A Guide to Unsupervised Clustering using K-Means

K-means is a popular unsupervised machine learning algorithm for clustering. The main goal of the K-means algorithm is to divide a set of n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

The algorithm works by first randomly selecting k centroids, which represent the center of each cluster. Then, each data point is assigned to the cluster whose centroid is closest to it. Once all the data points have been assigned to a cluster, the centroids are recalculated as the mean of all the data points in each cluster. This process of reassigning data points to clusters and recalculating the centroids is repeated until the cluster assignments no longer change or a maximum number of iterations is reached.

K-means is often used for tasks such as customer segmentation, image compression, and image segmentation. It is a simple and efficient algorithm, but it has some limitations. It requires the number of clusters to be specified in advance, it can be sensitive to the initial conditions and it doesn’t work well with non-globular clusters or clusters with different densities. In this blog post, we will show you how to perform K-means clustering using Python.

First, we will begin by importing the necessary libraries.

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

Next, we will generate some sample data to work with. In this example, we will generate two-dimensional data points that cluster around three distinct centers.

# generate sample data
np.random.seed(0)
X = np.random.randn(150, 2)
X[:50] += 3
X[50:100] += 6
X[100:150] += 9

Now we will create a KMeans object, setting the number of clusters to 3, and fit the data to the model.

# create k-means model
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

Finally, we will use the predict method to predict the cluster labels for our data points, and visualize the results using a scatter plot.

# predict cluster labels
y_pred = kmeans.predict(X)

# plot the data points colored by their cluster label
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

In this example, we were able to group the data points into three clusters using K-means algorithm. The technique is sensitive to the initial conditions, so it is advisable to run the algorithm multiple times with different initial configurations.

In practice, you may want to use KMeans on real-world data, with more than 2-dimensional feature space, and in those cases, you may need to preprocess the data, such as normalizing or scaling it before applying KMeans.

This is just a basic example of how K-means works, you can experiment with different parameters and techniques to better fit your dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *