Understanding K-Means Clustering
This article will introduce you to Unsupervised Learning and will help you gain a proper conceptual understanding of K-Means Clustering.
Have you ever noticed how systemic books in the library are arranged? Like the maths section containing all maths books and references, the physics section consists of all physics references, and similarly, all the other subjects’ sections contain their books. We can say that all the books are grouped based on their subjects and arranged in the self.
Let me take another example, you might have visited the supermarket and observed how different items are arranged in different sections. Like you will never find groceries kept between clothes or grains kept between laundry material. Every item is perfectly arranged in its specific section.
These two are real-life examples of how items can be grouped or clustered based on a certain similarity.
So, from here I can conclude that clustering means the division of items into certain categories/sections or grouping them based on certain similarities.
Before we go further, let me define the name of this algorithm first.
What is K-means Clustering?
K-means algorithm is the unsupervised machine learning algorithm in which whole data is divided into K number of clusters. Every cluster has its centroid which is calculated by averaging the data points of that cluster. But what are the criteria of clustering?
- Inertia: It is the measure of intra-cluster distances, which means how far away the datapoint is concerning its centroid. This indicates that data points in the same cluster should be well matched and similar to each other. For better clustering, the inertia value should be minimum. In contrast, if the inertia value is high, that means data points in the cluster are not similar to each other. This indicates a simple concept that for a given datapoint intra-cluster distance should always be less than inter-cluster distance.
How does K – means algorithm works?
- Initialize ‘ K’ and centroid values.
- Assign data points to the closest clusters, by calculating the Euclidean distance.
- When the clusters are formed, recompute their centroid values by calculating the average of data points.
- Repeat steps 2 & 3 until all the clusters are stable.
Cool, isn’t it? Let’s visualize these steps for better understanding.
Suppose we have a dataset, Let’s start with plotting the data points.
As we can see the given data points can be roughly divided into three clusters. Let our K value be 3. Now, let’s initialize the centroids randomly.
As we have marked our centroid values. The next step is to assign the data points to the nearest cluster. This is done by calculating the distance between the centroids and data points. Euclidean distance is a better choice when the need is to find the shortest distance.
Euclidean distance can be defined as :
The below picture shows how we can use Euclidean distance and assign clusters to the data points.
Similarly, Euclidean distance is calculated for every data point. After the formation of clusters, the next step is to update or recompute the centroid values. This is done by taking the Euclideanaverage or mean value of the data points in a particular cluster.
After updating the centroids, we need to repeat the last two steps. This is because since our centroid value has changed, there might be changes in the Euclidean distance calculated. Therefore, some data points might change their clusters as we recompute the euclidean distance and update the clusters.
This seems like this isn’t a 4 step process. It might take certain iterations to form stable clusters. So, how do we know when to stop performing these steps.
- We can stop our training when even after many iterations our centroids are stable, that is they are the same and fixed.
- We can stop our training when even after many iterations our data points remain in the same cluster. That they do not change their cluster anymore.
- We can stop our training when the fixed or maximum number of iterations is reached. This is because an insufficient number of iterations might give us poor results and hence unstable clusters.
After the formation of stable clusters. Let’s visualize how our data points look.
Looks great, isn’t it? I guess reading this blog till now, one question would have crossed your mind. If I am not wrong, the question is “How can we choose k value in case of real problems, where the dataset consists of thousands of data points?’’
Alright, don’t stress yourself. I will explain everything to you in detail.
How to choose the optimal ‘K’?
‘K’ in K-means clustering has an important role. Wrong k value might result in wrong or unstable clusters. So, choosing the optimal value of k is a tough task.
Two methods can be used for choosing the right value of K.
- Elbow method
- The Silhouette Method
Let’s understand them one by one.
In the elbow method, we plot WSS error against different values of K.
WSS error, isn’t it a new term for you? Let’s crack this.
WSS error is the Within-cluster sum of squared error.
- Take a cluster(say cluster 1); Find the distance between a data point and its centroid(within-cluster distance)
- Do this for every data point in that cluster
- Sum the values, Consider this summation as WSS1
- Find such summation for every cluster.
- Finally, sum up these values to get the WSS error.
Refer to the diagram for a, more clear understanding.
After finding the WSS error, plot it against the different values of K.
But why did we call it an elbow method?
This is because when we will plot WSS errors against different values of K. We will see the graph in the shape of an elbow. The value of K where we can see the clear arm will be our optimal K.
So, as in the above figure, our optimal k value is 4.
But sometimes we might not get a clear elbow, by this method. At that time, we can use the Silhouette Method.
The silhouette method can be considered as a better technique for finding an optimal value of K or simply a validation technique for clustering algorithms.
In this method, we need to find the average intra-cluster distance and the minimum average inter-cluster distance.
Formula for the silhouette coefficient can be written as :
s(i) =( b(i) – a(i) ) / max(b(i),a(i))
Let’s understand this formula, with the help of a diagram
The range of silhouette coefficient is (-1,1)
If the coefficient value is closer to 1, that means the point is similar to all other points in the same cluster.
If the coefficient value is closer to -1, that means the point is highly dissimilar with others of its cluster and assigned in the wrong cluster
After calculation of the silhouette coefficient, we can plot the silhouette score against different k values.
The value of K at which there is a peak in the graph is considered as optimal k.
It can be visualized as :
I hope you have understood these two methods. There are even more such methods to find the optimal value of k in clustering algorithms. But these two are most commonly used. You can use the elbow method to find the optimal k, but if the result is not clear from this method then you can try to validate it with the silhouette method.
Now as we have learned the whole concept behind k-means clustering. Let’s learn how to implement this through python’s in-built libraries.
I am taking a very simple dataset here, with a few rows and columns, for better understanding. So, let’s get started. You can find the dataset here.
Scaling the data
Visualizing the data
Using Elbow method to find optimal ‘k’
Fit the K-means model
Visualizing the final clusters
So this is how the K-means clustering algorithm works.
In this blog, we have learned how the K-means algorithm works, what is the clustering criteria for this, how to choose the optimal K, how to validate the K(silhouette method), and at last we have learned how to implement this algorithm with the help of in-built libraries.
I have used a very simple dataset, remember real problems will never be like this. Problem dataset may contain many columns and rows. There you will be required to choose features that are relevant in finding the final predictions. This was just an example to make you understand how things work here. So, try to build your own k-means model with a different dataset.
Thanks for reading!