The Ultimate Guide to Clustering in Machine Learning
A Quick Review Guide That Explains the Clustering— An Unsupervised Machine Learning Technique, Along with Some of the Most Used Clustering Algorithms, All Under 20 Minutes.
When it comes to solving real-world problems via Machine Learning, a lot of the problems involve data that is not labeled.
This means that the data doesn’t exactly have a target variable (say, price, or the class to which a data instance might belong to, etc.) that our model aims to predict. Rather, for such problems, generally, the aim is to group the data instances into different categories.
The primary objective in mind for this grouping operation is to uncover some hidden trends or relationships within the data that might not be directly visible to the naked eyes. Such problems where we work with unlabeled data are known as unsupervised machine learning problems.
You’ll get a better gist of this with the help of an example.
Let’s consider a hypothetical situation where you are working with a dataset that contains a store’s sales data—the items sold in a sale, profit/loss made per sale, along with the basic information about the customer such as their age and the gender.
Now, your job is to extract insights from within the data regarding the different customer purchase patterns, for example, what age group buys a certain kind of product more, which age group is spending more on their purchases, what flavor of a soft drink sells more, etc.
Such a problem where we divide a customer base into several groups based on various factors such as age, gender, spending habits, etc. is known as Customer Segmentation, and it is, in fact, one of the most popular unsupervised machine learning problems. As you can see, there is no fixed target variable that our model is trying to predict.
In such problems, the more traditional machine learning methods such as classification or regression are rendered useless since the data doesn’t have a target variable that these algorithms can predict. So, here comes another category of machine learning algorithms to the rescue— Clustering.
Clustering is an unsupervised machine learning technique where data points are clustered together into different groups based on the similarity of their features. These groups are known as clusters.
The underlying principle of the clustering algorithms is that points that share a lot of similarities tend to be closer together into the same cluster, while those points that are highly dissimilar in terms of their features tend to be farther apart into separate clusters.
So, how exactly does this clustering of data into separate groups benefit us? Well, the answer to this question is that once the clustering process is completed, these individual clusters can then be analyzed to develop some useful insights from the data. For example, one can determine what feature-similarities do data points within a cluster share, or what features tend to drive two data instances apart into different categories.
This is where clustering begins to deviate away from techniques like regression and classification. The aim is not to predict some quantity or class. Rather, clustering is more of a research tool that might help uncover the trends and relationships within the data.
Now that we know what clustering is, let us have a look at some of the most commonly used clustering algorithms in machine learning. We will understand each of these clustering algorithms one by one while analyzing their strength and weaknesses. We will also have a look at the implementation of these algorithms in Python using the popular PyData library Scikit-Learn.
So, let’s get started.
K-Means Clustering is one of the simplest clustering algorithms in terms of implementation. The clustering of data instance involves random initialization of centroid points, then grouping the data instances into clusters/groups based on their proximity to these centroid points. Out of the ‘k’ centroid, the data instance is assigned to the cluster of that centroid which is the closest to it. The algorithm then keeps optimizing the position of the centroids until either convergence is achieved, i.e., all the data points are grouped in such a way that the loss is minimum, or until the model exhausts its maximum number training cycles (which is generally a hyperparameter set by the researcher).
Now that we have a rough idea regarding how the K-Means Clustering algorithm works, let us have a look at the advantages and the disadvantages of this algorithm.
- The K-Means algorithm is relatively very easy to implement as compared to some of the other more complex clustering algorithms.
- The algorithm has high scalability. This means it can be used for datasets where we have a large number of data instances. This makes it an ideal choice for very large datasets.
- The algorithm has high adaptability. This means that it easily adapts to the variations within the data.
- K-Means Clustering algorithms guarantee convergence.
- One major drawback associated with the k-means algorithm is that the value of the ‘k’ hyperparameter needs to be set manually. This means that an assumption needs to be made regarding the number of groups/clustered that can be observed within the data. The disadvantage associated with this is that the results may vary vastly based on the number of clusters the researcher assumes.
- Convergence may depend on the initialization of the k random centroid points. This means that the location of the final converged centroid points may vary depending upon the starting locations these centroid points were initialized at. Thus, the results are often non-reproducible.
- K-means algorithm is very sensitive to outliers. The outliers within the data might result in the centroids converging to a location away from the ideal position. Or in the worst-case scenario, the outliers might end up forming a separate cluster of their own. Hence, to use k-means clustering, the outliers in the data need to be dealt with, which can be a time consuming, and hence, an expensive process.
- The K-means algorithm doesn’t work well with high dimensional data.
Now that we know the advantages and disadvantages of the k-means clustering algorithm, let us have a look at how to implement a k-mean clustering machine learning model using Python and Scikit-Learn.
One of the most powerful clustering algorithms, Density-Based Density-Based Spatial Clustering of Applications with Noise, commonly known as DBSCAN, has several advantages over the other regular algorithms like k-means clustering.
One notable advantage that DBSCAN grants are that unlike k-means clustering, the DBSCAN does not assume circular clusters. Rather, the data points can be grouped into clusters of arbitrary shapes. The clustering process is as follows-
The algorithm starts with an arbitrary, unvisited data point. The neighborhood of this unvisited is analyzed via a distance formula.
If there are enough points in the vicinity of this data point, then this point and all the other neighboring points are assigned to a cluster.
In case there aren’t enough data points in the vicinity of this selected data point, it is categorized as an outlier.
This process of selecting the points and grouping them into clusters continues until all the points have been either clustered into a group or have been marked as an outlier.
Now that we have a rough idea regarding how DBSCAN works, let us have a look at its advantages and disadvantages.
- One primary advantage of the DBSCAN algorithm is that unlike the K-means clustering, it does not assume a fixed number of clusters within the data. Rather, the algorithm itself determines the total number of clusters observed within the data.
- DBSCAN is very robust to outliers within the data. This makes it an ideal choice for clustering when the data has too many outliers and noise.
- Unlike K-means, where the clusters assume a circular shape, in the case of DBSCAN, the clusters can be of an arbitrary shape, which works well in the case of the real-world data.
- DBSCAN is ideal for when we have to deal with a large-scale dataset. However, in the case of sparse datasets, i.e., when there aren’t many data instances, DBSCAN tends to perform badly.
- DBSCAN does not give very promising results when the dataset is of variable densities.
Finally, let us have a look at the implementation of the DBSCAN algorithm in Python.
Hierarchical clustering or hierarchical agglomerative clustering (HAC) is another popular clustering algorithm. The way these algorithm works is slightly different from the other two we saw earlier. HAC works in the following way.
The algorithm initiates the clustering process by treating every single data instance as a separate cluster. Then, based on the similarity between these initially assigned clusters, the algorithm recursively merges the two closest clusters until all the individual clusters have been merged into a single cluster. Hierarchical clustering is especially useful when it comes to extracting relationship patterns from within the data.
Now that we have a rough idea regarding how HAC works, let us have a look at the advantages and the disadvantages associated with the algorithm.
- Unlike the other clustering algorithms, the HAC algorithm unravels a hierarchy within the data, which is more insightful as compared to some 2-D clusters.
- Generally, the algorithm doesn’t need any prior information or assumption regarding the number of clusters.
- HAC is relatively easy to implement.
- The results, in the case of HAC, are often non-reproducible.
- With a time-complexity of O (n2 x log n), the algorithm is comparatively slower and doesn’t scale well for large datasets.
- The HAC algorithm is sensitive to outliers. Thus, if your dataset has a large number of outliers, then, in that case, the performance of the model might degrade considerably.
Now, let us have a look at the implementation of a HAC model using Python and Scikit-Learn.
With this, we reach the end of our clustering guide. To sum things up, first, we understood what clustering is, followed by a quick analysis of some of the classic clustering algorithms used in machine learning.
If you want to continue this learning journey, we have a similar quick-review guide for Classification in Machine Learning.