Machine Learning Algorithms: K-Nearest Neighbours Detailed Explanation
KNN is one of the basic supervised learning algorithm used for classification. By the end of this article, you will easily start applying KNN on various datasets.
I guess most of you have already used Netflix. Do u remember Netflix provides you suggestions of various movies/series, you might be interested in? These suggestions might be similar in genre/actors or any other factor to the movies/series u have watched earlier. This type of suggestion system is called a recommendation system.
The recommendation system automatically suggests items of your choice based on your earlier activities or purchases. This is not specifically for Netflix, but many companies use such recommendation systems in their website/app to recommend users books/movies/shopping material or any other specific item.
You might be wondering how these recommendation systems are created. So to answer this, This kind of recommendation system is created using a machine learning algorithm called K-Nearest Neighbors.
Cool, now you just cracked the trick behind Netflix. So let’s study more about this simple though interesting algorithm.
Concept behind K-NN
Suppose there are two groups in your class. One group consists of studious, frontbenchers, and sincere ones. The other one consists of naughty, backbenchers, and fun loving students. Now, two new students “X” and “Y” took admission in your class. “X” stays with a studious group every time while “Y” stays with backbenchers. Your principal all of a sudden asked your class teacher to give reports of new students to him. As the teacher is not much familiar with these new students within this short duration of time, so it became a problem for her to answer the same. What she does is, she reports these new students with the reference to other students they are staying with. The teacher delivered, “X is a studious student and pays attention in the class. Y is a fun-loving and naughty student, always prepared for other co-curricular activities”.
Okay, if you got what I am trying to say through this example, then congrats you have learned the concept behind KNN. To conclude from this, the K-NN algorithm predicts one’s value by observing its nearest neighbor.
You might be wondering if this algorithm this easy? So the answer is yes!!! This is the only concept behind K-NN.
Don’t worry, I will explain to you the technical definition too.
What is K-NN?
The K-Nearest Neighbors(K-NN) is a Supervised Machine learning algorithm that takes a bunch of labeled points and uses them to label other points. This algorithm classifies cases based on their similarity to other cases.
The K-NN algorithm is mostly used for classification problems, although can be used for regression also.
‘k’ in KNN is a parameter that refers to the number of nearest neighbors to be considered in the majority of the voting. For example, if k=5, that means 5 nearest neighbors should be considered while predicting.
Let me explain to you the two properties of K-NN, which will help to define this algorithm easily:
*Lazy learning algorithm − K-NN is a lazy learning algorithm as unlike other machine learning algorithms it does not have a training phase(or just a minimal training phase). This algorithm makes predictions based on the training dataset directly.
*Non-parametric learning algorithm − K-NN is also a non-parametric learning algorithm as it doesn’t assume anything from the given data. Predictions are based on similarity patterns between k neighbors and new data points.
How does K-NN work?
Below are a few steps which explain the working of K-NN:
- load the training and testing data
- Choose the integer value of ‘k'(number of nearest neighbors to be considered)
- Calculate the distance between test data and each row of training data.
- Now, sort them in ascending order based on the distance value
- Choose the top K rows from the sorted array(shortest distance values)
- Assign the new data points to that category or class for which the number of the neighbor is maximum.
In the case of regression, the mean of these k labels is considered.
- Your K-NN model is ready!
Let’s understand this with the help of a diagram:
Suppose we have a dataset that has two categories.
Category 1 = blue points, Category 2 = green points.
We want to predict in which category New data point (red) belongs.
This can be done as follows:
- We will plot all the data points of the dataset & also the new data point.
- Let me choose k = 5, nearest neighbors.
- We will now check for the 5 nearest points to our new data point.
- We will note down in which categories these nearest points come.
From the diagram, it is clear that red’s nearest neighbors consist of 2 blue(category 1) & 3 green(category 2).
Maximum neighbors are of category 2 so we will assign redpoints into category 2.
So, these are the steps we need to work on. You might be wondering about so many questions, like how can you choose k value? How can you calculate the distance? So let’s answer these questions one by one.
How to choose the ‘K’ value?
The performance of your model is based on the “K” value you choose. So, this is the most important part of the K-NN algorithm.
- If your k value is too small then there is a chance of underfitting.
- If your k value is too high then there is a chance of overfitting.
- ‘K’ should be odd, as we will consider “majority voting” while assigning categories to new data points and even the value of ‘K’ might end up in tie-up results.
How to choose the ‘K’ value by considering the above points?
- Choose a range for k: the minimum value of k can be 1 and the maximum can be the number of data points in the dataset.
But we don’t want our model to be over-fit so generally, we take a range like (1,40) or (1,50). There is no compulsion, you can choose a range of your choice.
- For each value of K in this range, we will implement our KNN model.
- We will calculate the accuracy or error corresponding to each K value and plot it.
- K value which gives us better accuracy, better stability, or minimum error will be assigned finally.
- Note: We will use the cross-validation technique to do the previous 2 steps. This in-built method can find the accuracy of the model. But we can find errors by subtracting the accuracy value from 1.
Moving further, let’s jump to the next question.
How to calculate the distance?
The distance can be calculated using any of the below-mentioned methods:
- Euclidean distance
- Manhattan distance
- Hamming distance
- Minkowski distance
Most commonly, Euclidean distance is used to find the shortest distance between data points and gives better results in comparison to others. So, I am just going to explain only Euclidean distance in this blog.
Euclidean distance can be calculated by squaring the root of the summation of squared distance between two data points.
Now let us see some advantages and disadvantages of the K-NN algorithm.
- This algorithm is easy to implement
- This algorithm does not need a training phase.
- Since this algorithm does not involve a training phase, so even if new data points are added, the accuracy of the model won’t change.
- For large datasets, computational costs will be high, as the distance between so many data points has to be calculated.
- Feature scaling is a mandatory thing to do, to get correct predictions.
- Determining K value can be a complex process sometimes.
Now as we are done with the conceptual part let’s see how this algorithm is implemented in python. For this, we will take the most common “Iris” dataset and build a K-NN model.
Python has in-built libraries which makes it easy to perform these kinds of algorithms. That means we need not calculate the distance between every data point. These libraries already contain every such method/function which is required for the algorithm. We just need to fit our model with the correct dataset and find the optimal value of ‘K’.
Dividing data into features and labels
Fit the KNN model
Using cross-validation to tune parameters
Plotting to get optimal value of K
From the above graph we can see that at k = 11, there is a minimum error. We can also get this value, by adding two lines in the above code.
Revaluating the model
Now our KNN model is ready. We can use it for predicting new data points.
So, this is the end of the blog. I tried to simplify the content as much as I could. Try building your KNN model with the steps and tips I suggested above.
Thanks for reading!