Random Forests in Machine Learning: A Detailed Explanation
Random forest is a famous and easy to use machine learning algorithm based on ensemble learning(a process of combining multiple classifiers to form an effective model). In this article, you will learn how this algorithm works, how it’s efficient when compared to other algorithms, and how to implement it.
What is Random Forest in Machine Learning?
Random forest is a supervised machine learning algorithm that can be used for solving classification and regression problems both. However, mostly it is preferred for classification. It is named as a random forest because it combines multiple decision trees to create a “forest” and feed random features to them from the provided dataset. Instead of depending on an individual decision tree, the random forest takes prediction from all the trees and selects the best outcome through the voting process.
Now, the question arises why do we prefer random forests over decision trees. So, individual trees are more prone to overfitting but random forests can reduce this problem by averaging the predicted results from each tree.
Note: The greater the number of trees will be in the forest, the higher the accuracy will be of the model and also it will reduce overfitting to a large extent.
I hope the basic idea regarding random forest should be clear up here. Let’s deep dive into it and understand how this algorithm works.
Working in Random forest
How it works?
We can understand the working of a random forest with the help of the given steps:-
- Select random samples from the provided dataset
- Create a decision tree for each selected sample. Then we will get predicted values from each tree created.
- Then for each predicted result voting will be done.
- In the end, the algorithm will choose the result(predicted) with majority votes.
Python has various in-built libraries used specifically for implementing machine learning algorithms. Fortunately, we don’t have to write the code for implementing complex parts, just by importing these libraries we can get the things done. Now, let’s discuss the problem on which we have to implement the random forest algorithm. We will build a random forest classifier using the Pima Indians diabetes dataset. This problem involves predicting the onset of diabetes within 5years based on the provided dataset. It is a classification problem.
Our goal is to create and analyze and build a model on the Pima Indians diabetes to predict if a specific person has chances of being prone to diabetes or not.
Let’s import the packages that will be helpful to load the dataset and create a random forest classifier.
In the provided dataset we are having 8 input features and 1 target feature (Outcome).
Now it’s time to split the dataset into independent features and target one. Here, our target feature is Outcome.
Feature scaling is a very important technique to standardize the given input features in a given range to increase the accuracy of the model. By doing so we put our features in the same range or scale to prevent bias in the model. Feature scaling can be done by importing StandardScaler() from the sklearn library.
Split the dataset into training and testing data
Building the Random Forest Classifier
Now let’s create our random forest classifier and train it on our training dataset. We can also specify the number of trees we want in our forest by using the let’s parameter n-estimators.
In the above output, we can see the different parameters used during the training process on the train data.
Prediction on the test data
Since the training part is done, let’s check how well our model is working on testing data.
We are getting above 75% which is a good accuracy. However, we can improve it further by choosing those features only which contribute a maximum part in our target variable ( Outcome).
Finding Important Feature
The most amazing quality of the random forest is that we can identify the importance of each feature on predictions. By measuring the importance of each feature we can easily determine which features are not contributing enough to the predictions and can drop them.
By visualizing we can better conclude which feature is contributing to the predictions.
From the above figure, it can be easily seen that Insulin is not contributing much to the prediction, hence we can drop that feature and train our classifier again to improve the performance of the model.
We can see that model accuracy has been increased by removing less important features. Hence, we can conclude that it is very important to check important features and remove less important features to increase your model’s performance.
Application of Random Forest
Random forest is an easy-to-use algorithm mostly preferred in banking, marketing, and medical fields.
Advantages of Random forest
The algorithm has the following advantages:-
- It can solve both regression and classification problems.
- It helps us to deal with the overfitting issue by taking the average of each decision tree.
- It can easily handle large datasets.
This is it for this article. Don’t forget to upvote if you have enjoyed and learned something new from this article.
Thanks for reading!
[…] forget to check the confusion matrix and random forest […]