Complete Analysis of Gradient Descent Algorithm

2+
In this article, I will do a complete analysis of the gradient descent algorithm. Gradient Descent is an optimization algorithm worked for reducing the loss function in multiple machine learning algorithms.

I discussed assumptions of Logistic regression and cross-entropy loss in my previous articles.

Gradient Descent is used when training data models, can be linked with all algorithms, and is easy to learn and execute. It is essentially used for tuning the parameters of the learning model. For a data scientist, it is important to get a solid understanding of the concepts of the gradient descent algorithm. Let us discuss in a much detailed manner.

What is the Gradient Descent algorithm?

Gradient descent algorithm is an optimization algorithm that is used to reduce the cost function. The function which is set to be reduced is called an objective function. It is the loss function that is optimized (minimized).

The gradient descent is used to find the most optimal value of parameters/weights which reduces the loss function. It’s based on a convex function and twitches its parameters iteratively to minimize a given function to its local minimum.

Loss function, simply talking, is the type of the squared difference between actual values and predicted values. To reduce the objective function, the most optimal value of the parameters of the function is used.

Gradient determines up to what extent the output of a function changes if we change the inputs. A gradient determines the change in all weights concerning the change in residual. We can also consider a gradient as the slope of a particular function.

The slope will be steeper if the gradient is high. If the gradient is high model can learn fast.

Model stop learning, if the slope is zero. Mathematically, a gradient is a partial derivative with concerning idea inputs.

Assume a blindfolded man who willing to reach the top of a hill with the fewer steps along the way as possible. He may start climbing the hill by using really big steps in the steepest direction. He can do it as long as he is not near the peak.

As he gets closer to the peak, his steps will get smaller and smaller to prevent overshooting it. This process can be explained mathematically using the gradient.

Instead of reaching up a hill, consider gradient descent as walking down to the bottom of a valley. This is a more useful analogy as it is a minimization algorithm that minimizes a given objective function.

The equation below explains what gradient descent does. b denotes the next position to move, whereas a represents the present position. The minus sign is showing the minimization thing of gradient descent. The gamma in the middle is a weighting factor and the gradient term ( Δf(a) ) is simply the way towards the steepest descent.

This formula basic us in getting the next position we have to go, which is the direction of the steepest descent.

The following two points simply explain the working of gradient descent:

• Initialize the given parameters with some random values.
• Keep altering these values iteratively in such a manner that it minimizes the objective function.

Usually, we tend to find the formula that provides us the optimal values for given parameters. But this algorithm finds the optimal values for us by itself.

There are different types of gradient descent algorithms. Let us discuss them one by one.

It is the primary essential type of gradient descent in which we use the complete dataset available to calculate the gradient of the cost function.

Also, it is unmanageable for datasets that don’t meet memory capacity. After initializing the parameter with random values we determine the gradient of a cost function using the following formula:

Where m is the number of samples trained.

• Stochastic Gradient Descent: This is a modified type of batch gradient descent that processes one training sample per iteration. That’s why it is quite faster than batch gradient descent. But again, if the number of training samples is large, even then it processes only one part which can be extra overhead for the system. Because the number of iterations will be quite large.
• Mini Batch gradient descent: In a mini-batch algorithm instead of using the entire data set, in every iteration, we use a batch of ‘m’ training examples termed batch. It is used to calculate the gradient of the cost function. Commonly, mini-batch size ranges between 50 and 256 but can vary according to the needs.

Hence, this algorithm:

• reduces the variation of the parameter updates, which leads to more durable convergence.
• can make use of a  highly optimized matrix, which makes computing of gradient very effective.