Artificial Neural Networks(ANN)- A detailed explanation
Today I will be discussing Artificial Neural Network. Our main focus here will be more on learning it’s architecture, and how it works, we will also look a little at how calculations are done behind the scene.
Also let me tell you before starting the discussion, RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), Radial Basis Function Neural Network, and all other neural networks are the type of Artificial Neural Network or ANN. Also, the below-mentioned points are the common parts of the training process in most of the Neural Network architectures, so once you go through these points and it’s details, you will have a general idea of how almost all neural network models get trained. But to get a detailed idea of how each model like RNN, CNN, and other’s work, you have to wait for my next article.
Following are the points that are common in every Neural Network model:
- The main aim of every neural network model is to reduce the loss.
- Weight updation happens in almost all the neural network models using the Back-propagation technique.
- There is a concept of the learning rate, epochs, batch size in every neural network models. These are basically used during the training process and are hyperparameters which can be used to optimize the weights.
There are many important things that we need to cover in this article, so let’s get started. Below is the image of a simple neural network model. Also let’s call this image as figure 1, for future reference.
So, the above image is an example of a simple neural network with one hidden layer with 4 neurons/units in the middle, one input layer with 3 features (input layer will always be one) in the left, and an output layer with one neuron in the right. If more than one hidden layer is used in the model, then it will be called “Deep Neural Network”, otherwise it is simply a neural network with one hidden layer.
How does an ANN Model Work?
Now let’s discuss how does an Artificial Neural Network model work in detail.
So, as I said earlier the main aim of every neural network model is to reduce the loss/error. So, before training the model, we need to decide a loss function based on which the loss will be calculated, and the loss function will be based on the problem that we are trying to solve using the neural network model. Also, let me tell you, we say “loss”, when we calculate the error for a single training example, whereas if we calculate the error for a batch (a subset of the training examples), then it would be called as “cost”, and the function would then be known as cost function.
Some examples of loss functions are:- Binary Cross-Entropy mainly used for Binary Classification problems, Mean Squared Error Loss used for Regression problems, Multi-Class Cross-Entropy Loss used in case of multilabel classification problems. There are other loss functions too, but we will discuss loss function in detail in some other article of mine, or if you are curious enough to know the details of each of the loss functions, then you can refer to other machine learning articles of this platform.
Below is the image which will help us in visualizing a loss function and it’s minimum point, also let’s call this figure as figure 2 for future reference.
So here in the above figure, J(w) in the y-axis is the loss function values and the curve is the loss function curve, and J(w min), the x-axis is the weight, and as we know the objective of every neural network model is to find the Global minima point of the loss/cost function, as this is the place where loss/cost is minimum. Now the question is how to reach this point.
There are weights associated with each connection of a neuron in each layer, be it input layer, be it hidden layer, or be it the layer connecting hidden with the output. So the process starts by assigning random weights to the connections. The below image with weights will clarify the doubts, let’s call this image as figure 3 for future reference.
So, in the above image, we can see a neural network with weights (consider these as random weight values). So, the process starts by assigning random weights to each of the connections, then using these weights, training of the neural network starts. Now the training can be done either on an entire training set or on a single training example each time or on a small proportion of the training example. Now as the training process starts with random weights, so the model is bound to make many errors, so the objective of the model is to reach optimized weights so that the error gets minimized, how it all happens is described below. Also before proceeding forward, we need to know about a few terms:
- Epoch – Epoch is nothing but when the entire training dataset is passed through the neural network model.
- Batch – Batch is nothing but a small subset of the whole data. So suppose for a data of size 1024 rows, batch size of 128 has been used, so for 1 epoch, 8 iterations are needed to pass the complete data through the neural network.
- Learning rate – Learning rate can be understood from the fig. 2, by how much the loss will go down is the learning rate. In the figure, the arrows which are pointing in the downward direction is the learning rate, if we take a larger value of learning rate, the size of the arrows will be large, and the model will not be able to converge, but if we take small values, then the size of the arrows will be small, and the model will converge eventually, but it will take longer time. It is a very important hyperparameter, and we will discuss that in detail in some of my other articles.
- Activation Function – Activation function converts the output of the previously processed layer to a form that decides whether a neuron should be activated or not. Also, it is known to add non-linearity to the model, which is helpful for the model to learn a complex function.
The process starts by as described earlier assigning random weights to each connection of the neurons, then in the first layer which is the input layer (the number of neurons in the input layer = number of features or variables), the randomly assigned weights get multiplied with the input data points. Then the multiplied output of the input layer reached the next layer, which is a hidden layer, there a bias term is added to it.
So, the equation for a neuron in the hidden layer would become (we will try to understand the equation with the help of figure 3)
h1 = w1*x1 + w4*x2 + w7*x3 + b
So, here I have used the same symbols as it is there in figure 3, but I didn’t use the values, as we know, the values depend on the data we are working on, like if you are working on loan default dataset, then the features would be like the salary of the person, his age, etc, etc., So those would be the values that would be passed from the input layer. Here in the above equation, I have shown the input to the hidden layer for just one neuron, the same thing will happen for all other neurons for the hidden layer.
‘b’ is a biased term that gets summed up every time when the multiplied outcome of the weights reaches the other(next) layer which is also called the hidden layer. So, now the new input in the hidden layer is weighted inputs with bias term, this net value is then passed through an “activation function” in the same neuron, (which is very briefly described above, but we will discuss this in detail in some other article. Also, the type of activation function that we will use depends on the type of problem we are solving). So, the output of the activation function is then passed to the next layer, which could be another hidden layer, or if there is no hidden layer, then it is passed to the output layer, and the same process would repeat there too.
So, this way finally at the output layer, we get an output (also remember activation function is used in all the hidden layers and the output layer, it is not used in the input layer), and the output is then compared with the original labels, and then the error is calculated based on the loss function that we have chosen at the beginning while creating the model using a deep learning framework like Tensorflow or PyTorch or any other.
As discussed earlier, this error could be for a single data point/row, for a batch of data points, or for the entire dataset all at once. But this is the initial stage, the next stage is updating the weights. So, once the error is calculated for any of the above three cases, the weights get updated using a process known as “back-propagation”.
But before we start discussing back-propagation, we need to know about what formula is used for updating the weights of the model, and so the weights get updated using the formula given below, and this formula is the start of back-propagation.
W (new) = W (old) –n*d(Loss)/dW (old)
Back-propagation is basically an extension of the derivative or “gradient” that we have used in the above formula for calculating the new weight. Let’s understand this for one weight, then we will understand this for other weights too. We will understand this with the help of the same figure 3. Let’s suppose we are updating “w10”.
So, updated w10 = w10 (old) – n*d(Loss)/dw10 (old) using the same above formula. Also, one important thing that I forgot to mention is n = “Learning Rate”. Here we will mainly focus on the derivative or the gradient part only.
So for w10, [d(Loss)/dw10] will be written as [d(Loss)/d(O1out)] * [d(O1out)/d(O1in)] * [d(O1in)/dw10].
Now, the reason why this has happened is that mathematically there is no direct relationship between Loss and the weight (w10), but the loss has a direct relationship with the output at the output layer, as the loss is calculated based on the output of the model, and the output has a direct relationship with what has been input to the output layer from the hidden layer, as the output of the hidden layer passed to the final output layer is the result of activation function applied on the input, so this way there is a direct relation.
And finally, input to the output layer has a direct relationship with the weight, as the input gets multiplied with the weight before passing to the output. So this way a connection has been established.
Also now let’s find the mathematical relationships,
d(Loss)/d(O1out) = d[½ * (y – O1out)^2]/d(O1out), so based on the values, we can calculate it.
d(O1out)/d(O1in) = d[1/(1 + e^O1in)]/d(O1in), here we are using sigmoid as an activation function.
d(O1in)/dw10 = d[(w10*h1out) + (w11*h2out) + (w12*h3out)]/dw10. So it is a clear cut thing.
This way they are calculated and weights get updated. Here it is shown just for one weight, which is w10, other weights get updated in the same way.
Now let’s summarize this below, as a lot of things happened above, so there is a high chance that readers might get confused. So let’s just summarize:-
- The objective of every neural network model is to minimize the error, and the way it is done is by updating the weights in every epoch.
- If the whole dataset is passed all at once, then weights get updated after each training or epoch, or if the whole dataset is not passed, batch-wise it is passed to the neural network, then weights get updated after training each batch. Similarly, if a single data point is passed through the network each time, then the weights get updated after each data point is passed through the network.
- And finally now let’s summarize how the “feed-forward” process happens, which is starting from the input layer to the output layer. So data points are passed through the input layer, then before it gets to the next layer, weights get multiplied to it, and in the next layer a bias term is added to it, and in the same layer which is the hidden layer, this output is then passed through an activation function.
- This output after it is passed through an activation function, it is then passed through the next layer, which could be another hidden layer or the output layer if we are using only one hidden layer. This way at the output layer we get output and based on the output, error gets calculated, and using back-propagation, the model updates the weights to get optimized weights, and finally, these optimized weights are used on the new dataset to make predictions.
So this is basically what happens in an Artificial Neural Network. Also, this is not it, before celebrating our success after creating the model, we need to check whether our model is facing an over-fitting problem, and if it is having over-fitting issues, then we need to change some hyperparameter values so that it doesn’t overfit, we will discuss those issues also in my next article, but for overview purpose, this is enough. If you have any feedback, you can please comment in the below section. Also very soon I’ll be coming up with new topics and I will try my best to cover those in full detail too, as I did here in this article.
About the Author:
I am a Data Science Practitioner, I have worked on a number of projects related to Machine Learning, Deep Learning, Natural Language Processing. I have finished my PGDM in Business Analytics & Finance in the year 2020, and from the college days itself, this subject took my interest, as I was a technology enthusiast from the very beginning. My recent focus is on working in the direction of Natural Language Processing and learning the subject more and contribute in this direction as much as possible.
Very well written & explained the whole topic
Thank you ankur