Pre-Trained models used for Computer Vision

These are some of the most common pre-trained in several computer vision problems to provide a good base to the problem at hand.
Source: Bernard Hermant from unsplash


Computer vision enables the computer to understand, visualize, and analyze the images present and enables it to make predictions and detect certain objects present within the images. 

What are Pre-trained Models

Pre-trained models are deep learning models that have been created by someone to solve a problem that involves the use of a large dataset. Pre-trained models are used when there’s a lot of similarity between the question at hand and the pre-trained model. If there is a lot of similarities we could simply use the pre-trained model at hand as the starting point of the deep learning model. 

A pre-trained model plays a crucial role in transfer learning. 

Transfer Learning

Transfer learning helps to enable the process of using a pre-trained model as the starting point. This type of learning has been mostly observed in several computer vision problems such as face detection as they require the model to be trained on a large dataset. This way by leveraging the pre-trained models which are required to form the base of transfer learning. 

Different types of Pre-trained models

There are several types of pre-trained models which can be used to solve several computer vision problems such as: 


VGG-16 is a type of CNN model which was first used in the ImageNet competition in 2014. The VGG-16 model was proposed by Karen Simonyan and Andrew Zisserman who used to work at the Visual Geometry Group Lab of Oxford University. The model had made waves when it had participated in the ImageNet competition in 2014. The model had reached an accuracy of 92.7% and made a mark on the top 5 test accuracy on the ImageNet dataset which consisted of nearly 14 million images belonging to nearly 1000 classes. 


VGG architecture is a network consisting of several layers of neural networks. The input to this neural network is an image of (224,224,3), the first two layers consisting of 64 channels which are internally consisting

of a 3*3 filter and similar padding. After which there is a layer of max pooling consisting of a stride of (2,2), the next two layers consist of two convolutional layers consisting of a filter size of (3,3) and having about 256 of them. 

The next layer further consists of 2 sets of 3 convolutional layers followed by a max-pooling layer. Both having 512 filters and a size of (3,3) and common padding. After this, the image is passed through two convolutional layers stacked followed by another set of max-pooling layers of size (3,3). 

After this process, we obtain a feature map of size (7,7,512). We then flatten out the output present with us to a (1,25088) feature map. After this, there are 3 fully connected layers present in which the first fully connected layer takes the input from the last feature vector present and the outputs from the previous layers giving us an output layer of (1, 4096) After receiving the output layer we classify the output into a 5-top classification. The activation function which is used in the neural network is ReLU, as it is more computationally efficient. 

Inception Network

The inception model was made to improve the speed and efficiency of the model. The Inception model makes use of the input data to learn multiple conversions simultaneously and eventually concatenating them into a single output. By multiple conversions, we mean that it does several convolutional transformations using different pooling layers such as 1×1, 3×3, 5×5, etc and at the end stacks them together and leaves the decision making process to the model to decide the best configuration. 

Inception models are mainly made of many 1×1 convolutions. The inception model usually consists of many 1×1,3×3,5×5 convolutions and has a max-pooling layer of 3×3 which is used to later stack the outputs together to give a prediction or output. The main idea is to allow the convolution layers to better handle the output and put them to scale better. 

The inception layer consists of several versions which are: 

Inception V1

The inception V1 is also known as GoogleNet, this version of the inception net was mainly used with the sole purpose of dimensionality reduction. The first version has 9 inception layers which are stacked together to provide better outputs. The 9 inception layers together account for 22 layers and a count total of 27 layers which also includes the pooling layers. This version of the inception layer uses the global average pooling method to provide the output at the very end of the inception model. This version has been able to successfully prevent the vanishing gradient problem. This model had made waves in several competitions namely the ImageNet 2014. 

Problems present in V1

This version of the inception used 1×1, 3×3, and 5×5 convolutions. Due to the usage of several 5×5 convolutions, the model had a lot of information loss which later resulted in a decrease in complexity. The 3×3 convolution layers, when factored, were later broken into 1×3 and 3×1 convolutions, and due to which these could not be used for the early layers of the convolution as the input layer was big. 

Inception V2

In version 2 of the inception network, the 3×3 convolutions were used instead of the 5×5 convolutions to boost the performance of the model. Doing this also increased computational speed. This version also used a good factorization method which broke the values into a 1×3 matrix or a 3×1 matrix. 

Using the above changes, they were successful in obtaining a bottleneck version of the model which would later prevent the information loss. 

Inception V3

The later versions of the inception layer were further made with small yet significant changes such as: 

● The use of Batch Normalization which was used in the fully connected layer of the model 

● The use of RMS prop was inculcated in the model. 

● The in-depth use of a 7×7 factorized convolutional layer. ● The layer used “Label Smoothing Regularization” which was used to estimate the effectiveness of the dropout while training the model.

These changes made to the model made the model give better output to the inputs provided. 


You may also like...

1 Response

  1. November 28, 2020

    […] Computer vision technology is everywhere in a person’s routine. For example, filtered photographs are found everywhere in our social media feed, journals, books, magazine, news articles, etc. Turns out that, image filtering is also a part of us. […]


Leave a Reply

Your email address will not be published. Required fields are marked * Protection Status