# Principal Component Analysis(PCA) in Machine Learning

##### A useful unsupervised learning algorithm for dimensionality reduction. After you are through with this article you will easily be able to get good conceptual clarity of PCA.

**What is PCA?**

The principal component analysis is an unsupervised learning algorithm in Machine Learning. Principal component analysis in short PCA is a way of identifying patterns among data and points out the resemblances and differences in the data.

Since similarities, patterns in data can be difficult to find because of high dimensionality (means a data set having a greater number of features) then, Principal component analysis comes to our rescue because we cannot use graphical representation for analyzing data.

PCA is a statistical process that converts a dependent set of features to a set of independent features. PCA has its major uses in dimensionality(feature) reduction by removing the redundant (repeated or dependent) features without loss of information. PCA will be widely used in image compression.

**Need**

Generalization becomes more difficult if the dimensionality of the training data set increases so, to reduce the dimensionality PCA comes into play and also reduces the cost of action.

- PCA is used to remove the noise from the data set
- Image compression
- Data visualization and interpretation
- Visualize the relationship between population

PCA uses Eigenvectors and Eigenvalues in its computation so, before finding the procedure let’s get some clarity about those terms.

**Eigen Vectors and Eigen Values**

In linear algebra, eigenvectors are non-zero vectors that change when the linear transformation is applied to it by a scalar value. The corresponding scalar value is known as the eigenvalue denoted by lambda.

Eigenvectors can only be found for square matrices and it is not vice versa, I.e every square matrix does not have eigenvectors. If dimensions of a square matrix are n*n then there will be ‘n’ eigenvectors relating to that matrix.

All eigenvectors of a square matrix are perpendicular (orthogonal) to each other irrespective of the number of dimensions. This is crucial because we can represent the information(data) in terms of these orthogonal eigenvectors, without using the original co-ordinate axis. In PCA we consider the length of eigenvectors to be one because the length of a vector doesn’t matter so, to standardize it we consider length as one. We rearrange the eigenvector in such a way that its length is one.

**PCA Implementation**

**Libraries:**Required libraries need to be imported.

**Data Collection:**We can import a data set directly from the internet or we can use our data that is gathered. Here we are considered only two dimensions because we can plot this in our co-ordinate system. If the data set is imported then, divide the data into two components like X and Y for easy analysis. If there are more dimensions PCA will perform dimensionality reduction.

**Subtract the Mean:**Mean is the average across each dimension. For proper working of PCA, we will subtract the mean from each data point. So, from all X components, x-mean will be subtracted and from all Y components, y-mean will be subtracted. This produces zero-mean data set. This process is known as**Normalization.**

Normalization is important in PCA because PCA calculates new projected data by subtracting the mean. If we normalize the data all variables will have the same standard deviation therefore, all have the same weight and PCA calculates the relevant axis. Before normalization split the data set into training and testing (Here in our case there is no need to split)

This is for normalizing the two-dimensional data.

**Co-variance matrix:**Co-variance is a measurement and is always measured between two dimensions. If we have three dimensions a,b,c then we could measure co-variance between a and b, b and c, a and c. In PCA co-variance matrix is calculated to discover all the possible relationships between all the different dimensions, and place them in a covariance matrix.

**What does co-variance tell us?**

Let us assume we are having the data of students studying hours(S) and exam results of each student(R). Now, let’s find covariance between S and R. The exact value of cov(S, R) does not matter but if we just check for the sign of the value(positive or negative). If positive it indicates features are positively co-related, that both dimensions increase together means if studying hours are more then, the result will be good. If negative then features are negatively co-related means one dimension decreases the other increases.

**Calculating Eigenvectors and values:**We can calculate eigenvectors and values because the covariance matrix is square. They furnish us the information about the patterns within the data. In the below figure we have plotted the normalized data and along with eigenvectors. As discussed above both eigenvectors are orthogonal. One eigenvector goes through all data points indicating the best fit. The other one gives us the other less important pattern. By using the eigenvector of the covariance matrix, we can extract the pattern within the data.

This is the plot of normalized data and eigenvectors. Dotted lines represent eigenvectors.

**Feature Reduction**

Here comes the dimensionality reduction. After finding the eigenvectors and corresponding eigenvalues, we have to sort them in descending order. Each feature will have its respective eigenvector and eigenvalue.

All eigenvalues are different and the eigenvector with the highest eigenvalue will be the Principal Component of the data set. It indicates the most prominent relationship between the features in data. Now, the features with the lowest eigenvalues will be ignored further. There won’t be much information loss because the eigenvalues are so small.

If there are ‘n’ features initially now we have reduced it to ‘p’. Now, we need to form a feature vector using these p features. I.e., is forming a matrix using these eigenvectors.

**Deriving new data set:**After forming a feature vector by chosen eigenvectors, we now will extract the new data set with reduced features.

Final Data = Transpose of Feature vector X Row Data adjust

The row feature vector is the transpose of the matrix we formed using the eigenvectors. Row data adjust is the transpose of the matrix formed with normalized data. We doing this transpose and matrix multiplication to just get back our original data form but with reduced features and changed axes.

This shows the original data plotted with eigenvectors as the axis

**Conclusion**

We are transforming our data to point out the patterns within the data. Where the patterns are the eigenvectors that precisely describe the relationship between the features. This helps classify the data.

Thanks for reading!

## 2 Responses

[…] algorithm is the unsupervised machine learning algorithm in which whole data is divided into K number of clusters. Every cluster has its centroid […]

[…] Data Engineering Skills: We use Data engineering skills to organize our data. It transforms our data set into a useful format. Raw data consists of a lot of noise which leads to more error percentages. So, to increase the accuracy, we have to remove the noise present within the data. If our data is in the form of images then the background may be noise. There will be umpteen features in a dataset but we might not use all those so, here we have to use feature reduction techniques. You can refer to this article for feature reduction technique Principal component analysis. […]