Principal Component Analysis (PCA) From Scratch

Aayushma Pant
5 min readJun 1, 2021

--

Data of higher dimensions tend to create sparse matrixes and make it much more difficult to compute, analyze and visualize. It is therefore very important to reduce the dimensions to minimize redundant information, improve visualization and discover hidden correlated information. One of these methods of dimension reduction is PCA.

The principal component analysis also referred to as the K-L or Karhunen-Loeve method is the technique of reducing the dimensions of data without losing a lot of information from data. It searches for k n-dimensional orthogonal vectors that can be used to represent the data where k≤n.

The basic procedure for PCA determination is as follows:

  1. Normalizing the Data

The first stage is the standardization of the data. The input data is normalized so that each attribute falls within the same range and is not dominated by the attribute of larger dimensions. Normalization facilitates calculations that can be done in many ways, such as the Min-max scalar, the Standard scalar, the Zero normalization (subtracting data of the average), etc.

To understand much deeper, let's dive into the given example. Suppose there are two measures X1 and X2 having 12 samples and 2 sizes. Our task is to locate the Principal Component of the data given by following the steps.

Subtracting the mean :

#Importing necessary modules
import numpy as np
import matplotlib.pyplot as plt
x1_norm=x1-x1.mean()#X1' normalized from X1
x2_norm=x2-x2.mean()#X2' normalized from X2
Raw data converted into Zero mean normalized data
Raw Data Converted into Zero mean Normalized Data

Plotting data through matplotlib. Here, the dots are our data.

Visualization of Raw data and Zero mean normalized data

2. Calculation of Covariance Matrix

The covariance matrix (Sx) is a square symmetric matrix of dimension m*m which is given as:

Covariance matrix where X is a column matrix of dimension (m*1)

The diagonal of the matrix Sx resembles the variance and off-diagonal are covariances. Our goal is to minimize the redundant off-diagonal elements and maximize the variance (diagonal element).

X_matrix=np.array((x1_norm,x2_norm))
Sx=(1/(X_matrix.shape[1]-1))*np.matmul(X_matrix,np.transpose(X_matrix))
Output
Sx=array([[ 3.51818182, -13.52454545],
[-13.52454545, 55.58878788]])
(Since non-diagonal element of the matrix Sx are negative. So, we asssume both X1 and X2 variable decrease together.)

3. Eigen Values and vector of Covariance Matrix

Before exploring the concept of Eigenvectors, let's understand the goal of PCA. As described earlier, PCA is to compute the orthonormal matrix P such that our outcome becomes Y=PX, where P is the principal component of the given input vector X. The choice of P is made in such a way that it diagonalizes Sy. That means preserving the diagonal values. So our output becomes:

For achieving this goal, an eigenvector must be calculated. In linear algebra, an eigenvector vector of a linear transformation is a non-null vector that changes to the maximum by a scalar factor when this linear transformation is applied to it. The relevant eigenvalue often referred to as λ, is the scale factor of the eigenvector. Here the eigenvalues and eigenvector are calculated by:

# eigen_value calculation
m=(Sx[0][0]+Sx[1][1])/2
p=(Sx[0][0]*Sx[1][1]-(Sx[1][0]*Sx[0][1]))
lambda1=m+(m**2-p)**(1/2)
lambda2=m-(m**2-p)**(1/2)
Output
lambda1=58.89203173874124
lambda2= 0.21493795822846806

Calculating the eigenvector by keeping y=1.

#eigen vector calculation
V1=np.array((Sx[0][1]/(lambda1-Sx[0][0]),1))
V2=np.array((Sx[0][1]/(lambda2-Sx[0][0]),1))
Output
eigen_vectorV1=[-0.24424066 1. ]
eigen_vectorV2=[ 4.09432244 1. ]

Plotting the two eigenvectors V1(represented by green dotted lines) and V2 (denoted by red dotted lines). Here V1 vectors seem to be the best fit than V2.

4. EigenVectors with the highest value

The variances explained by the eigenvalues of the respective eigenvector are identified. To do this, the Eigenvalues are sorted in descending order. The proportion of variance explained by r principal components is:

This is only helpful when dimensions are highly correlated and r is smaller than m.

k = lambda1 / (lambda1 + lambda2)
print(str(round(k*100,2))+'% variance is explained by v1')
print(str(round((1-k)*100,2))+'% variance is explained by v2')
Output
99.64% variance is explained by v1
0.36% variance is explained by v2
The variance explained by two PCA

The principal component from the given eigenvectors can be given as:

P=np.array((V1,V2))
Y=np.matmul(P,X_matrix)
Y1 and Y2 two principal component derived V1 and V2 eigenvectors respectively
Visualization of Choosing both components and finding the value of Y (No PCA)

Hence from the above visualization, it is confirmed that the V1 vector (also principal component) is the best fit.

5. Dimension reduction using the principal component.

Finally, we take the V1 vector as our main component and derive the dimension using it.

P=np.array((V1))#orthonormal vector
Y=np.matmul(P,X_matrix)
Sy=(1/(Y.shape[0]-1))*np.matmul(Y,np.transpose(Y))
Output
Sy=62.40514745166816
One dimensional data plot with 99% of variance explained.

This is how the reduction of dimensions is carried out by analyzing the principal components.

--

--