Principal Component Analysis (PCA) From Scratch
Data of higher dimensions tend to create sparse matrixes and make it much more difficult to compute, analyze and visualize. It is therefore very important to reduce the dimensions to minimize redundant information, improve visualization and discover hidden correlated information. One of these methods of dimension reduction is PCA.
The principal component analysis also referred to as the K-L or Karhunen-Loeve method is the technique of reducing the dimensions of data without losing a lot of information from data. It searches for k n-dimensional orthogonal vectors that can be used to represent the data where k≤n.
The basic procedure for PCA determination is as follows:
- Normalizing the Data
The first stage is the standardization of the data. The input data is normalized so that each attribute falls within the same range and is not dominated by the attribute of larger dimensions. Normalization facilitates calculations that can be done in many ways, such as the Min-max scalar, the Standard scalar, the Zero normalization (subtracting data of the average), etc.
To understand much deeper, let's dive into the given example. Suppose there are two measures X1 and X2 having 12 samples and 2 sizes. Our task is to locate the Principal Component of the data given by following the steps.
Subtracting the mean :
#Importing necessary modules
import numpy as np
import matplotlib.pyplot as pltx1_norm=x1-x1.mean()#X1' normalized from X1
x2_norm=x2-x2.mean()#X2' normalized from X2
Plotting data through matplotlib. Here, the dots are our data.
2. Calculation of Covariance Matrix
The covariance matrix (Sx) is a square symmetric matrix of dimension m*m which is given as:
The diagonal of the matrix Sx resembles the variance and off-diagonal are covariances. Our goal is to minimize the redundant off-diagonal elements and maximize the variance (diagonal element).
X_matrix=np.array((x1_norm,x2_norm))
Sx=(1/(X_matrix.shape[1]-1))*np.matmul(X_matrix,np.transpose(X_matrix))Output
Sx=array([[ 3.51818182, -13.52454545],
[-13.52454545, 55.58878788]])(Since non-diagonal element of the matrix Sx are negative. So, we asssume both X1 and X2 variable decrease together.)
3. Eigen Values and vector of Covariance Matrix
Before exploring the concept of Eigenvectors, let's understand the goal of PCA. As described earlier, PCA is to compute the orthonormal matrix P such that our outcome becomes Y=PX, where P is the principal component of the given input vector X. The choice of P is made in such a way that it diagonalizes Sy. That means preserving the diagonal values. So our output becomes:
For achieving this goal, an eigenvector must be calculated. In linear algebra, an eigenvector vector of a linear transformation is a non-null vector that changes to the maximum by a scalar factor when this linear transformation is applied to it. The relevant eigenvalue often referred to as λ, is the scale factor of the eigenvector. Here the eigenvalues and eigenvector are calculated by:
# eigen_value calculation
m=(Sx[0][0]+Sx[1][1])/2
p=(Sx[0][0]*Sx[1][1]-(Sx[1][0]*Sx[0][1]))
lambda1=m+(m**2-p)**(1/2)
lambda2=m-(m**2-p)**(1/2)Output
lambda1=58.89203173874124
lambda2= 0.21493795822846806
Calculating the eigenvector by keeping y=1.
#eigen vector calculation
V1=np.array((Sx[0][1]/(lambda1-Sx[0][0]),1))
V2=np.array((Sx[0][1]/(lambda2-Sx[0][0]),1))Output
eigen_vectorV1=[-0.24424066 1. ]
eigen_vectorV2=[ 4.09432244 1. ]
Plotting the two eigenvectors V1(represented by green dotted lines) and V2 (denoted by red dotted lines). Here V1 vectors seem to be the best fit than V2.
4. EigenVectors with the highest value
The variances explained by the eigenvalues of the respective eigenvector are identified. To do this, the Eigenvalues are sorted in descending order. The proportion of variance explained by r principal components is:
This is only helpful when dimensions are highly correlated and r is smaller than m.
k = lambda1 / (lambda1 + lambda2)
print(str(round(k*100,2))+'% variance is explained by v1')
print(str(round((1-k)*100,2))+'% variance is explained by v2')Output
99.64% variance is explained by v1
0.36% variance is explained by v2
The principal component from the given eigenvectors can be given as:
P=np.array((V1,V2))
Y=np.matmul(P,X_matrix)
Hence from the above visualization, it is confirmed that the V1 vector (also principal component) is the best fit.
5. Dimension reduction using the principal component.
Finally, we take the V1 vector as our main component and derive the dimension using it.
P=np.array((V1))#orthonormal vector
Y=np.matmul(P,X_matrix)
Sy=(1/(Y.shape[0]-1))*np.matmul(Y,np.transpose(Y))Output
Sy=62.40514745166816
This is how the reduction of dimensions is carried out by analyzing the principal components.
Check out my article on Quantum Computing For Machine Learning.