PCA Principal Component Analysis

Principal component analysis (PCA) is used as a dimensionality reduction technique.

When we have a lot of dimension in data it’s difficult to find the dimensions which are responsible for the results. Let us consider a scenario in which you have an army of 10,000 soldiers. How can we determine that the army will win a battle or not?

In order to solve this problem, we check the strength of the army, in order to do that we do a drill. In practice drill, we order the army to fight with the transposed team. This will lead to a covariance matrix. Then as per the result of that fight, we decide the strength and weakness of the team and provide a percentile to each soldier. The new rating will be based on the fight that new ranks are the principal components.

After the drill we find a new dimension like shown in the below figure, we can say those travel a lot have better and last long performance, while who drink have failed in the mid but that is a dimension.

Now let’s check the performance of PCA on mnist data set.

Import libraries

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.linalg import eigh
import pandas as pd
import seaborn as sn
mnist=load_digits()

#Load the data in X and Y

X=mnist.images
Y=mnist.target

x=X.reshape(1797,64)

Print a few data

num_row = 2
num_col = 5# plot images<br>
fig, axes = plt.subplots(num_row, num_col, figsize=(1.5,2))
for i in range(10):
    ax = axes[i//num_col, i%num_col]
    ax.imshow(X[i].reshape(8,8))
    ax.set_title('Label: {}'.format(Y[i]))
plt.tight_layout()
plt.show()

Apply PCA on 30 dimensions.

pca = PCA(n_components=30)
principalComponents = pca.fit_transform(x)
num_row = 2
num_col = 5
fig, axes = plt.subplots(num_row, num_col, figsize=(1.5,2))
for i in range(10):
    ax = axes[i//num_col, i%num_col]
    ax.imshow(principalComponents[i].reshape(6,5))
    ax.set_title('Label: {}'.format(Y[i]))
plt.tight_layout()
plt.show()

But how can we check the number of the component should we consider, it will depend on the amount of variance explained by the principal components.

pca = PCA().fit(mnist.images.reshape(1797,64))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

Now Apply Logistics on normal data and on PCA data and check the performance improvement.

pca = PCA(n_components=30)
principalComponents = pca.fit_transform(x)

X_train,X_test,y_train,y_test=train_test_split(x,Y,test_size=0.30,random_state=0)
XPCA_train,XPCA_test,yPCA_train,yPCA_test=train_test_split(principalComponents,Y,test_size=0.30,random_state=0)

model = LogisticRegression()

model.fit(X_train,y_train)
y_pred=model.predict(X_test)

model.fit(XPCA_train,yPCA_train)
yPCA_pred=model.predict(XPCA_test)

print("Accuracy of model on normal data :",metrics.accuracy_score(y_test, y_pred))
print("Accuracy of model on PCA data :",metrics.accuracy_score(yPCA_test, yPCA_pred))
Accuracy of model on normal data : 0.9518518518518518
Accuracy of model on PCA data : 0.9555555555555556

Now, let us understand how this PCA works.

The story begins with Eigenvalue and Eigenvector. In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled.

Apply PCA on whole MNISt data, start with the standardization of data.

standardized_data = StandardScaler().fit_transform(x)

sample_data = standardized_data

# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data)

print ( "The shape of co-variance matrix = ", covar_matrix.shape)
The shape of co-variance matrix =  (64, 64) 
df=pd.DataFrame(covar_matrix)
sn.heatmap(df.corr())
values, vectors = eigh(covar_matrix, eigvals=(62,63))
print("Shape of eigen vectors = ",vectors.shape)
vectors = vectors.T
print("Updated shape of eigen vectors = ",vectors.shape)
Shape of eigen vectors =  (64, 2)
Updated shape of eigen vectors =  (2, 64)
new_coordinates = np.matmul(vectors, sample_data.T)
print (" resultanat new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)

resultanat new data points’ shape (2, 64) X (64, 1797) = (2, 1797)

new_coordinates = np.vstack((new_coordinates, Y)).T
dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
print(dataframe.head())
   1st_principal  2nd_principal  label
0       0.954502       1.914214    0.0
1      -0.924636       0.588980    1.0
2       0.317189       1.302039    2.0
3       0.868772      -3.020770    3.0
4       1.093480       4.528949    4.0
pca.n_components = 3
pca_data = pca.fit_transform(x)
print("shape of pca_reduced.shape = ", pca_data.shape)
pca_data = np.vstack((pca_data.T, Y)).T
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "3rd_principal","label"))<br>
sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal', '3rd_principal').add_legend()<br>
plt.show()

In order to look into a 3-D plot.

df = pd.DataFrame(pca_data,columns=['x','y','z','label'])
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

my_dpi=96
df['label']=pd.Categorical(df['label'])
my_color=df['label'].cat.codes
fig = plt.figure(figsize=(15,8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['x'], df['y'], df['z'],  c=my_color, cmap="Set2_r", s=5)

xAxisLine = ((min(df['x']), max(df['x'])), (0, 0), (0,0))
ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r')
yAxisLine = ((0, 0), (min(df['y']), max(df['y'])), (0,0))
ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'g')
zAxisLine = ((0, 0), (0,0), (min(df['z']), max(df['z'])))
ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'b')

ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.set_title("PCA on the iris data set")

Leave a Reply