Principal component analysis (PCA) is used as a dimensionality reduction technique.
When we have a lot of dimension in data it’s difficult to find the dimensions which are responsible for the results. Let us consider a scenario in which you have an army of 10,000 soldiers. How can we determine that the army will win a battle or not?
In order to solve this problem, we check the strength of the army, in order to do that we do a drill. In practice drill, we order the army to fight with the transposed team. This will lead to a covariance matrix. Then as per the result of that fight, we decide the strength and weakness of the team and provide a percentile to each soldier. The new rating will be based on the fight that new ranks are the principal components.
After the drill we find a new dimension like shown in the below figure, we can say those travel a lot have better and last long performance, while who drink have failed in the mid but that is a dimension.
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/pca-diag.png)
Now let’s check the performance of PCA on mnist data set.
Import libraries
from sklearn.datasets import load_digits import matplotlib.pyplot as plt import numpy as np from sklearn import metrics from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from scipy.linalg import eigh import pandas as pd import seaborn as sn
mnist=load_digits()
#Load the data in X and Y
X=mnist.images Y=mnist.target x=X.reshape(1797,64)
Print a few data
num_row = 2 num_col = 5# plot images<br> fig, axes = plt.subplots(num_row, num_col, figsize=(1.5,2)) for i in range(10): ax = axes[i//num_col, i%num_col] ax.imshow(X[i].reshape(8,8)) ax.set_title('Label: {}'.format(Y[i])) plt.tight_layout() plt.show()
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-12.png)
Apply PCA on 30 dimensions.
pca = PCA(n_components=30) principalComponents = pca.fit_transform(x) num_row = 2 num_col = 5 fig, axes = plt.subplots(num_row, num_col, figsize=(1.5,2)) for i in range(10): ax = axes[i//num_col, i%num_col] ax.imshow(principalComponents[i].reshape(6,5)) ax.set_title('Label: {}'.format(Y[i])) plt.tight_layout() plt.show()
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-13.png)
But how can we check the number of the component should we consider, it will depend on the amount of variance explained by the principal components.
pca = PCA().fit(mnist.images.reshape(1797,64)) plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('number of components') plt.ylabel('cumulative explained variance');
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-14.png)
Now Apply Logistics on normal data and on PCA data and check the performance improvement.
pca = PCA(n_components=30) principalComponents = pca.fit_transform(x) X_train,X_test,y_train,y_test=train_test_split(x,Y,test_size=0.30,random_state=0) XPCA_train,XPCA_test,yPCA_train,yPCA_test=train_test_split(principalComponents,Y,test_size=0.30,random_state=0) model = LogisticRegression() model.fit(X_train,y_train) y_pred=model.predict(X_test) model.fit(XPCA_train,yPCA_train) yPCA_pred=model.predict(XPCA_test) print("Accuracy of model on normal data :",metrics.accuracy_score(y_test, y_pred)) print("Accuracy of model on PCA data :",metrics.accuracy_score(yPCA_test, yPCA_pred))
Accuracy of model on normal data : 0.9518518518518518 Accuracy of model on PCA data : 0.9555555555555556
Now, let us understand how this PCA works.
The story begins with Eigenvalue and Eigenvector. In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled.
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/eigen-vector.png)
Apply PCA on whole MNISt data, start with the standardization of data.
standardized_data = StandardScaler().fit_transform(x) sample_data = standardized_data # matrix multiplication using numpy covar_matrix = np.matmul(sample_data.T , sample_data) print ( "The shape of co-variance matrix = ", covar_matrix.shape)
The shape of co-variance matrix = (64, 64)
df=pd.DataFrame(covar_matrix) sn.heatmap(df.corr())
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-15.png)
values, vectors = eigh(covar_matrix, eigvals=(62,63)) print("Shape of eigen vectors = ",vectors.shape) vectors = vectors.T print("Updated shape of eigen vectors = ",vectors.shape)
Shape of eigen vectors = (64, 2) Updated shape of eigen vectors = (2, 64)
new_coordinates = np.matmul(vectors, sample_data.T) print (" resultanat new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)
resultanat new data points’ shape (2, 64) X (64, 1797) = (2, 1797)
new_coordinates = np.vstack((new_coordinates, Y)).T dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label")) print(dataframe.head())
1st_principal 2nd_principal label 0 0.954502 1.914214 0.0 1 -0.924636 0.588980 1.0 2 0.317189 1.302039 2.0 3 0.868772 -3.020770 3.0 4 1.093480 4.528949 4.0
pca.n_components = 3 pca_data = pca.fit_transform(x) print("shape of pca_reduced.shape = ", pca_data.shape) pca_data = np.vstack((pca_data.T, Y)).T pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "3rd_principal","label"))<br> sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal', '3rd_principal').add_legend()<br> plt.show()
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-16.png)
In order to look into a 3-D plot.
df = pd.DataFrame(pca_data,columns=['x','y','z','label'])
from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import numpy as np import pandas as pd my_dpi=96 df['label']=pd.Categorical(df['label']) my_color=df['label'].cat.codes fig = plt.figure(figsize=(15,8)) ax = fig.add_subplot(111, projection='3d') ax.scatter(df['x'], df['y'], df['z'], c=my_color, cmap="Set2_r", s=5) xAxisLine = ((min(df['x']), max(df['x'])), (0, 0), (0,0)) ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r') yAxisLine = ((0, 0), (min(df['y']), max(df['y'])), (0,0)) ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'g') zAxisLine = ((0, 0), (0,0), (min(df['z']), max(df['z']))) ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'b') ax.set_xlabel("PC1") ax.set_ylabel("PC2") ax.set_zlabel("PC3") ax.set_title("PCA on the iris data set")
![](http://www.machineintellegence.com/wp-content/uploads/2020/05/image-17.png)