Introduction

Principle component analysis (PCA) is to analyze the main components of input data. Usually this means to find the main dimensions of the input so that when we do projection of data points onto those dimensions, we don’t lose much of the information. The information preserved is measured by the amount of variance preserved - an indirect indicator. The first step in PCA is to substract the input for the mean of it, this is done automatically in sklearn. Then we choose the number of dimensions that retain at least 90-95%. How? Imagine all the data points are on the axis, now we rebase the origin to be zero. Then the principal dimensions of the data are the eigenvectors of the covariance matrix. With the top eigen vector to be the most importance component (direction). Let \(\lambda_1, \lambda_2,..\lambda_n\) to be the eigen values accordingly (sorted in decreasing order). If we retain k principal components, the percentage of variance retained is: \(\frac{\sum_{j=1}^{k}\lambda_j}{\sum_{j=1}^{n}\lambda_j}\). Hence we can calculate how many dimensions to retain how much variance. When we recover the images from after PCA, we have a noisy images and PCA can be seen as a compression scheme. PCA can be done for any high dimensional data set.

Code example

Let’s consider the face dataset from Yale. To retain 95% of the variance, we keep 68 principle dimensions. When we run random forest on X, the accuracy is 42%, which is low, due to the small number of images. We plot the feature importance and it focus on eyes, noses and chin. After PCA, the accuracy reduces to 33% (we loses 10% of prediction). We can say that, small details on the faces are accounted for about 1/4 of the prediction ability. When we restore the images, and run the random forest again, the accuracy rises back to 39%. When we plot feature importance, the heatmap shows that after PCA, the algorithm only glances at the picture, with a comprehension on light and the complexion.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from matplotlib import image
import seaborn as sns
import pandas as pd
import numpy as np
import os

PATH='./yalefaces/'
image_paths=next(os.walk(PATH))[2]
X =[]
y=[]
for i in range(165):
    image_name=image_paths[i]
    image = plt.imread(PATH+image_name)
    X.append(image.flatten())
    y.append(image_name[10:-5])
plt.imshow(image)
plt.show()
print(image.shape)
X=np.array(X)
print(X.shape)
pd.Series(y).value_counts()

png

(231, 195)
(165, 45045)





glasses        15
noglasses      15
centerlight    15
happy          15
leftlight      15
normal         15
rightlight     15
sleepy         15
wink           15
sad            15
surprised      15
dtype: int64

9PCA_2_0

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42, oob_score=True)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

heatmap_image = rnd_clf.feature_importances_.reshape(231,195)
plt.imshow(heatmap_image, cmap="hot")
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),
                           rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['very important', 'not important'], fontsize=14)
plt.axis("off")
plt.show()

Accuracy: 42.42%

png

9PCA_3_1

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X)

# to retain 95% of the variance, we keep 68 principle dimensions
# on X_pca, the accuracy reduces to 33% (losing 10% of prediction)
# this is to say that, small details are accounted for about 1/4 of the prediction ability
len(pca.explained_variance_ratio_)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=2)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42, oob_score=True)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
cf = confusion_matrix(y_test, y_pred)
plt.figure()
sns.heatmap(cf, annot=True)
plt.xlabel('Prediction')
plt.ylabel('Target')
plt.title('Confusion Matrix')

Accuracy: 33.33%

Text(0.5, 1.0, 'Confusion Matrix')

png

9PCA_6_2

# Here are the recovered images, we can see PCA as a compression method:
X_recovered = pca.inverse_transform(X_pca)
for i in range (3):
    plt.imshow(X_recovered[i].reshape(231,195))
    plt.show()

png

9PCA_7_0 9PCA_7_1 9PCA_7_2

# if we run random forest on the restored dataset,
# the accuracy raises back to 39%
X_tilde=[]
for i in range (165):
    X_tilde.append(X_recovered[i])
X_tilde=np.array(X_tilde)
X_tilde.shape

(165, 45045)

X_train, X_test, y_train, y_test = train_test_split(X_tilde, y, test_size=0.2, random_state=2)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42, oob_score=True)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

heatmap_image = rnd_clf.feature_importances_.reshape(231,195)
plt.imshow(heatmap_image, cmap="hot")
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),
                           rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['very important', 'not important'], fontsize=14)
plt.axis("off")
plt.show()

# with pca, the algorithms look briefly overall the images, 
# with focus on the light and the complexion.

Accuracy: 39.39%

png

9PCA_9_1

def plot_explained_variance(pca):
    num_components = len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)
    vals = pca.explained_variance_ratio_
 
    plt.figure(figsize=(10, 6))
    ax = plt.subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals)
    ax.plot(ind, cumvals)
    for i in range(num_components):
        ax.annotate(r"%s%%" % ((str(round(vals[i]*100,1))[:3])), (ind[i]+0.2, vals[i]), 
                    va="bottom", 
                    ha="center", 
                    fontsize=12)
 
    ax.xaxis.set_tick_params(width=0)
    ax.yaxis.set_tick_params(width=1, length=6)
 
    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')
plot_explained_variance(pca)

png

9PCA_10_0

# here are some main components:
def plot_component(pca, comp):
    if comp <= len(pca.components_):
        mat_data = np.asmatrix(pca.components_[comp]).reshape(231,195)  #reshape images
        plt.imshow(mat_data); #plot the data
        plt.xticks([]) #removes numbered labels on x-axis
        plt.yticks([]) #removes numbered labels on y-axis   
plot_component(pca, 0)

png

9PCA_11_0

plot_component(pca, 1)

png

9PCA_12_0

plot_component(pca, 2)

png

9PCA_13_0