Principle Components


Principle Components Analysis (PCA) was invented more than 100 years ago and is an unsupervised learning data reduction technique. When the data set contains a large number of features, some of these may be highly correlated.  Therefore, it is possible to essentially merge these highly correlated variables and as a consequence reduce the number of dimensions used to represent the data.  These resultant dimensions are called the principle components.

This data reduction can enable the use of techniques that are more suited to smaller dimensions such as visualisation, regression and clustering.  Visualisation in particular is only possible using two or three dimensions.

Interpreting Principle Components

Using Principle Components, a data space represented by P features can be represented using a maximum of P principle components.  The first principle component attempts to capture the most variation of the dataset; the second principle component attempts to capture as much of the remaining variance as possible, subject that the second principle component is uncorrelated (or orthogonal) to the first principle component.  This process can continue until all the variance has been captured by the principle components.

The first principle component (x axis) represents more variance than the second principle (y axis) component.  That is the reason why the graph is wider than it is tall.

The first principle component (x axis) represents more variance than the second principle (y axis) component. That is the reason why the graph is wider than it is tall.

How many Principle Components to use

As more and more principle components are used, more and more variance of the original data set is captured.  But this increase in the variance captured occurs at a decreasing rate.  If the original data set contains features that are highly correlated than the number of principle components used to represent 95% of the variance may be quite small.

Implementation in R

Using R, the prcomp() function in the stats package enables easy implementation of PCA.  However, the following will use the svd() (Singular Value Decomposition) which first requires computation of the covariance matrix.

Calculating Covariance

To calculate the covariance matrix for all variables, the data is first normalised and then the following matrix expression  as implemented in R below:

# scale to mean = 0, sd =1. ssdataMat = 7352 x 561
ssDataMat <- scale(ssDataMat)
# calculate covariance matrix. ssCov = 561 x 561
ssCov <- (t(ssDataMat) %*% ssDataMat) / nrow(ssDataMat)

Creating the Principle Components

The covariance matrix is a positive semi-definite matrix and this means that the svd() factors this matrix into its eigenvectors and eigenvalues.  The columns of the eigenvectors are the principle components.

To summarise the data using the first two (or n) principle components is easily achieved by using the first two (or n) eigenvector columns.  The svd() function returns a named list and the u component are the eigenvectors.  The code below shows how to represent the data using two principle components:

#singular value decomposition
svd <- svd(ssCov)
#construct PC reduction using first K components
k <-2
#in this case reduce dimensions from 561 to 2
ssReduction <- ssDataMat %*% svd$u[,1:k]

Calculating the variance of K principle components

The svd() function returns the eigenvalues in component in descending order.  The proportion of variance explained by the K principle components can easily be obtained by summing the total of the first K eigenvalues over the sum of all eigenvalues.  Maybe this is easier to understand using the following code where K = 2.

#calculate total variance explained
#1. get the eigenvalues
eigen <- svd$d
#proportion explained by first two evs
firstTwo <- sum(eigen[1:2]) / sum(eigen)
sprintf("Variance explained by first two components: %s", firstTwo)

Github Code and Data

The source code and data is available here on github.