Principle Components Analysis (PCA) was invented more than 100 years ago and is an unsupervised learning data reduction technique. When the data set contains a large number of features, some of these may be highly correlated. Therefore, it is possible to essentially merge these highly correlated variables and as a consequence reduce the number of dimensions used to represent the data. These resultant dimensions are called the principle components.
This data reduction can enable the use of techniques that are more suited to smaller dimensions such as visualisation, regression and clustering. Visualisation in particular is only possible using two or three dimensions.
Interpreting Principle Components
Using Principle Components, a data space represented by P features can be represented using a maximum of P principle components. The first principle component attempts to capture the most variation of the dataset; the second principle component attempts to capture as much of the remaining variance as possible, subject that the second principle component is uncorrelated (or orthogonal) to the first principle component. This process can continue until all the variance has been captured by the principle components.
How many Principle Components to use
As more and more principle components are used, more and more variance of the original data set is captured. But this increase in the variance captured occurs at a decreasing rate. If the original data set contains features that are highly correlated than the number of principle components used to represent 95% of the variance may be quite small.
Implementation in R
Using R, the prcomp() function in the stats package enables easy implementation of PCA. However, the following will use the svd() (Singular Value Decomposition) which first requires computation of the covariance matrix.
To calculate the covariance matrix for all variables, the data is first normalised and then the following matrix expression as implemented in R below:
# scale to mean = 0, sd =1. ssdataMat = 7352 x 561 ssDataMat <- scale(ssDataMat) # calculate covariance matrix. ssCov = 561 x 561 ssCov <- (t(ssDataMat) %*% ssDataMat) / nrow(ssDataMat)
Creating the Principle Components
The covariance matrix is a positive semi-definite matrix and this means that the svd() factors this matrix into its eigenvectors and eigenvalues. The columns of the eigenvectors are the principle components.
To summarise the data using the first two (or n) principle components is easily achieved by using the first two (or n) eigenvector columns. The svd() function returns a named list and the u component are the eigenvectors. The code below shows how to represent the data using two principle components:
#singular value decomposition svd <- svd(ssCov) #construct PC reduction using first K components k <-2 #in this case reduce dimensions from 561 to 2 ssReduction <- ssDataMat %*% svd$u[,1:k]
Calculating the variance of K principle components
The svd() function returns the eigenvalues in component in descending order. The proportion of variance explained by the K principle components can easily be obtained by summing the total of the first K eigenvalues over the sum of all eigenvalues. Maybe this is easier to understand using the following code where K = 2.
#calculate total variance explained #1. get the eigenvalues eigen <- svd$d #proportion explained by first two evs firstTwo <- sum(eigen[1:2]) / sum(eigen) sprintf("Variance explained by first two components: %s", firstTwo)
Github Code and Data
The source code and data is available here on github.