Random Forests

Tags:  

Random Forests are a way of blending together the results of many classification or regression trees (CART models).  A forest is a set of trees and in this way  Random Forests are simply a collection of different CART models.

CART models offer a more interpretable output than compared to Random Forests. However, the relative importance of each predicator variable can be displayed. The graph above shows the results of 1000 different Random Forest models in regards to predicting survival on the Titantic. The relative importance of each predictor variable is displayed. The output is grouped into 3 clusters in order to display variation between the different models.

CART models offer a more interpretable output than compared to Random Forests. However, the relative importance of each predicator variable can be displayed. The graph above shows the results of 1000 different Random Forest models in regards to predicting survival on the Titantic. The relative importance of each predictor variable is displayed. The output is grouped into 3 clusters in order to display variation between the different models.

Ensemble Methods

A predictive model converts a set of inputs into a prediction.  If there is a small change in the values of one (or some) of the inputs and a large change the value of the prediction, then the model has a high variance.  Models with high variance may perform well on a training set of observations but are unlikely to generalise to making accurate predictions with a different set of data.

Ensemble based methods can reduce prediction variance by taking the average from a collection of different models. The more these individual models are uncorrelated with each other, the lower the variance of the ensemble model that is constructed from the individual models.

How to derive a prediction from many trees?

In the case of a CART model, the prediction is derived from comparing the value of the predictor variables at each successive decision branch and then choosing the appropriate alternative. In the case of many trees (i.e. a forest) the same approach is taken but in this case, many predictions are obtained – the number of these predictions is equal to the number of trees in the forest.  In the case of regression, where the prediction is a real number, the result for the forest is the average of the predictions.  In the case, where the prediction is a discrete classification, the result for the forest is the majority prediction.

How to create multiple trees?

Multiple trees are created by repeatedly sampling the set of observations with replacement. For example, if there are 100 observations and a sample of 100 is drawn with replacement, then there will be some duplicate values and some values from the original set that are not included.  This means the sample is slightly different to the original set and this difference means that the sample is not perfectly correlated with the original set of observations.  A tree is grown from this new set of sampled observations. Another sample is created and another tree is grown. This process continues until a specified number of trees have been created.

The diagram above assumes five observations [1, 2, 3, 4, 5] and three predictor variables [A, B, C].  It shows the construction of four simple (but different) trees.  The observations have been sampled with replacement which means that some observations can occur more than once.  In the scenario above, two predictor variables are used to grow each tree, rather than using the entire set of predictor variables.

The diagram above assumes five observations [1, 2, 3, 4, 5] and three predictor variables [A, B, C]. It shows the construction of four simple (but different) trees. The observations have been sampled with replacement which means that some observations can occur more than once. In the scenario above, two predictor variables are used to grow each tree, rather than using the entire set of predictor variables.

Bagged Forests Versus Random Forests

With a bagged forest, all of the predictor variables will be applied to each tree that is created.  For example, if there are 10 predicator variables and 200 trees are created, then, using bagging, all of these 10 predictor variables will be applied to each of the 200 trees.

Random Forests apply a different technique.  Rather than applying all predictor variables, a random subset of n predictors is drawn and applied to each tree in the forest.  For example, if there are 10 predictors, an n = 3, then predictors 1, 7, 9 may be applied to tree 1 while predictors 3, 2, 9 may be applied to tree 2.  This process helps to add additional randomness to each tree and this helps to reduce the overall prediction variance.

Ada Boosting

Adaptive Boosting (Ada Boosting) generates trees whose structure (unlike Random Forests) depends on the trees grown previously.  Ada Boosting generates an initial model and then calculates the residuals of this model.  A tree is then grown to fit the residuals and a portion of this residual-based tree (governed by the shrinkage parameter) is added to the existing tree.  This process continues for a specified number of iterations.

Interpreting a Tree Compared to a Forest

Forests can produce more powerful predictive models than compared to trees, however, a forest is difficult to interpret.  If the resultant forest-based model is created from 100’s or 1000’s of trees, then visualising the result can be impossible.  R’s randomForrest package and the party package have functions that can report the relative importance of each predicator but this is less expressive than compared to the output from a CART model.

Implementation in R

To implement Random Forests, R’s randomForest and party packages were used.  The party package uses conditional inference trees which may improve results when the predictor variables are correlated.  The ada package was used to implement the Adaptive Boosting algorithm.

Using the randomforest Package

The only two parameters that can be changed for the randomForest() function are the number of trees (using ntree) and the number of variable used (using mtry)  The data set used in the code below has 9 variables, using cross-validation, the number of variables with the highest prediction accuracy was found to be 3.

#mTitanicAll contains 1309 rows with 9 variables
#split this up into a training and test set
trainSet #create a 2000 tree random forest, taking samples of 3 predictors
forest.surive mtry = 3, importance = TRUE, subset = trainSet, ntree = 2000)
#make predictions on the test set
survive.Predict newdata = mTitanicAll[-trainSet,])
#create and then display confusion matrix
tab <- table(survive.Predict, mTitanicAll[-trainSet, "survived"])
#list importance of the variables
importance(forest.surive, type=1)

Ada Boost & the Caret Package

The Ada Boost algorithm is implemented by the ada package and uses 3 parameters: the depth of the trees (maxdepth), the shrinkage parameter (nu) and the number of iterations (iter).  The Caret package implements are series of functions that assist in the model training process.  In this example, 3 different values were used for the 3 parameters, resulting in 3  x 3 x 3 = 27 different models.

The Caret package by default implements a K –Fold cross validation process, where K = 10.  This process was repeated twice.  Therefore, there were 2 x 10 = 20 different cross-validations applied to each model.  As there were 27 different models, 27 x 20 = 540 different scenarios applied.  The following took 17 minutes to run.

#data has been previously split into two sets:
#testData and trainData
# 3 * 3 * 3 = 27 combos.
#10 fold validation ==> (10-1) * 2 repeats = 18 cvs. 486 models
ada.simple <- expand.grid(maxdepth = c(2,4,6), nu = c(0.01, 0.1, 1),
 iter = c(20, 50, 80))
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 2,
 summaryFunction = twoClassSummary, classProbs = TRUE)
ada.tune <- train(survived~ . , data = trainData,
 method = "ada", metric = "ROC",
 tuneGrid = ada.simple, trControl = cv.ctrl)
#took 17 minutes to run

#plot the results
plot(ada.tune)
#create predictions
predClass <- predict(ada.tune, newdata = testData)

Github Code and Data

The code and data is available from Github here.

OctocatSmall