Ridge Regression

Tags:  

Ridge regression is a method used to produce simpler but more accurate regression models. It is also known as ‘regularisation’ .

Model selection

When a function fitted to training data, there is a risk that an overly flexible function will fit the training data very well but will not generalise to unseen test data. For example it may be possible to perfectly fit a model to the training data using an n degree polynomial but this is unlikely to perform well on test data. This phenomena is called overfitting.

Graph showing different under and overfit models

The graph above uses simulated data. A function was used to create the grey dotted line. Some noise was added to this function and a random sample of 80 observations was obtained. This sample was partitioned into 40 training and 40 test cases. The training set is displayed above by the white dots. An 18 degree polynomial was then fitted to the training data. The value of lambda was then varied and this resulted in the 7 coloured curves displayed above. The glmnet package provides a facility to return a value of lambda that minimises the training error. This is the yellow curve. The blue curves have lambda values that are greater than the optimum lambda – these curves underfit the data. The red curves have lambda values that are less than the optimum lambda – these curves overfit the data. A lambda value of 0 represents the typical OLS regression; extremely high values of lambda will result in the horizontal intercept, the value of which will be equal to the mean of Y.

Overfitting

This is a particular risk for models that have a large number of features relative to the number of observations. One solution is to create a simpler model by only using a subset of the original features.  However, the number of candidate models is equal to 2P and therefore a brute force approach to subset selection will only be feasible when the number of features is small.

Model selection using regularisation

As an alternative to generating different subsets from the candidate features, regularisation keeps all the features but reduces the magnitude of the coefficients. This leads to functions which are smoother and simpler. Such functions are less prone to overfitting.

Ridge Regression

Ridge regression is a type of regularisation. Ridge regression seeks to minimise the following equation:

Ridge regression equation

The summation term on the left is the typical RSS error measure in a linear regression context.  But the expression on the right is a penalty term which increases as the coefficients become larger. The value of lambda determines the importance of this penalty term. When lambda is zero, the result will be same as conventional regression; when the value of lambda is large, the coefficients will approach zero.

Scaling the data

The formula above shows that the ridge regression penalty has grouped all the coefficients together.  This penalty component is shown below:

Ridge regression penalty term

The value produced by the penalty component is dependant on the scale of the underlying data. For example, if there is an income variable measured in whole dollars (i.e. $47,000) then the coefficient for this model will be 1000 times smaller than if the income variable was coded in thousands of dollars (i.e. $ 47 K).  As a consequence, the penalty term would be lower if the variable was coded in thousands of dollars.

To resolve this problem, the data is scaled to a common variance before applying the ridge algorithm.  This ensures that the penalty term is applied equally to all variables irrespective of the units in which they are measured.

Selecting a value of lambda

A cross-validation approach is used to select the best value for lambda. This involves splitting the data into a training and a test set. A model is fitted to the training set with a specific value of lambda.  Once values for the co-efficients have been determined, the predictive accuracy of the model is determined by applying the model to test set data.  This process is repeated for different values of lambda. The model with the highest accuracy on the test set is then selected.

Test training graph for ridge regression

The graph above plots training and test error measured by Mean Squared Error. The value of lambda is varied and the associated error rates are displayed. Note that x axis is sorted in descending order: higher values of lambda produce less flexible functions and higher training errors. Lower values of lambda produce the most flexible functions and these will fit the training data more accurately and therefore produce lower training error rates; however, these flexible functions may not generalize to test data. The optimal value of lambda is the value that produces the lowest test errors.

Implementation in R

To demonstrate logistic ridge regression, the glmnet package was used. This package implements a k-fold cross validation approach with k equal to 10.  The data is simulated and was partitioned into a training and a test set. The code below shows the model being fitted to the training set using the glmnet function and then making predictions using the predict function. The glmnet package does not use R formulas, therefore, a feature matrix is created using the model.matrix function.

# GLMNet does not use the R formula so a feature matrix is needed.
# We don't need to the intercept. get rid of it i.e. [,-1]
# nrow(dfSample) x polyDegree matrix
matX <- model.matrix(yNoisy ~ poly(x, polyDegree), dfSample)[,-1]
# nrow(dfTruth) x polyDegree matrix
matxAll <- model.matrix(yNoisy ~ poly(x, polyDegree), dfTruth)[,-1]
# train our model. alpha = 0 ==> ridge; alpha = 1 ==> lasso
ridgeModel <- glmnet(matX[vctTrain,] , dfSample[vctTrain, "yNoisy"], alpha = 0)
# these two are not really needed. Used in development to get best lambda values
# cv.glmnet uses 10 fold x validation by default.
cv.out <- cv.glmnet(matX[vctTrain,] , dfSample[vctTrain, "yNoisy"], alpha = 0)
bestLamda <- cv.out$lambda.min
# plot(cv.out) looks pretty.
# This is how we predict values. The parameter “s” is equal to lambda
ridgePredictTest <- predict(ridgeModel, s=0, newx= matX[vctTest,])

When to use Ridge Regression

Ridge regression may produce the best results when there are a large number of features. In cases where the number of features are greater than the number of observations, the matrix used in the normal equations may not be invertible. But using Ridge Regression enables this matrix to be inverted.

Ridge Regression versus Lasso Regression

Ridge regression with large lambda values gradually shrink the coefficients towards zero — the value of the coefficients asymptote to zero. In contrast, Lasso regression produces a hard cutoff where the coefficients are reduced to exactly zero. Lasso regression can therefore produce simpler models with fewer coefficients. To implement Lasso regression using glmnet, set the value of the parameter ‘alpha’ to 1 when using the glmnet function.

Get the code from Github

You can get all the code for this example from here.