Ridge regression is a method used to produce simpler but more accurate regression models. It is also known as ‘regularisation’ .
When a function fitted to training data, there is a risk that an overly flexible function will fit the training data very well but will not generalise to unseen test data. For example it may be possible to perfectly fit a model to the training data using an n degree polynomial but this is unlikely to perform well on test data. This phenomena is called overfitting.
This is a particular risk for models that have a large number of features relative to the number of observations. One solution is to create a simpler model by only using a subset of the original features. However, the number of candidate models is equal to 2P and therefore a brute force approach to subset selection will only be feasible when the number of features is small.
Model selection using regularisation
As an alternative to generating different subsets from the candidate features, regularisation keeps all the features but reduces the magnitude of the coefficients. This leads to functions which are smoother and simpler. Such functions are less prone to overfitting.
Ridge regression is a type of regularisation. Ridge regression seeks to minimise the following equation:
The summation term on the left is the typical RSS error measure in a linear regression context. But the expression on the right is a penalty term which increases as the coefficients become larger. The value of lambda determines the importance of this penalty term. When lambda is zero, the result will be same as conventional regression; when the value of lambda is large, the coefficients will approach zero.
Scaling the data
The formula above shows that the ridge regression penalty has grouped all the coefficients together. This penalty component is shown below:
The value produced by the penalty component is dependant on the scale of the underlying data. For example, if there is an income variable measured in whole dollars (i.e. $47,000) then the coefficient for this model will be 1000 times smaller than if the income variable was coded in thousands of dollars (i.e. $ 47 K). As a consequence, the penalty term would be lower if the variable was coded in thousands of dollars.
To resolve this problem, the data is scaled to a common variance before applying the ridge algorithm. This ensures that the penalty term is applied equally to all variables irrespective of the units in which they are measured.
Selecting a value of lambda
A cross-validation approach is used to select the best value for lambda. This involves splitting the data into a training and a test set. A model is fitted to the training set with a specific value of lambda. Once values for the co-efficients have been determined, the predictive accuracy of the model is determined by applying the model to test set data. This process is repeated for different values of lambda. The model with the highest accuracy on the test set is then selected.
Implementation in R
To demonstrate logistic ridge regression, the glmnet package was used. This package implements a k-fold cross validation approach with k equal to 10. The data is simulated and was partitioned into a training and a test set. The code below shows the model being fitted to the training set using the glmnet function and then making predictions using the predict function. The glmnet package does not use R formulas, therefore, a feature matrix is created using the model.matrix function.
# GLMNet does not use the R formula so a feature matrix is needed. # We don't need to the intercept. get rid of it i.e. [,-1] # nrow(dfSample) x polyDegree matrix matX <- model.matrix(yNoisy ~ poly(x, polyDegree), dfSample)[,-1] # nrow(dfTruth) x polyDegree matrix matxAll <- model.matrix(yNoisy ~ poly(x, polyDegree), dfTruth)[,-1] # train our model. alpha = 0 ==> ridge; alpha = 1 ==> lasso ridgeModel <- glmnet(matX[vctTrain,] , dfSample[vctTrain, "yNoisy"], alpha = 0) # these two are not really needed. Used in development to get best lambda values # cv.glmnet uses 10 fold x validation by default. cv.out <- cv.glmnet(matX[vctTrain,] , dfSample[vctTrain, "yNoisy"], alpha = 0) bestLamda <- cv.out$lambda.min # plot(cv.out) looks pretty. # This is how we predict values. The parameter “s” is equal to lambda ridgePredictTest <- predict(ridgeModel, s=0, newx= matX[vctTest,])
When to use Ridge Regression
Ridge regression may produce the best results when there are a large number of features. In cases where the number of features are greater than the number of observations, the matrix used in the normal equations may not be invertible. But using Ridge Regression enables this matrix to be inverted.
Ridge Regression versus Lasso Regression
Ridge regression with large lambda values gradually shrink the coefficients towards zero — the value of the coefficients asymptote to zero. In contrast, Lasso regression produces a hard cutoff where the coefficients are reduced to exactly zero. Lasso regression can therefore produce simpler models with fewer coefficients. To implement Lasso regression using glmnet, set the value of the parameter ‘alpha’ to 1 when using the glmnet function.
Get the code from Github
You can get all the code for this example from here.