A conceptual introduction to Regularized Regression
Ridge Regression
Lasso Regression
Elastic Net
Implementation with the glmnet
package
Building Regularized Regression models with the caret
package
Regularization is a general strategy to incorporate additional penalty terms into the model fitting process
It is implemented in a various of other algorithms, not just regression.
The idea behind the regularization is to constrain the size of model coefficients to reduce their sampling variation and, hence, reduce the variance of model predictions.
The reduction in the variance comes with an expense of bias in model predictions.
These constraints are typically incorporated into the loss function to be optimized.
There are two commonly used regularization strategies:
ridge penalty
lasso penalty
elastic net (a mix of ridge and lasso penalty)
The loss function for the unregularized linear regression: the sum of squared residuals across all observations.
N∑i=1ϵ2(i)
For ridge regression, we add a penalty term to this loss function, which is a function of all the regression coefficients in the model.
Penalty=λP∑i=1β2p,
Loss function to optimize for the ridge regression:
Loss=N∑i=1ϵ2(i)+λP∑p=1β2p,
Let’s consider the same example from the previous class. Suppose that we would like to predict the target readability score for a given text from the Feature 220.
Y=β0+β1X+ϵ
Assume the set of coefficients are { β0,β1 } = {-1.5,2}, so my model is Y=−1.5+2X+ϵ.
Then, the value of the loss function when λ=0.2 would be equal to 19.02.
d <- readability_sub[,c('V220','target')]b0 = -1.5b1 = 2d$predicted <- b0 + b1*d$V220d$error <- d$target - d$predictedd
V220 target predicted error1 -0.13908258 -2.06282395 -1.7781652 -0.284658792 0.21764143 0.58258607 -1.0647171 1.647303213 0.05812133 -1.65313060 -1.3837573 -0.269373274 0.02526429 -0.87390681 -1.4494714 0.575564605 0.22430885 -1.74049148 -1.0513823 -0.689109186 -0.07795373 -3.63993555 -1.6559075 -1.984028097 0.43400714 -0.62284268 -0.6319857 0.009143048 -0.24364550 -0.34426981 -1.9872910 1.643021209 0.15893717 -1.12298826 -1.1821257 0.0591374010 0.14496475 -0.99857142 -1.2100705 0.2114990811 0.34222975 -0.87656742 -0.8155405 -0.0610269312 0.25219145 -0.03304643 -0.9956171 0.9625706613 0.03532625 -0.49529863 -1.4293475 0.9340488614 0.36410633 0.12453660 -0.7717873 0.8963239415 0.29988593 0.09678258 -0.9002281 0.9970107316 0.19837037 0.38422270 -1.1032593 1.4874819617 0.07807041 -0.58143038 -1.3438592 0.7624288018 0.07935690 -0.34324576 -1.3412862 0.9980404419 0.57000953 -0.39054205 -0.3599809 -0.0305611120 0.34523284 -0.67548411 -0.8095343 0.13405021
lambda = 0.2SSR <- sum((d$error)^2)penalty <- lambda*(b0^2 + b1^2)loss <- SSR + penaltySSR
[1] 17.76657
penalty
[1] 1.25
loss
[1] 19.01657
Note that when λ is equal to zero, the loss function is identical to SSR; therefore, it becomes a linear regression with no regularization.
As the value of λ increases, the degree of penalty linearly increases.
The λ can technically take any positive value between 0 and ∞.
A visual representation of the effect of penalty terms on the loss function and regression coefficients
The matrix solution we learned before for regression without regularization can also be applied to estimate the coefficients from ridge regression given a specific λ value.
^β=(XTX+λI)−1XTY
X is an N x (P+1) design matrix for the set of predictor variables, including an intercept term,
β is an (P+1) x 1 column vector of regression coefficients,
I is a (P+1) x (P+1) identity matrix,
and λ is a positive real-valued number,
Suppose we want to predict the readability score using the two predictors, Feature 220 ($X_1$) and Feature 166 ($X_2$). Our model will be
Y(i)=β0+β1X1(i)+β2X2(i)+ϵ(i).
If we estimate the ridge regression coefficients by using λ=.5, the estimates would be
{ β0,β1,β2 } = {-.915, 1.169, -0.221}.
Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166))lambda <- 0.5beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Ybeta
[,1][1,] -0.9151087[2,] 1.1691920[3,] -0.2206188
If we change the value of λ to 2, we will get different estimates for the regression coefficients.
Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166))lambda <- 2beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Ybeta
[,1][1,] -0.7550986[2,] 0.4685138[3,] -0.1152953
Suppose you manipulate the value of λ from 0 to 100 with increments of .1 and calculate the regression coefficients for different levels of λ values.
Note that the regression coefficients will shrink toward zero but will never be exactly equal to zero in ridge regression.
A critical complication arises in ridge regression when you have more than one predictor.
Different scales of different variables will affect the magnitude of the unstandardized regression coefficients.
A regression coefficient of a predictor ranging from 0 to 100 will be very different from a regression coefficient of a predictor from 0 to 1.
If we work with the unstandardized variables, the ridge penalty will be amplified for the coefficients of those variables with a more extensive range of values.
Therefore, it is critical that we standardize variables before we use ridge regression.
When we standardize the variables, the mean of all variables becomes zero.
Therefore, the intercept estimate for any regression model with standardized variables is guaranteed to be zero.
The design matrix (X) doesn’t have to have a column of ones anymore because it is unnecessary (it would be a column of zeros if we had one).
Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))
# Standardized YY <- scale(Y)Y
[,1] [1,] -1.34512103 [2,] 1.39315702 [3,] -0.92104526 [4,] -0.11446655 [5,] -1.01147297 [6,] -2.97759768 [7,] 0.14541127 [8,] 0.43376355 [9,] -0.37229209[10,] -0.24350754[11,] -0.11722056[12,] 0.75591253[13,] 0.27743280[14,] 0.91902756[15,] 0.89029923[16,] 1.18783003[17,] 0.18827737[18,] 0.43482354[19,] 0.38586691[20,] 0.09092186attr(,"scaled:center")[1] -0.7633224attr(,"scaled:scale")[1] 0.9660852
# Standardized XX <- scale(X)X
[,1] [,2] [1,] -1.54285661 1.44758852 [2,] 0.24727019 -0.47711825 [3,] -0.55323997 -0.97867834 [4,] -0.71812450 1.41817232 [5,] 0.28072891 -0.62244962 [6,] -1.23609737 0.11236158 [7,] 1.33304531 0.34607530 [8,] -2.06757849 -1.22697345 [9,] -0.04732188 0.05882229[10,] -0.11743882 -1.24554362[11,] 0.87248435 1.93772977[12,] 0.42065052 0.13025831[13,] -0.66763117 -0.40479141[14,] 0.98226625 1.39073712[15,] 0.65999286 0.25182543[16,] 0.15056337 -0.28808443[17,] -0.45313072 0.02862469[18,] -0.44667480 0.25187100[19,] 2.01553800 -2.00805114[20,] 0.88755458 -0.12237606attr(,"scaled:center")[1] 0.1683671 0.1005784attr(,"scaled:scale")[1] 0.19927304 0.06196686
The regression model's coefficients with standardized variables when there is no ridge penalty.
lambda <- 0beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Ybeta.s
[,1][1,] 0.42003881[2,] -0.06335594
The regression coefficients when the ridge penalty is increased to 0.5.
lambda <- 0.5beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Ybeta.s
[,1][1,] 0.40931629[2,] -0.06215875
Change in standardized regression coefficients when we manipulate the value of λ from 0 to 100 with increments of .1.
glmnet
packageSimilar to the lm()
function, we can use the glmnet()
function from the glmnet
package to run a regression model with ridge penalty.
There are many arguments for the glmnet() function. For now, the arguments we need to know are
x
: an N x P input matrix, where N is the number of observations and P is the number of predictors
y
: an N x 1 input matrix for the outcome variable
alpha
: a mixing constant for lasso and ridge penalty. When it is 0, the ridge regression is conducted.
lambda
: penalty term
intercept
: set FALSE to avoid intercept for standardized variables
Regression with no penalty (traditional OLS regression)
alpha = 0
lambda = 0
.
require(glmnet)Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))Y <- scale(Y)X <- scale(X)mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0, intercept= FALSE)coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix" s0(Intercept) . V1 0.42003881V2 -0.06335594
Increase the penalty term to 0.5.
alpha = 0
lambda = 0.5
.
require(glmnet)Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))Y <- scale(Y)X <- scale(X)mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0.5, intercept= FALSE)coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix" s0(Intercept) . V1 0.27809880V2 -0.04571182
In Slide 14, when we estimate the ridge regression coefficients with λ=0.5 using matrix solution, we got different numbers than whan the glmnet()
function reports.
In the glmnet()
function, it appears that the penalty term for the ridge regression is specified as
λNP∑i=1β2p.
For identical results, we should use λ=0.5/20 in the glmnet()
package.
mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0.5/20, intercept= FALSE)coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix" s0(Intercept) . V1 0.40958102V2 -0.06218857
Note that these numbers are still slightly different.
The difference is due to the numerical approximation glmnet is using when optimizing the loss function. glmnet doesn’t use the closed-form matrix solution for ridge regression. This is a good thing because there is not always a closed form solution for different types of regularization approaches (e.g., lasso). Therefore, the computational approximation in glmnet is very needed moving forward.
In the context of machine learning, the parameters in a model can be classified into two types: parameters and hyperparameters.
The parameters are typically estimated from data and not set by users. In the context of ridge regression, regression coefficients, {$\beta_0,\beta_1,...,\beta_P$}, are parameters to be estimated from data.
The hyperparameters are not estimable because there are no first-order or second-order derivatives for these hyperparameters. Therefore, they must be set by the users. In the context of ridge regression, the penalty term, {$\lambda$}, is a hyperparameter.
The process of deciding what value to use for a hyperparameter is called Tuning.
It is usually a trial-error process. You try many different hyperparameter values and check how well the model performs based on specific criteria (e.g., MAE, MSE, RMSE) using k-fold cross-validation.
Then, you pick the value of a hyperparameter that provides the best predictive performance on cross-validation.
Please review the following notebook for applying Ridge Regresison to predict readability scores from 768 features using the whole dataset.
Predicting Readability Scores using the Ridge Regression
Lasso regression is similar to the Ridge regression but applies a different penalty term.
Penalty=λP∑i=1|βp|,
λ is the penalty constant,
|βp| is the absolute value of the regression coefficient for the pth parameter.
The loss function for the lasso regression to optimize:
Loss=N∑i=1ϵ2(i)+λP∑i=1|βp|
Let's consider the same example where we fit a simple linear regression model: the readability score is the outcome ($Y$) and Feature 229 is the predictor($X$).
Y=β0+β1X+ϵ,
Assume the set of coefficients are { β0,β1 } = {-1.5,2}, so my model is
Y=−1.5+2X+ϵ. Then, the value of the loss function when λ=0.2 would be equal to 18.467.
d <- readability_sub[,c('V220','target')]b0 = -1.5b1 = 2d$predicted <- b0 + b1*d$V220d$error <- d$target - d$predictedd
V220 target predicted error1 -0.13908258 -2.06282395 -1.7781652 -0.284658792 0.21764143 0.58258607 -1.0647171 1.647303213 0.05812133 -1.65313060 -1.3837573 -0.269373274 0.02526429 -0.87390681 -1.4494714 0.575564605 0.22430885 -1.74049148 -1.0513823 -0.689109186 -0.07795373 -3.63993555 -1.6559075 -1.984028097 0.43400714 -0.62284268 -0.6319857 0.009143048 -0.24364550 -0.34426981 -1.9872910 1.643021209 0.15893717 -1.12298826 -1.1821257 0.0591374010 0.14496475 -0.99857142 -1.2100705 0.2114990811 0.34222975 -0.87656742 -0.8155405 -0.0610269312 0.25219145 -0.03304643 -0.9956171 0.9625706613 0.03532625 -0.49529863 -1.4293475 0.9340488614 0.36410633 0.12453660 -0.7717873 0.8963239415 0.29988593 0.09678258 -0.9002281 0.9970107316 0.19837037 0.38422270 -1.1032593 1.4874819617 0.07807041 -0.58143038 -1.3438592 0.7624288018 0.07935690 -0.34324576 -1.3412862 0.9980404419 0.57000953 -0.39054205 -0.3599809 -0.0305611120 0.34523284 -0.67548411 -0.8095343 0.13405021
lambda = 0.2SSR <- sum((d$error)^2)penalty <- lambda*(abs(b0) + abs(b1))loss <- SSR + penaltySSR
[1] 17.76657
penalty
[1] 0.7
loss
[1] 18.46657
Below is a demonstration of what happens to the loss function and the regression coefficients for increasing levels of loss penalty (λ) for Lasso regression.
Note that the regression coefficients become equal to 0 at some point (this was not the case for ridge regression).
glmnet()
There is no closed-form solution for lasso regression due to the absolute value terms in the loss function.
The only way to estimate the coefficients of the lasso regression is to use numerical techniques and obtain computational approximations.
Similar to ridge regression, glmnet is an engine we can use to estimate the coefficients of the lasso regression.
The arguments are identical. The only difference is that we fix the alpha=
argument to 1 for lasso regression.
Y <- as.matrix(readability_sub$target)X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))Y <- scale(Y)X <- scale(X)mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 1, lambda = 0.2, intercept=FALSE)coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix" s0(Intercept) . V1 0.2174345V2 .
The .
symbol for the coefficient of the second predictor indicates that it is equal to zero.
While the regression coefficients in the ridge regression shrink to zero, they do not necessarily end up being exactly equal to zero.
In contrast, lasso regression may yield a value of zero for some coefficients in the model.
For this reason, lasso regression may be used as a variable selection algorithm. The variables with coefficients equal to zero may be discarded from future considerations as they are not crucial for predicting the outcome.
We implement a similar strategy for finding the optimal value of λ as we did for the Ridge Regression.
Please review the following notebook to apply Lasso Regression to predict readability scores from all 768 features using the whole dataset.
Predicting Readability Scores using the Lasso Regression
Elastic net combines the two types of penalty into one by mixing them with some weighted average. The penalty term for the elastic net could be written as
λ[(1−α)P∑i=1β2p+αP∑i=1|βp|)].
The loss function for the elastic net to optimize:
Loss=N∑i=1ϵ2(i)+λ[(1−α)P∑i=1β2p+αP∑i=1|βp|)]
When α is set to 1, this is equivalent to ridge regression.
When α equals 0, this is the equivalent of lasso regression.
When α takes any value between 0 and 1, this term becomes a weighted average of the ridge penalty and lasso penalty.
In Elastic Net, two hyperparameters will be tuned: α and λ.
We can consider many possible combinations of these two hyperparameters and try to find the optimal combination using 10-fold cross-validation.
Review of the following notebook for applying Elastic Net to predict readability scores from all 768 features using the whole dataset.
Predicting Readability Scores using the Elastic Net
Review of the following notebook for predicting the readability of a given text with the existing model objects
Predicting Readability Scores for a new text
A conceptual introduction to Regularized Regression
Ridge Regression
Lasso Regression
Elastic Net
Implementation with the glmnet
package
Building Regularized Regression models with the caret
package
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |