+ - 0:00:00
Notes for current slide
Notes for next slide

Regularized (Penalized) Linear Regression

Cengiz Zopluoglu

College of Education, University of Oregon

1 / 31

Today's Goals:

  • A conceptual introduction to Regularized Regression

    • Ridge Regression

    • Lasso Regression

    • Elastic Net

  • Implementation with the glmnet package

  • Building Regularized Regression models with the caret package

2 / 31

What is regularization?

  • Regularization is a general strategy to incorporate additional penalty terms into the model fitting process

  • It is implemented in a various of other algorithms, not just regression.

  • The idea behind the regularization is to constrain the size of model coefficients to reduce their sampling variation and, hence, reduce the variance of model predictions.

  • The reduction in the variance comes with an expense of bias in model predictions.

  • These constraints are typically incorporated into the loss function to be optimized.

    • bias - variance tradeoff
  • There are two commonly used regularization strategies:

    • ridge penalty

    • lasso penalty

    • elastic net (a mix of ridge and lasso penalty)

3 / 31










Ridge Regression

4 / 31

Ridge Penalty

  • The loss function for the unregularized linear regression: the sum of squared residuals across all observations.

    i=1Nϵ(i)2

  • For ridge regression, we add a penalty term to this loss function, which is a function of all the regression coefficients in the model.

    Penalty=λi=1Pβp2,

    • λ is a parameter that penalizes the regression coefficients when they get larger.
  • Loss function to optimize for the ridge regression:

Loss=i=1Nϵ(i)2+λp=1Pβp2,

5 / 31

Let’s consider the same example from the previous class. Suppose that we would like to predict the target readability score for a given text from the Feature 220.

Y=β0+β1X+ϵ

Assume the set of coefficients are { β0,β1 } = {-1.5,2}, so my model is Y=1.5+2X+ϵ.

Then, the value of the loss function when λ=0.2 would be equal to 19.02.

d <- readability_sub[,c('V220','target')]
b0 = -1.5
b1 = 2
d$predicted <- b0 + b1*d$V220
d$error <- d$target - d$predicted
d
V220 target predicted error
1 -0.13908258 -2.06282395 -1.7781652 -0.28465879
2 0.21764143 0.58258607 -1.0647171 1.64730321
3 0.05812133 -1.65313060 -1.3837573 -0.26937327
4 0.02526429 -0.87390681 -1.4494714 0.57556460
5 0.22430885 -1.74049148 -1.0513823 -0.68910918
6 -0.07795373 -3.63993555 -1.6559075 -1.98402809
7 0.43400714 -0.62284268 -0.6319857 0.00914304
8 -0.24364550 -0.34426981 -1.9872910 1.64302120
9 0.15893717 -1.12298826 -1.1821257 0.05913740
10 0.14496475 -0.99857142 -1.2100705 0.21149908
11 0.34222975 -0.87656742 -0.8155405 -0.06102693
12 0.25219145 -0.03304643 -0.9956171 0.96257066
13 0.03532625 -0.49529863 -1.4293475 0.93404886
14 0.36410633 0.12453660 -0.7717873 0.89632394
15 0.29988593 0.09678258 -0.9002281 0.99701073
16 0.19837037 0.38422270 -1.1032593 1.48748196
17 0.07807041 -0.58143038 -1.3438592 0.76242880
18 0.07935690 -0.34324576 -1.3412862 0.99804044
19 0.57000953 -0.39054205 -0.3599809 -0.03056111
20 0.34523284 -0.67548411 -0.8095343 0.13405021
lambda = 0.2
SSR <- sum((d$error)^2)
penalty <- lambda*(b0^2 + b1^2)
loss <- SSR + penalty
SSR
[1] 17.76657
penalty
[1] 1.25
loss
[1] 19.01657
6 / 31
  • Note that when λ is equal to zero, the loss function is identical to SSR; therefore, it becomes a linear regression with no regularization.

  • As the value of λ increases, the degree of penalty linearly increases.

  • The λ can technically take any positive value between 0 and .

  • A visual representation of the effect of penalty terms on the loss function and regression coefficients

7 / 31

Model Estimation

The matrix solution we learned before for regression without regularization can also be applied to estimate the coefficients from ridge regression given a specific λ value.

β^=(XTX+λI)1XTY

  • Y is an N x 1 column vector of observed values for the outcome variable,
  • X is an N x (P+1) design matrix for the set of predictor variables, including an intercept term,

  • β is an (P+1) x 1 column vector of regression coefficients,

  • I is a (P+1) x (P+1) identity matrix,

  • and λ is a positive real-valued number,

8 / 31

Suppose we want to predict the readability score using the two predictors, Feature 220 ($X_1$) and Feature 166 ($X_2$). Our model will be

Y(i)=β0+β1X1(i)+β2X2(i)+ϵ(i).

If we estimate the ridge regression coefficients by using λ=.5, the estimates would be

{ β0,β1,β2 } = {-.915, 1.169, -0.221}.

Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166))
lambda <- 0.5
beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y
beta
[,1]
[1,] -0.9151087
[2,] 1.1691920
[3,] -0.2206188
9 / 31

If we change the value of λ to 2, we will get different estimates for the regression coefficients.

Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166))
lambda <- 2
beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y
beta
[,1]
[1,] -0.7550986
[2,] 0.4685138
[3,] -0.1152953
10 / 31

Suppose you manipulate the value of λ from 0 to 100 with increments of .1 and calculate the regression coefficients for different levels of λ values.

Note that the regression coefficients will shrink toward zero but will never be exactly equal to zero in ridge regression.

11 / 31

Standardized variables

A critical complication arises in ridge regression when you have more than one predictor.

  • Different scales of different variables will affect the magnitude of the unstandardized regression coefficients.

  • A regression coefficient of a predictor ranging from 0 to 100 will be very different from a regression coefficient of a predictor from 0 to 1.

  • If we work with the unstandardized variables, the ridge penalty will be amplified for the coefficients of those variables with a more extensive range of values.

Therefore, it is critical that we standardize variables before we use ridge regression.

12 / 31
  • When we standardize the variables, the mean of all variables becomes zero.

  • Therefore, the intercept estimate for any regression model with standardized variables is guaranteed to be zero.

  • The design matrix (X) doesn’t have to have a column of ones anymore because it is unnecessary (it would be a column of zeros if we had one).

Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))
# Standardized Y
Y <- scale(Y)
Y
[,1]
[1,] -1.34512103
[2,] 1.39315702
[3,] -0.92104526
[4,] -0.11446655
[5,] -1.01147297
[6,] -2.97759768
[7,] 0.14541127
[8,] 0.43376355
[9,] -0.37229209
[10,] -0.24350754
[11,] -0.11722056
[12,] 0.75591253
[13,] 0.27743280
[14,] 0.91902756
[15,] 0.89029923
[16,] 1.18783003
[17,] 0.18827737
[18,] 0.43482354
[19,] 0.38586691
[20,] 0.09092186
attr(,"scaled:center")
[1] -0.7633224
attr(,"scaled:scale")
[1] 0.9660852
# Standardized X
X <- scale(X)
X
[,1] [,2]
[1,] -1.54285661 1.44758852
[2,] 0.24727019 -0.47711825
[3,] -0.55323997 -0.97867834
[4,] -0.71812450 1.41817232
[5,] 0.28072891 -0.62244962
[6,] -1.23609737 0.11236158
[7,] 1.33304531 0.34607530
[8,] -2.06757849 -1.22697345
[9,] -0.04732188 0.05882229
[10,] -0.11743882 -1.24554362
[11,] 0.87248435 1.93772977
[12,] 0.42065052 0.13025831
[13,] -0.66763117 -0.40479141
[14,] 0.98226625 1.39073712
[15,] 0.65999286 0.25182543
[16,] 0.15056337 -0.28808443
[17,] -0.45313072 0.02862469
[18,] -0.44667480 0.25187100
[19,] 2.01553800 -2.00805114
[20,] 0.88755458 -0.12237606
attr(,"scaled:center")
[1] 0.1683671 0.1005784
attr(,"scaled:scale")
[1] 0.19927304 0.06196686
13 / 31

The regression model's coefficients with standardized variables when there is no ridge penalty.

lambda <- 0
beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y
beta.s
[,1]
[1,] 0.42003881
[2,] -0.06335594

The regression coefficients when the ridge penalty is increased to 0.5.

lambda <- 0.5
beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y
beta.s
[,1]
[1,] 0.40931629
[2,] -0.06215875
14 / 31

Change in standardized regression coefficients when we manipulate the value of λ from 0 to 100 with increments of .1.

15 / 31

Ridge regression with the glmnet package

Similar to the lm() function, we can use the glmnet() function from the glmnet package to run a regression model with ridge penalty.

There are many arguments for the glmnet() function. For now, the arguments we need to know are

  • x: an N x P input matrix, where N is the number of observations and P is the number of predictors

  • y: an N x 1 input matrix for the outcome variable

  • alpha: a mixing constant for lasso and ridge penalty. When it is 0, the ridge regression is conducted.

  • lambda: penalty term

  • intercept: set FALSE to avoid intercept for standardized variables

16 / 31

Regression with no penalty (traditional OLS regression)

  • alpha = 0

  • lambda = 0.

require(glmnet)
Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))
Y <- scale(Y)
X <- scale(X)
mod <- glmnet(x = X,
y = Y,
family = 'gaussian',
alpha = 0,
lambda = 0,
intercept= FALSE)
coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
V1 0.42003881
V2 -0.06335594
17 / 31

Increase the penalty term to 0.5.

  • alpha = 0

  • lambda = 0.5.

require(glmnet)
Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))
Y <- scale(Y)
X <- scale(X)
mod <- glmnet(x = X,
y = Y,
family = 'gaussian',
alpha = 0,
lambda = 0.5,
intercept= FALSE)
coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
V1 0.27809880
V2 -0.04571182
18 / 31

In Slide 14, when we estimate the ridge regression coefficients with λ=0.5 using matrix solution, we got different numbers than whan the glmnet() function reports.

In the glmnet() function, it appears that the penalty term for the ridge regression is specified as

λNi=1Pβp2.

For identical results, we should use λ=0.5/20 in the glmnet() package.

mod <- glmnet(x = X,
y = Y,
family = 'gaussian',
alpha = 0,
lambda = 0.5/20,
intercept= FALSE)
coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
V1 0.40958102
V2 -0.06218857

Note that these numbers are still slightly different.

The difference is due to the numerical approximation glmnet is using when optimizing the loss function. glmnet doesn’t use the closed-form matrix solution for ridge regression. This is a good thing because there is not always a closed form solution for different types of regularization approaches (e.g., lasso). Therefore, the computational approximation in glmnet is very needed moving forward.

19 / 31

Tuning the Hyperparameter λ

In the context of machine learning, the parameters in a model can be classified into two types: parameters and hyperparameters.

  • The parameters are typically estimated from data and not set by users. In the context of ridge regression, regression coefficients, {$\beta_0,\beta_1,...,\beta_P$}, are parameters to be estimated from data.

  • The hyperparameters are not estimable because there are no first-order or second-order derivatives for these hyperparameters. Therefore, they must be set by the users. In the context of ridge regression, the penalty term, {$\lambda$}, is a hyperparameter.

The process of deciding what value to use for a hyperparameter is called Tuning.

It is usually a trial-error process. You try many different hyperparameter values and check how well the model performs based on specific criteria (e.g., MAE, MSE, RMSE) using k-fold cross-validation.

Then, you pick the value of a hyperparameter that provides the best predictive performance on cross-validation.

20 / 31

Using Ridge Regression to Predict Readability Scores

Please review the following notebook for applying Ridge Regresison to predict readability scores from 768 features using the whole dataset.

Predicting Readability Scores using the Ridge Regression

21 / 31










Lasso Regression

22 / 31

Lasso Penalty

Lasso regression is similar to the Ridge regression but applies a different penalty term.

Penalty=λi=1P|βp|,

  • λ is the penalty constant,

  • |βp| is the absolute value of the regression coefficient for the pth parameter.

The loss function for the lasso regression to optimize:

Loss=i=1Nϵ(i)2+λi=1P|βp|

23 / 31

Let's consider the same example where we fit a simple linear regression model: the readability score is the outcome ($Y$) and Feature 229 is the predictor($X$).

Y=β0+β1X+ϵ,

Assume the set of coefficients are { β0,β1 } = {-1.5,2}, so my model is

Y=1.5+2X+ϵ. Then, the value of the loss function when λ=0.2 would be equal to 18.467.

d <- readability_sub[,c('V220','target')]
b0 = -1.5
b1 = 2
d$predicted <- b0 + b1*d$V220
d$error <- d$target - d$predicted
d
V220 target predicted error
1 -0.13908258 -2.06282395 -1.7781652 -0.28465879
2 0.21764143 0.58258607 -1.0647171 1.64730321
3 0.05812133 -1.65313060 -1.3837573 -0.26937327
4 0.02526429 -0.87390681 -1.4494714 0.57556460
5 0.22430885 -1.74049148 -1.0513823 -0.68910918
6 -0.07795373 -3.63993555 -1.6559075 -1.98402809
7 0.43400714 -0.62284268 -0.6319857 0.00914304
8 -0.24364550 -0.34426981 -1.9872910 1.64302120
9 0.15893717 -1.12298826 -1.1821257 0.05913740
10 0.14496475 -0.99857142 -1.2100705 0.21149908
11 0.34222975 -0.87656742 -0.8155405 -0.06102693
12 0.25219145 -0.03304643 -0.9956171 0.96257066
13 0.03532625 -0.49529863 -1.4293475 0.93404886
14 0.36410633 0.12453660 -0.7717873 0.89632394
15 0.29988593 0.09678258 -0.9002281 0.99701073
16 0.19837037 0.38422270 -1.1032593 1.48748196
17 0.07807041 -0.58143038 -1.3438592 0.76242880
18 0.07935690 -0.34324576 -1.3412862 0.99804044
19 0.57000953 -0.39054205 -0.3599809 -0.03056111
20 0.34523284 -0.67548411 -0.8095343 0.13405021
lambda = 0.2
SSR <- sum((d$error)^2)
penalty <- lambda*(abs(b0) + abs(b1))
loss <- SSR + penalty
SSR
[1] 17.76657
penalty
[1] 0.7
loss
[1] 18.46657
24 / 31

Below is a demonstration of what happens to the loss function and the regression coefficients for increasing levels of loss penalty (λ) for Lasso regression.

Note that the regression coefficients become equal to 0 at some point (this was not the case for ridge regression).

25 / 31

Model Estimation via glmnet()

  • There is no closed-form solution for lasso regression due to the absolute value terms in the loss function.

  • The only way to estimate the coefficients of the lasso regression is to use numerical techniques and obtain computational approximations.

  • Similar to ridge regression, glmnet is an engine we can use to estimate the coefficients of the lasso regression.

  • The arguments are identical. The only difference is that we fix the alpha= argument to 1 for lasso regression.

Y <- as.matrix(readability_sub$target)
X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166))
Y <- scale(Y)
X <- scale(X)
mod <- glmnet(x = X,
y = Y,
family = 'gaussian',
alpha = 1,
lambda = 0.2,
intercept=FALSE)
coef(mod)
3 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
V1 0.2174345
V2 .
26 / 31
  • The . symbol for the coefficient of the second predictor indicates that it is equal to zero.

  • While the regression coefficients in the ridge regression shrink to zero, they do not necessarily end up being exactly equal to zero.

  • In contrast, lasso regression may yield a value of zero for some coefficients in the model.

  • For this reason, lasso regression may be used as a variable selection algorithm. The variables with coefficients equal to zero may be discarded from future considerations as they are not crucial for predicting the outcome.

27 / 31

Tuning λ

We implement a similar strategy for finding the optimal value of λ as we did for the Ridge Regression.

Using Lasso Regression to Predict the Readability Scores

Please review the following notebook to apply Lasso Regression to predict readability scores from all 768 features using the whole dataset.

Predicting Readability Scores using the Lasso Regression

28 / 31










Elastic Net

29 / 31

Elastic net combines the two types of penalty into one by mixing them with some weighted average. The penalty term for the elastic net could be written as

λ[(1α)i=1Pβp2+αi=1P|βp|)].

The loss function for the elastic net to optimize:

Loss=i=1Nϵ(i)2+λ[(1α)i=1Pβp2+αi=1P|βp|)]

  • When α is set to 1, this is equivalent to ridge regression.

  • When α equals 0, this is the equivalent of lasso regression.

  • When α takes any value between 0 and 1, this term becomes a weighted average of the ridge penalty and lasso penalty.

In Elastic Net, two hyperparameters will be tuned: α and λ.

We can consider many possible combinations of these two hyperparameters and try to find the optimal combination using 10-fold cross-validation.

30 / 31

Using Elastic Net to Predict the Readability Scores

Review of the following notebook for applying Elastic Net to predict readability scores from all 768 features using the whole dataset.

Predicting Readability Scores using the Elastic Net

Using the Prediction Model for a New Text

Review of the following notebook for predicting the readability of a given text with the existing model objects

Predicting Readability Scores for a new text

31 / 31

Today's Goals:

  • A conceptual introduction to Regularized Regression

    • Ridge Regression

    • Lasso Regression

    • Elastic Net

  • Implementation with the glmnet package

  • Building Regularized Regression models with the caret package

2 / 31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow