On the equivalency between frequentist Ridge (and LASSO) regression and hierarchial Bayesian regression

May 6, 2019 28 min read Statistical Modeling

Introduction

In this post, we will explore frequentist and Bayesian analogues of regularized/penalized linear regression models (e.g., LASSO [L1 penalty], Ridge regression [L2 penalty]), which are an extention of traditional linear regression models of the form:

\[y = \beta_{0}+X\beta + \epsilon\tag{1}\] where $\epsilon$ is the error, which is normally distributed as:

\[\epsilon \sim \mathcal{N}(0, \sigma)\tag{2}\] Unlike these traditional linear regression models, regularized linear regression models produce biased estimates for the $\beta$ weights. Specifically, both frequentist and Bayesian regularized linear regression models pool information across $\beta$ weights, resulting in regression toward a common mean. When the common mean is centered at 0, this pooling of information produces more conservative estimates for each $\beta$ weight (they are biased toward 0). In contrast, traditional linear regression models assume that $\beta$ weights share no group-level information (i.e. they are independent), which leads to so-called unbiased estimates.

So then, why are these models—which produce biased estimates—becoming increasingly popular throughout the social and behavioral sciences?

Learning Objectives

The current post seeks to show that we actually want biased estimates in many contexts. In doing so, we will also explore associations between frequentist and Bayesian regularization. Therefore, the learning objectives of the current post are to develop an understanding of:

why so-called biased estimates are actually good for science,
the difference between traditional and regularized linear regression models, and
the correspondence between frequentist regularization and hierarchical Bayesian regression

Data

For this post, we will use a college-admissions dataset that is freely available from Kaggle. Make sure to download the .csv file named Admission_Predict_Ver1.1.csv. This dataset contains 500 observations (i.e. rows) and 9 total variables (i.e. columns). 1 of these columns is a subject ID, which we will note use for modeling. Taken directly from the link:

“The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )”

Our goal is to predict the likelihood of being admitted to graduate school (Chance of Admit), given the other variables, which we will now refer to as predictors.

Getting Started

First off, let’s load the libraries that we will use:

# For traditional LASSO/Ridge regression
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.1.1

# For Bayesian modeling
library(rstan)
# For data wrangling/plotting
library(dplyr)
library(tidyr)
library(foreach)
library(ggplot2)
library(bayesplot) # Visualizing posteriors 
library(akima)     # for 3D plotting

## Warning: package 'akima' was built under R version 4.1.1

library(plotly)    # for 3D plotting

## Warning: package 'plotly' was built under R version 4.1.1

Next, we need to read in the data. Assuming that you have already downloaded the data from the link above, we can read it into R as follows:

# I was lazy and just left this in the downloads folder...
grad_dat <- read.csv("~/Downloads/Admission_Predict_Ver1.1.csv")

# View first few observations 
tbl_df(grad_dat)

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 500 × 9
##    Serial.No. GRE.Score TOEFL.Score University.Rating   SOP   LOR  CGPA Research
##         <int>     <int>       <int>             <int> <dbl> <dbl> <dbl>    <int>
##  1          1       337         118                 4   4.5   4.5  9.65        1
##  2          2       324         107                 4   4     4.5  8.87        1
##  3          3       316         104                 3   3     3.5  8           1
##  4          4       322         110                 3   3.5   2.5  8.67        1
##  5          5       314         103                 2   2     3    8.21        0
##  6          6       330         115                 5   4.5   3    9.34        1
##  7          7       321         109                 3   3     4    8.2         1
##  8          8       308         101                 2   3     4    7.9         0
##  9          9       302         102                 1   2     1.5  8           0
## 10         10       323         108                 3   3.5   3    8.6         0
## # … with 490 more rows, and 1 more variable: Chance.of.Admit <dbl>

Creating Training and Test Sets

Before really looking at the data, we need to separate out the training and testing portions. The original competition version of data uploaded to Kaggle only included the first 400 observations, and competitors had to make predictions on the remaining 100 observations before the actual outcomes (i.e. likelihood of getting into graduate school) were released.

To show off the benefits of regularized over traditional methods, we will only train our models on the first 20 observations, and make predictions on the remaining 480. In these low data settings—which are common to many areas of social and behavioral science (e.g., psychology, neuroscience, human ecology, etc.)—regularized regression models show clear advantages over traditional regression models. In fact, I will go as far as claiming that regularized models should be the default choice in most areas of science, and the following examples should explain why.

# Training data (used to fit model)
train_dat <- grad_dat[1:20,] # Only first 20 for training

# Testing data (used to test model generalizability)
test_dat <- grad_dat[21:500,] # Testing on the rest

Now that we have the data read in and separated, it would be useful to visualize our training data to get a sense of what we are working with. A correlation matrix should do a good job here:

# Plotting correlation matrix
cor(train_dat) %>%
  as.data.frame() %>%
  mutate(Var1 = factor(row.names(.), levels=row.names(.))) %>% # For nice order
  gather(Var2, Correlation, 1:9) %>%
  ggplot(aes(reorder(Var2, Correlation), # Reorder to visualize
             reorder(Var1, -Correlation), fill = Correlation)) +
  geom_tile() +
  scale_fill_continuous(type = "viridis") +
  xlab("Variable") +
  ylab("Variable") +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This correlation matrix shows us that Serial.No is not really correlated with any of the other 8 variables. This makes sense, given that Serial.No is the subject ID (we will not use that in our models). Otherwise, it appears that there is a moderate amount of collinearity amoung the predictor and outcome variables.

Traditional Regression

Traditional linear regression models seek to mimimize the squared error between predicted and actual observations. Formally, we can represent this with a loss function of the following form:

\[\underset{\boldsymbol{\beta}}{argmin}\sum_{i=1}^{n}(y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij})^2\tag{3}\]

where $y_{i}$ is the outcome variable (probability of acceptance in our example) for observation $i$, $n$ is the total number of observations (20 students in our training data example), $p$ is the number of predictors (7 in our example), $\beta_{0}$ is the intercept of the model, and $\beta_{1,2,...,j}$ are the weights (i.e. slopes) for each of the $j$ predictor variables. Under the assumption of normality, both ordinary least squares and maximum likelihood estimation of $\beta$ weights in equation 3 will offer the same results—from here on, we will refer to these methods of minimizing equation 3 as traditional linear regression.

Below, we will fit a traditional linear regression model on the training set (which contains 20 observations total), and we will see how well it predicts the graduate school acceptance probability ($\text{Pr}(Acceptance)$) for each of the remaining 480 observations. We begin by scaling all variables—that is, we mean-center and divide each column by its own standard deviation (SD). Then, we apply the same standardization to the test data, followed by fitting the traditional model and making test-set predictions.

# Scale training data (and get rid of ID)
train_scale <- scale(train_dat[,2:9])

# Find means and SDs of training data variables
means <- attributes(train_scale)$`scaled:center`
SDs <- attributes(train_scale)$`scaled:scale`

# Scale test data using training data summary stats (no cheating!)
test_scale <- scale(test_dat[,-1], center = means, scale = SDs)

# Fit linear regression
fit_lr <- lm(Chance.of.Admit ~ ., data = as.data.frame(train_scale))

# Generate test-set predictions with linear regression
y_pred_lr <- predict(fit_lr, newdata = as.data.frame(test_scale[,-8]))

# Plot cor(predicted, actual)
qplot(x = y_pred_lr, y = test_scale[,8],
      main = paste0("Traditional Linear Regression\n", 
                    "r = ", round(cor(test_scale[,8], y_pred_lr), 2))) +
  xlab("Model Predicted Pr(Acceptance)") +
  ylab("Actual Pr(Acceptance)") +
  theme_minimal(base_size = 20)

Wow, not bad at all! Even with only 20 observations, the correlation between predicted and actual probability of acceptance for the remaining 480 observations is $r = $ $0.76$. This suggests that probability of acceptance is rather well described as a simple linear combination of the predictors.

Regularized Regression

As described above, regularized linear regression models aim to estimate more conservative values for the $\beta$ weights in a model, and this is true for both frequentist and Bayesian versions of regularization. While there are many methods that can be used to regularize your estimation procedure, we will focus specifically on two popular forms—namely, ridge and LASSO regression. We start below by describing each regression generally, and then proceed to implement both the frequentist and Bayesian versions.

Ridge Regression

The extention from traditional to ridge regression is actually very straightforward! Specifically, we modify the loss function (equation 3) to include a penalty term for model complexity, where model complexity is operationalized as the sum of squared $\beta$ weights:

\[\underbrace{\underset{\boldsymbol{\beta}}{argmin}\sum_{i=1}^{n}(y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij})^2}_{\text{Traditional Loss Function}} + \underbrace{\lambda\sum_{j=1}^{p}\beta_{j}^2}_{\text{Ridge Penalty}}\tag{4}\] Here, $\lambda~(0<\lambda<\infty)$ is a penalty parameter, which controls how much regularization we would like in the model. To gain an intuition for the behavior of this model, think of what happens at the very extremes of $\lambda$. For example, when $\lambda = 0$, what happens to equation 4? Well, the model simply reduces to equation 3, and we are back to traditional regression! What about as $\lambda \rightarrow \infty$? In that case, any non-zero values for $\beta$ will lead to an infinitely large penalty—then, the only solution to equation 4 is for all $\beta$ weights to be equal to 0, indicating no learning from the data at all.

Therefore, we can think of $\lambda$ as a parameter that controls how much we learn from the data, with smaller and larger values leading to more and less learning, respectively. Note that parameters that control how much we learn from data are typically called hyper-parameters. Framed in these terms, we can view traditional regression methods as those that maximally learn from the data, and regularized regression models as those that restrict learning from the data. It is in this way that traditional regression produces unbiased estimate of the $\beta$ weights. That is, $\beta$ weights are unbiased with respect to the information available in the training data. However, when data are noisy (i.e. contain large amounts of variability), such unbiased estimates will be unduly influenced by noise. Thus, although traditional regression offers unbiased estimates, it is also succeptable to estimating large magnitude $\beta$ weights based purely on noise within the data. Ridge regression minimizes the potential learning from noise by penalizing the model for the squared sum of all the $\beta$ weights. The squaring of the $\beta$ weights encodes our knowledge that large-magnitude $\beta$ weights are much less likely than small-magnitude $\beta$ weights (given that all our variables are on the same scale after standardization). Practically, this means that if a traditional regression would give us a very large magnitude $\beta$ weight, when data are highly variable, ridge regression will bias such large $\beta$ estimates toward 0.

We can actually visualize the effects of the ridge penalty on $\beta$ weights by plotting out the ridge penalty from equation 4. Specifically, the below plot shows the resulting penalty (where 0 is no penalty and increasingly negative numbers are stronger penalties) for the 2-dimensional case, where we only have 2 predictors. Importantly, the plot also includes the penalty functions for varying settings of $\lambda \in \{0, .5, 1.5\}$. Note that the flat surface is when $\lambda = 0$, which leads to no penalization.

The goal of ridge regression is to find the optimal combination of $\beta$ weights that minimizes the traditional loss function, while also accounting for the added “loss” incurred by the penalty term. Looking at the plot above, you can see that as the $\beta$ weights both move away from 0, the ridge penalty function shows a slow increase near 0, which becomes progressively steeper as we move away from the origin. Further, as $\lambda \rightarrow \infty$, the penalty function becomes steeper and steeper. Altogether, ridge regression does not prefer particular values for $\beta$ weights—it only prefers for all $\beta$ weights to be closer to 0.

Frequentist Ridge Regression

One of the problems that arises from equation 4 is that we need to select a value for $\lambda$, the parameter that controls how much we learn from our training data. But how should we select an optimal $\lambda$? Of course, we could just pick a value based on our intuition, but that would likely lead to non-optimal solutions in many cases. Therefore, frequentist ridge regression typically relies on cross-validation (CV) to select $\lambda$ (see here for more details on hyper-parameter tuning/optimization). CV proceeds by:

generating a grid of values for our hyper-parameter ($\lambda$)
splitting the training data into k equivalently sized training sets (k folds),
fitting a model to k-1 folds and testing prediction accuracy on the left-out fold,
repeat step 3 until each fold has been left out exactly once, and
iterating steps 2-4 for each hyper-parameter in the grid defined by step 1

When we are finished, we select the value of $\lambda$ in the grid that minimizes the CV error (step 3). The R code below proceeds through the above steps, using the R glmnet package to fit the model:

# Penalty term values to test
lambdas <- 10^seq(3, -2, by = -.1)

# Fit and cross-validate
fit_ridge <- glmnet(x = train_scale[,-8], y = train_scale[,8], 
                    alpha = 0, lambda = lambdas, intercept = F)
cv_ridge <- cv.glmnet(x = train_scale[,-8], y = train_scale[,8], 
                      alpha = 0, lambda = lambdas, intercept = F, 
                      grouped = FALSE)

# Find optimal penalty term (lambda)
opt_lambda <- cv_ridge$lambda.min

# Generate predictions with opitmal lambda model from cross-validation
y_pred <- predict(fit_ridge, s = opt_lambda, newx = test_scale[,-8])

# Plot cor(predicted, actual)
qplot(x = y_pred[,1], y = test_scale[,8],
      main = paste0("Frequentist Ridge Regression:\nEstimating ", expression(lambda), 
                    " with CV\nr = ", round(cor(test_scale[,8], y_pred[,1]), 2))) +
  #geom_point(aes(x = , y = train_scale[,8]))
  xlab("Model Predicted Pr(Acceptance)") +
  ylab("Actual Pr(Acceptance)") +
  theme_minimal(base_size = 20)

Woah, much better than traditional linear regression! Before diving into these results, however, let’s first explore Bayesian ridge regression.

Bayesian Ridge Regression

Bayesian Ridge regression differs from the frequentist variant in only one way, and it is with how we think of the $\lambda$ penalty term. In the frequentist perspective, we showed that $\lambda$ effectively tells our model how much it is allowed to learn from the data. In the Bayesian world, we can capture such an effect in the form of a prior distribution over our $\beta$ weights. To reveal the extraordinary power hiding behind this simple idea, let’s first discuss Bayesian linear regression.

Bayesian models view estimation as a problem of integrating prior information with information gained from data, which we formalize using probability distributions. This differs from the frequntist view, which treats regression as an opimization problem that results in a point estimate (e.g., minimizing squared error). A Bayesian regression model takes the form of:

\[y_{i} | \beta_{0},\boldsymbol{x_{i}},\boldsymbol{\beta},\sigma \sim \mathcal{N}(\beta_{0} + \sum_{j=1}^{p}x_{ij}\beta_{j}, \sigma) \tag{5}\] Importantly, Bayesian models require us to specify a prior distribution for each parameter we seek to estimate. Therefore, we need to specify a prior on the intercept ($\beta_{0}$), slopes ($\boldsymbol{\beta}$), and error variance ($\sigma$) in equation 5. Since we are standardizing all of our predictors and outcome variable(s), we will ignore the intercept term. Then, we are left with $\boldsymbol{\beta}$ and $\sigma$. Crucially, our choice of prior distribution on $\boldsymbol{\beta}$ is what determines how much information we learn from the data, analagous to the penalty term $\lambda$ used for frequentist regularization.

Like before, you can gain an intuition for this behavior by imagnining extreme cases. For example, suppose that we assumed that $\boldsymbol{\beta} \sim \mathcal{U}(-\infty,+\infty)$. That is, we assume that $\boldsymbol{\beta}$ can take on any real-valued number, and every value is equally likely. In this case, the mode of the posterior distribution on each $\beta$ weight will be equivalent to the maximum likelihood estimate of the respective $\beta$ weight, and by extention the OLS estimate under the normality assumption (see our prior post on parameter estimation for more details). If an unbounded uniform distribution on $\boldsymbol{\beta}$ produces the same behavior as traditional linear regression, it follows from our discussions above that it also allows us to maximally learn from the data.

Now, what can we do to restrict learning in a Bayesian model? Well, we can use a prior distribution that pulls the $\beta$ weights toward 0 (unlike the unbounded uniform distribution), similar to the $\lambda$ penalty parameter in frequentist regularization! Specifically, specifying the following prior distribution on $\boldsymbol{\beta}$ is mathematically equivalent in expectation (see here) to using the ridge penalty in the frequentist model:

\[\boldsymbol{\beta} \sim \mathcal{N}(0, \sigma_{\beta})\tag{6}\]

There are two important points that the prior distribution in equation 6 conveys:

The normal distribution places very little prior probability on large-magnitude $\beta$ weights (i.e. far from 0), while placing high prior probability on small-magnitude weights (i.e. near 0), and
$\sigma_{\beta}$ controls how wide the normal distribution is, thus controlling the specific amount of prior probability placed on small- to large-magnitude $\beta$ weights.

The Bayesian version differs most in that we jointly estimate the “penalization term” $\sigma_{\beta}$ and $\boldsymbol{\beta}$ in a single model, as opposed to using CV to select an optimal $\lambda$ for frequentist regression. Because we are jointly estimating $\sigma_{\beta}$ along with individual-level $\beta$ weights, we can actually view Bayesian ridge regression as a simple hierarchical Bayesian model, where $\sigma_{\beta}$ is interpreted as a group-level scaling parameter that is estimated from pooled information across individual $\beta$ weights.

Below is the Stan code that specifies the Bayesian variant of ridge regression. Note that this code is based on Erp, Oberski, & Mulder (2019):

data{
    int N_train;             // "# training observations"
    int N_test;              // "# test observations"
    int N_pred;              // "# predictor variables"
    vector[N_train] y_train; // "training outcomes"
    matrix[N_train, N_pred] X_train; // "training data"
    matrix[N_test, N_pred] X_test;   // "testing data"
}
parameters{
    real<lower=0> sigma;   // "error SD"
    real<lower=0> sigma_B; // "hierarchical SD across betas"
    vector[N_pred] beta;   // "regression beta weights"
}
model{
  // "group-level (hierarchical) SD across betas"
  sigma_B ~ cauchy(0, 1);
  
  // "model error SD"
  sigma ~ normal(0, 1);
  
  // "beta prior (provides 'ridge' regularization)"
  beta ~ normal(0, sigma_B);
    
  // "model likelihood"
    y_train ~ normal(X_train*beta, sigma);
}
generated quantities{ 
    real y_test[N_test]; // "test data predictions"
    for(i in 1:N_test){
        y_test[i] = normal_rng(X_test[i,] * beta, sigma);
    }
}

The important parts of the above code are in the model{...} section. While the parameterization looks different from the frequentist model (equation 4), the behavior of the model is equivalent. In particular, we have a $\sigma_{\beta}$ “penalty” term, can shrink the variation between $\beta$ weights toward 0. Intuitively, as $\sigma_{\beta} \rightarrow 0$, all ($\beta$) weights are shrunk toward 0 (no learning from data). Conversely, as $\sigma_{\beta} \rightarrow \infty$, prior on $\sigma_{\beta}$ becomes uniform. A uniform prior denotes no penalty at all, and we are left with traditional, non-regularized regression (albeit we get a joint posterior distribution as opposed to point estimates). It is worth noting that the mode of the resulting posterior distribution over $\boldsymbol{\beta}$ will be equivalent to the maximum likelihood estimate when the prior on $\boldsymbol{\beta}$ is uniform (for details, see MLE/MAP estimation described in a previous post).

Now, let’s try fitting the model!

# First, prepare data for Stan
stan_dat <- list(N_train = nrow(train_scale),
                 N_test  = nrow(test_scale),
                 N_pred  = ncol(train_scale)-1,
                 y_train = train_scale[,8],
                 X_train = train_scale[,-8],
                 X_test  = test_scale[,-8])

# Fit the model using Stan's NUTS HMC sampler
fit_bayes_ridge <- sampling(bayes_ridge, stan_dat, iter = 2000, 
                            warmup = 500, chains = 3, cores = 3)

# Extract posterior distribution (parameters and predictions)
post_ridge <- rstan::extract(fit_bayes_ridge)

# Compute mean of the posterior predictive distribution over test set predictors,
# which integrates out uncertainty in parameter estimates
y_pred_bayes <- apply(post_ridge$y_test, 2, mean)

# Plot correlation between posterior predicted mean and actual Pr(Acceptance)
qplot(x = y_pred_bayes, y = test_scale[,8],
      main = paste0("Bayesian Ridge Regression:\nEstimating ", expression(lambda), 
                    " Hierarchically\nr = ", round(cor(test_scale[,8], y_pred_bayes), 2))) +
    xlab("Model Predicted Pr(Acceptance)") +
    ylab("Actual Pr(Acceptance)") +
    theme_minimal(base_size = 20)

It is clear that the Bayesian ridge regression is giving results very similar (almost identical) to frequentist ridge regression as outlined above. Importantly, the Bayesian method also offers straightforward measures of uncertainty for both parameter estimates and predictions on the test data. The plot above integrates over the prediction uncertainty (i.e. integrating over the posterior predictive distribution). To really appreciate the prediction uncertainty, we compute the correlation between predicted and actual Pr(Acceptance) for each posterior sample, and then visualize the resulting distribution of correlations:

# Iterate through each posterior prediction and compute cor(predicted, actual),
# which preserves uncertainty in parameter estimates
y_pred_bayes2 <- apply(post_ridge$y_test, 1, function(x) cor(x, test_scale[,8]))

# Plot posterior predictive distribution of cor(predicted, actual)
qplot(x = y_pred_bayes2, geom = "histogram", bins = 50,
      xlab = "Correlation between\nPredicted and Actual Pr(Acceptance)") +
  theme_minimal(base_size = 20)

Clearly, there is good amount of uncertainty contained within the posterior predictive distribution, and the above plot shows it well. Importantly, this uncertainty reduce in proportion to how much data (i.e. observations) we have. Given that we only used 20 observations, these results are actually pretty good!

Another method is to visualize the scatterplot like above, but to include prediction intervals around the predicted mean estimates for each observation in the test set:

# `bayesplot` has many convenience functions for working with posteriors
color_scheme_set(scheme = "darkgray")
ppc_intervals(x = colMeans(post_ridge$y_test), y = test_scale[,8],
              yrep = post_ridge$y_test, prob = 0.95) +
  ggtitle("95% Posterior Prediction Intervals") +
  xlab("Model Predicted Pr(Acceptance)") +
  ylab("Actual Pr(Acceptance)") +
  theme_minimal(base_size = 20)

Looks great! We see that the 95% posterior prediction intervals contain the actual values in virtually all cases. Importantly, this visualization also makes it clear just how much uncertainty there is for each individual observation.

Next up, we will explore Traditional and Bayesian LASSO regressions, followed by a discussion of the benefits of Bayesian over frequentist regularization more broadly.

LASSO Regression

Now that we have covered ridge regression, LASSO regression only involves a minor revision to the loss function. Specifically, as opposed to penalizing the model based on the sum of squared $\beta$ weights, we will penalize the model by the sum of the absolute value of $\beta$ weights:

That’s it! But what difference does this make? At this point, it is useful to visualize the effect of the LASSO penalty on $\beta$ weight estimates, as we did above for the ridge penalty. As before, we are looking at penalty functions for varying settings of $\lambda \in \{0, .5, 1.5\}$:

Interesting! We can see here that unlike for the ridge penalty, there are sharp corners in the geometry, which correspond to when either $\beta_{1}$ or $\beta_{2}$ equal 0. Further, the LASSO penalty shows a linear increase as we move away from the origin. Together, the sharp corners on 0-valued $\beta$ weights and the linear increase in penalty make the LASSO penalty favor solutions that set one (or more) $\beta$ weights to exactly 0.

Frequentist LASSO Regression

The code below is very similar to what we had for traditional ridge regression, except that we will set $\alpha = 1$ as opposed to $\alpha = 0$:

# Scale training data (get rid of ID)
train_scale <- scale(train_dat[,2:9])

# Penalty term values to test
lambdas <- 10^seq(3, -2, by = -.1)

# Fit and cross-validate
fit_lasso <- glmnet(x = train_scale[,-8], y = train_scale[,8], 
                    alpha = 1, lambda = lambdas, intercept = F)
cv_lasso <- cv.glmnet(x = train_scale[,-8], y = train_scale[,8], 
                      alpha = 1, lambda = lambdas, intercept = F, 
                      grouped = FALSE)

# Find optimal penalty term (lambda)
opt_lambda_lasso <- cv_lasso$lambda.min

# Generate predictions on the test set
y_pred_lasso <- predict(fit_lasso, s = opt_lambda_lasso, newx = test_scale[,-8])

# Plot cor(predicted, actual)
qplot(x = y_pred_lasso[,1], y = test_scale[,8],
      main = paste0("r = ", round(cor(test_scale[,8], y_pred_lasso[,1]), 2))) +
  xlab("Model Predicted Pr(Acceptance)") +
  ylab("Actual Pr(Acceptance)") +
  theme_minimal(base_size = 20)

Clearly, there is not much of a performance difference between frequentist ridge and LASSO regression in this dataset. However, it is worth noting that the LASSO is often more useful when we expect the solution to be sparse—that is, we expect that many of the predictors in our model are mostly noise, and are goal is to identify more robust effects. In such a case, the LASSO does a good job of setting $\beta$ weights on noisy predictors to exactly 0.

Bayesian LASSO Regression

Recall that for Bayesian ridge regression, we only needed to specifiy a normal prior distribution to the $\beta$ weights that we were aiming to regularize. For Bayesian LASSO regression, the only difference is in the form of the prior distribution. Specifically, setting a Laplace (i.e. double-exponential) prior on the $\beta$ weights is mathematically equivalent in expectation to the frequentist LASSO penalty (see e.g., here):

\[\boldsymbol{\beta} \sim \text{double-exponential}(0, \tau_{\beta})\tag{8}\] Here, $\tau_{\beta}~(0 < \tau_{\beta} < \infty)$ is a scale parameter that controls how peaked the prior distribution is around the center (0 in this case). As $\tau_{\beta} \rightarrow 0$, then the model assigns infinite weight on 0-valued $\beta$ weights (i.e. no learning from data). Conversely, as $\tau_{\beta} \rightarrow \infty$, the prior reduces to a uniform prior, thus leading to no regularization at all (i.e traditional regression, albeit with posterior distributions rather than point estimates). It can be useful to visualize what this prior distribution looks like over a single $\beta$ weight. For example, centered at 0 with $\tau_{\beta} = 0.5$:

qplot(x = seq(-3, 3, .1), y = rmutil::dlaplace(seq(-3, 3, .1), 0, .5), geom = "line") +
  ggtitle("Laplace Prior with tau = 0.5") +
  xlab(expression(beta[1])) +
  ylab("Density") +
  theme_minimal(base_size = 20)

## Registered S3 method overwritten by 'rmutil':
##   method         from
##   print.response httr

Compared to the ridge prior, which is a normal distribution, it is clear that the Laplace distribiution places much more probability mass directly on 0, which produces the variable selection effect specific to LASSO regression. Note also that such peakedness explains why there are sharp corners in the frequentist penalty function (see the LASSO contour plot above).

Below is the Stan code that specifies this Bayesian variant of LASSO regression. Like for the Bayesian ridge regression above, this code is loosely based on Erp, Oberski, & Mulder (2019):

data{
    int N_train;             // "# training observations"
    int N_test;              // "# test observations"
    int N_pred;              // "# predictor variables"
    vector[N_train] y_train; // "training outcomes"
    matrix[N_train, N_pred] X_train; // "training data"
    matrix[N_test, N_pred] X_test;   // "testing data"
}
parameters{
    real<lower=0> sigma;   // "error SD"
    real<lower=0> sigma_B; // "(hierarchical) SD across betas"
    vector[N_pred] beta;   // "regression beta weights"
}
model{
  // "group-level (hierarchical) SD across betas"
  sigma_B ~ cauchy(0, 1);
  
  // "Prior on SD"
  sigma ~ normal(0, 1);
  
  // "beta prior (Note this is the only change!)"
  beta ~ double_exponential(0, sigma_B); 
    
  // "model likelihood"
    y_train ~ normal(X_train*beta, sigma);
}
generated quantities{ 
    real y_test[N_test]; // "test data predictions"
    for(i in 1:N_test){
        y_test[i] = normal_rng(X_test[i,] * beta, sigma);
    }
}

Time to fit the model and visualize results:

# Fit the model using Stan's NUTS HMC sampler
fit_bayes_lasso <- sampling(bayes_lasso, stan_dat, iter = 2000, 
                            warmup = 500, chains = 3, cores = 3)

# Extract posterior distribution (parameters and predictions)
post_lasso <- rstan::extract(fit_bayes_lasso)

# Compute mean of the posterior predictive distribution over test set predictors,
# which integrates out uncertainty in parameter estimates
y_pred_bayes_lasso <- apply(post_lasso$y_test, 2, mean)

# Plot correlation between posterior predicted mean and actual Pr(Acceptance)
qplot(x = y_pred_bayes_lasso, y = test_scale[,8],
      main = paste0("r = ", round(cor(test_scale[,8], y_pred_bayes_lasso), 2))) +
    xlab("Model Predicted Pr(Acceptance)") +
    ylab("Actual Pr(Acceptance)") +
    theme_minimal(base_size = 20)

Like before, the Bayesian LASSO is producing very similar results compared to the frequentist version.

Comparing the Models

So far, we have described and fit both the frequentist and Bayesian versions of ridge and LASSO regression to our training data, and we have shown that we can make pretty outstanding predictions on our held-out test set! However, we have not explored the parameters that each model has estimated. Here, we will begin to probe our models.

First off, let’s look at the ridge regression model parameters.

# Extract beta weights from traditional regression model
betas_trad <- fit_lr$coefficients[-1]

# Extract optimal ridge beta weights from CV
betas_ridge <- fit_ridge$beta[,which(cv_ridge$lambda==cv_ridge$lambda.min)]

# Plot posterior distributions and frequentist point estimates
p1 <- mcmc_areas(as.array(fit_bayes_ridge), pars = paste0("beta[", 1:7, "]"), 
           prob = 0.8, prob_outer = 0.99, point_est = "mean") +
  geom_point(x = rep(betas_ridge, 1024), y = rep(7:1, 1024), color = I("#c61d29"), size = 2) + # rep() here is an annoying work-around to this code breaking in an update...
  geom_point(x = rep(betas_trad, 1024), y = rep(7:1, 1024), color = I("black"), size = 2) +
  ggtitle("Ridge Penalty") +
  theme_minimal(base_size = 20)

# Extract optimal LASSO beta weights from CV
betas_lasso <- fit_lasso$beta[,which(cv_lasso$lambda==cv_lasso$lambda.min)]

# Plot posterior distributions and frequentist point estimates
p2 <- mcmc_areas(as.array(fit_bayes_lasso), pars = paste0("beta[", 1:7, "]"), 
           prob = 0.8, prob_outer = 0.99) +
  geom_point(x = rep(betas_lasso, 1024), y = rep(7:1, 1024), color = I("#c61d29"), size = 2) +
  geom_point(x = rep(betas_trad, 1024), y = rep(7:1, 1024), color = I("black"), size = 2) +
  ggtitle("LASSO Penalty") +
  theme_minimal(base_size = 20)

# Plot in grid
cowplot::plot_grid(p1, p2, ncol = 2)

Above, the disributions reflect the Bayesian posteriors for each $\beta$ weight (shading represents the 80% central interval, based on quantiles; bar representing the posterior mean), the red points indicate the regularized frequentist point estimates, and black points are estimates from the trditional, non-regularized frequentist regression model (shown on each plot for comparison purposes). Importantly, the traditional regression estimates are far from 0 in many cases, whereas both frequentist and Bayesian regularization estimates are much closer to 0. It is this conservatism that allows biased estimates (from regularization/penalization) to outperform traditional models that offer unbiased estimates when tested on external data.

Thus, the benefit of regularization is that we get better parameter estimates (translating to better model performance) in low and/or noisy data settings. Crucially, as we collect more and more data, the effects of regularization become less apparent. We can demonstrate this by fitting all the models above to 400—as opposed to only 20—observations and compare the coefficients to the above plot.

To get started, we first go through our pre-processing steps once again:

# Training data
train_dat2 <- grad_dat[1:400,] # Use 400 observations now

# Testing data
test_dat2 <- grad_dat[401:500,] # Only using last 100 now

# Scale training data (and get rid of ID)
train_scale2 <- scale(train_dat2[,2:9])

# Find means and SDs of training data variables
means2 <- attributes(train_scale2)$`scaled:center`
SDs2 <- attributes(train_scale2)$`scaled:scale`

# Scale test data using training data summary stats (no cheating!)
test_scale2 <- scale(test_dat2[,-1], center = means2, scale = SDs2)

# Prepare data for Stan
stan_dat2 <- list(N_train = nrow(train_scale2),
                  N_test  = nrow(test_scale2),
                  N_pred  = ncol(train_scale2)-1,
                  y_train = train_scale2[,8],
                  X_train = train_scale2[,-8],
                  X_test  = test_scale2[,-8])

Then, we re-fit all the models:

# Fit linear regression
fit_lr2 <- lm(Chance.of.Admit ~ ., data = as.data.frame(train_scale2))

# Fit and cross-validate ridge
fit_ridge2 <- glmnet(x = train_scale2[,-8], y = train_scale2[,8], 
                     alpha = 0, lambda = lambdas, intercept = F)
cv_ridge2 <- cv.glmnet(x = train_scale2[,-8], y = train_scale2[,8], 
                       alpha = 0, lambda = lambdas, intercept = F)

# Fit the model using Stan's NUTS HMC sampler
fit_bayes_ridge2 <- sampling(bayes_ridge, stan_dat2, iter = 2000, 
                             warmup = 500, chains = 3, cores = 3)
# Fit and cross-validate LASSO
fit_lasso2 <- glmnet(x = train_scale2[,-8], y = train_scale2[,8], 
                     alpha = 1, lambda = lambdas, intercept = F)
cv_lasso2 <- cv.glmnet(x = train_scale2[,-8], y = train_scale2[,8], 
                       alpha = 1, lambda = lambdas, intercept = F)
# Fit the model using Stan's NUTS HMC sampler
fit_bayes_lasso2 <- sampling(bayes_lasso, stan_dat2, iter = 2000, 
                             warmup = 500, chains = 3, cores = 3)

And time to plot!

# Extract beta weights from traditional regression model
betas_trad2 <- fit_lr2$coefficients[-1]

# Extract optimal ridge beta weights from CV
betas_ridge2 <- fit_ridge2$beta[,which(cv_ridge2$lambda==cv_ridge2$lambda.min)]

# Plot posterior distributions and frequentist point estimates
p3 <- mcmc_areas(as.array(fit_bayes_ridge2), pars = paste0("beta[", 1:7, "]"), 
           prob = 0.8, prob_outer = 0.99, point_est = "mean") +
  geom_point(x = rep(betas_ridge2, 1024), y = rep(7:1, 1024), color = I("#c61d29"), size = 2) +
  geom_point(x = rep(betas_trad2, 1024), y = rep(7:1, 1024), color = I("black"), size = 2, alpha = .6) +
  ggtitle("Ridge Penalty") +
  theme_minimal(base_size = 20)

# Extract optimal LASSO beta weights from CV
betas_lasso2 <- fit_lasso2$beta[,which(cv_lasso2$lambda==cv_lasso2$lambda.min)]

# Plot posterior distributions and frequentist point estimates
p4 <- mcmc_areas(as.array(fit_bayes_lasso2), pars = paste0("beta[", 1:7, "]"), 
           prob = 0.8, prob_outer = 0.99) +
  geom_point(x = rep(betas_lasso2, 1024), y = rep(7:1, 1024), color = I("#c61d29"), size = 2) +
  geom_point(x = rep(betas_trad2, 1024), y = rep(7:1, 1024), color = I("black"), size = 2, alpha = .6) +
  ggtitle("LASSO Penalty") +
  theme_minimal(base_size = 20)

cowplot::plot_grid(p3, p4, ncol = 2)

Woah! Unlike before, all of our estimates pretty much fall right on top of each other. Because this dataset has both: (1) many observations, and (2) a relatively high signal-to-noise ratio, both traditional and regularized regression methods offer the same conclusion when we make use the full training dataset.

Discussion

In this post, we learned about the benefits of using regularized/penalized regression models over traditional regression. We determined that in low and/or noisy data settings, the so-called unbiased estimates given by traditional regression modeling actually lead to worse-off model performance. Importantly, we learned that this occurs because being ubiased allows a model to learn a lot from the data, including learning patterns of noise. Then, we learned that biased methods such as ridge and LASSO regression restrict the amount of learning that we get from data, which leads to better estimates in low and/or noisy data settings.

Finally, we saw that hierarchical Bayesian models actually contain frequentist ridge and LASSO regression as a special case—namely, we can choose a prior distribution across the $\beta$ weights that gives us a solution that is equivalent to that of the frequentist ridge or LASSO methods! Not only that, but Bayesian regression gives us a full posterior distribution for each parameter, thus circumventing problems with frequentist regularization that require the use of bootstrapping to estimate confidence intervals.

So What?

Psychology—and many other areas of social and behavoiral science—is in the midst of a replication crisis. While many factors are at play here, one problem is that the methods we use to draw inference from our data are too liberal. Taken with the high measurement error, low sample sizes, and lack of mechanistic models in our research areas, we are bound to overestimate the magnitude (and even miss the sign) of effects that we are interested in if we continue to use methods that are prone to overfitting. Regularization provides an elegant solution to this problem of “learning too much” from our data.

Moving forward, I hope that you consider regularizing your models! (and if you are feeling really fancy, go Bayes! Embrace the uncertainty!)

R Machine Learning Regularization Bayesian Statistics

Nathaniel Haines

Data Scientist & Computational Psychologist, PhD

An academic Bayesian who is currently exploring the high dimensional posterior distribution of life