How to Estimate a Mean, and What It Means for Science
Introduction
If I asked you to estimate 30 means, you would probably compute 30 sample means. And you would be provably wrong (well.. maybe not wrong, but at least provably inefficient).
Now, this might seem like a strange claim. After all, you may have known that the sample mean is the minimum variance unbiased estimator (MVUE) for a single mean—as the verbose name implies, this makes the sample mean the best unbiased estimator of the true underlying mean. And if we are only interested in a single mean—say, the average height of adults in the United States—the sample mean is indeed the best we can do (in terms of frequentist risk). Of course, scientists almost never estimate just one mean. We estimate means for multiple individuals, multiple conditions, multiple groups, oftentimes all within a single analysis. However, as alluded to above, once we are estimating three or more means simultaneously, the sample mean is inadmissible—meaning there exist other estimators that have lower total mean squared error for every possible set of true values.
This result, first proven by Charles Stein in 1956 and made constructive by James and Stein in 1961 (i.e. the actual estimator was introduced), was so counterintuitive that it became known as “Stein’s Paradox” (for a fun, approachable read check out Efron & Morris, 1977). Yet over 60 years later, the vast majority of scientists—whether data or research scientists in industry or ivory tower academics—still use sample means as if they were optimal. Every time a researcher computes a mean response time or summed score per participant, a mean score per group, or a mean difference across conditions, they are using an estimator that is provably inferior to alternatives.
The most interesting part of all this is that alternative estimators were discovered independently across hundreds of years. Psychometricians had the core insight in the mid 1920’s through the lens of classical test theory, reliability, and Kelley’s true score estimation. Mathematical statisticians formalized it through James-Stein estimators in 1961. And Bayesians—going back to the Reverend himself in 1763 (see Efron & Morris, 1977), but extended later by Laplace (1812) and Jeffreys (1939)—understood that combining outside information with observed data naturally produces shrinkage, and when done right this shrinkage produces better estimates than the observed data alone. Today, modern regularization techniques like ridge regression achieve the same effect via optimizing penalized likelihood functions. All of these approaches converge on the same fundamental principle: when estimating a mean (or really any expectation..), pool information across units.
In this post, we will walk through how shrinkage works according to each of the above approaches, starting with a simulation to demonstrate the problem and ending with a discussion of what this all means for science and statistical inference more braodly. If you have read my previous post on measurement error, some of this will be familiar. Here, we will take a broader perspective, connecting the dots across fields and making the case that choosing good estimators is not just a curiosity, but an important part of getting the right answer when making statistical inference.
Setting Up a Simulation
Before we compare estimators, we need some ground truth. Let’s simulate a realistic data-generating process where we know the true individual-level means, and we can then evaluate how well different estimators recover them.
We will simulate 30 individuals, each with a true mean \(\mu_i\) drawn from a population distribution. For each individual, we then generate 10 observations from a normal distribution centered at their true mean. The data-generating process is:
\[\mu_i \sim \mathcal{N}(\mu_{\text{pop}}, \sigma_{\text{pop}})\]
\[y_{i,t} \sim \mathcal{N}(\mu_i, \sigma_{\text{obs}})\]
where \(\mu_{\text{pop}} = 5\), \(\sigma_{\text{pop}} = 2\), and \(\sigma_{\text{obs}} = 6\):
library(dplyr)
library(tidyr)
library(foreach)
library(ggplot2)
set.seed(43202)
# Simulation parameters
N_subj <- 30
n_obs <- 10
mu_pop <- 5
sigma_pop <- 2
sigma_obs <- 6
# Generate true individual means from population distribution
mu_true <- rnorm(N_subj, mu_pop, sigma_pop)
# Generate observations for each individual
obs_data <- foreach(i = 1:N_subj, .combine = "rbind") %do% {
data.frame(
subj = i,
mu_true = mu_true[i],
y = rnorm(n_obs, mu_true[i], sigma_obs)
)
}
To get a feel for the data, let’s look at the observations for 6 randomly selected individuals. The vertical black line shows each individual’s true mean \(\mu_i\), and the dusty rose histogram shows their observed data:
# Select 6 subjects to display
display_subjs <- sort(sample(1:N_subj, 6))
obs_data %>%
filter(subj %in% display_subjs) %>%
mutate(subj_label = paste0("Subject ", subj)) %>%
ggplot(aes(x = y)) +
geom_histogram(fill = "#DCBCBC", color = "white", bins = 12) +
geom_vline(aes(xintercept = mu_true), linetype = 1, color = "black", size = 1) +
facet_wrap(~subj_label, nrow = 2, scales = "free_y") +
xlab("Observed Value (y)") +
ylab("Count") +
ggtitle("Observations for 6 Individuals") +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
As you can see, with only 10 observations per individual and a large observation-level standard deviation (\(\sigma_{\text{obs}} = 6\)), the observations are quite spread out relative to the true means. This is the fundamental challenge: we are trying to estimate the “true” underlying person-level means from noisy observational data.
The Sample Mean
The obvious approach is to compute the sample mean for each individual:
\[\hat{\mu}_i^{\text{sample}} = \bar{y}_i = \frac{1}{n}\sum_{t=1}^{n}y_{i,t}\]
# Compute sample means and standard errors
subj_summary <- obs_data %>%
group_by(subj) %>%
summarize(mu_true = unique(mu_true),
x_bar = mean(y),
var_y = var(y),
se_sample = sd(y) / sqrt(n()),
.groups = "drop")
# RMSE
rmse_sample_mean <- sqrt(mean((subj_summary$x_bar - subj_summary$mu_true)^2))
We can evaluate how well the sample mean performs by plotting the true values against the estimated values. If the sample mean were a perfect estimator, all points would fall on the identity line (dashed gray):
subj_summary %>%
ggplot(aes(x = mu_true, y = x_bar)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_linerange(aes(ymin = x_bar - 1.96 * se_sample,
ymax = x_bar + 1.96 * se_sample),
color = "#DCBCBC", alpha = 0.5) +
geom_point(color = "#DCBCBC", size = 3) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle(paste0("Sample Mean (RMSE = ", round(rmse_sample_mean, 2), ")")) +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
The sample means scatter considerably around the identity line. Individuals with truly extreme values tend to have sample means that tend to be less extreme (regression to the mean), and individuals with truly average values can have sample means that are too extreme in either direction. We can use the root mean squared error (RMSE) to quantify this:
\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(\hat{\mu}_i - \mu_i)^2}\]
It will become more clear as we compare against other estimators, but we will note here that the sample mean is unbiased with high variance. This means that if we repeated the experiment many times, the sample means would be centered on the true values on average, but any particular set of sample means will be scattered widely around the truth. In other words, the sample mean gets it right “on average” but can be quite far off for any given individual in any given dataset—which is, of course, the only dataset we actually have. As we are about to see, we can do considerably better by accepting a small amount of bias in exchange for a large reduction in variance (this phenomenon is known as the bias-variance tradeoff throughout statistics and machine learning).
James-Stein Estimators
As alluded to above, in 1956, Charles Stein proved that when estimating the means of three or more normal distributions simultaneously (since extended beyond the normal), the sample mean is inadmissible. That is, there exists another estimator that has strictly lower total mean squared error for every possible set of true means. Five years later, James and Stein (1961) provided a constructive proof by giving the actual estimator.
The James-Stein (JS) estimator shrinks each sample mean toward the grand mean (or any fixed target) in proportion to the total noise in the estimates:
\[\hat{\mu}_i^{\text{JS}} = \bar{\bar{y}} + \left(1 - B_{\text{JS}}\right)(\bar{y}_i - \bar{\bar{y}})\]
where \(\bar{\bar{y}}\) is the grand mean across all individuals and the shrinkage factor is:
\[B_{\text{JS}} = \frac{(N-3)\hat{\sigma}^2_{\text{obs}}/n}{\sum_{i=1}^{N}(\bar{y}_i - \bar{\bar{y}})^2}\]
Intuitively, individuals whose sample means are far from the grand mean are pulled back toward it. The amount of shrinkage depends on the ratio of within-person variability to total variability. When the person-level variability is high relative to total variability, there is a lot of shrinkage.
I find the behavior of the JS estimator more intuitive after a bit of rearranging in the shrinkage term. Specifically, note that the denominator \(\sum(\bar{y}_i - \bar{\bar{y}})^2\) estimates the total variability across individuals, which reflects both true between-person variance and within-person variability: \(\sum(\bar{y}_i - \bar{\bar{y}})^2 \approx \sigma^2_{\text{pop}} + \sigma^2_{\text{obs}}/n\). Substituting into \(B_{\text{JS}}\) and noting that the \((N-3)\) term is negligible as \(N\) increases, the shrinkage factor \((1 - B_{\text{JS}})\) simplifies to approximately:
\[1 - B_{\text{JS}} \approx \frac{\sigma^2_{\text{pop}}}{\sigma^2_{\text{pop}} + \sigma^2_{\text{obs}}/n}\]
That is, the JS estimator shrinks in proportion to the ratio of between-person variance (signal) to total variance (signal plus noise). This “signal over signal-plus-noise” form will reappear in every method we cover from here on.
The idea that you can improve your estimate of one individual’s mean by looking at other individuals’ data is genuinely strange. Stranger yet—Stein’s proof does not even require that the units being pooled together are even related (e.g., the means could be from different studies, different levels of analysis, etc.). In fact, the example in Efron & Morris ( 1977) describes pooling together batting averages and foreign automotive import rates. Regardless of what our means represent, it works because the shrinkage reduces total variance more than it introduces bias.
We will use the “positive-part” JS estimator, which ensures that the shrinkage factor does not exceed 1 (which would flip the direction of estimates):
# James-Stein estimator
grand_mean <- mean(subj_summary$x_bar)
sigma2_pool <- mean(subj_summary$var_y)
se2_mean <- sigma2_pool / n_obs
SS <- sum((subj_summary$x_bar - grand_mean)^2)
B_js <- ((N_subj - 3) * se2_mean) / SS
B_js <- min(B_js, 1) # Positive-part JS: don't over-shrink
subj_summary <- subj_summary %>%
mutate(js_est = grand_mean + (1 - B_js) * (x_bar - grand_mean),
se_js = (1 - B_js) * se_sample)
rmse_js <- sqrt(mean((subj_summary$js_est - subj_summary$mu_true)^2))
Now let’s see how the JS estimator compares to the sample mean. The lines show the connections between the sample means and shrunken JS estimates for each individual:
subj_summary %>%
ggplot(aes(x = mu_true, y = js_est)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_linerange(aes(ymin = js_est - 1.96 * se_js,
ymax = js_est + 1.96 * se_js),
color = "#8F2727") +
geom_segment(aes(xend = mu_true, yend = x_bar),
color = "#DCBCBC") +
geom_point(aes(x = mu_true, y = x_bar),
color = "#DCBCBC", size = 2) +
geom_point(color = "#8F2727", size = 3) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle(paste0("James-Stein (RMSE = ", round(rmse_js, 2), ")")) +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
The JS estimates (dark red points) are noticeably closer to the identity line than the sample means (dusty rose points). The RMSE has decreased—we are doing better, and we are doing so for free, without collecting any additional data or making any assumptions regarding prior distributions. All we did was borrow information across individuals.
One thing worth noting is that the JS estimator does not come with a natural standard error. The error bars shown above use a conditional approximation: since \(\hat{\mu}_i^{JS}\) is a linear function of \(\bar{y}_i\) (treating \(B_{JS}\) as fixed), we can write \(SE_{JS} = (1 - B_{JS}) \cdot SE_{\text{sample}}\). This captures the intuition that shrinkage compresses the uncertainty intervals just as it compresses the point estimates. However, this approximation ignores the uncertainty in \(B_{JS}\) itself (which is estimated from the data) and does not account for the bias that shrinkage introduces for any given individual. So these intervals should be taken as rough guides rather than calibrated confidence intervals—a limitation we will see resolved by the approaches that follow.
Classical Test Theory and True Score Estimation
OK, so James and Stein surprised the statistics world in 1961. What is perhaps less well known is that psychometricians had essentially the same insight forty years earlier.
In the 1920’s, Truman Kelley showed that the best estimate of an individual’s “true score” is not their observed score, but a reliability-weighted combination of the observed score and the group mean. Specifically, Kelley’s regression-based true score estimate is:
\[\hat{\mu}_i^{\text{CTT}} = (1 - \hat{\rho})\bar{X} + \hat{\rho}\bar{y}_i\]
which can be rearranged as:
\[\hat{\mu}_i^{\text{CTT}} = \bar{X} + \hat{\rho}(\bar{y}_i - \bar{X})\]
where \(\hat{\rho}\) is the estimated reliability. Look at that second form carefully—we start at the grand mean, then move toward the individual’s sample mean, but only by a fraction \(\hat{\rho}\). This is a shrinkage estimator! The reliability score plays exactly the same role as \((1 - B_{\text{JS}})\) in the James-Stein estimator.
In our simulation context, the reliability is:
\[\hat{\rho} = 1 - \frac{\hat{\sigma}^2_{\text{obs}}/n}{\text{Var}(\bar{y})}\]
Since \(\text{Var}(\bar{y}) = \hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n\), this simplifies to:
\[\hat{\rho} = \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n}\]
That is, reliability is the ratio of between-person variance to total variance—signal over signal-plus-noise. This is the same form of shrinkage factor that will reappear in every method we cover from here on (if you want a deeper dive into reliability and true score estimation, I cover it in detail in a previous post).
# CTT true score estimation
var_x_bar <- var(subj_summary$x_bar)
rho_hat <- 1 - (se2_mean / var_x_bar)
subj_summary <- subj_summary %>%
mutate(ctt_est = grand_mean + rho_hat * (x_bar - grand_mean),
se_ctt = sd(x_bar) * sqrt(1 - rho_hat) * sqrt(rho_hat))
rmse_ctt <- sqrt(mean((subj_summary$ctt_est - subj_summary$mu_true)^2))
In our simulated data, the estimated reliability is \(\hat{\rho}\) = 0.58, meaning that only about 58% of the variance in the sample means reflects true between-person differences—the rest is noise. Because the person-level variability (or rather, uncertainty) is high relative to between-person variability, each person’s estimate is pulled 42% of the way back toward the grand mean.
subj_summary %>%
ggplot(aes(x = mu_true, y = ctt_est)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_linerange(aes(ymin = ctt_est - 1.96 * se_ctt,
ymax = ctt_est + 1.96 * se_ctt),
color = "#8F2727") +
geom_segment(aes(xend = mu_true, yend = x_bar),
color = "#DCBCBC") +
geom_point(aes(x = mu_true, y = x_bar),
color = "#DCBCBC", size = 2) +
geom_point(color = "#8F2727", size = 3) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle(paste0("CTT True Score (RMSE = ", round(rmse_ctt, 2), ")")) +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
Unlike the JS estimator, the CTT framework provides a principled standard error for the true score estimates. As Kelley showed, the SE of the estimated true score is \(\hat{\sigma}_{\hat{\theta}} = \sigma_X\sqrt{1-\hat{\rho}}\sqrt{\hat{\rho}}\), which is smaller than the SE of the observed score \(\sigma_X\sqrt{1-\hat{\rho}}\) by a factor of \(\sqrt{\hat{\rho}}\) (for a derivation, see Brennan, 2012, and my previous post for a more thorough treatment). Notice in the plot above that the 95% intervals for the true score estimates are indeed narrower than those of the sample means, yet they still exhibit good coverage of the underlying true values.
I said it before and I will say it again—I think this is pretty cool for a technique developed in the 1920’s :D. The CTT estimates look nearly identical to the James-Stein estimates—and they should, because they are doing essentially the same thing. Both shrink individual estimates toward the group mean in proportion to the signal-to-noise ratio. The main difference is in the exact shrinkage coefficient: JS uses a \((N-3)\) correction in the numerator, while CTT does not. In practice, these give very similar results. With a large enough number set of means, they converge to the same solution.
Empirical Bayes
So we have seen that both James-Stein estimators and classical true score estimation arrive at the same fundamental insight through different paths. And we saw that psychometricians had the core insight before the mathematical statistics community. But like many discoveries, the core insight of shrinkage goes back even further than the 1920’s! Efron & Morris (1977) describe how (our lord and savior) Reverend Thomas Bayes had the key insight in as early as 1763:
The formula for the James-Stein estimator is strikingly similar to that of Bayes’s equation. Indeed, as the number of means being averaged grows very large, the two equations become identical… The James-Stein procedure, however, has one important advantage over Bayes’s method. The James-Stein estimator can be employed without knowledge of the prior distribution; indeed, one need not even suppose the means being estimated are normally distributed.
Although they describe limitations with Bayes’s method relative to the JS estimator (in particular, the need for a prior distribution), in practice we can skirt around this limitation by simply deriving a prior empirically from the data. Specifically, Empirical Bayes (EB; coined by Robbins (1956)) estimates the population-level distribution from the data and then uses it to form posterior estimates for each individual. In the normal-normal case (which our simulation satisfies), the EB posterior mean is a precision-weighted average of the individual’s sample mean and the estimated population mean, where \(n_i\) is the number of observations for individual \(i\):
\[\hat{\mu}_i^{\text{EB}} = \frac{\frac{n_i}{\hat{\sigma}^2_{\text{obs}}}\bar{y}_i + \frac{1}{\hat{\sigma}^2_{\text{pop}}}\hat{\mu}_{\text{pop}}}{\frac{n_i}{\hat{\sigma}^2_{\text{obs}}} + \frac{1}{\hat{\sigma}^2_{\text{pop}}}}\]
This can be simplified to a shrinkage form:
\[\hat{\mu}_i^{\text{EB}} = \hat{\mu}_{\text{pop}} + \hat{B}_{\text{EB}}(\bar{y}_i - \hat{\mu}_{\text{pop}})\]
where the shrinkage factor is:
\[\hat{B}_{\text{EB}} = \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\]
The key advantage of EB over CTT and JS is its generality. Notice the \(n_i\) subscript in the formulas above—each individual’s shrinkage factor depends on their own sample size, so EB naturally accommodates unbalanced designs—something that JS and CTT as formulated above do not handle as easily. In our current simulation all individuals share the same sample size, so this advantage does not manifest. However, it becomes important in practice when different individuals contribute different amounts of data.
# Empirical Bayes (method of moments for population parameters)
mu_pop_hat <- grand_mean
sigma2_pop_hat <- max(var_x_bar - se2_mean, 0.001)
subj_summary <- subj_summary %>%
mutate(
B_eb = sigma2_pop_hat / (sigma2_pop_hat + sigma2_pool / n_obs),
eb_est = mu_pop_hat + B_eb * (x_bar - mu_pop_hat),
se_eb = sqrt(1 / (n_obs / sigma2_pool + 1 / sigma2_pop_hat))
)
rmse_eb <- sqrt(mean((subj_summary$eb_est - subj_summary$mu_true)^2))
subj_summary %>%
ggplot(aes(x = mu_true, y = eb_est)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_linerange(aes(ymin = eb_est - 1.96 * se_eb,
ymax = eb_est + 1.96 * se_eb),
color = "#8F2727") +
geom_segment(aes(xend = mu_true, yend = x_bar),
color = "#DCBCBC") +
geom_point(aes(x = mu_true, y = x_bar),
color = "#DCBCBC", size = 2) +
geom_point(color = "#8F2727", size = 3) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle(paste0("Empirical Bayes (RMSE = ", round(rmse_eb, 2), ")")) +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
Unlike the JS estimator, the EB framework naturally provides uncertainty estimates: the posterior standard deviation is \(\sqrt{1/(n_i/\hat{\sigma}^2_{\text{obs}} + 1/\hat{\sigma}^2_{\text{pop}})}\), which follows directly from the normal-normal conjugate update. The resulting intervals in the plot above are properly calibrated (conditional on the estimated population parameters).
At this point, you may be noticing a pattern :D. James-Stein, classical true scores, and empirical Bayes are all giving us essentially the same answer. This is not a coincidence—they are all manifestations of the same fundamental principle: when you have multiple related quantities to estimate, you should pool information across the units in the group.
Ridge Regression
So, shrinkage estimators have been re-discovered in different fields for the past couple hundred years. And yet, there is no sign of this trend stopping—modern machine learning literature is keeping the trend alive, and my favorite example is ridge regression.
Setting the stage—the sample mean for individual \(i\) is the same value \(\mu_i\) that minimizes the sum of squared residuals (or maximizes the normal distribution likelihood function):
\[\underset{\mu_1,...,\mu_N}{argmin}\sum_{i=1}^{N}\sum_{t=1}^{n_i}(y_{i,t} - \mu_i)^2\]
Ridge regression modifies this loss function by adding a penalty term for how far each individual’s mean deviates from zero:
\[\underbrace{\underset{\mu_1,...,\mu_N}{argmin}\sum_{i=1}^{N}\sum_{t=1}^{n_i}(y_{i,t} - \mu_i)^2}_{\text{Traditional Loss Function}} + \underbrace{\lambda\sum_{i=1}^{N}\mu_i^2}_{\text{Ridge Penalty}}\]
Here, \(\lambda~(0<\lambda<\infty)\) is a penalty parameter that controls how much each individual’s estimate is allowed to deviate from the group. When \(\lambda = 0\), we recover the sample means exactly. As \(\lambda \rightarrow \infty\), the penalty dominates and all estimates are shrunk toward zero. For our balanced design with \(n\) observations per subject, the solution simplifies to:
\[\hat{\mu}_i^{\text{ridge}} = \frac{n}{n + \lambda}\bar{y}_i\]
which shrinks each sample mean toward zero by the factor \(n/(n + \lambda)\). If we include an intercept (as we do in practice), this becomes shrinkage toward the grand mean—the same operation as JS, CTT, and EB, with \(\lambda\) controlling the degree of shrinkage.
What makes this connection especially interesting is that ridge regression with an L2 penalty is equivalent to placing a normal prior on the coefficients in a Bayesian regression model (I cover this equivalence in more detail in a previous post). Under this equivalence, \(\lambda = \sigma^2_{\text{obs}} / \sigma^2_{\text{pop}}\), and the ridge shrinkage factor becomes:
\[\frac{n_i}{n_i + \lambda} = \frac{n_i}{n_i + \sigma^2_{\text{obs}} / \sigma^2_{\text{pop}}} = \frac{\sigma^2_{\text{pop}}}{\sigma^2_{\text{pop}} + \sigma^2_{\text{obs}}/n_i}\]
The same signal-over-signal-plus-noise ratio we saw for JS, CTT and empirical Bayes. The penalty parameter \(\lambda\) is just another way of specifying the variance ratio.
If you don’t believe me, we can use glmnet to fit the ridge regression, letting cross-validation choose the penalty \(\lambda\) for us:
library(glmnet)
# Create design matrix of subject indicators (no intercept)
X_mat <- model.matrix(~ 0 + factor(subj), data = obs_data)
# Fit ridge regression with cross-validation
fit_ridge <- cv.glmnet(X_mat, obs_data$y, alpha = 0, intercept = TRUE,
standardize = FALSE)
# Extract coefficients at lambda.min
ridge_coefs <- coef(fit_ridge, s = "lambda.min")
ridge_intercept <- ridge_coefs[1]
ridge_subj <- as.numeric(ridge_coefs[-1])
# Individual estimates = intercept + subject coefficient
subj_summary <- subj_summary %>%
mutate(ridge_est = ridge_intercept + ridge_subj)
rmse_ridge <- sqrt(mean((subj_summary$ridge_est - subj_summary$mu_true)^2))
subj_summary %>%
ggplot(aes(x = mu_true, y = ridge_est)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_segment(aes(xend = mu_true, yend = x_bar),
color = "#DCBCBC") +
geom_point(aes(x = mu_true, y = x_bar),
color = "#DCBCBC", size = 2) +
geom_point(color = "#8F2727", size = 3) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle(paste0("Ridge Regression (RMSE = ", round(rmse_ridge, 2), ")")) +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
Once again, the ridge estimates are shrunk toward the grand mean, and the RMSE is lower than the sample mean. Note that glmnet selects \(\lambda\) via cross-validation, which differs from previous approaches that use signal-to-noise ratios empirically derived from the data. However, like the JS estimator, ridge regression does not come with a natural standard error for the shrunken estimates (you could bootstrap, but I will leave that as an exercise for the reader).
Hierarchical Models
All of the approaches so far share a common feature: they estimate the population parameters as point estimates and then plug those in to compute the individual-level estimates. A hierarchical model goes one step further by jointly estimating everything simultaneously. In the frequentist case, this gives us Best Linear Unbiased Predictors (BLUPs); in the Bayesian case, it gives us full posterior distributions.
The hierarchical model for our simulation is:
\[y_{i,t} \sim \mathcal{N}(\mu_i, \sigma_{\text{obs}})\] \[\mu_i \sim \mathcal{N}(\mu_{\text{pop}}, \sigma_{\text{pop}})\]
In the Bayesian version, we additionally specify priors:
\[\mu_{\text{pop}} \sim \mathcal{N}(0, 10)\] \[\sigma_{\text{pop}} \sim \text{half-}\mathcal{N}(0, 5)\] \[\sigma_{\text{obs}} \sim \text{half-}\mathcal{N}(0, 10)\]
In either case, hierarchical models impose a (by-now-familiar) shrinkage form on the individual (i.e. random) effects (see Gelman & Pardoe, 2006):
\[\hat{\mu}_i^{\text{hier}} = \hat{\mu}_{\text{pop}} + \hat{B}_i(\bar{y}_i - \hat{\mu}_{\text{pop}}), \quad \hat{B}_i = \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\]
The difference from the previous methods is that \(\hat{\sigma}^2_{\text{pop}}\), \(\hat{\sigma}^2_{\text{obs}}\), and \(\hat{\mu}_{\text{pop}}\) are all estimated jointly with the individual-level parameters, rather than plugged in as fixed quantities. In practice this means that the uncertainty in variance parameters is propagated through the shrinkage term, which in theory should lead to improved performance—let’s take a look!
Frequentist: lme4
The simplest way to fit a hierarchical model in R is with lme4::lmer:
library(lme4)
# Fit hierarchical model (random intercept per subject)
fit_lmer <- lmer(y ~ 1 + (1 | subj), data = obs_data)
# Extract BLUPs (conditional modes + fixed effect)
blups <- coef(fit_lmer)$subj[, "(Intercept)"]
# Extract conditional variances for random effects
re_vars <- as.numeric(attr(ranef(fit_lmer, condVar = TRUE)$subj, "postVar"))
se_lmer <- sqrt(re_vars)
subj_summary <- subj_summary %>%
mutate(hier_freq_est = blups,
se_hier_freq = se_lmer)
rmse_hier_freq <- sqrt(mean((subj_summary$hier_freq_est - subj_summary$mu_true)^2))
Bayesian: brms
For a fully Bayesian analysis, we can use brms, which provides the same formula interface as lme4 but estimates full posterior distributions:
library(brms)
# Fit Bayesian hierarchical model
fit_brms <- brm(
y ~ 1 + (1 | subj),
data = obs_data,
family = gaussian(),
prior = c(
prior(normal(0, 10), class = "Intercept"),
prior(normal(0, 5), class = "sd"),
prior(normal(0, 10), class = "sigma")
),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 43202
)
# Extract posterior means for each subject
ranefs <- ranef(fit_brms)$subj[, , "Intercept"]
fixef_mu <- fixef(fit_brms)["Intercept", "Estimate"]
subj_summary <- subj_summary %>%
mutate(hier_bayes_est = fixef_mu + ranefs[, "Estimate"],
se_hier_bayes = ranefs[, "Est.Error"])
rmse_hier_bayes <- sqrt(mean((subj_summary$hier_bayes_est - subj_summary$mu_true)^2))
Let’s look at both hierarchical estimates:
# Combine estimates and SEs into long format
hier_long <- bind_rows(
subj_summary %>%
transmute(subj, mu_true, x_bar, estimate = hier_freq_est, se = se_hier_freq,
method = paste0("lme4 (RMSE = ", round(rmse_hier_freq, 2), ")")),
subj_summary %>%
transmute(subj, mu_true, x_bar, estimate = hier_bayes_est, se = se_hier_bayes,
method = paste0("brms (RMSE = ", round(rmse_hier_bayes, 2), ")"))
) %>%
mutate(method = factor(method, levels = unique(method)))
hier_long %>%
ggplot(aes(x = mu_true, y = estimate)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "gray") +
geom_linerange(aes(ymin = estimate - 1.96 * se,
ymax = estimate + 1.96 * se),
color = "#8F2727") +
geom_segment(aes(xend = mu_true, yend = x_bar),
color = "#DCBCBC") +
geom_point(aes(x = mu_true, y = x_bar),
color = "#DCBCBC", size = 2) +
geom_point(color = "#8F2727", size = 3) +
facet_wrap(~method, nrow = 1) +
xlab(expression("True" ~ mu[i])) +
ylab(expression("Estimated" ~ hat(mu)[i])) +
ggtitle("Hierarchical Models") +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank())
As expected, the hierarchical model estimates are essentially identical to the JS, CTT, EB, and ridge estimates. Notice that lme4 provides conditional variances for its BLUPs (via condVar = TRUE in ranef()), while brms gives us full posterior distributions from which the standard deviations follow directly. In both cases, the uncertainty intervals are narrower than those of the sample means and have principled justification—no ad-hoc approximations required. This is a major advantage over the JS estimator, where (as we saw) the standard error requires assumptions that may not hold.
Putting It All Together
The Shrinkage Factors
Before comparing the estimates themselves, let’s put all the shrinkage factors side by side. Recall that every method we have covered takes the same general form:
\[\hat{\mu}_i = \hat{\mu}_{\text{pop}} + \hat{B} \cdot (\bar{y}_i - \hat{\mu}_{\text{pop}})\]
where \(\hat{B} \in [0, 1]\) controls how much we trust the individual data relative to the group. When \(\hat{B} = 1\), we get the sample mean (no shrinkage); when \(\hat{B} = 0\), every individual gets the grand mean (complete shrinkage). Let’s see what each method gives us:
# Extract variance components from lme4 and brms
vc_lmer <- as.data.frame(VarCorr(fit_lmer))
sigma2_pop_lmer <- vc_lmer$vcov[1]
sigma2_obs_lmer <- vc_lmer$vcov[2]
B_lmer <- sigma2_pop_lmer / (sigma2_pop_lmer + sigma2_obs_lmer / n_obs)
vc_brms <- VarCorr(fit_brms)
sigma2_pop_brms <- vc_brms$subj$sd[1]^2
sigma2_obs_brms <- vc_brms$residual__$sd[1]^2
B_brms <- sigma2_pop_brms / (sigma2_pop_brms + sigma2_obs_brms / n_obs)
# Ridge: compute effective shrinkage from the fitted coefficients (due to glmnet internal scaling)
# ridge_subj[i] is the subject deviation from the intercept;
# (x_bar[i] - grand_mean) is the unshrunken deviation.
# Their ratio gives the effective shrinkage factor.
B_ridge <- mean(ridge_subj / (subj_summary$x_bar - grand_mean), na.rm = TRUE)
# Collect all shrinkage factors as named scalars
B_values <- c(
"James-Stein" = 1 - B_js,
"CTT True Score" = rho_hat,
"Empirical Bayes" = subj_summary$B_eb[1],
"Ridge" = B_ridge,
"Hier. (lme4)" = B_lmer,
"Hier. (brms)" = B_brms
)
shrinkage_df <- data.frame(
Method = names(B_values),
Formula = c(
"$1 - \\frac{(N-3)\\hat{\\sigma}^2_{\\text{obs}}/n}{\\sum(\\bar{y}_i - \\bar{\\bar{y}})^2} \\approx \\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n}$",
"$1 - \\frac{\\hat{\\sigma}^2_{\\text{obs}}/n}{\\text{Var}(\\bar{y})} = \\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n}$",
"$\\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n_i}$",
"$\\frac{n_i}{n_i + \\lambda} = \\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n_i}$",
"$\\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n_i}$",
"$\\frac{\\hat{\\sigma}^2_{\\text{pop}}}{\\hat{\\sigma}^2_{\\text{pop}} + \\hat{\\sigma}^2_{\\text{obs}}/n_i}$"
),
B_hat = round(as.numeric(B_values), 3)
)
knitr::kable(
shrinkage_df,
col.names = c("Method", "Shrinkage Factor ($\\hat{B}$)", "Value"),
align = "lcc",
escape = FALSE)
| Method | Shrinkage Factor (\(\hat{B}\)) | Value |
|---|---|---|
| James-Stein | \(1 - \frac{(N-3)\hat{\sigma}^2_{\text{obs}}/n}{\sum(\bar{y}_i - \bar{\bar{y}})^2} \approx \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n}\) | 0.609 |
| CTT True Score | \(1 - \frac{\hat{\sigma}^2_{\text{obs}}/n}{\text{Var}(\bar{y})} = \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n}\) | 0.580 |
| Empirical Bayes | \(\frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\) | 0.580 |
| Ridge | \(\frac{n_i}{n_i + \lambda} = \frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\) | 0.585 |
| Hier. (lme4) | \(\frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\) | 0.580 |
| Hier. (brms) | \(\frac{\hat{\sigma}^2_{\text{pop}}}{\hat{\sigma}^2_{\text{pop}} + \hat{\sigma}^2_{\text{obs}}/n_i}\) | 0.584 |
Despite arriving from very different starting points—decision theory, psychometrics, Bayesian updating, penalized optimization, and mixed-effects modeling—all six methods converge on nearly the same shrinkage factor. I absolutely love this table—if you don’t feel warm and fuzzy inside when seeing it, you simply do not have a soul.
Comparing Estimates
Now let’s compare all of the estimates side by side:
# Reshape to long format for faceted comparison (with SEs)
# Order: red family (baseline/classical) -> green family (frequentist) -> blue family (Bayesian)
comparison_long <- bind_rows(
subj_summary %>% transmute(subj, mu_true, estimate = x_bar,
se = se_sample, method = "Sample Mean"),
subj_summary %>% transmute(subj, mu_true, estimate = js_est,
se = se_js, method = "James-Stein"),
subj_summary %>% transmute(subj, mu_true, estimate = ctt_est,
se = se_ctt, method = "CTT True Score"),
subj_summary %>% transmute(subj, mu_true, estimate = ridge_est,
se = NA_real_, method = "Ridge"),
subj_summary %>% transmute(subj, mu_true, estimate = hier_freq_est,
se = se_hier_freq, method = "Hier. (lme4)"),
subj_summary %>% transmute(subj, mu_true, estimate = eb_est,
se = se_eb, method = "Empirical Bayes"),
subj_summary %>% transmute(subj, mu_true, estimate = hier_bayes_est,
se = se_hier_bayes, method = "Hier. (brms)")
) %>%
mutate(method = factor(method,
levels = c("Sample Mean", "James-Stein", "CTT True Score",
"Ridge", "Hier. (lme4)", "Empirical Bayes", "Hier. (brms)")))
order_df <- comparison_long %>%
filter(method == "Sample Mean") %>%
arrange(estimate) %>%
mutate(subj_order = row_number()) %>%
select(subj, subj_order)
comparison_long <- comparison_long %>%
left_join(order_df, by = "subj")
truth_df <- comparison_long %>%
distinct(subj, subj_order, mu_true)
ggplot(comparison_long,
aes(x = subj_order, y = estimate, color = method)) +
geom_line(linewidth = 1.2) +
geom_line(data = truth_df,
aes(x = subj_order, y = mu_true, color = "True Value"),
linewidth = 1,
linetype = "dotted",
inherit.aes = FALSE) +
scale_color_manual(values = c(
"Sample Mean" = "#DCBCBC",
"James-Stein" = "#8F2727",
"CTT True Score" = "#B2182B",
"Ridge" = "#4daf4a",
"Hier. (lme4)" = "#78c679",
"Empirical Bayes" = "#2166AC",
"Hier. (brms)" = "#8ec2de",
"True Value" = "gray40"
)) +
scale_linetype_manual(values = c("True Value" = "dotted")) +
xlab("Subjects (ordered by sample mean)") +
ylab(expression(hat(mu)[i])) +
ggtitle("Shrinkage of Individual Estimates Across Methods") +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank(),
legend.position = "right")
# RMSE comparison
rmse_df <- data.frame(
Method = c("Sample Mean", "James-Stein", "CTT True Score",
"Ridge", "Hier. (lme4)", "Empirical Bayes", "Hier. (brms)"),
RMSE = c(rmse_sample_mean, rmse_js, rmse_ctt, rmse_ridge,
rmse_hier_freq, rmse_eb, rmse_hier_bayes)
) %>%
mutate(Method = factor(Method, levels = Method))
rmse_df %>%
ggplot(aes(x = Method, y = RMSE, fill = Method)) +
geom_col() +
scale_fill_manual(values = c(
"Sample Mean" = "#DCBCBC",
"James-Stein" = "#8F2727",
"CTT True Score" = "#B2182B",
"Ridge" = "#4daf4a",
"Hier. (lme4)" = "#78c679",
"Empirical Bayes" = "#2166AC",
"Hier. (brms)" = "#8ec2de"
)) +
xlab("") +
ylab("RMSE") +
ggtitle("Estimation Error Across Methods") +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank(),
axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")
Unsurprisingly, all of the shrinkage estimators substantially outperform the sample mean, and they all give very similar results to one another. This is the central message of this post—whether you call it James-Stein estimation, reliability-based true score estimation, empirical Bayes, ridge regression, or hierarchical modeling, the core insight is the same: pooling information across individuals in a group results in better individual-level estimates.
How Shrinkage Depends on Sample Size
Whenever I talk to people about how they should use hierarchical models (or other shrinkage estimator techniques), they often ask, “but how do I know that it is the right amount of shrinkage? What if it is too much?”. My answer is usually along the lines of, “well, it is data dependent, if you can estimate the individual-level effects with certainty then the sample mean and shrinkage estimator will give the same answer”. In fact, you can see this quite easily if we just pump up the number of observations per person:
set.seed(43202)
n_obs_vec <- c(5, 10, 25, 50, 100)
rmse_by_nobs <- foreach(n = seq_along(n_obs_vec), .combine = "rbind") %do% {
n_obs_k <- n_obs_vec[n]
# Generate data with same true means
obs_k <- foreach(i = 1:N_subj, .combine = "rbind") %do% {
data.frame(subj = i, mu_true = mu_true[i],
y = rnorm(n_obs_k, mu_true[i], sigma_obs))
}
# Sample means
summ_k <- obs_k %>%
group_by(subj) %>%
summarize(mu_true = unique(mu_true),
x_bar = mean(y), var_y = var(y), .groups = "drop")
# JS
gm_k <- mean(summ_k$x_bar)
s2p_k <- mean(summ_k$var_y)
se2_k <- s2p_k / n_obs_k
SS_k <- sum((summ_k$x_bar - gm_k)^2)
B_js_k <- min(((N_subj - 3) * se2_k) / SS_k, 1)
js_k <- gm_k + (1 - B_js_k) * (summ_k$x_bar - gm_k)
# CTT
vxb_k <- var(summ_k$x_bar)
rho_k <- 1 - (se2_k / vxb_k)
ctt_k <- gm_k + rho_k * (summ_k$x_bar - gm_k)
# EB
s2pop_k <- max(vxb_k - se2_k, 0.001)
B_eb_k <- s2pop_k / (s2pop_k + s2p_k / n_obs_k)
eb_k <- gm_k + B_eb_k * (summ_k$x_bar - gm_k)
# Ridge
X_k <- model.matrix(~ 0 + factor(subj), data = obs_k)
fit_r_k <- cv.glmnet(X_k, obs_k$y, alpha = 0, intercept = TRUE,
standardize = FALSE)
r_coefs <- coef(fit_r_k, s = "lambda.min")
ridge_k <- as.numeric(r_coefs[1] + r_coefs[-1])
# Hierarchical (lme4)
fit_k <- lmer(y ~ 1 + (1 | subj), data = obs_k)
hier_k <- coef(fit_k)$subj[, "(Intercept)"]
# Hierarchical (brms)
fit_brms_k <- brm(
y ~ 1 + (1 | subj), data = obs_k, family = gaussian(),
prior = c(
prior(normal(0, 10), class = "Intercept"),
prior(normal(0, 5), class = "sd"),
prior(normal(0, 10), class = "sigma")
),
chains = 4, iter = 2000, warmup = 1000, seed = 43202,
silent = 2, refresh = 0
)
ranefs_k <- ranef(fit_brms_k)$subj[, , "Intercept"]
fixef_k <- fixef(fit_brms_k)["Intercept", "Estimate"]
hier_bayes_k <- fixef_k + ranefs_k[, "Estimate"]
data.frame(
n_obs = n_obs_k,
Method = c("Sample Mean", "James-Stein", "CTT True Score",
"Empirical Bayes", "Ridge", "Hier. (lme4)", "Hier. (brms)"),
RMSE = c(sqrt(mean((summ_k$x_bar - summ_k$mu_true)^2)),
sqrt(mean((js_k - summ_k$mu_true)^2)),
sqrt(mean((ctt_k - summ_k$mu_true)^2)),
sqrt(mean((eb_k - summ_k$mu_true)^2)),
sqrt(mean((ridge_k - summ_k$mu_true)^2)),
sqrt(mean((hier_k - summ_k$mu_true)^2)),
sqrt(mean((hier_bayes_k - summ_k$mu_true)^2)))
)
}
rmse_by_nobs %>%
mutate(Method = factor(Method,
levels = c("Sample Mean", "James-Stein", "CTT True Score",
"Ridge", "Hier. (lme4)", "Empirical Bayes", "Hier. (brms)"))) %>%
ggplot(aes(x = n_obs, y = RMSE, color = Method)) +
geom_line(size = 1) +
geom_point(size = 2.5) +
scale_color_manual(values = c("Sample Mean" = "#DCBCBC",
"James-Stein" = "#8F2727",
"CTT True Score" = "#B2182B",
"Ridge" = "#4daf4a",
"Hier. (lme4)" = "#78c679",
"Empirical Bayes" = "#2166AC",
"Hier. (brms)" = "#8ec2de")) +
xlab("Observations per Individual (n)") +
ylab("RMSE") +
ggtitle("Shrinkage Advantage Decreases with More Data") +
theme_minimal(base_size = 15) +
theme(panel.grid = element_blank(),
legend.position = "right")
As you can see, the advantage of shrinkage estimators is largest when data are sparse (small \(n\)). Even in big data settings, we find ourselves with sparse data when we dive deep enough into subgroups of interest, etc. It is common in behavioral research in particular, where we often have relatively few trials or items per person per measure. As \(n\) grows, all methods converge toward the sample mean, because the sample mean becomes precise enough (i.e. \(\sigma^2_\text{obs} / n \to 0\)) that there is little room for improvement through shrinkage.
What This All Means for Science
So, dear reader, why should you care? It is simple, really—if you routinely estimate individual-level parameters using sample means (or analogous summary statistics) and then enter these estimates into secondary analyses (e.g., correlations, regressions, group comparisons), you are at least under-powered. You may even be biasing your statistical inference. This two-stage approach implicitly assumes that our individual-level estimates contain no measurement error, which, as we have seen, is a very strong and typically false assumption. The consequences are real and show up everywhere:
The Reliability Paradox. Behavioral tasks that produce robust group-level effects (Stroop, IAT, Go/No-go, etc.) consistently show poor test-retest reliability when individual-level performance is estimated with summary statistics. As I have discussed in a previous post, this is not a paradox at all—it is a direct consequence of using noisy estimators. A well specified hierarchical model can leverage shrinkage/pooling to substantially improve the reliability of individual-level estimates from these tasks.
Attenuated Correlations. When we correlate two noisy summary statistic-based measures, the resulting correlation is systematically biased toward zero. This is the attenuation problem, which classical test theory addressed through disattenuation corrections (again, see my post on measurement error). Using hierarchical models or estimators developed via Classical Test Theory are necessary for unbiased inference on “true score” correlations.
Regression to the Mean Artifacts. When researchers select individuals based on extreme sample means—which are noisy—and then examine their subsequent behavior, regression to the mean cannot be differentiated from true psychological effects of interest. There is an ongoing debate on whether the Dunning-Kruger effect is actually purely a measurement artifact resulting from regression to the mean, but you can use hierarchical models to account for such statistical artifacts and make better inference on the true underlying psychological process parameters of interest. When you do so, the classic effect is still present (see my post linked above).
Mis-ranking individuals. Suppose you need to rank patients by health risk to decide who receives a costly intervention, or rank geographic regions by disease prevalence to target public health resources. If the data are imbalanced—some individuals (or regions) have many observations and others have few—summary statistics will produce misleading rankings. Individuals with fewer observations are more likely to land at the extremes (both top and bottom) simply due to noise, while those with more data will regress toward the middle. The result is misaligned resource allocation. Shrinkage estimators address this directly—individuals with less data get shrunk more toward the group mean, while those with more data retain estimates closer to their sample means. This produces rankings that better reflect true underlying differences rather than sampling noise. Many of the estimators we covered above can account for imbalance through variation in \(n_i\) (in the shrinkage term) across individuals.
And the list goes on!
Conclusion
Whether you call it James-Stein estimation, reliability-based true score estimation, empirical Bayes, ridge regression, or hierarchical modeling, the core insight is the same—when estimating multiple related quantities (and potentially even unrelated quantities…), you should pool information across individuals. For most scientists (whether you work in a brutalist building downtown or an ivory tower), the easiest path is to use hierarchical models (e.g., lmer or brms in R), which produce shrinkage automatically, are well-supported with extensive documentation, and are easily extendable to more complex settings. For example, estimating random effects on the individual-level variance component (i.e. allowing \(\sigma_{\text{obs}}\) to vary across individuals) allows for hierarchical models to account for heteroscedasticity. For the brave souls, you can work with JS-, CTT-, or EB-based estimators to achieve similar results, although you may have to do some algebra. And hey, if you want to feel hipster you can even try your hand at accomplishing this all with ridge regression.
The sample mean had a good run. Perhaps it is time to let it go :D.
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] brms_2.23.0 Rcpp_1.1.1 lme4_2.0-1 glmnet_4.1-10 Matrix_1.6-5
## [6] ggplot2_4.0.1 foreach_1.5.2 tidyr_1.3.2 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] tensorA_0.36.2.1 gtable_0.3.6 shape_1.4.6.1
## [4] QuickJSR_1.8.1 xfun_0.56 bslib_0.9.0
## [7] processx_3.8.6 inline_0.3.21 lattice_0.22-7
## [10] callr_3.7.6 ps_1.9.1 vctrs_0.7.1
## [13] tools_4.3.3 Rdpack_2.6.6 generics_0.1.4
## [16] stats4_4.3.3 parallel_4.3.3 tibble_3.3.1
## [19] pkgconfig_2.0.3 checkmate_2.3.3 RColorBrewer_1.1-3
## [22] S7_0.2.1 RcppParallel_5.1.11-1 distributional_0.6.0
## [25] lifecycle_1.0.5 compiler_4.3.3 farver_2.1.2
## [28] stringr_1.6.0 Brobdingnag_1.2-9 codetools_0.2-20
## [31] htmltools_0.5.9 sass_0.4.10 bayesplot_1.15.0
## [34] yaml_2.3.12 pillar_1.11.1 nloptr_2.2.1
## [37] jquerylib_0.1.4 MASS_7.3-60.0.1 cachem_1.1.0
## [40] StanHeaders_2.32.10 reformulas_0.4.4 iterators_1.0.14
## [43] bridgesampling_1.2-1 boot_1.3-32 abind_1.4-8
## [46] nlme_3.1-168 rstan_2.32.7 posterior_1.6.1
## [49] tidyselect_1.2.1 digest_0.6.39 mvtnorm_1.3-3
## [52] stringi_1.8.7 purrr_1.2.1 bookdown_0.46
## [55] labeling_0.4.3 splines_4.3.3 fastmap_1.2.0
## [58] grid_4.3.3 cli_3.6.5 magrittr_2.0.4
## [61] loo_2.9.0 pkgbuild_1.4.8 survival_3.8-6
## [64] withr_3.0.2 backports_1.5.0 scales_1.4.0
## [67] rmarkdown_2.30 matrixStats_1.5.0 otel_0.2.0
## [70] gridExtra_2.3 blogdown_1.23 coda_0.19-4.1
## [73] evaluate_1.0.5 knitr_1.51 rbibutils_2.4.1
## [76] rstantools_2.6.0 rlang_1.1.7 glue_1.8.0
## [79] formatR_1.14 renv_1.1.6 rstudioapi_0.18.0
## [82] minqa_1.2.8 jsonlite_2.0.0 R6_2.6.1