# The Reliability Paradox

## Defining Reliability

In 2017, Hedge, Powell, and Sumner (2017) conducted a study to determine the reliability of a variety of of behavioral tasks. Reliability has many different meanings throughout the psychological literature, but what Hedge et al. were interested in was how well a behavioral measure consistently ranks individuals. In other words, when I have people perform a task and then measure their performance, does the measure that I use to summarize their behavior show high test-retest reliability? Those studying individual differences—including but not limited to personality psychologists, clinical psychologists, and many developmental psychologists—are likley to be familiar with test-retest reliability, as we often desire measures that help us understand differences between people.

Despite what many of us may first think when we hear the word reliable, as Hedge et al. note, test-retest reliability is only one of many different definitions of reliable used throughout the psychological literature. For example, when someone states that an effect is reliable, they often mean that the effect is easily replicable.

### Replicable Behavioral Effects Aren’t Reliabile?

A commonly cited example of a replicable effect is the Stroop effect, which is often measured as the mean difference in response time between incongruent and congruent word-color pairs. For example, when tasked with responding to the color of a word, people tend to respond much more quickly when the word is consistent with its color (“Red” colored red) compared to when it is not (“Red” colored blue). Since the original study in 1935, this basic effect of an average difference in response times between congruent and incongruent trials has been replicated countless times (see MacLeod, 1991), thereby becoming one of the most well-known effects in all of psychology. In fact, at the time of writing this post, the original 1935 paper has almost 20,000 citations on Google Scholar alone.

Despite how replicable it is, Hedge et al. show that the Stroop effect is unreliable, in that it shows low test-retest reliability. While Hedge et al. assess many examples of such effects, we will focus on the Stroop effect throughout this post for brevity. Using their first experiment as an example, they had a a group of 50 college students complete a Stroop task two separate times separated by a period of three weeks. The Stroop task consisted of three conditions containing 240 trials each: (1) incongruent, (2) neutral, and (3) congruent conditions (for 720 trials total). On each trial, participants responded to the color of a word presented on a computer monitor which could be colored either red, blue, green, and yellow. Responses were collected from key presses. The word could be the same as the font color (congruent condition; e.g., “Red” colored red), unrelated to the font color (neutral; e.g., “lot” colored red), or conflicting with the font color (incongruent; e.g., “Blue” colored red).

After subjects completed the above Stroop task at both timepoints, Hedge et al. used the following behavioral model to estimate each individual’s Stroop effect at both timepoints:

\begin{aligned} Stroop_{i,\text{time}} & = \overline{RT}_{i,\text{time},\text{incongruent}} - \overline{RT}_{i,\text{time}, \text{congruent}} \end{aligned}\tag{1} In plain English, for each person $$i$$ at timepoint $$\text{time}$$, the Stroop effect is equal to the average response time across incongruent trials ($$\overline{RT}_{i,\text{time},\text{incongruent}}$$) minus the average response time across congruent trials ($$\overline{RT}_{i,\text{time},\text{congruent}}$$). Then, to estimate test-retest reliability, Hedge et al. use an Intraclass Correlation Coefficient (ICC) using a two-way random effects model for absolute agreement ($$ICC(2,1)$$). The $$ICC(2,1)$$ is similar to a traditional Pearson’s $$r$$, except that it is also sensitive to differences in the mean rather than just the variance of our observations. For example, if $$A = \{1,2,3\}$$ and $$B = \{4,5,6\}$$, the Pearson’s $$r$$ between $$A$$ and $$B$$ is $$1$$. However, the ICC(2,1) between these vectors is attenuated to $$.18$$ because the corresponding elements of each vector differ on average by some value (3 in this example). This is important when our aim is to ensure that, for example, two different coders give the same ratings to a set of stimuli.

Importantly, Hedge et al. found that the test-retest reliability of the Stroop effect as measured using equation 1 was $$ICC(2,1) = .6$$, which is a surprisingly low number given how robust the Stroop effect is at the group level! As Hedge et al. discuss, with such low reliability, any research that correlates individual-level Stroop estimates with other measures (e.g., BOLD fMRI signals, personality measures, etc.) should be correcting the estimated correlations for the high measurement error of the Stroop effect, which substantially increases the sample size required to make reasonably-powered statistical inferences. Because Hedge et al. replicated this result of low test-retest reliability across both multiple groups of participants and multiple different tasks (e.g., Flanker, Navon, Go/No-go, Stop-signal), their general conclusion naturally follows that behavioral tasks such as the Stroop task are severely limited in their ability to distinguish between individuals, thereby calling into question their utility for individual-differences research.

# Revisiting the Reliability Paradox

## Thinking Generatively Improves Reliability

This post will show that the conclusions drawn by Hedge et al. are highly dependent on the implict assumptions of their methodology. Specifically, we will develop a statistical model that is more consistent with the data-generating mechanism underlying the group-level Stroop effect, which will allow us to estimate more precise individual-level effects. In turn, we achieve higher test-retest reliability estimates. This work is similar in kind to that of Rouder & Haaf (2019), although we will have a more specific focus on estimating test-retest reliability and go into a bit more depth when discussing the implications of our results in reference to the current standard practices in psychology (I was also unfortunately not aware of this fantastic work until shortly after completing this project!). Additionally, we will use a model parameterization that is more familiar to those who regularly work with R with packages such as brms (e.g., compare the model we develop below to this brms write-up), which I hope will help readers more readily generalize this material to their own work.

In the following sections, we will download the Stroop data used by Hedge et al., walk through a new behavioral model of the Stroop effect, fit the new model to the data from Hedge et al., extract test-retest reliability estimates from our new model, and compare them to those that are estimated using the traditional model (equation 1). We will end by discussing the implications of our findings for individual-differences research, including suggestions for future research.

## Preprocessing and Summarizing the Stroop Data

To start, we can download the Stroop data from experiment 1 of Hedge et al., which is hosted on the Open Science Foundation webpage linked here.

Then, we use the following R code to read in and preprocess the Stroop data:

# For easy data manipulation
library(foreach)
library(dplyr)
library(tidyr)
library(broom)
# For pretty plots
library(ggplot2)
# For nice tables
library(knitr)
# For intra-class correlations
library(irr)
# For Bayesian modeling
library(rstan)
# For some useful utility functions
library(hBayesDM)
# Data path and individual subject file names
files_t1 <- list.files(data_path, pattern = "*1.csv")
files_t2 <- list.files(data_path, pattern = "*2.csv")

# Create long-format stroop task data including all subjects
long_stroop <- foreach(i=seq_along(files_t1), .combine = "rbind") %do% {
# For time 1
tmp_t1 <- read.csv(file.path(data_path, files_t1[i]), header = F) %>%
mutate(subj_num = i,
time = 1)
# For time 2 (about 3 weeks apart)
tmp_t2 <- read.csv(file.path(data_path, files_t2[i]), header = F) %>%
mutate(subj_num = i,
time = 2)
# Condition (0 = congruent, 1=neutral, 2=incongruent),
# Correct (1) or incorrect (0),
# Reaction time is in seconds
names(tmp_t1)[1:6] <- names(tmp_t2)[1:6] <- c("Block", "Trial", "Unused",
"Condition", "Correct", "RT")
rbind(tmp_t1, tmp_t2)
}

# Compute Stroop effect for each person at each time (equation 1)
sum_stroop <- long_stroop %>%
group_by(subj_num, time) %>%
summarize(stroop_eff = mean(RT[Condition==2]) - mean(RT[Condition==0]))
## summarise() has grouped output by 'subj_num'. You can override using the .groups argument.
# Peak at the data
kable(head(sum_stroop), digits = 3)
subj_num time stroop_eff
1 1 0.068
1 2 0.025
2 1 0.053
2 2 0.012
3 1 0.066
3 2 0.073

Now, we have a Stroop effect estimate for each subject at each timepoint, which we can use for further analyses. One thing to note is that unlike Hedge et al., we are not using any heiristic data cleaning rules. Specifically, Hedge et al. did not include response times (RTs) that were below 100 ms or above 3 times an indivual’s absolute median deviation of RTs in their computation of the Stroop effect. One other note is that the RTs in the raw data are in seconds (not ms). Here are the basic descriptives of our data:

# Means and SDs of Stroop effect at each timepoint
sum_stroop %>%
ungroup() %>%
group_by(time) %>%
summarize(N = n(),
Mean = round(mean(stroop_eff), 3),
SD = round(sd(stroop_eff), 3)) %>%
kable(digits = 3)
time N Mean SD
1 47 0.081 0.036
2 47 0.059 0.030

### Testing for a Group-level Stroop Effect

It is pretty clear from the descriptives above that there is a group-level Stroop effect, but we can conduct a t-test at each timepoint regardless:

# Test for group-level Stroop effect at time 1
sum_stroop %>%
filter(time==1) %>%
{t.test(.$stroop_eff)} %>% tidy() %>% kable(digits = 3) estimate statistic p.value parameter conf.low conf.high method alternative 0.081 15.418 0 46 0.071 0.092 One Sample t-test two.sided Here, you can see that the Stroop effect is apparent at time 1, with a mean increase in RT during the incongruent trials of about .08 seconds (80 ms). We can also check time 2: # Test for group-level Stroop effect at time 2 sum_stroop %>% filter(time==2) %>% {t.test(.$stroop_eff)} %>%
tidy() %>%
kable(digits = 3)
estimate statistic p.value parameter conf.low conf.high method alternative
0.059 13.633 0 46 0.05 0.068 One Sample t-test two.sided

Looks good! The Stroop effect is slightly lower at time 2, but still comes out significant at the group level. From these analyses, it is clear that the Stroop effect is indeed replicable using equation 1.

### Test-retest Reliability

To compute test-retest reliability as in Hedge et al., we will first reformat the data so that it can be used with the irr package, which allows us to estimate the $$ICC(2,1)$$:

# Format for test-retest analysis
stroop_unpooled <- sum_stroop %>%
ungroup() %>%
mutate(time = ifelse(time==1, "Stroop_T1", "Stroop_T2"),
pooled = "No",
Replication = "Sample Mean") %>% # these "pooled" and Replication variables will come in later
spread(key = time, value = stroop_eff)

# Intraclass correlation of stroop effect at time 1 and 2
stroop_unpooled %>%
select(Stroop_T1, Stroop_T2) %>%
{icc(., model = "twoway", type = "agreement", unit = "average")[c(1, 7:15)]} %>%
as.data.frame() %>%
kable(digits = 3)
subjects value r0 Fvalue df1 df2 p.value conf.level lbound ubound
47 0.578 0 2.951 46 13.419 0.017 0.95 0.059 0.793

Here, we see that the $$ICC(2,1) = .58$$, which is slightly lower compared to the results reported in Hedge et al. ($$ICC(2,1) = .6$$). Remember, we did not use the same pre-processing steps as Hedge et al. (i.e. we used all trials to compute RTs, rather than removing those beyond certain thresholds), which is why we have slightly different results. We can additionally compute a traditional Pearson’s $$r$$ to assess test-retest reliability, which offers similar conclusions:

# Pearson's correlation
stroop_unpooled %>%
select(Stroop_T1, Stroop_T2) %>%
{cor.test(.$Stroop_T1, .$Stroop_T2)} %>%
tidy() %>%
kable(digits = 3)
estimate statistic p.value parameter conf.low conf.high method alternative
0.503 3.908 0 45 0.253 0.691 Pearson’s product-moment correlation two.sided

In fact, Pearson’s $$r = .5$$—even lower than the $$ICC(2,1)$$ estimate! This happens because the model we are using for the $$ICC(2,1)$$ estimate treats the individual-level Stroop effect estimates as average units (since they are averages across all trials), which affects how variance is estimated. Still, the main conclusion holds—with a test-retest reliability estimate of only $$.5$$ to $$.58$$, the utility of the Stroop effect for individual-differences research is limited relative to more reliable measures.

## Building a Generative Model

While the above results paint a clear picture, there are a few problems with the methodology that could be improved upon to make better inference on individual-level Stroop estimates.

First, when we average across the RT data before including the averages in a model (i.e. to compute the $$ICC(2,1)$$), we are throwing away useful information by making the assumption that individual-level Stroop effects can be estimated without measurement error. We should be able to do much better by including all trials (not just averages) in the statistical model that we use to estimate test-retest reliability, which will allow us to build a hierarchical model that can pool information across individuals. By building our model to respect the structure of our data (e.g., trials within timepoints within subjects), we are able to gain a better understanding of the data-generating process that gives rise to the Stroop effect.

Second, the Stroop task exhibits practice effects, which could attenuate $$ICC(2,1)$$ estimates. For example, if everyone becomes faster by timepoint 2, then the $$ICC(2,1)$$ will go down even if their are consistent individual differences across timepoints. In fact, plotting out the Stroop effects at each timepoint reveals evidence for practice effects:

# Test-retest plot
stroop_unpooled %>%
ggplot(aes(x = Stroop_T1, y = Stroop_T2)) +
geom_abline(intercept = 0, slope = 1, linetype = 2, color = "black", size = 1) +
geom_point(color = I("#b5000c")) +
theme_minimal(base_size = 20) +
theme(panel.grid = element_blank())

Above, the black dotted line indicates the identity line, which is where all the red points (unpooled individual-level Stroop estimates) would fall if all subjects had the same Stroop effect estimates at each timepoint. Overall, we see that Stroop effects at time 2 are generally lower than at time 1, which is indicative of some form of practice effect.

### The Hierarchical Bayesian Approach

Hierarchical Bayesian Analysis (HBA) offers a statistical framework that allows us to develop a model that better captures the data-generating distribution of the group-level Stroop effect. Specifically, HBA simultaneously: (1) uses information from each individual subject to inform a group-level estimate, and (2) uses group-level information to inform all individual-level estimates. Also known as partial pooling, the structure imposed by a hierarchical model therefore allows us to more precisely estimate individual-level effects compared to methods that do not use pooling, which can lead to better estimates of data-generating parameters (e.g., Ahn et al., 2011).

Before walking through the equations, let’s first determine how we would like to structure our hierarchical model in a way that captures our assumptions. When most psychologists think of hierarchical modeling, they think of estimating “random slopes” or “random intercepts”, or some combination of the two. This style of thinking can sometimes obscure our actual intentions. Instead, we will think of hierarchical models as a way to encode our assumptions regarding different behavioral effects in the Stroop task. Specifically, we want a model that does the following:

1. Captures uncertainty in individual-level Stroop estimates from within-subject variability,
2. Estimates the difference in RT between incongruent and congruent trials,
3. Estimates the correlation between the difference score from (2) between timepoint 1 and 2 across subjects (i.e. test-retest reliability), and
4. Does 1-3 within a single hierarchical model

It is worth emphasizing that there are multiple ways we could construct a model to encode these assumptions (see Rouder & Haaf’s word mentioned previously), and the model we will develop is only one example. Now, here is the model we will use:

\begin{align*} \text{RT}_{i,t} & \sim \text{Normal} (\mu_{i,t}, \sigma_{\text{condition}}) \\ \mu_{i,t} & = (\beta_{\text{congruent}_{1,i}} + \beta_{\Delta_{1,i}} \cdot I_{\text{condition}}) \cdot I_\text{time} + \\ & ~~~~~ (\beta_{\text{congruent}_{2,i}} + \beta_{\Delta_{2,i}} \cdot I_{\text{condition}}) \cdot (1-I_\text{time}) \\ \begin{bmatrix} \beta_{\Delta_{1,i}} \\ \beta_{\Delta_{2,i}} \end{bmatrix} & \sim \text{MVNormal} \bigg (\begin{bmatrix} \mu_{\beta_{\Delta_{1}}} \\ \mu_{\beta_{\Delta_{2}}} \end{bmatrix}, \mathbf{S} \bigg ) \\ \mathbf S & = \begin{pmatrix} \sigma_{\beta_{\Delta_{1}}} & 0 \\ 0 & \sigma_{\beta_{\Delta_{2}}} \end{pmatrix} \mathbf R \begin{pmatrix} \sigma_{\beta_{\Delta_{1}}} & 0 \\ 0 & \sigma_{\beta_{\Delta_{2}}} \end{pmatrix} \\ \beta_{\text{congruent}_{1,i}} & \sim \text{Normal} (\mu_{\beta_{\text{congruent}_1}}, \sigma_{\beta_{\text{congruent}_1}}) \\ \beta_{\text{congruent}_{2,i}} & \sim \text{Normal} (\mu_{\beta_{\text{congruent}_2}}, \sigma_{\beta_{\text{congruent}_2}}) \\ \mu_{\beta_{\text{congruent}}} & \sim \text{Normal} (0, 1) \\ \mu_{\beta_{\Delta}} & \sim \text{Normal} (0, 1) \\ \sigma_{\beta_{\text{congruent}}} & \sim \text{HalfCauchy} (0, 1) \\ \sigma_{\beta_{\Delta}} & \sim \text{HalfCauchy} (0, 1) \\ \sigma_{\text{condition}} & \sim \text{HalfCauchy} (0, 1) \\ \mathbf R & \sim \text{LKJcorr} (1) \end{align*}\tag{2} Obviously, there is a lot going on here—yet, at its core, the model is not much different from a traditional linear regression model. Above, $$\text{RT}_{i,t}$$ indicates the response time for subject $$i$$ on trial $$t$$. We assume that each $$\text{RT}_{i,t}$$ is distributed normally, with a mean $$\mu_{i,t}$$ and standard deviation $$\sigma_{\text{condition}}$$. Note that a different standard deviation is assumed for the incongruent versus congruent conditions (indicated by $$\text{condition}$$).

The $$\mu_{i,t}$$ term is determined as a linear combination of some $$\beta$$ weights that we are estimating across conditions and timepoints. Specifically, $$\beta_{\text{congruent}_{1, i}}$$ can be thought of as an “intercept”, which captures the average $$\text{RT}$$ at timepoint 1 for subject $$i$$ (i.e. the first time they took the Stroop task). Then, $$\beta_{\Delta_{1, i}}$$ is a “slope” (or different score) that is added to the intercept term when the condition identifier ($$I_\text{condition}$$) returns 1. Note that $$I_\text{condition}$$ returns 1 only for the incongruent trials, and returns 0 otherwise. In this way, $$\beta_{\Delta_{1, i}}$$ is our estimate for the Stroop effect for subject $$i$$ at the first timepoint. Importantly, the entire term representing the first timpeoint is then multiplied by $$I_\text{time}$$, which indicates whether the current trial is from the first (1) or second (0) timepoint. Then, $$\beta_{\text{congruent}_{2, i}}$$ and $$\beta_{\Delta_{2, i}}$$ are interpreted in the same way, except they represent corresponding estimates for the second timepoint.

To estimate test-retest reliability, we then make the assumption that our individual-level estimates for $$\beta_{\Delta_{1, i}}$$ and $$\beta_{\Delta_{2, i}}$$ (i.e. the Stroop effect estimates for each subject at timepoints 1 and 2) are drawn from a multivariate normal distribution, which allows us to estimate the Pearson’s correlation between Stroop effect timepoints across subjects in the same model that we use to estimate individual-level Stroop effects. The multivariate normal distribution assumes that the individual-level effects are drawn from group-level Stroop effects for timepoint 1 ($$\mu_{\beta_{\Delta_{1}}}$$) and timepoint 2 ($$\mu_{\beta_{\Delta_{2}}}$$), and with covariance matrix $$\bf{S}$$. Covariance matrix $$\bf{S}$$ is itself constructed with the estimated SDs of each of the Stroop effects and correlation matrix $$\bf{R}$$. $$\bf{R}$$ is where we get our test-retest reliability estimate.

Then, we assume that the individual-level $$\beta_{\text{congruent}}$$ weights for each timepoint are drawn from separate group-level normal distributions, with means indicated by $$\mu_{\beta_{\text{congruent}_1}}$$ and $$\mu_{\beta_{\text{congruent}_2}}$$ and SDs indicated by $$\sigma_{\beta_{\text{congruent}_1}}$$ and $$\sigma_{\beta_{\text{congruent}_2}}$$.

Lastly, we have the prior distributions on the group-level parameters. These priors in particular are not really very informative, given that the RTs are in seconds (see the descriptives above). The one prior worth mentioning is the the $$\text{LKJcorr}$$ distribution, which specifies a prior on a correlation matrix. Since we specified 1 as the shape parameter and we are only estimating the correlation between two values, this particular prior is uniform from -1 to 1, which assumes that all possible correlations for test-retest reliability of the Stroop effect are equally likely. We could certainly do better, but this will be fine for a first pass at the data. For a more detailed description of this particular distribution and some informative visualizations, check out this great blog post.

Now that we have walked through the model, here is the Stan code that we will use to fit it:

data {
int N;      // # of subjects
int N_cond; // # of conditions
int N_time; // # of timepoints
int T_max;  // max # of trials across subjects
real RT[N, N_cond, N_time, T_max]; // Reaction times for each subject, condition, timepoint, and trial
}
parameters {
// Correlation matrix for Stroop effect test-retest reliability
corr_matrix[N_time] R;

// SDs for group-level parameters
vector<lower=0>[N_time] sigma_con;
vector<lower=0>[N_time] sigma_delta;

// SDs for normal model on RTs
vector<lower=0>[N_cond] sigma_RT;

// Means for group-level parameters
vector[N_time] mu_beta_con;
vector[N_time] mu_beta_delta;

// Individual-level parameters
vector[N_time] beta_con[N];
vector[N_time] beta_delta[N];
}
transformed parameters {
// Construct covariance matrix from SDs and correlation matrix
cov_matrix[N_time] S;
S = quad_form_diag(R, sigma_delta);
}
model {
// Priors on group-level SDs and correlation matrix
R ~ lkj_corr(1);
sigma_delta ~ normal(0, 1);
sigma_con   ~ normal(0, 1);
sigma_RT    ~ normal(0, 1);

// Priors on group-level means
mu_beta_con   ~ normal(0,1);
mu_beta_delta ~ normal(0,1);

// Priors on individual parameters
beta_con[1] ~ normal(mu_beta_con[1], sigma_con[1]);
beta_con[2] ~ normal(mu_beta_con[2], sigma_con[2]);
beta_delta  ~ multi_normal(mu_beta_delta, S);

// For each subject
for (i in 1:N) {
// Congruent at time 1
RT[i,1,1,:] ~ normal(beta_con[i,1], sigma_RT[1]);
// Incongruent at time 1
RT[i,2,1,:] ~ normal(beta_con[i,1] + beta_delta[i,1], sigma_RT[2]);
// Congruent at time 2
RT[i,1,2,:] ~ normal(beta_con[i,2], sigma_RT[1]);
// Incongruent at time 2
RT[i,2,2,:] ~ normal(beta_con[i,2] + beta_delta[i,2], sigma_RT[2]);
}
}

The comments describe much of what is going on in the code, and I wrote it to match equation 2 as best as possible. The only major change worth noting is that as opposed to using identifiers as in equation 2 (e.g., $$I_\text{condition}$$ and $$I_\text{time}$$), the Stan model has the different conditions and timepoints stored in different dimensions of a single array, which is more efficient. The underlying model is equivalent to equaiton 2.

### Fitting and Refining Our Model

Now, we can prepare the data and fit the Stan model:

# Number of subjects
n_subj <- length(unique(long_stroop$subj_num)) # Number of conditions n_cond <- 2 # Number of timepoints n_time <- 2 # All subjects have 240 trials within each condition within timepoints T_max <- 240 # Create RT data array for stan; dims = (subject, condition, time, trial) RT <- array(NA, dim = c(n_subj, n_cond, n_time, T_max)) for (i in 1:n_subj) { # RTs for congruent condition at time 1 RT[i, 1, 1,] = with(long_stroop, RT[subj_num==i & Condition==0 & time==1]) # RTs for incongruent condition at time 1 RT[i, 2, 1,] = with(long_stroop, RT[subj_num==i & Condition==2 & time==1]) # RTs for congruent condition at time 2 RT[i, 1, 2,] = with(long_stroop, RT[subj_num==i & Condition==0 & time==2]) # RTs for incongruent condition at time 2 RT[i, 2, 2,] = with(long_stroop, RT[subj_num==i & Condition==2 & time==2]) } # Stan-ready data list stan_dat <- list(N = n_subj, N_cond = n_cond, N_time = n_time, T_max = T_max, RT = RT) # Fit the hierarchical model fit_m1 <- sampling(stroop_m1, data = stan_dat, iter = 2000, warmup = 500, chains = 3, cores = 3, seed = 2) ## Warning: There were 451 divergent transitions after warmup. See ## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup ## to find out why this is a problem and how to eliminate them. ## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See ## http://mc-stan.org/misc/warnings.html#bfmi-low ## Warning: Examine the pairs() plot to diagnose sampling problems ## Warning: The largest R-hat is NA, indicating chains have not mixed. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#r-hat ## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#bulk-ess ## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#tail-ess Great! The model runs fine, although we get warning messages about divergent transitions. This happens in Stan because the sampler used (i.e. the No-U-Turn Hamiltonian Monte Carlo sampler) returns divergences when the sampler explores difficult parts of the parameter space. Such a large number of divergences is problematic, suggesting that we may not have explored important parts of the parameter space. This often happens when the means and SDs of hierarchical parameters are correlated in strange ways, which we can confirm by checking the bivariate distributions: # Pairs plot of the group-level congruent cond. means and SDs pairs(fit_m1, pars = c("mu_beta_con", "sigma_con")) The red dots above indicate the divergent transitions. It is clear that the divergences are occuring most frequently in parts of the parameter space that are funnel-shaped. Now known as Neal’s funnel, we can take care of this problem by re-parameterizing the model such that the hierarchical means and SDs are less dependent on each other (i.e. a non-centered parameterization). Additionally, we will re-parameterize the correlation matrix using a Cholesky decomposition, which will also help with estimation a bit. I will not go into more detail about why Cholesky decomposition is useful, and instead refer interested readers to the wiki page referred to above. The updated Stan code below uses a non-centered parameterization to estimate all hierarchical parameters, which is mathematically identical to the model above but allows for better parameter estimation due to changes in the shape of the parameter space that Stan’s NUTS-HMC sampler needs to explore. data { int N; // # of subjects int N_cond; // # of conditions int N_time; // # of timepoints int T_max; // max # of trials across subjects real RT[N, N_cond, N_time, T_max]; // Reaction times for each subject, condition, timepoint, and trial } parameters { // Group-level correlation matrix (cholesky factor for faster computation) cholesky_factor_corr[2] R_cholesky; // Group-level parameter SDs vector<lower=0>[2] sigma_con; vector<lower=0>[2] sigma_delta; // Group-level SDs for normal model vector<lower=0>[2] sigma_RT; // Group-level parameter means vector[2] mu_beta_con; vector[2] mu_beta_delta; // Individual-level parameters (before being transformed) matrix[N,2] beta_con_pr; matrix[2,N] beta_delta_pr; // order flipped here for operation below } transformed parameters { // Individual-level parameter off-sets (for non-centered parameterization) matrix[2,N] beta_delta_tilde; // Individual-level parameters matrix[N,2] beta_con; matrix[N,2] beta_delta; // Construct inidividual offsets (for non-centered parameterization) beta_delta_tilde = diag_pre_multiply(sigma_delta, R_cholesky) * beta_delta_pr; // Compute individual-level parameters from non-centered parameterization for (i in 1:N) { // Congruent at time 1 beta_con[i,1] = mu_beta_con[1] + sigma_con[1] * beta_con_pr[i,1]; // Congruent at time 2 beta_con[i,2] = mu_beta_con[2] + sigma_con[2] * beta_con_pr[i,2]; // Stroop effect at time 1 beta_delta[i,1] = mu_beta_delta[1] + beta_delta_tilde[1, i]; // Stroop effect at time 2 beta_delta[i,2] = mu_beta_delta[2] + beta_delta_tilde[2, i]; } } model { // Prior on cholesky factor of correlation matrix R_cholesky ~ lkj_corr_cholesky(1); // Priors on group-level SDs sigma_delta ~ cauchy(0, 1); sigma_con ~ cauchy(0, 1); sigma_RT ~ cauchy(0, 1); // Priors on individual-level parameters to_vector(beta_delta_pr) ~ normal(0, 1); to_vector(beta_con_pr) ~ normal(0, 1); // For each subject for (i in 1:N) { // Congruent at time 1 RT[i,1,1,:] ~ normal(beta_con[i,1], sigma_RT[1]); // Incongruent at time 1 RT[i,2,1,:] ~ normal(beta_con[i,1] + beta_delta[i,1], sigma_RT[2]); // Congruent at time 2 RT[i,1,2,:] ~ normal(beta_con[i,2], sigma_RT[1]); // Incongruent at time 2 RT[i,2,2,:] ~ normal(beta_con[i,2] + beta_delta[i,2], sigma_RT[2]); } } generated quantities { corr_matrix[2] R; // Reconstruct correlation matrix from cholesky factor R = R_cholesky * R_cholesky'; }  Now, let’s fit the non-centered model: # Fit the non-centered hierarchical model fit_m2 <- sampling(stroop_m2, data = stan_dat, iter = 2000, warmup = 500, chains = 3, cores = 3, seed = 2) And just like that, no divergent transitions! Also, we get the added bonus of slightly faster computation. We can again look at the pairs plot to see what happened to Neal’s funnel: # Pairs plot of the group-level congruent cond. means and SDs pairs(fit_m2, pars = c("mu_beta_con", "sigma_con")) Above, you can see that the funnel has disappeared completely. Instead, we have well-behaved, more-or-less elliptical bivariate distributions (which MCMC samplers love). We also need to check convergence more generally, using both visual and quantitative diagnostics. First, we can graph the traceplots, which should look like “furry caterpillars”: # Check all group-level parameters (and test-retest correlation estimate) traceplot(fit_m2, pars = c("mu_beta_con", "mu_beta_delta", "sigma_con", "sigma_delta", "sigma_RT", "R[1,2]")) These look good. We would usually also check individual-level parameters, but for the sake of brevity we can look at some quantitative diagnostics. In particular, $$\hat{R}$$ (a.k.a. the Gelman-Rubin statistic) is a measure of within- relative to between-chain variance, which should be close to 1 for all parameters if chains mix well. We can easily plot a distribution of $$\hat{R}$$ for all parameters in the model as follows: # rstan's default plotting method stan_rhat(fit_m2, bins = 30) ## Warning: Removed 3 rows containing non-finite values (stat_bin). Here, all the $$\hat{R}$$ statistics are very close to 1. In combination with the traceplots above and lack of divergences, we can be pretty sure that the three chains we used have converged. ### Making Inference with Our Model Now, the next step is to extract the parameter estimates and check the estimated correlation (i.e. test-retest estimate) across the individual-level Stroop effect parameters at each timepoint! First, we can plot the posterior distribution of the test-retest correlation: # Extract parameters from model pars <- extract(fit_m2) # Plot density of test-retest correlation estimate qplot(pars$R[,1,2], geom = "density", fill = I("#b5000c")) +
ggtitle(paste0("Posterior Mode = ", round(estimate_mode(pars$R[,1,2]), 2))) + xlab("Test-Retest Correlation") + theme_minimal(base_size = 15) + theme(panel.grid = element_blank(), legend.position = "none") The results are quite stunning—the mass of the posterior is up against 1, with a mode of $$0.96$$. Such a high test-retest reliability estimate is clearly at odds with the findings of Hedge et al., who reported an estimate of $$ICC(2,1) = .6$$. As discussed before, it is the pooling of information both across individual subjects and across timepoints within subjects that gives us these results. Now that our model is fit, we can readily visualize this pooling. The plot below shows the unpooled estimates from equation 1 (shown in our first figure at the start of the post), and the pooled estimates from our non-centered hierarchical model: # Extracting posterior modes of individual-level Stroop effect estimates stroop_pooled <- apply(pars$beta_delta, c(2,3), mean) %>%
as.data.frame() %>%
rename(Stroop_T1 = V1,
Stroop_T2 = V2) %>%
mutate(subj_num = row_number(),
pooled = "Yes")
# Pooled correlation
pooled_cor <- with(stroop_pooled, round(cor(Stroop_T1[pooled=="Yes"], Stroop_T2[pooled=="Yes"]), 2))

# My favorite visualization of all time
bind_rows(stroop_unpooled, stroop_pooled) %>%
mutate(subj_num = as.factor(subj_num)) %>%
ggplot(aes(x = Stroop_T1, y = Stroop_T2)) +
ggtitle(paste0("Posterior Means r = ", pooled_cor)) +
geom_abline(intercept = 0, slope = 1, linetype = 2, color = "black", size = 1) +
stat_ellipse(geom="polygon", type="norm", level=1/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=2/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=3/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=4/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=5/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=6/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=7/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=8/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=9/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=.99, size=0, alpha=1/10, fill="gray") +
geom_line(aes(group = subj_num), size = 1/4) +
geom_point(aes(group = subj_num, color = pooled)) +
scale_color_manual("Pooled?",
values = c("#990000", "#fee8c8")) +
theme_minimal(base_size = 20) +
theme(panel.grid = element_blank())

Here, we can see that the pooled estimates (individual-level posterior means pictured here) are actually highly consistent across timepoints within individuals. In fact, the correlation between the posterior means of the individual-level Stroop effects at time 1 versus time 2 is $$r = 0.98$$. We cannot really get much better than that! Of course, there is more uncertainty in the correlation matrix we estimated (pictured above), but this plot shows just how powerful the pooling from a well-constructed hierarchical model can be. Moreover, the hierarchical pooling removes the need to get rid of outliers at the individual-subject level—as you can see, data-points that fall far from the group average (which usually show large variance) shrink more strongly toward the group-level average.

## On Averaging Before Modeling

If you are not yet convinced that the traditional practice of averaging across trials before modeling a correlation is sub-optimal, we can actually view such practices as a special case within the context of the generative model that we developed above (equation 2). Specifically, equation 1 assumes that individual-level Stroop effects can be estimated with no measurement error—but how much does this actually affect our inference? We will explore this question below before summarizing our findings and discussing implications for future research on behavioral tasks.

Let’s begin with a thought experiment. If we were to encode the assumption of 0 measurement error into the generative model that we developed, how would we do so? At first, we may think to change the prior distributions, the likelihood of the model, or some other aspect of the model that can encode certainty/uncertainty. However, an idea that leads to better intuition of the implications is that we could:

1. Take our dataset containing all trials from all subjects in each condition/timepoint,
2. Append our dataset to itself an infinite number of times (yes, literally copy pasting your dataset repeatedly to create infinite data for each subject), and
3. Fit the data using the generative model from equation 2.

If we follow these three steps, the test-retest correlation we estimate in equation 2 will converge to the correlation we get by using the traditional analysis described by equation 1 and surrounding text. This follows because as we continue to artificially replicate our dataset ad infinitum, the estimates for each individual-level estimate will approach the sample mean, and the uncertainty estimates will reduce to single points—this is identical to what we are doing in equation 1. Therefore, the traditional practice of averaging before modeling is as functionally problematic as artificially replicating our datasets an infinite number of times before fitting our model. In fact, we can demonstrate this by observing what happens as we artificially replicate our dataset an increasing number of times. Of course, we cannot do this anywhere near an infinite number of times, but we can at least get a sense of the problem using this method. The R code below does exactly this, varying the number of artificially replicated datasets $$\in \{1, 2, 4, 8, 16, 32\}$$.

# Number of artificially replicated datasets
reps <- c(1, 2, 4, 8, 16, 32)

# Looping through each replication and saving model fits
results <- foreach(r=reps) %do% {
# Same as above
n_subj <- length(unique(long_stroop$subj_num)) n_cond <- 2 n_time <- 2 # If we multiply T_max by the reps variable, the number of trials alloted in # the RT array for each condition/timepoint will increase accordingly T_max <- 240 * r # Create RT data array for stan; dims = (subject, condition, time, trial) RT <- array(NA, dim = c(n_subj, n_cond, n_time, T_max)) for (i in 1:n_subj) { # Because we created an array that is larger than the vector of data that # we assigned to it, R will (by default) replicate the observations we are # assigning to each condition/timepoint to fill the corresponding RT array RT[i, 1, 1,] = with(long_stroop, RT[subj_num==i & Condition==0 & time==1]) RT[i, 2, 1,] = with(long_stroop, RT[subj_num==i & Condition==2 & time==1]) RT[i, 1, 2,] = with(long_stroop, RT[subj_num==i & Condition==0 & time==2]) RT[i, 2, 2,] = with(long_stroop, RT[subj_num==i & Condition==2 & time==2]) } # Stan-ready data list rep_dat <- list(N = n_subj, N_cond = n_cond, N_time = n_time, T_max = T_max, RT = RT) # Fit the model with artificially replicated data tmp_fit <- sampling(stroop_m2, data = rep_dat, iter = 2000, warmup = 500, chains = 3, cores = 3, seed = 2) # Save model fit in list (results) tmp_fit } ## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#bulk-ess ## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#bulk-ess ## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#bulk-ess ## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#tail-ess ## Warning: There were 4 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See ## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded ## Warning: Examine the pairs() plot to diagnose sampling problems ## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#bulk-ess ## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. ## Running the chains for more iterations may help. See ## http://mc-stan.org/misc/warnings.html#tail-ess After fitting everything, we can visualize how the estimated test-retest correlation changes with respect the the number of times we artificially replicated our data (remember that the correlation should converge to the unpooled sample correlation): # Plot posterior distribution of test-retest correlation across replications foreach(i=seq_along(results), .combine = "rbind") %do% { data.frame(R = rstan::extract(results[[i]])$R[,1,2]) %>%
mutate(Replication = as.factor(reps[i]))
} %>%
ggplot(aes(x = Replication, y = R)) +
geom_violin(fill = I("#b5000c")) +
stat_summary(aes(x = Replication, y = R), fun.y = mean, geom = "point", size = 2) +
geom_hline(yintercept = .5, linetype = 2, color = I("black")) +
theme_minimal(base_size = 20) +
theme(panel.grid = element_blank())
## Warning: fun.y is deprecated. Use fun instead.

As expected, as we artificially replicated our dataset more and more, the model-estimated test-retest correlation (posterior means indicated by black points) converges toward the unpooled sample Pearson’s correlation (indicated by the dashed black line). Next, let’s take a look at this effect at the individual-level by plotting out the changes in posterior means with respect to the number of artificially replicated datasets:

# Plot convergence of estimated individual-level posterior means with sample means (Sample Means = equation 1/unpooled estimates)
foreach(i=seq_along(results), .combine = "rbind") %do% {
apply(rstan::extract(results[[i]])\$beta_delta, c(2,3), mean) %>%
as.data.frame() %>%
rename(Stroop_T1 = V1,
Stroop_T2 = V2) %>%
mutate(subj_num = row_number(),
Replication = as.character(reps[i]))
} %>%
bind_rows(stroop_unpooled) %>%
mutate(subj_num = as.factor(subj_num),
Replication = factor(Replication,
levels = c("1", "2", "4", "8", "16", "32", "Sample Mean"),
labels = c("1", "2", "4", "8", "16", "32", "Sample Mean"))) %>%
ggplot(aes(x = Stroop_T1, y = Stroop_T2)) +
geom_abline(intercept = 0, slope = 1, linetype = 2, color = "black", size = 1) +
stat_ellipse(geom="polygon", type="norm", level=1/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=2/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=3/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=4/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=5/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=6/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=7/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=8/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=9/10, size=0, alpha=1/10, fill="gray") +
stat_ellipse(geom="polygon", type="norm", level=.99, size=0, alpha=1/10, fill="gray") +
geom_line(aes(group = subj_num), size = 1/4) +
geom_point(aes(group = subj_num, color = Replication)) +
scale_color_brewer("Number of\nReplications",
type = "seq",
palette = "Reds") +
theme_minimal(base_size = 20) +
theme(panel.grid = element_blank())

Similar to the group-level test-retest correlation estimate, we see that the individual posterior means converge to the unpooled sample means as we increase the number of times we artificially replicate our data before fitting the model.

# Discussion

In this post, we showed that hierarchical models can: (1) pool information across individual subjects to more precisely estimate individual-level behavioral effects, and subsequently (2) increase test-retest reliability esimtates to the extent that we can infer strong, reliable individual differences between subjects in behavioral tasks. Therefore, our findings show that despite the conclusions drawn by Hedge et al., robust effects can in fact be reliable, with the caveat that we need to use more computationally expensive models that are more difficult to work with. Our findings are consistent with Rouder & Haaf (2019), who similarly showed that hierarchical models (even using a different functional form) are necessary to properly account for the high measurement error of behavioral effects when the goal is to make inference on individual-differences. I would highly recommend taking a look at Rouder & Haaf’s work if you are interested in reading more about the benefits of hierarchical modeling in the context of behavioral tasks.

More broadly, our results show the benefits of thinking about the data-generating process when analyzing behavioral data. For example, traditional analyses (e.g., equation 1) fail to consider variability within subjects, which is detrimental when we aim to make inference at the individual-subject level. Specifically, by failing to consider within-subject variability, we implicitly assume that behavioral summary statistics are infinitely precise—when made explicit, it is apparent just how inconsistent such assumptions are with our existing knowledge regarding behavioral data.

Equally important to the statistical issues that our results reveal are the ethical concerns brought forth by our last analysis. Specifically, we showed that averaging across trials before analyzing behavioral summary statistics is equivalent to artificially replicating each subject’s data an infinite number of times before fitting an inferential statistical model. Put in these terms, we can see just how problematic measurement error is when not properly accounted for. For example, if a researcher were caught artificially replicating their data before analyzing it, they would immediately be ridiculed—and likely formally investigated—for committing research misconduct and/or fraud. However, as it currently stands, researchers regularly commit to the practice of averaging before modeling, and we would never think to suggest that such behavior is outright fraud. This raises the question—should we?

Thanks for reading! I learned quite a lot writing this post, and I hope you gained some insight reading it :D

##### Nathaniel Haines
###### Computational Psychologist & Data Scientist, PhD

An academic Bayesian who is currently exploring the high dimensional posterior distribution of life