On Curbing Your Measurement Error: From Classical Corrections to Generative Models

Introduction

In this post, we will explore how measurement error arising from imprecise parameter estimation can be corrected for. Specifically, we will explore the case where our goal is to estimate the correlation between a self-report and behavioral measure–a common situation throughout the social and behavioral sciences.

For example, as someone who studies impulsivity and externalizing psychopathology, I am often interested in whether self-reports of trait impulsivity (e.g., the Barratt Impulsiveness Scale) correlate with performance on tasks designed to measure impulsive behavior (e.g., the Balloon Analogue Risk task). In these cases, it is common for researchers to compute summed or averaged scores on the self-report measure (e.g., summing item responses and divding by the number of items) and use summary statistics of behavioral performance (e.g., percent risky choices) for each subject’s behavioral measure, wherein the resulting estimates are then entered into a secondary statistical model to make inference (e.g., correlation between self-reported trait and behavioral measures). Importantly, this two-stage approach to inference assumes that our summary measures both contain no measurement error, or alternatively that we have estimated these summary mesaures with perfect precision–a very strong assumption that is surely not met in practice.

Here, we will explore how such assumptions can bias our statistical inferences on individual differences. As we will show, this bias arises because these two-stage approaches ignore important sources of measurement error. We will begin with an exploration of traditional methods developed within the context of classical test theory, and we will then transition to the use of more contemporary generative models. Throughout, we will explore relationships between classical and generative approaches, which are actually more similar than they are different in many ways.

Classical Corrections

Imprecision and Reliability

At the level of a single measure, we can think of reliability as directly corresponding to precision–or how close our estimates are to the underlying “true score”. As our estimates become more precise at the individual-level, we should be able to better infer differences between individuals.

For example, assume that we are interested estimating a trait score for an individual, which we measure using a 10 item self-report questionnaire. For simplicity, let’s also assume that each item requires a yes/no endorse/not endorse response (coded 1 and 0, respectively), no items are reverse scored, and all items share the same difficulty. A typical approach to score this questionnaire would be to sum up the individual items \(t\) and divide the sum by the number of items \(T\) (i.e. taking the average). So, the vector of responses \(\textbf{x}\) for a single subject may look something like:

\[\textbf{x} = [1, 0, 1, 0, 1, 1, 1, 0, 1, 1]\]

We then compute our observed score like so:

\[X = \frac{1}{T}\sum^{T}_{t=1}\textbf{x}_{t}\]

The resulting observed score \(X\) is simply the proportion of yes/endorsed responses, which is \(X = .7\) in this example. We then interpret \(X\) as a quantitative measure of the underlying contruct. However, there are a few important, related questions worth asking regarding this estimate:

  1. How close is our observed score measure \(X\) to the “true score” we aim to measure?
  2. Can we use this measurement approach to make inference on individual differences?
  3. If so, how can we best estimate the “true score” for each individual?

From the perspective of classical test theory, we can answer these questions by determining the reliability of our measure. Below, we will define reliability and discuss how it can be used to answer the questions above.

Defining Reliability

In psychology and education, reliability is often discussed within the context of classical test theory, which assumes that our estimates reflect some combination of the unobserved “true score” plus some error:

\[X = \theta + \epsilon\] where \(\epsilon\) is measurement error that contaminates the “true score” \(\theta\). Note that there are different interpretations of what the truth actually is, but here we will use the following definition: the true score is the expected value of the observed score over many independent realizations (for alternative definitions of the true score, see Lord, Novick, & Birnbaum, 1968). Mathematically, we can represent this as:

\[\theta = \mathbb{E}[X]\] Following this definition, the error for a single realization is then defined as:

\[\epsilon = X - \mathbb{E}[X]\] where \(\mathbb{E}[\epsilon] = 0\) and \(\text{Cov}(\epsilon, \theta) = 0\). In english, the expected error is 0, and the correlation/covariance between the error and true score is 0.

Given these assumptions, reliability is then defined as the squared correlation between the true score and observed score, or \(\rho^{2}_{X, \theta}\). Inuitively, \(\rho^{2}_{X, \theta} \rightarrow 1\) as \(X \rightarrow \theta\), and \(\rho^{2}_{X, \theta}\) is attenuated to the extent that \(X \ne \theta\).

As a consequence of the assumption that \(\text{Cov}(\epsilon, \theta) = 0\), the observed, true, and error variances have the following relationship:

\[\sigma^{2}_X = \sigma^{2}_\theta + \sigma^{2}_\epsilon\]

Therefore, \(\rho^{2}_{X, \theta}\) can also be thought of as the proportion of variance that the true score accounts for in the observed score (similar to \(R^2\) in regression), or alternatively as 1 minus the ratio of error variance relative to observed score variance:

\[\rho^{2}_{X, \theta} = \frac{\sigma^{2}_\theta}{\sigma^{2}_{X}} = 1 - \frac{\sigma^{2}_{\epsilon}}{\sigma^{2}_{X}}\]

We now have a few different ways to think about reliability. Now, it is time to estimate it.

Estimating Reliability

To estimate \(\rho^{2}_{X, \theta}\), we need to somehow estimate either the true score or error variance, neither of which are directly observed. How do we do this? The answer is actually quite nuanced, and it all comes down to how we want to conceptualize the error variance \(\sigma^{2}_\epsilon\).

If we are simply interested in how precisely we can estimate the true score from observed data, it may be best to think of \(\sigma^{2}_\epsilon\) as the uncertainty in our estimate across administrations of an identical mesaure (termed precision or parallel forms reliability). If we are interested in whether or not multiple items on a scale capture the same underlying construct, we may use something like Chronbach’s \(\alpha\) (termed internal consistency). If we are interested in consistency of a measure over time, then \(\sigma^{2}_\epsilon\) may best be thought of as uncommon variance across different administrations of the same measure over an arbitrary period of time (termed termporal consistency or test-retest reliability). Crucially, the equations above underlie all these different forms of reliability–the big difference being how we conceptualize and compute \(\sigma^{2}_\epsilon\).

Given our focus here on precision, the current post will explore the first form of reliability mentioned above–parallel forms reliability. Typically, parallel forms reliability is estimated by having a group people take two identical versions of the same measure and then estimating the correlation between measures across individuals. Here, identical means that the measures tap into the same underlying trait/construct (i.e. the underlying true score is the same), share the same item characteristics, and have the same error variance \(\sigma^{2}_\epsilon\). Under these assumptions, the correlation between the observed scores across measure 1 and 2 (\(r_{X,X'}\)) is equivalent to \(\rho^{2}_{X, \theta}\):

\[r_{X,X'} = \frac{\sigma^{2}_\theta}{\sigma^{2}_{X}} = \rho^{2}_{X, \theta}\] This relationship allows us to think of reliability in two different ways: (1) as the correlation between observed scores of identical measures, or (2) as the ratio of true score variance to observed score variance.

However, what if we do not have parallel forms? Obviously, having two identical measures is infeasible in many settings. Assuming that there are no practice effects, we could always have our subjects re-take the measure and then estimate \(\rho^{2}_{X, \theta}\) as the correlation between their scores across administrations, although this would still demand more time on the part of participants.

Reliability as Measurement Precision

One of the key of framing reliability in terms of precision is that we can estimate \(\rho^{2}_{X, \theta}\) from a single administration of our measure. In this section, we will employ this method and compare it to the parallel forms/test-retest method where reliability is characterized as the correlation between observed scores of identical measures.

Using precision to estimate the error variance of our measure is actually pretty straightforward. Returning to the example in the introduction, say we have a vector of yes/no responses for a single indivdual \(i\) who responded to 10 items: \(\textbf{x}_i = [1, 0, 1, 0, 1, 1, 1, 0, 1, 1]\). As before, if we take the mean of this vector as the observed score, we have \(X_i = \frac{1}{T}\sum^{T}_{t=1}\textbf{x}_{i,t} = .7\).

Because we have binary observed data, we can view each response as a bernoulli trial, allowing us to use the bernoulli distribution to estimate the error variance and standard error. According to the wisdom of Wikipedia, the mean of a bernoulli distribution is \(p\), where \(p\) is the success probability (i.e. probability of observing 1 rather than 0). In our case, \(p\) is estimated by the observed mean \(X_i\).

Again taking from Wikipedia, the variance of a sum of bernoulli trials is given by \(pq\), where \(p\) is as defined above and \(q\) is the failure probability, \(1-p\). We can then compute the standard error of measurement for this individual \(\sigma_{\epsilon,i}\) (which we can square to get \(\sigma^{2}_{\epsilon,i}\)) by dividing the variance by \(T-1\), where \(T\) is the number of items:

\[\sigma^{2}_{\epsilon,i} = \frac{pq}{T-1} = \frac{.7 \times (1-.7)}{10-1} = 0.021\] The standard error of measurement corresponds to the strandard deviation of the sampling distribution of our observed score. The standard error of measurement thus gives us useful information about how close our observed score is to the true score. In particular, as we increase the number of items on our measure, the sampling distribution of the mean becomes more concentrated. In turn, on average, our observed score becomes closer to the true score.

If these ideas are unfamiliar to you, it is useful to visualize what happens to the sampling distribution as we increase the number of items on the questionnaire. The R code below simulates the sampling distribution for each of a set of item sizes ranging from 10 to 100, using a success probability of .7:

# First load some packages we will use
library(dplyr)
library(foreach)
library(ggplot2)
library(ggridges)
library(truncnorm)
# num of items and success probability
n_items    <- seq(10, 100, by = 5)
success_pr <- .7

# number of samples from sampling distribution
n_reps <- 10000

# Generate binomial sampling distribution
results <- foreach(i=seq_along(n_items), .combine = "rbind") %do% {
  data.frame(T = rep(n_items[i], n_reps),
             X = replicate(n_reps, 
                           mean(rbinom(n_items[i], 1, prob = success_pr))))
}

# Compute standard deviation of the sampling distribution 
# (aka the standard error of the mean)
sem <- results %>%
  mutate(T = factor(T, levels = rev(unique(sort(T))))) %>%
  group_by(T) %>%
  summarize(se_lo = mean(X) - sd(X), # mean - 1 SEM
            se_hi = mean(X) + sd(X)) # mean + 1 SEM

# Plot some cool ridges
results %>% 
  mutate(T = factor(T, levels = rev(unique(sort(T))))) %>%
  ggplot(aes(x = X, y = T, group = T)) +
  geom_density_ridges(aes(fill = I("#DCBCBC"), color = I("white")), 
                      stat = "binline", binwidth = .01, scale = 3) +
  geom_segment(aes(x = se_lo, xend = se_lo, y = as.numeric(T),
                   yend = as.numeric(T) + 1),
               data = sem, size = 1) +
  geom_segment(aes(x = se_hi, xend = se_hi, y = as.numeric(T),
                   yend = as.numeric(T) + 1),
               data = sem, size = 1) +
  coord_cartesian(xlim = c(.3, 1)) +
  ggtitle("The Bernoulli Sampling Distribution") +
  xlab("Sample Mean (X)") +
  ylab("# Items (T)") +
  theme_minimal(base_size = 15) +
  theme(panel.grid = element_blank(),
        legend.position = "none")

Here, the black bars on either side of each distribution represent the mean \(\pm ~1\) standard deviation of the sampling distribution. As you can see, with an increasing number of items, the sampling distribution becomes more concentrated and more normal (i.e. Gaussian). For us, this means that the range of possible observed scores decreases as we use more items–therefore, an obsvered score computed with a measure containing many items is probably closer to the true score than a measure with a low number of items. In other words, our uncertainty becomes lower, and our estimate more precise.

Importantly, the standard error of measurement has a direct relationship to the reliability of our measure. As described by Lord (1955) (see also Lord, 1957), if we calculate \(\sigma^{2}_{\epsilon,i}\) for all \(N\) individuals, the average of the resulting squared standard errors of measurement corresponds to the error variance we use to estimate reliability:

\[\sigma^{2}_\epsilon = \frac{1}{N}\sum_{i=1}^{N}\sigma^{2}_{\epsilon,i}\] Then, if \(X\) is the vector of observed scores for each individual, the total or observed variance is simply:

\[\sigma^{2}_X = \text{Var}(X)\] Remember, \(X\) here is a vector of observed scores across all individuals. Reliability is then computed using the same formula as always:

\[\rho^{2}_{X,\theta} = 1 - \frac{\sigma^{2}_\epsilon}{\sigma^{2}_X}\] The implication here is that we can actually use what we know about the standard error/precision of our estimates for each individual to estimate reliability pretty easily from just a single measure. In fact, assuming that the underlying true score is equivalent across measures/timepoints, this approach is equivalent to the parallel forms or test-retest reliability estimate that we discussed above. We can check this correspondence with a quick simulation, wherein we can look at the reliability estimates for each approach as a function of the number of items:

set.seed(43202)

# Number of subjects and items
n_subjs <- 30
n_items <- seq(10, 500, length.out = 30)

# Random sample of "true" scores around success prob. p = .7
theta <- rnorm(n_subjs, .7, .1)

# Estimate standard error of measurement (squared)
est_se2 <- function(x) {
  # Success and failure probability
  n <- length(x)
  p <- mean(x)
  q <- 1 - p

  sig2_ep_i <- (p*q)/(n-1)

  return(sig2_ep_i)
}

# Estimate observed and true score
rel_dat <- foreach(t=seq_along(n_items), .combine = "rbind") %do% {
    # Parallel form 1 (or timepoint 1 administration)
    X_all_f1 <- foreach(i=seq_along(theta), .combine = "rbind") %do% {
      # Simulate from binomial distribution
      rbinom(n_items[t], 1, prob = theta[i])
    }
    # Parallel form 2 (or timepoint 2 administration)
    X_all_f2 <- foreach(i=seq_along(theta), .combine = "rbind") %do% {
      # note theta here is equivalent to above (i.e. true scores are the same)
      rbinom(n_items[t], 1, prob = theta[i])  
    }
    
    # Computing X_i (p) for each individual
    X_f1 <- rowMeans(X_all_f1)
    X_f2 <- rowMeans(X_all_f2)
    
    # Standard arror of measurement approach (just using form/timepoint 1)
    sig2_ep <- mean(apply(X_all_f1, 1, est_se2)) 
    sig2_X   <- var(X_f1)
    
    data.frame(n_items  = n_items[t],
               rho2_pf  = cor(X_f1, X_f2), # Parallel form/test-retest approach
               rho2_sem = 1 - (sig2_ep/sig2_X))
}
# ooooooh nice :D
rel_dat %>%
  ggplot() +
  geom_line(aes(x = n_items, y = rho2_pf), color = I("#DCBCBC")) +
  geom_line(aes(x = n_items, y = rho2_sem), color = I("#8F2727")) +
  ggtitle("Approaches to estimating reliability") +
  xlab("Number of Items") +
  ylab(bquote("Reliability (" ~rho[X~~","~theta]^2~ ")")) +
  annotate("text", x = 220, y = 1, label = "Parallel Forms", 
           color = "#DCBCBC", size = 5) +
  annotate("text", x = 320, y = .8, label = "Standard Error of\nMeasurement", 
           color = "#8F2727", size = 5) +
  theme_minimal(base_size = 15) +
  theme(panel.grid = element_blank())

It is clear from the above figure that both approaches give the same reliability estimates on average, although there is some noise due to the probabilistic nature of the simulation.

Using Reliability to Improve Individual-level Inference

Importantly, we can use our reliability estimate to improve our inference on individual-level true scores.

Specifically, Kelley demonstrated–in as early as 1920 (see Kelley, 1947)–that we can estimate the true scores for each individual, \(\hat{\theta}_i\), by regressing the observed scores on the reliability estimate:

\[\hat{\theta}_i = (1-\rho^{2}_{X,\theta})\bar{X} + \rho^{2}_{X,\theta}X_i\]

Here, \(\bar{X}\) is the mean of the observed scores across all subjects, given by \(\bar{X} = \frac{1}{N}\sum_{i=1}^{N}X_i\). Intuitively, the true score for each subject is estimated by pooling the their observed score toward the group-level mean score in proportion to the reliability of the individual-level estimate. If \(\rho^{2}_{X,\theta} = 1\), then \(X_i = \hat{\theta}_i\) for all individuals–there is no pooling. Conversely, as \(\rho^{2}_{X,\theta} \rightarrow 0\), the individual-level observed scores have no weight at all, and the estimated true score is simply the group-level mean for each individual.

In addition to estimating the true scores as observed scores pooled toward the group mean, Kelley also showed that the standard error of true scores is reduced in the following manner (for a derivation see this technical report by Brennan, 2012):

\[\hat{\sigma}_{\epsilon} = \sigma_{X}\sqrt{1-\rho^{2}_{X,\theta}}\sqrt{\rho^{2}_{X,\theta}}\] This standard error of the true score estimates is lower than that of the observed scores, which is given by \(\hat{\sigma}_{\epsilon} = \sigma_{X}\sqrt{1-\rho^{2}_{X,\theta}}\). Comparing the equations, we see that the true score standard error incorporates an additional term, \(\sqrt{\rho^{2}_{X,\theta}}\).

These equations are rather interesting for a few reasons. First, they were discovered in 1920 (100 years ago!), long before James-Stein estimators (which similarly pool individual-level estimates toward the group-level mean) made their way into statistics (check out these slides for some cool historical notes on this).

Second, they have a Bayesian interpretation! In particular, if we assume that the “true scores” have a normal prior distribution, and that true scores are distributed normally with respect to the observed scores, empirical Bayesian estimation produces posterior means that are equivalent to the Kelley true score point estimates described above (see page 22 of de Gruijter & van der Kamp, 2008). It is also worth noting that many popular multilevel/hierarchical modeling software packages work using something akin to empirical Bayesian estimation (e.g., lmer in R). It goes without saying that using group-level information can be exceedingly useful to improve individual-level inference.

OK, so this has all been a bit abstract. To make things more concrete, we can run a simulation to observe how the true score estimation/pooling described above works. The R code below simulates data from a binomial distribution for 20 “subjects” with success probabilities centered around \(.7\). Additionally, we will use item sizes of 10, 30, and 100 to demonstrate how pooling effects changes as a function of the number of items in a measure (due to the effect of number of items on resulting reliability).

set.seed(43202)

# Number of subjects and items
n_subj  <- 20
n_items <- c(10, 30, 100)

# Random sample of "true" scores around .7
theta  <- rnorm(n_subj, .7, .1)

# Estimate observed and true score
dis_dat <- foreach(i=seq_along(n_items), .combine = "rbind") %do% {
  # Generate observed data for each subject using "true" score
  X_all <- foreach(t=seq_along(theta), .combine = "rbind") %do% {
   rbinom(n_items[i], 1, prob = theta[t]) 
  }
  
  # group average observed score
  X_bar <- mean(rowMeans(X_all))
  
  # Reliability
  X <- rowMeans(X_all)
  
  # Standard arror of measurement approach
  sig2_ep  <- mean(apply(X_all, 1, est_se2)) 
  sig2_X   <- var(X)
  rho      <- 1 - (sig2_ep/sig2_X)

  foreach(t=seq_along(theta), .combine = "rbind") %do% {
    # Using observed scores from parallel form 1
    X_obs <- X_all[t,]
    X_i   <- mean(X_obs)
    
    data.frame(subj_num  = t,
               n_items   = n_items[i],
               theta     = theta[t],
               rho       = rho,
               X         = X_i,
               se_obs    = sd(X)*sqrt(1-rho),
               se_hat    = sd(X)*sqrt(1-rho)*sqrt(rho),
               theta_hat = (1-rho)*X_bar + rho*X_i)
  }
}

# Plot true, observed, and estimated true scores
dis_dat %>%
  mutate(subj_num = reorder(subj_num, theta)) %>%
  ggplot(aes(x = subj_num, y = theta)) + 
  geom_point(color = I("black")) +
  geom_point(aes(x = subj_num, y = X),
             color = I("#DCBCBC"),
             position = position_jitter(width=.2, height=0, seed = 1)) +
  geom_linerange(aes(x = subj_num,
                    ymin = X - 1.96*se_obs,
                    ymax = X + 1.96*se_obs),
                color = I("#DCBCBC"),
                position = position_jitter(width=.2, height=0, seed = 1)) +
  geom_point(aes(x = subj_num, y = theta_hat), 
             color = I("#8F2727"),
             position = position_jitter(width=.2, height=0, seed = 2)) +
  geom_linerange(aes(x = subj_num, 
                    ymin = theta_hat - 1.96*se_hat, 
                    ymax = theta_hat + 1.96*se_hat), 
                color = I("#8F2727"), 
                position = position_jitter(width=.2, height=0, seed = 2)) +
  geom_hline(yintercept = X_bar, linetype = 2, color = I("gray")) +
  annotate("text", x = 15, y = .4, label = expression("True"~theta[i]), 
           color = "black", size = 5) +
  annotate("text", x = 15, y = .3, label = expression("Obs"~X[i]),
           color = "#DCBCBC", size = 5) +
  annotate("text", x = 15, y = .2, label = expression("Est"~hat(theta)[i]),
           color = "#8F2727", size = 5) +
  facet_wrap(c("n_items"), nrow = 1) +
  ggtitle("Regression-Based True Score Estimates") +
  xlab("Subject") +
  ylab("Value") +
  theme_minimal(base_size = 15) +
  theme(panel.grid = element_blank(),
        axis.text.x.bottom = element_blank())

I don’t know about you, but I think this is pretty cool for a technique developed in 1920 :D. What we see is that the Kelley regression-based point estimates are pooled toward the group-level average (represented by the horizontal gray dotted line) in proportion to the reliability of the measure, which increases as a function of the number of items. Further, the confidence intervals for the estimated true scores \(\hat{\theta}_i\) are much narrower than those of the observed scores–yet they still exhibit good coverage properties of the underlying true scores \(\theta_i\).

A Brief Note on Assumptions

It is worth noting that the models we have discussed so far make many assumptions that might not be met in practice (i.e. normality of the sampling distribution, which is clearly not true when the number of items is low). Much of applied frequentist modeling relies on similar assumptions regarding normality when computing standard errors and p-values. Additionally, our model assumes that responses to each item are random with respect the the underlying true score, which is not true if items are of different difficulties. The need to relax these latter assumptions was the impetus for the development of what is now called Generalizability Theory, or G-Theory, which seeks to tease apart these various sources of error.

Using Reliability to Disattenuate a Correlation

So, it is now clear that correcting for low reliability can improve inference on individual-level true scores. However, how do these ideas then translate to measuring the correlation between true scores of two different measures? When we are interested in such quantities, we need to take into account measurement error–else we obtain an attenuted, or biased, correlation estimate, and we may falsely infer that a relationship does not exist when it does at the level of true scores.

The equation for attenutation is actually quite straightforward, and it is similar to Kelley’s regression-based true score estimation equation above. If we have two measures, say measures \(M_1\) and \(M_2\), each with corresponding reliability estimates of \(\rho^{2}_{M_1,\theta_1}\) and \(\rho^{2}_{M_2,\theta_2}\), the correlation between the osberved scores of the measures \(r_{M_1,M_2}\) is attenuated as follows:

\[r_{M_1,M_2} = \hat{r}_{\theta_1,\theta_2}\sqrt{\rho^{2}_{M_1,\theta_1}\rho^{2}_{M_2,\theta_2}}\] Here, \(\hat{r}_{\theta_1,\theta_2}\) is our estimate for what the correlation between true scores should be, but our observed correlation is \(r_{M_1,M_2}\) is attenuated by the reliability of each measure. Inuitively, if \(\rho^{2}_{M_1,\theta_1} = \rho^{2}_{M_2,\theta_2} = 1\), then the observed scores on each measure are equal to the true scores (i.e. all \(\theta_1 = M_1\) and \(\theta_2 = M_2\)), and there is no attenuation. Otherwise, the true correlation becomes attenuated in proportion to the square root of the product of each measure’s reliability.

To disattenuate the observed correlation, we can just use some algebra to rearrange the equation to get:

\[\hat{r}_{\theta_1,\theta_2} = \frac{r_{M_1,M_2}}{\sqrt{\rho^{2}_{M_1,\theta_1}\rho^{2}_{M_2,\theta_2}}}\]

Although this disattenuation formula is widely accepted, how to best estimate the corresponding confidence interval–which is necessary for making statistical inference–is much more controversial. For our purposes, we will use a simplified approach, which applies the disattenuation formula above to the lower and upper bounds of the observed correlation confidence interval (see Winne & Belfry, 1982):

\[\frac{Lower}{\sqrt{\rho^{2}_{M_1,\theta_1}\rho^{2}_{M_2,\theta_2}}} < \hat{r}_{\theta_1,\theta_2} < \frac{Upper}{\sqrt{\rho^{2}_{M_1,\theta_1}\rho^{2}_{M_2,\theta_2}}}\]

We will explore these disattenuation corrections in more detail after we cover generative models below.

A Brief Note on Disattenuation and the Meaning of Test-retest Reliability

Before moving on, it is worth emphasizing the implications of the above attenuation equation for research on changes in individual differences over time. For example, if I am interested in how a psychological construct changes over the span of 1 month, I may have a group of participants take the same measure at two separate timepoints, followed by computing a correlation to determine if individual differences at time 1 are consistent at time 2. In this case, we would actually need to correct this correlation by the reliability of each measure if we wanted to infer whether or not observed changes result from actual changes in the underlying, supposedly trait-like construct (i.e. true score), versus measurement error resulting from imprecision. This is an important distinction, because confusion between these different sources of variation (i.e. actual change versus low precision) can result in strong conclusions regarding the utility of certain measures for individual differences research. If you are interested in more on this topic, I cover it in detail in a previous blog post, where I dive into the so-called “Reliability Paradox”. In fact, it isn’t so paradoxical after all.

Generative Modeling

Generative models can accomplish all of what we described above pertaining to classical test theory, but in a more intuitive and easy to implement way–or at least that is what I hope to convince you of by the end of this post.

When building generative models, as opposed to thinking about “true” and “observed” scores, we will think about “true” and “estimated” data-generating parameters, respectively. As opposed to referring to “error variance” or “standard errors of the mean”, we will discuss “uncertainty”–or precision–in our parameter estimates. Finally, as opposed to relying on the asymptotic assumptions necessary for us to estimate error variances using normal approximations, we will use hierarchical Bayesian modeling to jointly account for our uncertainty across all parameters of interest.

The Goal of Generative Modeling

From the perspective of generative modeling, our goal is to specify a joint probability distribution that describes both the relationships between parameters within our model and the relationship between model parameters and observed data. By specifiying a generative model, all of our assumptions are made explicit, which is a useful practice conducive to theory development and testing.

For the current application, we are interested in the relationship between a trait and behavioral measure, which are captured using a questionnaire and response time task, respectively. Our goal is to determine if individual differences in one measure relate to individual differences in the other. Therefore, we need to develop models of the data-generating process underlying both: (1) responding to questionnaire items, (2) response time distributions, and (3) the relationship between parameters of 1 and 2. Additionally, we need to ensure that our model accounts for the uncertainty in our individual-level parameter estimates (analagous to dissociating error from observed scores to get true scores within classical test theory). Finally, we will compare how well the generative model can uncover individual differences compared to the disattentuation approach used within the context of classical test theory.

Generative Model of Questionnaire Data

For the questionnaire data, we will use the bernoulli distribution, where the response to each item can be thought of as a bernoulli trial. Recall that we used the bernoulli distribution within the classical test theory example to estimate the standard error for our reliability calculation.

The bernoulli model, as we will now refer to it, is given by:

\[\text{Pr}(x_{i,t}=1) = p_{i} = 1 - \text{Pr}(x_{i,t}=0) = 1 - q_i\]

Pretty simple! As in the classical test theory example, \(p_i\) is the success probability, or the probability that the response of individual \(i\) is 1. Then, \(q_i\) is the failure probability, or the probability that the response is 0 (which is simply \(1-p_i\)). We will then define \(p_i\) such that:

\[p_i = \frac{1}{1+\text{exp}(-\theta_i)}\] where \(\theta_i ~(-\infty < \theta_i < +\infty)\) is unbounded, and the logistic function transforms \(\theta_i\) to ensure that \(p_i\) is between 0 and 1. Note that we use this transformation because \(p_i\) is a probability, and therefore must be between 0 and 1. We could use other types of tranformations, but I chose to use the logistic function because it may be more familiar to readers (logistic regression!) compared to other functions. In fact, this model can be thought of as a very simple Item Response Theory model, with only an “ability” parameter for each person (i.e. \(\theta_i\)), and item difficulties and discriminabilities set to 1.

Given the above generative specification, the parameter that we actually need to estimate for each individual is \(\theta_i\), where \(p_i\) is then determined by the logistic function.

However, in addition to specifying how the observed responses are generated, we also need to specify how the individual-level \(\theta_i\) parameters are generated. Similar to how the classical test theory approach requires us to assume a population-level sampling distribution, the generative modeling approach requires us to assume a group-level generative distribution. For simplicity, we may assume that the individual-level \(\theta_i\) parameters are generated by a standard normal distribution:

\[\theta_i \sim \mathcal{N}(0,1)\] In Bayesian terminology, such a group-level distribution is often referred to as a prior distribution on \(\theta\), but I actually do not like this terminology–I instead prefer to think of is as our generative model of the parameters themselves. My reasoning is simple–we do not actually need to specify the parameters of the group-level normal distribution like we did above. Instead, we could estimate these group-level parameters from the data itself:

\[\theta_i \sim \mathcal{N}(\mu_{\theta},\sigma_{\theta})\]

Then, we need to make a generative assumption regarding the group level mean and standard deviation parameters (or a prior on the group-level parameters in typical Bayes speak). In our case, we will assume that \(\mu_{\theta} \sim \mathcal{N}(0,1)\) and \(\sigma_{\theta} \sim \text{half-}\mathcal{N}(0,1)\). Here, \(\text{half-}\mathcal{N}\) indicates a half-normal distribution, which only places probability mass on values greater than 0 (given that standard deviations must be positive). Note that the generative model is now considered hierarchical–as we fit the model, the the individual-level parameters will influence the group-level parameters, which will in turn influence the all individual-level parameters. Much like for the regression-based true score estimates in classical test theory, our individual-level parameters will become pooled toward the group-level mean, which will also shrink the uncertainty intervals at the individual-level.

Because it is difficult to get a sense of how to interpret these generative assumptions with respect to the outcome, we can generate samples from the model to observe the distribution of \(p_i\) and ensure that it aligns with our expectations:

# Number of samples we will draw from the priors on group-level dist
n_draw <- 12
# Number of individual-level theta paramaeter drawn from group dist 
n_subj <- 30

# Loop through and plot
foreach(i=1:n_draw, .combine = "rbind") %do% {
  # Sample group-level parameters from priors 
  mu_theta <- rnorm(1,0,1)
  sig_theta <- rtruncnorm(1,0,1,a=0,b=Inf) # normal dist truncated at 0
  
  # Sample individual level parameters from group distribution
  theta <- rnorm(n_subj, mu_theta, sig_theta)
  # Logistic transform to ensure 0 < p < 1
  p <- 1/(1+exp(theta))
  data.frame(n_draw = rep(i, n_subj),
             p      = p)
} %>%
  ggplot(aes(x = p)) +
  geom_histogram(fill = I("#8F2727")) +
  xlab(expression(p[i])) +
  facet_wrap("n_draw", ncol = 3, scales = "free_y") +
  theme_minimal(base_size = 12) +
  theme(panel.grid = element_blank())

Note that each of these 12 panels are separate samples from the priors on the group-level distribution (i.e. separate samples from \(\mu_{\theta} \sim \mathcal{N}(0,1)\) and \(\sigma_{\theta} \sim \text{half-}\mathcal{N}(0,1)\)). Within each panel, 30 individual-level \(\theta_i\) parameters have been sampled, which we then transformed to the bernoulli success probability \(p_i\) plotted above.

From the panels, we can see that our generative specification can produce many different distributions of possible success probabilities–in other words, our model is relatively uninformative with respect to what the individual-level success probabilities will be. For our application, this is just fine! Because this is such a simple model, I feel comfortable looking at only the success probability generated from our model. When models are more complex, it is useful to do more rigorous prior predictive simulations, which simulate to the level of observed data.

Generative Model Parameters vs Classical Test Theory True Scores

Before moving on, we can take a look at how the generative model estimates compare to the classical test theory regression-based true score estimates that we computed above. As a refresher, the regression-based true score estimates were pooled toward the group-level mean, and they were also more precise (i.e. smaller confidence intervals) compared to the observed score counterparts.

First, here is the Stan code we will use to fit the generative bernoulli model (note that the model is saved in the R envirnoment to the variable m_bernoulli):

# Load in rstan first
library(rstan)
data {
    int N;      // # of subjects
    int N_items; // # of timepoints
    int Y[N, N_items]; // Responses for each subject and item
}
parameters {
  // Group-level parameters
  // SDs
  real<lower=0> sigma_theta;
  // means
  real mu_theta;
  
  // Individual-level parameters
    vector[N] theta_pr;
}
transformed parameters {
  // Individual-level parameters 
  vector[N] theta;
  
  // Compute individual-level parameters from non-centered parameterization
  theta = mu_theta + sigma_theta * theta_pr;
}
model {
  // Priors on group-level means
  mu_theta ~ normal(0, 1);
  
  // Priors on group-level SDs
  sigma_theta ~ normal(0, 1);
  
  // Priors on individual-level parameters
  theta_pr ~ normal(0, 1);
    
  // For each subject
  for (i in 1:N) {
    // self-report model
    Y[i,:] ~ bernoulli_logit(theta[i]);
  }
}
generated quantities {
  vector[N] p;
  // Success probability estimate for each individual
  p = inv_logit(theta);
}

The Stan code above uses a non-centered parameterization of the group-level portion of the model for reasons described in a previous post, but otherwise is mathematically equivalent to the generative model as described by the equations above.

The R code below fits the generative bernoulli model to the same data we used above when estimating true scores, with item sizes of 10, 30, and 100 to demonstrate the effects of pooling:

# seeded to reproduce same theta's and data as above for comparison
set.seed(43202) 

# Number of subjects and items
n_subj  <- 20
n_items <- c(10, 30, 100)

# Random sample of "true" scores around .7
theta  <- rnorm(n_subj, .7, .1)

# Estimate observed and true score
gen_dat <- foreach(i=seq_along(n_items), .combine = "rbind") %do% {
  # Generate observed data for each subject using "true" score
  X_all <- foreach(t=seq_along(theta), .combine = "rbind") %do% {
   rbinom(n_items[i], 1, prob = theta[t]) # theta same as in above example
  }
  # Fit generative model
  fit_bernoulli <- sampling(m_bernoulli,
                            data   = list(N = n_subj,
                                          N_items = n_items[i],
                                          Y = X_all),
                            iter   = 3000,
                            warmup = 500,
                            chains = 4,
                            cores  = 4,
                            seed  = 43202) 
  pars <- rstan::extract(fit_bernoulli)
  foreach(t=seq_along(theta), .combine = "rbind") %do% {
    # Using observed scores from parallel form 1
    bayes_est <- pars$p[,t]
    hdi <- hBayesDM::HDIofMCMC(bayes_est)
    
    data.frame(subj_num  = t,
               n_items   = n_items[i],
               theta     = theta[t],
               bayes_theta = mean(bayes_est),
               bayes_lo    = hdi[1],
               bayes_hi    = hdi[2])
  }
}

# Plot true, observed, and estimated true scores
dis_dat %>%
  full_join(gen_dat) %>%
  mutate(subj_num = reorder(subj_num, theta)) %>%
  ggplot(aes(x = subj_num, y = theta)) + 
  geom_point(color = I("black")) +
  geom_point(aes(x = subj_num, y = theta_hat), 
             color = I("#DCBCBC"),
             position = position_jitter(width=.3, height=0, seed = 2)) +
  geom_linerange(aes(x = subj_num, 
                    ymin = theta_hat - 1.96*se_hat, 
                    ymax = theta_hat + 1.96*se_hat), 
                color = I("#DCBCBC"), 
                position = position_jitter(width=.3, height=0, seed = 2)) +
  geom_point(aes(x = subj_num, y = bayes_theta), 
             color = I("#8F2727"),
             position = position_jitter(width=.3, height=0, seed = 1)) +
  geom_linerange(aes(x = subj_num, 
                    ymin = bayes_lo, 
                    ymax = bayes_hi), 
                color = I("#8F2727"),
                position = position_jitter(width=.3, height=0, seed = 1)) +
  geom_hline(yintercept = X_bar, linetype = 2, color = I("gray")) +
  annotate("text", x = 15, y = .4, label = expression("True"~theta[i]), 
           color = "black", size = 5) +
  annotate("text", x = 15, y = .3, label = expression("Est"~hat(theta)[i]),
           color = "#DCBCBC", size = 5) +
  annotate("text", x = 15, y = .2, label = expression("Gen"~hat(theta)[i]), 
           color = "#8F2727", size = 5) +
  facet_wrap(c("n_items"), nrow = 1) +
  ggtitle("Regression-Based True Scores vs\nGenerative Model Estimates") +
  xlab("Subject") +
  ylab("Value") +
  theme_minimal(base_size = 15) +
  theme(panel.grid = element_blank(),
        axis.text.x.bottom = element_blank())

Again, I think this is super cool! What we see is that the generative model expectations (i.e. the posterior means) and uncertainty intervals (i.e. the 95% highest density intervals) are very similar to the corresponding regression-based true score estimates and 95% confidence intervals. In fact, the point esitmates for both approaches are almost indistinguishable. Given that the regression-based true scores have a Bayesian interpretation, this should not be too surprising, yet it is still nice to see it work out empirically.

More generally, this example demonstrates that generative models can give us the same “true scores” that classical test theory does, yet we do not have to compute reliability, etc., to get there. Instead, we made generative, distributional assumptions regarding how our model parameters related to each other (e.g., the group-level generative model) and to the observed data. Then, we took the posterior expectation (i.e. the posterior mean) of the individual-level parameters to get our best estimates of the “true” underlying data-generating parameters.

Generative Model of Response Time Data

For the response time model, we will assume that response times for each individual \(i\) and for each trial \(t\) arise from a normal distribution, which we will term the normal model:

\[\text{RT}_{i,t} \sim \mathcal{N}(\delta_i, \sigma_i)\] Here, \(\delta_i\) and \(\sigma_i\) indicate the mean and standard deviation, respectively, of the response time distribution for individual \(i\). Of course, response times are not typically normally distributed in many beahvioral tasks, but we will ignore that for this demonstration.

Like the bernoulli model for the questionnaire data, we can then assume that the individual-level normal model parameters are themselves generated by normal distributions such that \(\delta_i \sim \mathcal{N}(\mu_{\delta},\sigma_{\delta})\) and \(\text{log}(\sigma_i) \sim \mathcal{N}(\mu_{\sigma},\sigma_{\sigma})\). To specify the group-level parameters, it is important to remember that response times are positive valued, and typically greater than 200 milliseconds. We could encode this into the model in the prior distribution. However, for the sake of brevity, we will not focus too much on this.

For the \(\delta\) parameters, we can assume that \(\mu_{\delta} \sim \mathcal{N}(1,1)\) and \(\sigma_{\delta} \sim \text{half-}\mathcal{N}(0,0.5)\). Next, for the \(\sigma\) parameters, we can assume that \(\mu_{\sigma} \sim \mathcal{N}(-1,0.2)\) and \(\sigma_{\sigma} \sim \text{half-}\mathcal{N}(0,0.5)\). This parameterization whould ensure that response times are generally above 0, but there is still a good amount of variation. It is much harder to form an intuition for what types of response time distributions these priors would produce. Therefore, let us simulate!

# Number of samples we will draw from the priors on group-level dist
n_draw <- 10
# Number of individual-level theta paramaeter drawn from group dist 
n_subj <- 5
# Number of trials to simulate from individual-level normal distributions
n_trials <- 200

# Loop through and plot
foreach(d=1:n_draw, .combine = "rbind") %do% {
  # Sample group-level parameters from priors 
  mu_delta  <- rnorm(1,1,1)
  sig_delta <- rtruncnorm(1,0,.5,a=0,b=Inf) # normal dist truncated at 0
  mu_sigma  <- rnorm(1,-1,.2)
  sig_sigma <- rtruncnorm(1,0,.5,a=0,b=Inf)
  
  # Sample individual-level parameters from group dist
  delta <- rnorm(n_subj, mu_delta, sig_delta)
  sigma <- exp(rnorm(n_subj, mu_sigma, sig_sigma))
    
  foreach(i=1:n_subj, .combine = "rbind") %do% {
    # Sample individual level response times from indiv dist
    RT <- rnorm(n_trials, delta[i], sigma[i])
    
    data.frame(n_draw = rep(d, n_trials),
               n_subj = rep(i, n_trials),
               RT     = RT)
  }
} %>%
  ggplot(aes(x = RT)) +
  geom_histogram(fill = I("#8F2727"), binwidth = .5) +
  xlab("Response Time") +
  facet_grid(n_draw ~ n_subj, scales = "free") +
  coord_cartesian(xlim = c(-2, 2)) +
  theme_minimal(base_size = 12) +
  theme(panel.grid = element_blank())

Here, the rows represent different draws for the group-level distribution parameters, where the columns are the response time distributions generated by 5 different “subjects” drawn from the given group-level parameters. Note also that I have zoomed in so that the x-axis is between -2 and 2. There are clearly some aspects that we could improve on. For example, many response times are below 0, which is not possible (although this is not relevant for our example given our use of simulated data). That said, feel free to modify the priors to get a sense of what happens!

A Joint Generative Model of Questionnaire and Response Time Data

We have now demonstrated how generative models relate to classical test theory through the use of regression-based true score estimates, and we have developed generative models for both the questionnaire and response time data that we would like to use to make inference on individual differences. In the current application, this amounts to estimating a correlation between the generative parameters for each task, or the correlation between \(\theta_i\) and \(\delta_i\).

Although there are many ways to estimate such a correlation, one straightforward method is to assume that \(\theta_i\) and \(\delta_i\) are drawn from a multivariate normal distribution as opposed to from independent normal distributions. Mathematically, we can make this change by modifying \(\theta_i \sim \mathcal{N}(\mu_{\theta},\sigma_{\theta})\) and \(\delta_i \sim \mathcal{N}(\mu_{\delta},\sigma_{\delta})\) to instead be:

\[\begin{bmatrix} \theta_{i} \\ \delta_{i} \end{bmatrix} \sim \text{MVNormal} \bigg (\begin{bmatrix} \mu_{\theta} \\ \mu_{\delta} \end{bmatrix}, \mathbf{S} \bigg )\] Here, \(\mathbf S\) is a covariance matrix, which can be decomposed into the group-level standard deviations and a 2 by 2 correlation matrix \(\mathbf R\) that captures the correlation between the individual-level generative parameters:

\[\begin{align*} \mathbf S & = \begin{pmatrix} \sigma_{\theta} & 0 \\ 0 & \sigma_{\delta} \end{pmatrix} \mathbf R \begin{pmatrix} \sigma_{\theta} & 0 \\ 0 & \sigma_{\delta} \end{pmatrix} \end{align*}\]

Here, the correlation matrix contains one free parameter on the off-diagonal, \(\rho_{\theta,\delta}\), which is the correlation of interest:

\[\begin{align*} \mathbf R & = \begin{pmatrix} 1 & \rho_{\theta,\delta} \\ \rho_{\theta,\delta} & 1 \end{pmatrix} \end{align*}\]

The only other change we need to make is to set a prior distribution on the correlation matrix–or just \(\rho_{\theta,\delta}\) in our case. For our purposes, we will assume a non-informative prior, given by \(\mathbf R \sim \text{LKJcorr} (1)\). The LKJ distribution assumes that the correlations on the off-diagonal are beta-distributed, ensuring that our correlation is between 0 and 1. In our case, the distribution is uniform across -1 to 1.

We are now ready to fit the model! The Stan code is given by the following script, which is assigned to the m_joint_RT_Bern variable in the R environment:

data {
    int N;             // # of subjects
    int N_items;       // # of items
    int T;             // max # of RT trials across subjects
    real RT[N, T];     // Reaction times for each subject and trial
    int Y[N, N_items]; // Responses for each subject and item
}
parameters {
  // Group-level parameters
  // correlation matrix (cholesky factor for faster computation)
  cholesky_factor_corr[2] R_chol;  
  // SDs
  vector<lower=0>[2] pars_sigma; 
  real<lower=0> sigma_SD;
  // means
  vector[2] pars_mu;
  real sigma_mu;
  
  // Individual-level parameters
    matrix[2,N] pars_pr; 
    vector[N] sigma_pr;
}
transformed parameters {
  // Individual-level parameter off-sets (for non-centered parameterization)
  matrix[2,N] pars_tilde;
  
  // Individual-level parameters 
  vector[N] theta;
  vector[N] delta;
  vector[N] sigma;
  
  // Construct inidividual offsets (for non-centered parameterization)
  pars_tilde = diag_pre_multiply(pars_sigma, R_chol) * pars_pr;
  
  // Compute individual-level parameters from non-centered parameterization
  for (i in 1:N) {
    theta[i] = pars_mu[1] + pars_tilde[1, i];
    delta[i] = pars_mu[2] + pars_tilde[2, i];
    sigma[i] = exp(sigma_mu + sigma_SD * sigma_pr[i]);
  }
}
model {
  // Prior on cholesky factor of correlation matrix
  R_chol ~ lkj_corr_cholesky(1);
  
  // Priors on group-level means
  pars_mu[1]  ~ normal(0, 1);
  pars_mu[2]  ~ normal(1, 1);
  sigma_mu ~ normal(-1, .2);
  
  // Priors on group-level SDs
  pars_sigma[1] ~ normal(0, 1);
  pars_sigma[2] ~ normal(0, .5);
  sigma_SD   ~ normal(0, .5);
  
  // Priors on individual-level parameters
  to_vector(pars_pr)  ~ normal(0, 1);
  to_vector(sigma_pr) ~ normal(0, 1);
    
  // For each subject
  for (i in 1:N) {
    // response time model
    RT[i,1:T] ~ normal(delta[i], sigma[i]);
    
    // self-report model
    Y[i,:] ~ bernoulli_logit(theta[i]);
  }
}
generated quantities { 
  corr_matrix[2] R;
    // Reconstruct correlation matrix from cholesky factor
  R = R_chol * R_chol';
} 

Like the Stan code for the bernoulli model, the code above is using a non-centered parameterization for the group-level portion of the model. Additionally, we are using a Cholesky decomposition to better estimate the correlation matrix. These are not important for the results, and for our current purposes we can think of these modification as simply ways to more efficiently fit the models. Regardless, the model in the Stan code above is mathematically equivalent to the model described by the equations in the text.

Time to fit! First, we will load some relevant R packages:

library(mvtnorm)
library(doParallel)
library(rstan)
library(hBayesDM)

Now, the R code below will simulate data where the correlation between the individual-level generative parameters \(\theta_i\) and \(\delta_i\) (or the “true scores”) varies from 0 to 1. Then, we will use the classical, reliability-based disattenuation formula along with the joint generative model to recover the true correlation.

The simulation uses 100 “participants”. Additionally, the questionnaire has 10 items, and the response time task has 50 trials. To the code!:

set.seed(43201)

# Number of subjects and items/trials for questionnaire/task
n_subj   <- 100
n_items  <- 10
n_trials <- 50

# Create grid of true generating correlations in (0,1)
true_r <- seq(0,1,length.out = 20)

# Parallel cluster
cl <- makeCluster(4)
registerDoParallel(cl)

# Parallelize across grid of true correlations
joint_dat <- foreach(r=seq_along(true_r), .combine = "rbind", 
                   .packages = c("mvtnorm", "dplyr", 
                                 "foreach", "rstan", "hBayesDM")) %dopar% {
  
  # Contruct correlation and covariance matrices
  M  <- c(0, .8) # group-level means for theta and delta
  SD <- c(1, .1) # group-level standard deviations for theta and delta
  R  <- matrix(c(1, true_r[r], true_r[r], 1), nrow = 2)
  S  <- diag(SD) %*% R %*% diag(SD)
    
  # Draw individual-level parameters from multivariate normal
  pars <- rmvnorm(n_subj, M, S) %>%
    as.data.frame()
  theta <- pars[,1] # for bernoulli model
  delta <- pars[,2] # for normal model
  sigma <- rnorm(n_subj, .4, .05) # for normal model
  
  # Simulate questionnaire data (i.e. bernoulli generative model)
  X_Q_all <- foreach(i=1:n_subj, .combine = "rbind") %do% {
    p  <- 1/(1 + exp(-theta[i])) # logistic transform
    rbinom(n_items, 1, prob = p)
  }
  # Simulate resposne time data (i.e. normal generative model)
  X_RT_all <- foreach(i=1:n_subj, .combine = "rbind") %do% {
    rnorm(n_trials, delta[i], sigma[i])
  }
  
  # group averages of observed scores
  X_bar_Q  <- mean(rowMeans(X_Q_all))
  X_bar_RT <- mean(rowMeans(X_RT_all))
  
  # individual-level observed scores
  X_Q  <- rowMeans(X_Q_all)
  X_RT <- rowMeans(X_RT_all)
  
  # Average of individual-level error variances 
  sig2_ep_Q  <- mean(apply(X_Q_all, 1, est_se2)) # same SE function from earlier 
  sig2_ep_RT <- mean(apply(X_RT_all, 1, function(x) var(x)/(length(x)-1)))
  
  # Group-level observed score variance
  sig2_X_Q  <- var(X_Q)
  sig2_X_RT <- var(X_RT)
  
  # Compute reliability using SEM approach
  rho_Q  <- 1 - (sig2_ep_Q/sig2_X_Q)
  rho_RT <- 1 - (sig2_ep_RT/sig2_X_RT)

  # Observed correlation and 50% confidence interval
  obs_cor <- cor.test(X_Q, X_RT, conf.level = .5)
  obs_r   <- obs_cor$estimate
  obs_ci  <- obs_cor$conf.int
  
  # Disattenuated correlation and 50% confidence interval
  dis_r  <- obs_r / sqrt(rho_Q*rho_RT)
  dis_ci <- obs_ci / sqrt(rho_Q*rho_RT)
    
  # Stan data for joint generative model
  stan_data <- list(N       = n_subj,
                    N_items = n_items,
                    T       = n_trials,
                    Y       = X_Q_all,
                    RT      = X_RT_all)
  
  # Fit joint generative model
  fit_joint <- sampling(m_joint_RT_Bern, 
                        data   = stan_data,
                        iter   = 1000,
                        warmup = 200, 
                        chains = 1, 
                        cores  = 1,
                        seed   = 43201)
  pars <- rstan::extract(fit_joint) # extract parameters
  
  # Generative model correlation and 50% highest density interval
  bayes_r <- mean(pars$R[,1,2])
  hdi <- hBayesDM::HDIofMCMC(pars$R[,1,2], credMass = .5)
  
  # Save out data
  data.frame(true_r    = true_r[r],
             obs_r     = obs_r,
             obs_lo    = obs_ci[1],
             obs_hi    = obs_ci[2],
             dis_r     = dis_r,
             dis_lo    = dis_ci[1],
             dis_hi    = dis_ci[2],
             bayes_r   = bayes_r,
             bayes_lo  = hdi[1],
             bayes_hi  = hdi[2])
}
# Stop the parallel boi
stopCluster(cl)

# Hacky way to get some ggplot
qplot() +
  geom_line(aes(x = true_r, y = true_r), col = I("black"),
            linetype = 2, size = 1) +
  geom_ribbon(aes(x = true_r,
                  ymin = joint_dat$obs_lo,
                  ymax = joint_dat$obs_hi,
                  fill = I("gray")), alpha = .2) +
  geom_ribbon(aes(x = true_r,
                  ymin = joint_dat$bayes_lo,
                  ymax = joint_dat$bayes_hi,
                  fill = I("#8F2727")), alpha = .2) +
  geom_ribbon(aes(x = true_r,
                  ymin = joint_dat$dis_lo,
                  ymax = joint_dat$dis_hi,
                  fill = I("#DCBCBC")), alpha = .2) +
  geom_line(aes(x = true_r, y = joint_dat$obs_r, col = I("gray"))) +
  geom_line(aes(x = true_r, y = joint_dat$dis_r, col = I("#DCBCBC"))) +
  geom_line(aes(x = true_r, y = joint_dat$bayes_r, col = I("#8F2727"))) +
  annotate("text", x = .8, y = .45, label = "Observed", 
           color = "gray", size = 5) +
  annotate("text", x = .35, y = .72, label = "Disattenuated", 
           color = "#DCBCBC", size = 5) +
  annotate("text", x = .85, y = .75, label = "Generative", 
           color = "#8F2727", size = 5) +
  coord_cartesian(xlim = c(0,1), ylim = c(0,1)) +
  ggtitle("Self-report & Behavioral Task Convergence") +
  xlab("True Generating Correlation") +
  ylab("Estimated Correlation") +
  theme_minimal(base_size = 15) +
  theme(panel.grid = element_blank())

And there we have it! The generative and classical disattenuation approaches produce almost identical correlation estimates and 50% uncertainty intervals. In fact, if you are having trouble identifying the classical disattenutation line and intervals, it is because they are just that hard to distinguish from those of the generative model.

Note that there is some noise across the range of “true” (or generating) correlations, which arises from the probabilistic nature of the simulated data. We could get a better sense of what the approaches produce in expectation by running many iterations for each true/generating correlation, but I prefer the above approach to get a sense of how each may work in a single study.

Discussion

The current post demonstrated that generative models are well-suited to make inference on individual differences, and in fact they can give us results similar to approaches developed within the framework of classical test theory (i.e. the correction for attenuation, true score estimation, etc.). Regardless of the approach you take, the results here make it clear that accounting for uncertainty (or reliability) is very important when our goal is to make inference on individual differences. Otherwise, our statistical inferences will be biased and overconfident. Moving forward, I hope that you consider accounting for such uncertainty in your models, regardless of the approach you decide to take.

More generally, the generative modeling approach is easily extendable. Unlike the classical test theory approach, generative models do not require us to work out a new sampling distribution each time we modify the assumed data-generating model. Instead, we specify the model in a way that is consistent with our theory, and the joint estimation of all parameters allows us to account for uncertainty (or the lack of precision of parameter estimates) across all levels of analysis. Therefore, when doing generative modeling, we do not necessarily need to think about the reliability of our measure–instead we can think about uncertanity in our model parameters. The model can then be refined, extended, and even simulated forward to generate predictions for observed data, and once we are happy with our generative model, we can use Bayesian estimation to condition on the data. The result is a joint probability distribution that contains all the information we need to make subsequent inferences and decisions–in our case, a self-report to behavioral task correlation. If our uncertainty intervals are too wide to make a good decision, we can collect more data. Further, we can conduct formal decision analyses (e.g., see Wiecki & Kumar, 2019). Altogether, generative modeling provides a flexible framework for developing and testing theories.

A Final Note

I hope that this post was useful for you! I sure learned a lot throughout, and it was great to finally delve into the relationships between classical test theory and generative modeling. Perhaps you will consider generative modeling for your next research project :D.

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hBayesDM_1.0.2     Rcpp_1.0.1         doParallel_1.0.14 
##  [4] iterators_1.0.10   mvtnorm_1.0-10     rstan_2.19.2      
##  [7] StanHeaders_2.19.0 truncnorm_1.0-8    ggridges_0.5.1    
## [10] ggplot2_3.3.2      foreach_1.4.4      dplyr_1.0.1       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0   xfun_0.7           purrr_0.3.3       
##  [4] colorspace_1.4-1   vctrs_0.3.2        generics_0.0.2    
##  [7] htmltools_0.3.6    stats4_3.6.0       loo_2.1.0         
## [10] yaml_2.2.0         rlang_0.4.10       pkgbuild_1.0.3    
## [13] pillar_1.4.1       glue_1.4.1         withr_2.1.2       
## [16] matrixStats_0.54.0 lifecycle_0.2.0    plyr_1.8.4        
## [19] stringr_1.4.0      munsell_0.5.0      blogdown_0.13     
## [22] gtable_0.3.0       codetools_0.2-16   evaluate_0.14     
## [25] labeling_0.3       inline_0.3.15      knitr_1.23        
## [28] callr_3.2.0        ps_1.3.0           scales_1.0.0      
## [31] formatR_1.7        gridExtra_2.3      digest_0.6.19     
## [34] stringi_1.4.3      bookdown_0.11      processx_3.3.1    
## [37] grid_3.6.0         cli_1.1.0          tools_3.6.0       
## [40] magrittr_1.5       tibble_2.1.3       crayon_1.3.4      
## [43] pkgconfig_2.0.2    data.table_1.12.2  prettyunits_1.0.2 
## [46] assertthat_0.2.1   rmarkdown_1.13     R6_2.4.0          
## [49] compiler_3.6.0

Related

comments powered by Disqus