A guide to measurement error in psychological research — tracing the history from Spearman's original insight through Classical Test Theory, to a modern signal-to-noise framework that reveals what reliability really is and what it demands of us.
The reliability problem is not a recent discovery. It was identified in 1904 by Charles Spearman — in the very same paper that introduced the correlation coefficient to psychology. That Spearman recognised the measurement error problem at the moment he invented the tool most affected by it is one of the more elegant ironies in the history of science. And the framework he proposed that year remains the foundation on which all subsequent work, including the most modern approaches, is built.
Spearman's core insight was that every observed score on a psychological test is composed of two distinct parts:
The true score is what we actually care about — the person's genuine, stable standing on whatever latent ability we are trying to measure: their working memory capacity, their face recognition ability, their susceptibility to a visual illusion. The measurement error is everything else: random fluctuations in attention on the day, guessing, ambiguous stimuli, fatigue, the particular items that happened to appear on this occasion. In classical test theory, error is treated as random with respect to the true score — that is, it is uncorrelated with the underlying ability we are trying to measure.
This decomposition has an immediate and important consequence. Because error is random, it contributes variance to observed scores without contributing any systematic information. As the proportion of error variance grows relative to true score variance, observed scores become increasingly poor guides to individuals' true standing on the latent ability — and, crucially, any statistic we compute from those observed scores inherits that contamination.
Reliability is the formal index of how much of the observed score variance reflects true score variance versus error. It is defined as the ratio:
A reliability of 1.0 means no error — every point of observed variance reflects genuine individual differences. A reliability of 0 means pure noise. Most psychological measures fall somewhere in between, and the practical consequences of that position are severe — not merely in the form of imprecision, but in the form of a systematic, one-directional bias in every effect we estimate. This systematic bias is what Spearman named attenuation, and quantifying it was his second major contribution of 1904.
Reliability can be estimated in two broad ways: through test-retest reliability (correlating two administrations of the same measure) or through internal consistency (asking how well items within a single administration agree with each other, as indexed by Cronbach's α, introduced in 1951). Despite their different computational forms, both aim to estimate the same underlying ratio above, but they rely on different assumptions and can diverge in practice. The choice between them is often practical — internal consistency is cheaper to obtain — but they should not be treated as interchangeable in all contexts.
A critical implication follows: reliability is not just a property of a measure, but of a design. Tasks that maximise within-person effects often minimise between-person variance, limiting the signal available for individual differences — a tension that becomes central in modern reliability work.
Section 1 established that observed scores are a mixture of true score and error, and that reliability quantifies how much of the observed variance is signal. The natural next question is: what does this do to the statistics we compute from those scores? The answer — which Spearman derived in the same 1904 paper — is precise and sobering. Low reliability does not merely add imprecision; it introduces a systematic, one-directional bias that pushes every effect we measure toward zero. This bias is called attenuation, and it operates on correlations and group differences alike.
When two measures are both imperfectly reliable, the observed correlation between them is systematically smaller than the true underlying correlation, in expectation. Spearman's attenuation formula gives the exact relationship:
The implication is stark. If two measures each have reliability of .5, and the true correlation between the underlying constructs is .8, the observed correlation will be .8 × √(.5 × .5) = .8 × .5 = .40 — exactly half the true value. A researcher who observes r = .40 and concludes the constructs are weakly related has been misled by measurement error, not by the constructs themselves.
A critical and often overlooked point: attenuation affects mean differences just as severely as it affects correlations. This is frequently misunderstood — researchers sometimes assume that poor reliability "only matters for individual differences research." This is wrong. Karvelis and Diaconescu (2025) demonstrate mathematically that observed standardised mean differences follow the same reliability-based scaling:
A true effect of d = 0.8 measured with an instrument of reliability .5 yields an observed d of 0.8 × √.5 ≈ 0.57. A "large" effect becomes a "medium" effect — not because the effect is smaller in reality, but because of how Cohen's d is constructed. Cohen's d is the mean difference between groups divided by the pooled standard deviation: d = (μ₁ − μ₂) / σ_pooled. Measurement error leaves the mean difference in the numerator unaffected — random noise cancels when averaged across participants. But that same noise inflates the pooled standard deviation in the denominator, making it larger than the true population SD. A larger denominator with an unchanged numerator always produces a smaller d. The effect is real and unchanged; the ruler has simply been made noisier, stretching the scale against which the difference is expressed.
At this point a reasonable question arises: if measurement error is random, why does it always push effects downward? If errors are equally likely to be positive or negative, shouldn't they sometimes inflate an observed correlation and sometimes deflate it — producing a noisy estimate centred on the true value rather than a systematically lower one?
This intuition is correct for means — random error cancels across people, leaving group means unbiased. But a correlation is not a mean. It is a ratio of covariance to variance, and error affects the numerator and denominator very differently:
When you add random error to both measures, the covariance in the numerator is unchanged — provided the errors on measure X and the errors on measure Y are independent of each other. But those same errors inflate the variance in the denominator — because each person's observed score now fluctuates around their true score, making both distributions wider. A larger denominator with the same numerator always means a smaller fraction. Always.
Error spreads scores out (inflating variance) without making them track each other any better (covariance unchanged). You have made both distributions noisier without adding any genuine shared signal between them. The ratio of shared signal to total spread therefore shrinks — and this happens every time, in every sample, regardless of direction.
This connects to a concept that often surprises students: attenuation and regression to the mean (RTM) are the same underlying mathematical fact, viewed from different angles.
RTM describes what happens to a single person across two measurement occasions. If someone scores extremely high on a first measurement, their score on retest tends to be lower — closer to the group mean. The reason is identical to the attenuation argument: their extreme first score is partly driven by error. On retest that error is not repeated, so the expected score shifts back toward the mean. The extreme score regresses toward the mean in proportion to how much of its variance was error — that proportion is precisely 1 − reliability.
RTM (within-person, across occasions): A person's extreme observed score will likely be less extreme on retest — because the error that pushed it to the extreme is random and will not repeat. The score regresses toward the group mean in proportion to measurement error.
Attenuation (between-persons, across measures): An observed correlation between two imperfect measures is always smaller than the true correlation — because error inflates variance without inflating covariance. The correlation regresses toward zero in proportion to measurement error.
Both follow directly from the same source: the correlation between an observed score and the true score it represents is √reliability — less than 1.0 whenever measurement is imperfect.
Attenuation does not diminish as you add more participants. Adding more people makes you increasingly confident in a badly attenuated value — not in the true value. What can help is adding more trials per person (which reduces trial noise and raises reliability) or using a hierarchical model that explicitly accounts for that noise. The distinction between more participants and more trials — and why they have such different consequences — is developed in detail in Section 4.
Many researchers in psychology and neuroscience routinely convert continuous measures into groups — median splits into "high" and "low" scorers, clinical thresholds creating "patient" versus "control" groups, or factorial designs that require categorical independent variables. This practice introduces a further layer of distortion that compounds the attenuation problem described above.
When you dichotomise a continuous distribution, the observed standardised difference between your two groups is not simply a function of the true underlying difference. It depends on two things working together: the overall variance of the underlying distribution, and where you place the cut point.
Add measurement error and the situation worsens further. When the underlying measure is unreliable, individuals near the cut point are frequently misclassified — a true "high" scorer falls below the median on this occasion, and vice versa. The consequence is that attenuation compounds: the observed group difference is attenuated first by measurement error in the underlying measure (following δ_observed = δ_true × √ICC) and then further by the information discarded through dichotomisation itself.
The pressure toward dichotomisation often comes from the desire to use factorial ANOVA designs, which require categorical independent variables. As a rule, any analysis that uses a continuous measure as a categorical independent variable is throwing away information and inflating required sample sizes. Regression and mixed models with continuous predictors are almost always preferable.
The practical consequence is that researchers using dichotomised designs need substantially larger sample sizes to achieve the same statistical power as those using continuous analyses — often two to three times larger, depending on reliability and cut location. The two sources of information loss are multiplicative, not additive.
The previous sections showed that reliability determines the degree of attenuation in both correlations and group differences. But there is a deeper problem with how reliability is conventionally measured that limits our ability to diagnose and fix the problem. This section introduces a more informative framework — due to Rouder and Mehrvarz (2024, 2026) — that reframes reliability as a signal-to-noise ratio, and asks what that framework reveals when placed alongside the richer variance decomposition of linear mixed models.
Cronbach's alpha and test-retest ICC are the workhorses of reliability assessment. Both are widely reported and widely interpreted. But they share a critical limitation: they are not fixed properties of a task — they depend on how many trials or items you administer.
Consider two labs both studying the Stroop task. Lab A runs 200 trials per person per condition; Lab B runs 20 trials per person per condition. Lab A will report much higher test-retest reliability than Lab B — but the task is identical. The reliability coefficient has told us something about the experimental implementation, not about the task itself.
Rouder and Mehrvarz propose replacing the reliability coefficient with a signal-to-noise variance ratio:
Here σ²_B is the genuine variance in people's true ability — how much people actually differ from one another. σ²_W is the trial-by-trial noise — the random fluctuations within a single person's performance. The ratio of these two quantities is a property of the task and the population, entirely independent of how many trials you choose to run.
From γ², the expected test-retest reliability at any trial size L follows directly:
This formula defines an entire curve of reliability as a function of trial size — and the shape of that curve is completely determined by γ². A task with high γ² rises steeply: reasonable reliability is achieved with modest trial counts. A task with low γ² rises slowly and may never reach acceptable reliability within any practical design.
The interactive curve above makes visible what a single ICC value completely conceals: the Stroop effect (γ ≈ .12) needs ~158 trials per condition to reach r = .70 and over 600 for r = .90 — yet most studies run 20–50. The Poggendorf illusion (γ ≈ 1.41) exceeds r = .90 with just 15 trials. The ceiling is set by γ², not by N. And since the ICC values in the attenuation formula are themselves determined by γ² and L, the degree of attenuation in any reported correlation is:
For contrast-based paradigms — such as the face inversion effect — observed γ is characteristically low because the subtraction cancels much of the between-person variance while doubling the noise. Poor cross-task correlations in such paradigms are not evidence of construct fragmentation; they are the mathematically predictable consequence of using a ruler too noisy to distinguish nearby positions.
Rouder's model contains no item subscript. His τ² absorbs everything — item difficulty, item-by-person interactions, and pure trial noise — into a single lump. This works cleanly for the paradigms he has in mind, where trials are essentially random draws from a large pool. But many experimental paradigms in psychology use a finite set of stimuli created by the experimenter, and those stimuli vary: some items are harder than others, some items discriminate between people better than others, and the degree of overlap between what was tested on occasion 1 and occasion 2 will depend on how the materials were sampled. A linear mixed model already partitions τ² into its constituents — between-person variance, between-item variance, and residual noise — and those components have different implications for reliability:
Always contributes: σ²residual — irreducible within-person trial noise. Contributes when items are resampled: σ²item-sampling — instability from drawing a different item subset each occasion. May contribute: σ²person×item — if people respond differently to specific items. Plain item intercept variance does not enter the denominator when items are fixed across occasions — it is shared structure, not noise. The LMM framing is not a correction to Rouder but an elaboration: it makes visible why a task has low SNR and what lever — more trials, better items, or a different sampling strategy — is most likely to raise it.
Once a hierarchical model has been fitted, the estimated variance components are already in hand. A Rouder-style γ² can be computed directly from model output, plugged into the reliability curve, and used to ask how many more trials — or better-calibrated items — would be needed to reach a measurement target. The following example makes this concrete.
To make this concrete, consider a 100-item face recognition memory task with a 2×2 design: Factor A = Upright vs Inverted, Factor B = Familiar vs Unfamiliar. Two hundred participants each complete all conditions. You fit a crossed hierarchical model with participant and item random slopes for both factors and their interaction:
Suppose the model yields the following variance components:
| Source | Component | Variance | Role |
|---|---|---|---|
| Participant intercept | σ²p-intercept | 0.40 | Overall memory ability signal |
| Participant slope A | σ²p-A | 0.12 | Individual inversion effect signal |
| Participant slope B | σ²p-B | 0.06 | Individual familiarity effect signal |
| Participant slope A×B | σ²p-A×B | 0.03 | Individual interaction signal |
| Item intercept | σ²i-intercept | 0.25 | Item difficulty (shared structure) |
| Item slope A | σ²i-A | 0.05 | Item-level inversion variability |
| Residual | σ²residual | 0.80 | Trial noise (always in denominator) |
Because faces are presented as a fixed set (the same 100 faces on both occasions), item intercepts are shared structure and do not contribute to noise. The effective noise denominator is σ²residual = 0.80 throughout. This gives a separate γ² for each inferential target:
How well can we rank people by general face memory?
Good signal — stable individual differences detectable with moderate trial counts.
How well can we rank people by their upright–inverted sensitivity?
Modest — people differ in inversion sensitivity, but many more trials needed to localise individuals.
How well can we rank people by familiar–unfamiliar sensitivity?
Low — weakly discriminating for individual differences research.
How well can we rank people by their interaction contrast?
Very low — 100 items is far too few; task redesign likely needed.
This example makes a point that a single reliability coefficient would completely obscure: the same task is simultaneously excellent for measuring overall ability, mediocre for inversion sensitivity, and essentially unsuitable for studying individual differences in the interaction effect. A researcher who runs 100 items and reports a high reliability for the overall memory score has no grounds for assuming the inversion contrast is equally well measured — yet this assumption is routinely made in the literature.
Consulting the reliability curve for each γ²: overall ability (γ² = 0.50) achieves r ≈ .96 at L = 100 trials, meaning very few additional items are needed. The inversion effect (γ² = 0.15) achieves only r ≈ .88 — still useful, but considerably more fragile. The interaction (γ² = 0.038) achieves r ≈ .66 — below the conventional threshold of .70 even with 100 trials per condition, and would require over 450 trials to reach .90. No amount of modest sample-size increase rescues this; only more items or a fundamentally different task design will help.
Section 2 introduced Spearman's correction as the classical remedy for attenuation: divide the observed correlation by the square root of the product of the two reliabilities. This works, and Haines (2026) confirms it is approximately unbiased (see below for details on his website) — across many simulations it centres on the true ρ. But it is fragile: reliability must be estimated and is itself noisy, so dividing two uncertain quantities inflates variance and produces impossible values above 1.0 in roughly 13% of cases. Most critically, it requires reliability as a known input to do anything at all.
The more modern approach — fitting a hierarchical model directly to trial-level data — arrives at the same disattenuated estimate through a fundamentally different route. Rather than estimating reliability and then applying a correction, it partitions the observed variance into its components (person signal, trial noise, item contributions) and estimates ρ as a direct parameter of the population covariance structure. No reliability figure is needed. No post-hoc correction is applied. Disattenuation is a natural consequence of modelling the data-generating process correctly. Haines shows that multivariate shrinkage and Bayesian hierarchical models all converge on the same answer — but with correctly calibrated uncertainty and no impossible values.
Spearman gets to the right answer on average — but needs reliability as input, and that input is noisy. The hierarchical model gets to the same right answer without reliability ever entering the calculation — and as a bonus, the variance components it estimates along the way are exactly the ingredients needed to compute Rouder's γ², consult the reliability curve, and reason about how many more trials or better-calibrated items would be needed. The modern approach does not just fix the correlation estimate; it generates the diagnostic information needed to improve the measurement itself.
Section 2 flagged that adding more participants does not reduce attenuation — the bias is in the measurement, not the sampling. Rouder's 2026 paper formalises exactly why this is so troubling. The naive sample correlation between contrast scores is not merely attenuated — it is an inconsistent estimator of the true latent correlation. Most biased estimators are at least consistent: as sample size grows, the bias shrinks toward zero. Not here. The estimator converges to the attenuated value and stays there regardless of how many participants are added, creating what Rouder calls "statistical hell": as N grows, the confidence interval narrows around the wrong answer, and researchers become increasingly certain of an incorrect conclusion.
The standard error compounds this: it is computed conditional on the model being correct, so if trial noise is omitted, the SE reflects only sampling variability around the attenuated estimate — it contains no information about the systematic downward bias. A study with N = 800 reporting r = 0.10 with a tight CI looks like reliable evidence of a weak relationship, when in reality the measurement was simply too noisy to recover the true association.
Suppose the true latent correlation between two tasks is 0.50. Because of noisy difference scores, the observed correlation converges near 0.10. Then:
N = 40 → estimate ≈ 0.10, wide CI
N = 400 → estimate ≈ 0.10, narrow CI
N = 4000 → estimate ≈ 0.10, extremely narrow CI
Precision improves while validity does not. The large study gives high confidence in an estimate five times smaller than the truth.
The hierarchical model resolves both problems at once. It is a consistent estimator — as N grows it converges to the true ρ, not the attenuated value. And its credible intervals incorporate both sources of uncertainty: finite participants and finite trials. The Stroop–Flanker example from Rouder et al. (2026) makes this concrete:
N = 253 participants, ~93 trials per condition each (Rey-Mermet et al., 2018).
Conventional approach — sample contrast scores correlated:
rsample = .045 | 95% CI: ±.123 (narrow, looks precise)
Bivariate hierarchical model fitted to trial-level data:
ρmodel = .17 | 95% CI: ±.45 (~4× wider, honestly uncertain)
The narrow CI on the sample correlation is not a sign of precision — it is a sign that trial noise was omitted from the uncertainty calculation entirely. The model is four times larger and four times wider, with both corrections going in the right direction simultaneously.
γ² and reliability remain useful for planning — they tell you how many trials are needed to achieve a target credible interval width before data collection. But for interpretation, the reliability coefficient is tangential. Fit the hierarchical model, report the posterior on ρ directly, and let the uncertainty speak for itself.
Poor γ² → low reliability at any practical trial count → biased and inconsistent observed correlation estimates → SE compression creating false confidence around the wrong value → misleading conclusions about construct structure, test validity, and theoretical models. Every link in this chain is quantifiable. The remedies are: better items (higher γ² via IRT-informed design), more trials per person, and — most powerfully — a hierarchical model that estimates the true correlation directly from trial-level data, requiring no reliability input and producing correctly calibrated uncertainty as a natural output.
Nathaniel Haines' (2026) post "How to Estimate a Correlation, and What It Means for Science" at haines-lab.com provides the definitive worked demonstration of everything discussed here — with R code, animated visualisations showing why univariate shrinkage cannot fix the correlation, and simulation results comparing Spearman, multivariate shrinkage, lme4, and brms side by side. The striking convergence of all four methods on the same disattenuated answer — despite their very different starting points — is the empirical punchline this section has been building toward.