Teaching Resource · Psychometrics & Individual Differences

The Reliability Problem
From Attenuation to Signal-to-Noise

A guide to measurement error in psychological research — tracing the history from Spearman's original insight through Classical Test Theory, to a modern signal-to-noise framework that reveals what reliability really is and what it demands of us.


Section 01 · Foundations

Observed Scores, True Scores,
and a Century of Consequences

The reliability problem is not a recent discovery. It was identified in 1904 by Charles Spearman — in the very same paper that introduced the correlation coefficient to psychology. That Spearman recognised the measurement error problem at the moment he invented the tool most affected by it is one of the more elegant ironies in the history of science. And the framework he proposed that year remains the foundation on which all subsequent work, including the most modern approaches, is built.

The Fundamental Decomposition

Spearman's core insight was that every observed score on a psychological test is composed of two distinct parts:

Observed Score = True Score + Measurement Error
The fundamental decomposition of Classical Test Theory (Spearman, 1904)

The true score is what we actually care about — the person's genuine, stable standing on whatever latent ability we are trying to measure: their working memory capacity, their face recognition ability, their susceptibility to a visual illusion. The measurement error is everything else: random fluctuations in attention on the day, guessing, ambiguous stimuli, fatigue, the particular items that happened to appear on this occasion. In classical test theory, error is treated as random with respect to the true score — that is, it is uncorrelated with the underlying ability we are trying to measure.

This decomposition has an immediate and important consequence. Because error is random, it contributes variance to observed scores without contributing any systematic information. As the proportion of error variance grows relative to true score variance, observed scores become increasingly poor guides to individuals' true standing on the latent ability — and, crucially, any statistic we compute from those observed scores inherits that contamination.

Reliability: Quantifying the Signal-to-Noise Balance

Reliability is the formal index of how much of the observed score variance reflects true score variance versus error. It is defined as the ratio:

Reliability = σ²_true / (σ²_true + σ²_error)
True score variance as a proportion of total observed variance

A reliability of 1.0 means no error — every point of observed variance reflects genuine individual differences. A reliability of 0 means pure noise. Most psychological measures fall somewhere in between, and the practical consequences of that position are severe — not merely in the form of imprecision, but in the form of a systematic, one-directional bias in every effect we estimate. This systematic bias is what Spearman named attenuation, and quantifying it was his second major contribution of 1904.

Reliability can be estimated in two broad ways: through test-retest reliability (correlating two administrations of the same measure) or through internal consistency (asking how well items within a single administration agree with each other, as indexed by Cronbach's α, introduced in 1951). Despite their different computational forms, both aim to estimate the same underlying ratio above, but they rely on different assumptions and can diverge in practice. The choice between them is often practical — internal consistency is cheaper to obtain — but they should not be treated as interchangeable in all contexts.

A critical implication follows: reliability is not just a property of a measure, but of a design. Tasks that maximise within-person effects often minimise between-person variance, limiting the signal available for individual differences — a tension that becomes central in modern reliability work.


Section 02 · Attenuation

Attenuation: The Predictable
Deflation of Every Effect

Section 1 established that observed scores are a mixture of true score and error, and that reliability quantifies how much of the observed variance is signal. The natural next question is: what does this do to the statistics we compute from those scores? The answer — which Spearman derived in the same 1904 paper — is precise and sobering. Low reliability does not merely add imprecision; it introduces a systematic, one-directional bias that pushes every effect we measure toward zero. This bias is called attenuation, and it operates on correlations and group differences alike.

Attenuation of Correlations

When two measures are both imperfectly reliable, the observed correlation between them is systematically smaller than the true underlying correlation, in expectation. Spearman's attenuation formula gives the exact relationship:

r_observed = r_true × √(ICC_x × ICC_y)
Spearman's attenuation formula for correlations

The implication is stark. If two measures each have reliability of .5, and the true correlation between the underlying constructs is .8, the observed correlation will be .8 × √(.5 × .5) = .8 × .5 = .40 — exactly half the true value. A researcher who observes r = .40 and concludes the constructs are weakly related has been misled by measurement error, not by the constructs themselves.

Attenuation of Group Differences

A critical and often overlooked point: attenuation affects mean differences just as severely as it affects correlations. This is frequently misunderstood — researchers sometimes assume that poor reliability "only matters for individual differences research." This is wrong. Karvelis and Diaconescu (2025) demonstrate mathematically that observed standardised mean differences follow the same reliability-based scaling:

δ_observed = δ_true × √ICC
Attenuation formula for standardised group differences (Cohen's d), assuming equal reliability across groups

A true effect of d = 0.8 measured with an instrument of reliability .5 yields an observed d of 0.8 × √.5 ≈ 0.57. A "large" effect becomes a "medium" effect — not because the effect is smaller in reality, but because of how Cohen's d is constructed. Cohen's d is the mean difference between groups divided by the pooled standard deviation: d = (μ₁ − μ₂) / σ_pooled. Measurement error leaves the mean difference in the numerator unaffected — random noise cancels when averaged across participants. But that same noise inflates the pooled standard deviation in the denominator, making it larger than the true population SD. A larger denominator with an unchanged numerator always produces a smaller d. The effect is real and unchanged; the ruler has simply been made noisier, stretching the scale against which the difference is expressed.

Why Attenuation Is Always a Lower Bound — Never an Upper One

At this point a reasonable question arises: if measurement error is random, why does it always push effects downward? If errors are equally likely to be positive or negative, shouldn't they sometimes inflate an observed correlation and sometimes deflate it — producing a noisy estimate centred on the true value rather than a systematically lower one?

This intuition is correct for means — random error cancels across people, leaving group means unbiased. But a correlation is not a mean. It is a ratio of covariance to variance, and error affects the numerator and denominator very differently:

r = Covariance(X, Y) / √(Variance(X) × Variance(Y))
The correlation as a ratio — and why each component is affected differently by error

When you add random error to both measures, the covariance in the numerator is unchanged — provided the errors on measure X and the errors on measure Y are independent of each other. But those same errors inflate the variance in the denominator — because each person's observed score now fluctuates around their true score, making both distributions wider. A larger denominator with the same numerator always means a smaller fraction. Always.

The Asymmetry in Plain Language

Error spreads scores out (inflating variance) without making them track each other any better (covariance unchanged). You have made both distributions noisier without adding any genuine shared signal between them. The ratio of shared signal to total spread therefore shrinks — and this happens every time, in every sample, regardless of direction.

Attenuation and Regression to the Mean — The Same Phenomenon

This connects to a concept that often surprises students: attenuation and regression to the mean (RTM) are the same underlying mathematical fact, viewed from different angles.

RTM describes what happens to a single person across two measurement occasions. If someone scores extremely high on a first measurement, their score on retest tends to be lower — closer to the group mean. The reason is identical to the attenuation argument: their extreme first score is partly driven by error. On retest that error is not repeated, so the expected score shifts back toward the mean. The extreme score regresses toward the mean in proportion to how much of its variance was error — that proportion is precisely 1 − reliability.

RTM and Attenuation — One Idea, Two Views

RTM (within-person, across occasions): A person's extreme observed score will likely be less extreme on retest — because the error that pushed it to the extreme is random and will not repeat. The score regresses toward the group mean in proportion to measurement error.

Attenuation (between-persons, across measures): An observed correlation between two imperfect measures is always smaller than the true correlation — because error inflates variance without inflating covariance. The correlation regresses toward zero in proportion to measurement error.

Both follow directly from the same source: the correlation between an observed score and the true score it represents is √reliability — less than 1.0 whenever measurement is imperfect.

Interactive: Regression to the Mean & Attenuation — Two Views of the Same Fact

The left panel shows RTM: as reliability decreases, extreme scores on a first measurement regress toward the group mean on retest. The right panel shows attenuation: the same reliability reduction deflates the observed correlation between two measures toward zero.

Reliability (ICC): 0.80
True correlation rtrue: 0.70
REGRESSION TO THE MEAN
Expected retest score: ICC × observed
ATTENUATION OF CORRELATION
r_observed = r_true × ICC =
RTM: % regression toward mean
Observed r (attenuated)
Signal lost to error (%)
Critical Point

Attenuation does not diminish as you add more participants. Adding more people makes you increasingly confident in a badly attenuated value — not in the true value. What can help is adding more trials per person (which reduces trial noise and raises reliability) or using a hierarchical model that explicitly accounts for that noise. The distinction between more participants and more trials — and why they have such different consequences — is developed in detail in Section 4.


Section 03 · Dichotomisation

The Hidden Cost of Cutting
Continuous Measures

Many researchers in psychology and neuroscience routinely convert continuous measures into groups — median splits into "high" and "low" scorers, clinical thresholds creating "patient" versus "control" groups, or factorial designs that require categorical independent variables. This practice introduces a further layer of distortion that compounds the attenuation problem described above.

Why Observed Group Differences Depend on Where You Cut

When you dichotomise a continuous distribution, the observed standardised difference between your two groups is not simply a function of the true underlying difference. It depends on two things working together: the overall variance of the underlying distribution, and where you place the cut point.

Interactive: Dichotomisation & Effect Size

See how the cut point location and measurement reliability jointly determine the observed effect size after dichotomisation.

Cut point (percentile): 50th
Reliability (ICC): 0.70
True Cohen's d
Observed d (attenuated)
Power vs. continuous analysis

The Misclassification Problem

Add measurement error and the situation worsens further. When the underlying measure is unreliable, individuals near the cut point are frequently misclassified — a true "high" scorer falls below the median on this occasion, and vice versa. The consequence is that attenuation compounds: the observed group difference is attenuated first by measurement error in the underlying measure (following δ_observed = δ_true × √ICC) and then further by the information discarded through dichotomisation itself.

Connection to ANOVA Designs

The pressure toward dichotomisation often comes from the desire to use factorial ANOVA designs, which require categorical independent variables. As a rule, any analysis that uses a continuous measure as a categorical independent variable is throwing away information and inflating required sample sizes. Regression and mixed models with continuous predictors are almost always preferable.

The practical consequence is that researchers using dichotomised designs need substantially larger sample sizes to achieve the same statistical power as those using continuous analyses — often two to three times larger, depending on reliability and cut location. The two sources of information loss are multiplicative, not additive.


Section 04 · Advanced

A Better Measure of Reliability:
The Signal-to-Noise Ratio Advanced

The previous sections showed that reliability determines the degree of attenuation in both correlations and group differences. But there is a deeper problem with how reliability is conventionally measured that limits our ability to diagnose and fix the problem. This section introduces a more informative framework — due to Rouder and Mehrvarz (2024, 2026) — that reframes reliability as a signal-to-noise ratio, and asks what that framework reveals when placed alongside the richer variance decomposition of linear mixed models.

The Problem with Conventional Reliability Coefficients

Cronbach's alpha and test-retest ICC are the workhorses of reliability assessment. Both are widely reported and widely interpreted. But they share a critical limitation: they are not fixed properties of a task — they depend on how many trials or items you administer.

Consider two labs both studying the Stroop task. Lab A runs 200 trials per person per condition; Lab B runs 20 trials per person per condition. Lab A will report much higher test-retest reliability than Lab B — but the task is identical. The reliability coefficient has told us something about the experimental implementation, not about the task itself.

The Signal-to-Noise Ratio γ²

Rouder and Mehrvarz propose replacing the reliability coefficient with a signal-to-noise variance ratio:

γ² = σ²_B / σ²_W
Between-person true variance (signal) divided by within-person trial noise (noise)

Here σ²_B is the genuine variance in people's true ability — how much people actually differ from one another. σ²_W is the trial-by-trial noise — the random fluctuations within a single person's performance. The ratio of these two quantities is a property of the task and the population, entirely independent of how many trials you choose to run.

From γ², the expected test-retest reliability at any trial size L follows directly:

E(r) ≈ γ² / (γ² + 2/L)
The reliability curve — reliability as a function of γ² and trial size L

This formula defines an entire curve of reliability as a function of trial size — and the shape of that curve is completely determined by γ². A task with high γ² rises steeply: reasonable reliability is achieved with modest trial counts. A task with low γ² rises slowly and may never reach acceptable reliability within any practical design.

Interactive: The Reliability Curve — γ² in Action

Adjust γ (the signal-to-noise standard deviation ratio) and see how the full reliability curve changes. Compare your task to real examples from the literature.

Your γ (SNR SD ratio): 0.30
γ²=0.09
Stroop Effect γ=.12
Flanker Effect γ=.08
Stroop Speed γ=.96
Mueller-Lyer γ=.72
Poggendorf γ=1.41
Trials needed for r = .70
Trials needed for r = .90
Reliability at L=100 trials
Reliability at L=500 trials
Your task
Stroop effect
Mueller-Lyer
Poggendorf
r = .70 / .90 targets

The Connection Back to Attenuation

The interactive curve above makes visible what a single ICC value completely conceals: the Stroop effect (γ ≈ .12) needs ~158 trials per condition to reach r = .70 and over 600 for r = .90 — yet most studies run 20–50. The Poggendorf illusion (γ ≈ 1.41) exceeds r = .90 with just 15 trials. The ceiling is set by γ², not by N. And since the ICC values in the attenuation formula are themselves determined by γ² and L, the degree of attenuation in any reported correlation is:

r_observed ≈ r_true × γ²/(γ² + 2/L)
Attenuation as a direct function of task signal-to-noise and trial size

For contrast-based paradigms — such as the face inversion effect — observed γ is characteristically low because the subtraction cancels much of the between-person variance while doubling the noise. Poor cross-task correlations in such paradigms are not evidence of construct fragmentation; they are the mathematically predictable consequence of using a ruler too noisy to distinguish nearby positions.

Bridging to Linear Mixed Models

Rouder's model contains no item subscript. His τ² absorbs everything — item difficulty, item-by-person interactions, and pure trial noise — into a single lump. This works cleanly for the paradigms he has in mind, where trials are essentially random draws from a large pool. But many experimental paradigms in psychology use a finite set of stimuli created by the experimenter, and those stimuli vary: some items are harder than others, some items discriminate between people better than others, and the degree of overlap between what was tested on occasion 1 and occasion 2 will depend on how the materials were sampled. A linear mixed model already partitions τ² into its constituents — between-person variance, between-item variance, and residual noise — and those components have different implications for reliability:

What Actually Enters the Noise Denominator

Always contributes: σ²residual — irreducible within-person trial noise. Contributes when items are resampled: σ²item-sampling — instability from drawing a different item subset each occasion. May contribute: σ²person×item — if people respond differently to specific items. Plain item intercept variance does not enter the denominator when items are fixed across occasions — it is shared structure, not noise. The LMM framing is not a correction to Rouder but an elaboration: it makes visible why a task has low SNR and what lever — more trials, better items, or a different sampling strategy — is most likely to raise it.

Once a hierarchical model has been fitted, the estimated variance components are already in hand. A Rouder-style γ² can be computed directly from model output, plugged into the reliability curve, and used to ask how many more trials — or better-calibrated items — would be needed to reach a measurement target. The following example makes this concrete.

A Worked Example: Face Recognition with a 2×2 Design

To make this concrete, consider a 100-item face recognition memory task with a 2×2 design: Factor A = Upright vs Inverted, Factor B = Familiar vs Unfamiliar. Two hundred participants each complete all conditions. You fit a crossed hierarchical model with participant and item random slopes for both factors and their interaction:

accuracy ~ A * B + (1 + A*B | participant) + (1 + A*B | item)
Full crossed model — random slopes for both participants and items across all effects

Suppose the model yields the following variance components:

Source Component Variance Role
Participant interceptσ²p-intercept0.40Overall memory ability signal
Participant slope Aσ²p-A0.12Individual inversion effect signal
Participant slope Bσ²p-B0.06Individual familiarity effect signal
Participant slope A×Bσ²p-A×B0.03Individual interaction signal
Item interceptσ²i-intercept0.25Item difficulty (shared structure)
Item slope Aσ²i-A0.05Item-level inversion variability
Residualσ²residual0.80Trial noise (always in denominator)

Because faces are presented as a fixed set (the same 100 faces on both occasions), item intercepts are shared structure and do not contribute to noise. The effective noise denominator is σ²residual = 0.80 throughout. This gives a separate γ² for each inferential target:

Overall Ability

How well can we rank people by general face memory?

γ² = 0.40 / 0.80 = 0.50

Good signal — stable individual differences detectable with moderate trial counts.

Inversion Effect

How well can we rank people by their upright–inverted sensitivity?

γ² = 0.12 / 0.80 = 0.15

Modest — people differ in inversion sensitivity, but many more trials needed to localise individuals.

Familiarity Effect

How well can we rank people by familiar–unfamiliar sensitivity?

γ² = 0.06 / 0.80 = 0.075

Low — weakly discriminating for individual differences research.

A×B Interaction

How well can we rank people by their interaction contrast?

γ² = 0.03 / 0.80 = 0.038

Very low — 100 items is far too few; task redesign likely needed.

This example makes a point that a single reliability coefficient would completely obscure: the same task is simultaneously excellent for measuring overall ability, mediocre for inversion sensitivity, and essentially unsuitable for studying individual differences in the interaction effect. A researcher who runs 100 items and reports a high reliability for the overall memory score has no grounds for assuming the inversion contrast is equally well measured — yet this assumption is routinely made in the literature.

Consulting the reliability curve for each γ²: overall ability (γ² = 0.50) achieves r ≈ .96 at L = 100 trials, meaning very few additional items are needed. The inversion effect (γ² = 0.15) achieves only r ≈ .88 — still useful, but considerably more fragile. The interaction (γ² = 0.038) achieves r ≈ .66 — below the conventional threshold of .70 even with 100 trials per condition, and would require over 450 trials to reach .90. No amount of modest sample-size increase rescues this; only more items or a fundamentally different task design will help.

Interactive: From LMM Variance Components to γ² and Trial Planning

Enter variance components from a fitted crossed mixed model and set item overlap. The noise denominator is computed correctly — residual always contributes; item-sampling noise scales with (1 − overlap). Compare how different random effects (person intercept vs slopes) yield different γ² values for the same task.

Person signal σ²person: 0.40
Item-sampling noise σ²item-samp: 0.25
Residual noise σ²residual: 0.80
Item overlap ρ (0=random, 1=fixed): 1.00
Trials per condition L: 100
Design: Fixed Items — item-sampling noise does not contribute
γ² (task goodness)
Reliability at L
Trials for r = .70
Trials for r = .90

Left: variance decomposition showing signal (green) vs noise components. Right: reliability curve implied by γ². Try setting overlap to 1.0 (face task) vs 0.0 (Stroop) to see how item design shifts the curve.

Correcting for Attenuation — Then and Now

Section 2 introduced Spearman's correction as the classical remedy for attenuation: divide the observed correlation by the square root of the product of the two reliabilities. This works, and Haines (2026) confirms it is approximately unbiased (see below for details on his website) — across many simulations it centres on the true ρ. But it is fragile: reliability must be estimated and is itself noisy, so dividing two uncertain quantities inflates variance and produces impossible values above 1.0 in roughly 13% of cases. Most critically, it requires reliability as a known input to do anything at all.

The more modern approach — fitting a hierarchical model directly to trial-level data — arrives at the same disattenuated estimate through a fundamentally different route. Rather than estimating reliability and then applying a correction, it partitions the observed variance into its components (person signal, trial noise, item contributions) and estimates ρ as a direct parameter of the population covariance structure. No reliability figure is needed. No post-hoc correction is applied. Disattenuation is a natural consequence of modelling the data-generating process correctly. Haines shows that multivariate shrinkage and Bayesian hierarchical models all converge on the same answer — but with correctly calibrated uncertainty and no impossible values.

The Central Insight: Same Answer, Without Needing Reliability

Spearman gets to the right answer on average — but needs reliability as input, and that input is noisy. The hierarchical model gets to the same right answer without reliability ever entering the calculation — and as a bonus, the variance components it estimates along the way are exactly the ingredients needed to compute Rouder's γ², consult the reliability curve, and reason about how many more trials or better-calibrated items would be needed. The modern approach does not just fix the correlation estimate; it generates the diagnostic information needed to improve the measurement itself.

Beyond Reliability: Consistency, Uncertainty, and Statistical Hell

Section 2 flagged that adding more participants does not reduce attenuation — the bias is in the measurement, not the sampling. Rouder's 2026 paper formalises exactly why this is so troubling. The naive sample correlation between contrast scores is not merely attenuated — it is an inconsistent estimator of the true latent correlation. Most biased estimators are at least consistent: as sample size grows, the bias shrinks toward zero. Not here. The estimator converges to the attenuated value and stays there regardless of how many participants are added, creating what Rouder calls "statistical hell": as N grows, the confidence interval narrows around the wrong answer, and researchers become increasingly certain of an incorrect conclusion.

The standard error compounds this: it is computed conditional on the model being correct, so if trial noise is omitted, the SE reflects only sampling variability around the attenuated estimate — it contains no information about the systematic downward bias. A study with N = 800 reporting r = 0.10 with a tight CI looks like reliable evidence of a weak relationship, when in reality the measurement was simply too noisy to recover the true association.

Statistical Hell: Precision Around the Wrong Number

Suppose the true latent correlation between two tasks is 0.50. Because of noisy difference scores, the observed correlation converges near 0.10. Then:

N = 40     →   estimate ≈ 0.10,   wide CI
N = 400    →   estimate ≈ 0.10,   narrow CI
N = 4000   →   estimate ≈ 0.10,   extremely narrow CI

Precision improves while validity does not. The large study gives high confidence in an estimate five times smaller than the truth.

The hierarchical model resolves both problems at once. It is a consistent estimator — as N grows it converges to the true ρ, not the attenuated value. And its credible intervals incorporate both sources of uncertainty: finite participants and finite trials. The Stroop–Flanker example from Rouder et al. (2026) makes this concrete:

The Stroop–Flanker Example: What the Numbers Actually Look Like

N = 253 participants, ~93 trials per condition each (Rey-Mermet et al., 2018).

Conventional approach — sample contrast scores correlated:

rsample = .045  |  95% CI: ±.123   (narrow, looks precise)

Bivariate hierarchical model fitted to trial-level data:

ρmodel = .17   |  95% CI: ±.45   (~4× wider, honestly uncertain)

The narrow CI on the sample correlation is not a sign of precision — it is a sign that trial noise was omitted from the uncertainty calculation entirely. The model is four times larger and four times wider, with both corrections going in the right direction simultaneously.

γ² and reliability remain useful for planning — they tell you how many trials are needed to achieve a target credible interval width before data collection. But for interpretation, the reliability coefficient is tangential. Fit the hierarchical model, report the posterior on ρ directly, and let the uncertainty speak for itself.

Summary: The Full Chain of Measurement Error

Poor γ² → low reliability at any practical trial count → biased and inconsistent observed correlation estimates → SE compression creating false confidence around the wrong value → misleading conclusions about construct structure, test validity, and theoretical models. Every link in this chain is quantifiable. The remedies are: better items (higher γ² via IRT-informed design), more trials per person, and — most powerfully — a hierarchical model that estimates the true correlation directly from trial-level data, requiring no reliability input and producing correctly calibrated uncertainty as a natural output.

Key Takeaways — The Full Argument in Summary
§1 · Foundations
  • ·Every observed score = true score + random error. Reliability quantifies the signal-to-noise ratio of that mixture.
  • ·Reliability is not just a property of a measure — it depends on how individual scores are estimated from it.
§2 · Attenuation
  • ·Measurement error leaves the covariance numerator unbiased but inflates the variance denominator — so every observed correlation is smaller than the true value.
  • ·The same mechanism deflates Cohen's d: error inflates the pooled SD in the denominator without touching the mean difference.
  • ·Adding more participants does not reduce attenuation — only more trials per person, and better statistical modelling, can do that.
§3 · Dichotomisation
  • ·Cutting a continuous measure into groups discards information and compounds attenuation — the two penalties are multiplicative, not additive.
  • ·Continuous predictors in regression or mixed models are almost always preferable to dichotomised groups in ANOVA.
§4 · Signal-to-Noise & Modern Remedies
  • ·γ² = σ²_person / τ² is a trial-size-invariant measure of task goodness. It determines the entire reliability curve — not just one point on it.
  • ·Low γ² from contrast paradigms is not a paradox — the subtraction cancels between-person variance while doubling noise. Poor cross-task correlations follow mathematically.
  • ·A fitted LMM already contains the ingredients to compute γ² — partitioning person signal from item and residual noise reveals why a task has low SNR, not just that it does.
  • ·The same task can simultaneously measure overall ability well and individual contrast effects poorly — a single reliability coefficient conceals this; separate γ² values per random effect reveal it.
  • ·Spearman's correction and hierarchical models both disattenuate correlations — Haines (2026) shows they converge on the same answer. But the hierarchical model does so without needing reliability as input, produces correctly calibrated uncertainty, and generates the variance components needed for γ² and trial planning as a natural by-product.
  • ·Adding more participants to a study with noisy contrast scores makes you more certain of the wrong answer — the sample correlation is an inconsistent estimator of the true latent correlation. The fix is modelling trial noise explicitly, not recruiting more people.
Further Reading

Nathaniel Haines' (2026) post "How to Estimate a Correlation, and What It Means for Science" at haines-lab.com provides the definitive worked demonstration of everything discussed here — with R code, animated visualisations showing why univariate shrinkage cannot fix the correlation, and simulation results comparing Spearman, multivariate shrinkage, lme4, and brms side by side. The striking convergence of all four methods on the same disattenuated answer — despite their very different starting points — is the empirical punchline this section has been building toward.