Statistical Foundations · Part 2 of 2

When the Chain Breaks
— and How to Fix It

ANOVA in depth, then what to do when its assumptions fail — Linear Mixed Models, and then the deeper question of measurement quality through IRT.

§ 00

Where Part 1 left us

Part 1 traced a logical chain from first principles to inference. We established why averaging works, and what it guarantees about the resulting means. Now we go deeper into that chain — first to understand ANOVA properly, then to ask what happens when its assumptions no longer hold.

Law of Large Numbersaveraging → stable summaries
Central Limit Theoremmeans → approximately normal
Group Meanseach ~ N(μ, σ²/N)
t-test / ANOVAprecision-weighted comparison

The key insight from Part 1 was precision: larger N brings group means into sharper focus, and that sharpness — expressed as the standard error σ/√N — is what makes it possible to judge whether an observed difference between conditions is real. The t-statistic and the F-ratio are both, at bottom, measures of how large the observed difference is relative to the precision of our estimates — that is, the ratio of signal to noise.

In Part 2, we ask three questions in sequence. First: how does ANOVA actually partition variance to arrive at the F-ratio? The mechanics reveal why the assumptions exist — they are not arbitrary rules but direct consequences of the CLT logic we already have in hand (Part 1 · §05 introduced the t-test and ANOVA; §01 here opens up their machinery). Second: what happens when those assumptions are violated? This leads to Linear Mixed Models, which extend ANOVA by modelling the dependence structure the data actually contain. Third: how good is the measurement feeding into any of these models? This is the IRT question — precision at the item level, before the data even reach the statistical model.

Part I
Inside ANOVA — partitioning variance and understanding the assumptions
§ 01

How ANOVA actually works

Most introductory courses present ANOVA as a procedure: run the test, check if p < .05. But ANOVA has a precise mechanical logic that makes it far more interpretable — and that connects directly back to the LLN and CLT we established in Part 1.

The fundamental partition

Start from a simple observation: every data point (which as we have seen is typically an individual mean) deviates from the grand mean of the whole dataset. ANOVA's core move is to ask where does that deviation come from? It partitions the total variability of the data-set we have into two non-overlapping sources:

The ANOVA partition — total = between + within SStotal = SSbetween + SSwithin
SStotalSum of squared deviations of every observation from the grand mean. The total variability in the dataset, ignoring group structure.
SSbetweenSum of squared deviations of each group mean from the grand mean, weighted by group size. This captures how far apart the conditions are from each other — the signal we are trying to detect.
SSwithinSum of squared deviations of each observation from its own group mean. This captures variability within conditions — the noise against which the signal is judged.

This partition is exact and algebraically necessary — it always holds. The ratio of these two sources, once each is adjusted for degrees of freedom, gives the F-statistic:

The F-ratio F = MSbetween / MSwithin  =  (SSbetween / dfbetween) / (SSwithin / dfwithin)
MSbetweenMean square between — the between-groups variance per degree of freedom. Reflects how spread out the group means are.
MSwithinMean square within — the average within-group variance. Reflects how much individuals vary within each condition. This is the baseline noise level.
FWhen the group means are far apart relative to within-group scatter, F is large. When within-group noise is large relative to between-group differences, F shrinks. F = 1 means the two sources are equal — nothing unusual.
The precision connection

It is worth pausing here to connect MSwithin back explicitly to the precision story from Part 1, because they are the same quantity viewed from two angles.

Part 1 framed precision as the standard error of a mean: SE = σ/√N. The SE shrinks with N — this is the CLT result that makes large samples give sharp estimates. But σ is a property of our target population from which we have sampled, so we do not have it directly. We have to estimate it from the data, and the estimator is the within-group variance. MSwithin is ANOVA's estimate of σ² — pooled across groups on the assumption that each condition shares the same underlying noise level.

Once MSwithin is in hand, the SE of each group mean follows immediately: SEgroup = √(MSwithin / N). This is exactly the σ/√N formula from Part 1, with the estimated σ² dropped in. Larger N per group shrinks this SE just as the CLT predicted. The F-ratio then compares the spread of the group means (MSbetween) against that same noise level (MSwithin) — asking whether the means are further apart than their individual precision would suggest they should be by chance.

ANOVA is not a new framework. It is the Part 1 precision story packaged as a variance ratio: estimate σ² from the within-group scatter, use it to set the precision of each group mean, then ask whether the observed separation between means exceeds that precision.

Explore it: what drives the F-ratio?

Simulation — ANOVA Variance Partition
Imagine the face-recognition experiment from Part 1, now extended to three conditions: upright faces, inverted faces, and scrambled faces. Each group of participants is tested in one condition and produces a set of recognition scores. The experimental effect — whether the manipulation matters at all — shows up as systematic differences between the three condition means. Everything else is within-group noise.

The two main sliders map directly onto the quantities the partition is built from. Group mean separation is the signal — it feeds straight into MSbetween. Within-group scatter (σ) is the noise level — it feeds straight into MSwithin. The F-ratio is simply the first divided by the second.

Once you have played with the sliders freely, work through these three reference patterns and watch how F and the partition respond:
  (a) Large separation, small scatter — a real effect with consistent participants. MSbetween dominates MSwithin; F is large; p is small. This is the pattern a well-run experiment aims for.
  (b) Small separation, small scatter — no experimental effect, but each group mean is estimated precisely. F stays close to 1 because there is no signal to detect, even though the measurements are clean.
  (c) Large separation, large scatter — a real effect exists, but individual variability is so high that MSwithin catches up with MSbetween. F shrinks, and the real effect can be missed even though it is present in the means.
  (d) Hold separation and scatter fixed, slide N per group up and down — the group means are the same on average, and the noise level σ is the same, but each group mean is now a sharper estimate of its true condition mean (recall SE = σ/√N from Part 1). The same signal stands out more cleanly against the same noise floor; MSbetween, which weights by N, grows relative to MSwithin; F climbs and p drops. This is the precision lever from Part 1, now visible inside the ANOVA output — raising N improves precision, and the F-ratio rewards that precision directly.
Group mean separation (→ MSbetween)1.5
Within-group scatter σ (→ MSwithin)1.0
N per group20
SS between
SS within
F ratio
p value (approx)
Data — three groups (dots = individual observations)
Variance partition
Group A
Group B
Group C
Grand mean

Why squares? — Laplace, Gauss, and the error distribution

A question the SS notation raises but rarely answers: why do we square the deviations? Why not simply sum the absolute distances between each observation and its group mean — which would also measure spread? The answer connects ANOVA to a deeper historical argument about what makes an estimator optimal.

Historical note — Laplace, Gauss, and the geometry of error

In the late eighteenth and early nineteenth centuries, Laplace and Gauss independently grappled with a fundamental question: given a set of imperfect measurements, what is the best way to combine them into a single estimate? Laplace's profound insight was that there is no universal answer — the optimal aggregation method depends on what you assume about the distribution of errors in your measurements.

Laplace himself initially favoured minimising the sum of absolute deviations — what we now call L1 estimation — because he was working under a different assumed error distribution (the double-exponential, or Laplace, distribution). The critical insight came when it was established that if errors are normally distributed, then minimising the sum of squared deviations is the optimal strategy. Under normality, least squares is not just computationally convenient — it is the maximum likelihood estimator (MLE). Maximum likelihood is one of the central pieces of machinery in frequentist statistics: given an assumed distribution for the data, it picks the parameter values that make the observed data most probable. When errors are Gaussian, those values are exactly the ones that minimise the sum of squared deviations — so least squares is not a computational shortcut but the MLE for a normal error distribution, and it extracts the most information the data can provide under that assumption.

This is where the argument connects directly back to Part 1. The CLT is the reason we expect errors to be approximately Gaussian in the first place — when an observation is the sum of many small, independent influences and no single source of error dominates, the CLT guarantees the combined error is close to normal. That is also why, before running an ANOVA or a t-test, researchers actively prefer data whose distribution looks approximately normal: the loss function built into the method was chosen for that distribution. The decision to use squared deviations and the expectation that errors will be Gaussian are not two separate assumptions bolted onto the method — they are the same design decision, both resting on the CLT.

Squaring does something geometrically important: it penalises large deviations disproportionately. An error of 4 contributes 16 to the sum of squares, not just 4. This aggressive penalisation of large deviations is precisely what the normal distribution implies — under normality, large errors are exponentially rare, and the estimator should reflect that by being pulled strongly away from them. Under a heavier-tailed distribution, where large errors are more common, squaring overpenalises and a different strategy is superior.

The consequence for ANOVA is that the normality assumption is doing two jobs simultaneously. It makes the F-distribution valid for generating p-values — the standard justification given in textbooks. But it also makes the sum-of-squares approach the right way to measure deviation in the first place. When normality fails, both justifications fail together: the p-values are wrong, and the SS partition is no longer the optimal way to quantify signal and noise.

The assumptions — now they make sense

ANOVA's assumptions are usually taught as a checklist before the method is explained. But if you understand the CLT basis of ANOVA, the assumptions are not arbitrary — each one is a condition that the underlying mathematics requires to be valid.

Independence

The LLN requires observations to be independent draws. If one observation tells you something about another — because they come from the same person, the same school, the same trial block — the noise does not cancel cleanly. MSwithin underestimates true variability, F is inflated, and p-values are too small.

Normality of residuals

The F-distribution used to generate p-values assumes that the within-group deviations — the residuals — are normally distributed. A residual is another name for what we have been calling the error up to now: the leftover for each observation once its group mean has been subtracted off. This follows from the CLT: if each observation is a sum of many small independent influences, the residuals will be approximately normal. With large N the CLT makes this robust — small samples are where it matters most.

This is also precisely why, as we saw above in "Why squares?", the sum-of-squares partition requires normality — both justifications stand or fall together.

Homogeneity of variance

MSwithin is a pooled estimate — it averages within-group variances across all conditions. This pooling only makes sense as a process if the conditions have similar variance. If one group is much more variable (spread out) than another, the pooled estimate is naturally misleading and the F-ratio will be subsequently distorted. Note in the simulation above I selected to illustrate "within-group scatter σ" using a single slider rather than three — consequentially the three groups shared one σ because ANOVA pools them. In a real dataset with potential heterogeneous variances, that shared slider does not exist, and the pooled MSwithin is averaging variances that should not be averaged together.

Aside — residuals and the "cancelling errors" picture from Part 1

Part 1 talked about random errors cancelling out when we average many observations. The word "residual" is the observable version of the same idea.

In the CLT framing, each observation is the true mean plus a random error: Yi = μ + εi. The εi's are what cancel — positive and negative values balance in expectation, and their average shrinks toward zero at rate 1/√N as N grows.

In practice we do not have μ. We estimate it with the sample mean x̄, and write Yi = x̄ + ri. The ri here is a residual. By construction, residuals from a least-squares fit sum to exactly zero — not approximately, exactly — because x̄ was chosen to make that happen. That is what being the least-squares estimate of the mean means. An equivalent way to say the same thing in likelihood vocabulary: x̄ is the value that maximises the Gaussian likelihood of the data. Under normal errors, the two statements — "the residuals sum to exactly zero because x̄ was chosen to make that happen" and "x̄ is the value that maximises the Gaussian likelihood" — are the same equation, written in different vocabulary.

Errors εi are deviations from the true mean μ. They cancel in expectation (the CLT result from Part 1).
Residuals ri are deviations from the fitted mean x̄. They cancel mechanically — they sum to zero exactly.

Residuals are our best reconstruction of the errors, with one unavoidable caveat: the sample mean has eaten one piece of information out of the data. That is why we divide by N−1 rather than N when estimating variance from residuals — Bessel's correction from Part 1 exists because the residuals have one degree of freedom fewer than the errors did. The CLT says the errors would cancel if we could see them directly; residuals are what appear on the page once we fit x̄ and force the balance.

ANOVA as a special case of regression

One issue with the way statistics is often conventionally taught is to separately mention ANOVA and linear regression as if they were in some sense different underlying statistical beasts. They aren't! Knowing this actually makes later extensions feel natural rather than arbitrary. A between-groups ANOVA is, algebraically, a special case of linear regression where the predictors are categorical group memberships — you are trying to fit a set of predicted values that minimise the sum of squared residuals. For two groups, that reduces to a straight line between the two group means. For more than two groups, β₀ sets a horizontal baseline at the reference group's mean, and each further β coefficient is simply the vertical distance from that baseline to another group's mean. In either case the groups in our data are represented by dummy-coded variables (0 or 1), the intercept becomes the reference group's mean, and each slope coefficient represents the deviation of another group's mean from that reference:

ANOVA as a regression equation Y = β₀ + β₁X₁ + β₂X₂ + … + ε
β₀The intercept — equal to the reference group's mean. All other groups are compared against this baseline.
β₁, β₂, …The slopes — each is the difference between one group's mean and the reference group's mean. These are the coefficients the model is estimating.
X₁, X₂, …Dummy variables: 1 if the observation belongs to that group, 0 otherwise. They act as switches — turning each group's β coefficient on or off, so the fitted value for any observation is β₀ plus whichever group coefficient applies.
εThe residual — each person's deviation from their group mean. This is MSwithin in the variance partition. The F-test asks: does knowing group membership predict scores significantly better than just using the grand mean (a flat line)?
Picturing the regression equation — two groups vs. three groups
(a) TWO GROUPS — A LINE THROUGH TWO MEANS Ŷ = β₀ + β₁X Y X (dummy) X = 0 (Group A · ref) X = 1 (Group B) β₀ β₁ ȳA = β₀ ȳB = β₀ + β₁ (b) THREE OR MORE GROUPS — EACH β AS A SLOPE FROM β₀ Ŷ = β₀ + β₁X₁ + β₂X₂ + β₃X₃ Y Group 1 (reference) Group 2 Group 3 Group 4 β₀ β₁ β₂ β₃
Left: with two groups, the dummy-coded regression is a straight line between the two group means — β₀ is the intercept (the reference group's mean at X = 0), and β₁ is the rise to the second group's mean at X = 1. Right: with three or more groups, β₀ (teal) is the reference group's mean, drawn as a horizontal baseline across the plot; each group's mean is the short black dash at that group's position; and each βi coefficient (red) is a slope — the rise or fall from the β₀ baseline over a unit x-step — whose vertical change equals the difference between that group's mean and β₀. In both cases the fit is the set of predicted values that minimises the sum of squared residuals.

Recognising this equivalence pays dividends immediately. Adding a continuous covariate to this equation gives you ANCOVA — the same model, one extra predictor. Multiplying two predictors together gives you an interaction term — the same arithmetic, now capturing whether the effect of one variable depends on the level of another. Factorial ANOVA, ANCOVA, and multiple regression are not separate techniques; they are the same linear model with different types of predictors. The F-ratio is just the test for whether the model explains significantly more variance than a flat line at the grand mean — in every case.

The slope is an estimate — and it wobbles

Thinking of the condition effect as a regression slope brings a critical insight into view: the slope is not a truth, it is an estimate. Like all estimates, it has uncertainty around it — a degree of wobble that reflects how much individual participants diverge from the group-level effect. In an ideal experiment, every participant follows the group pattern closely: condition A is higher than condition B for everyone by roughly the same amount. The slope is tight and unwobbly, and we have high precision on the condition effect.

But when participants vary substantially in how they respond to the conditions — some showing a large effect, others showing none, others going in the opposite direction — the slope estimate is blurry. The residual ε is large not because the conditions have no effect, but because the effect is heterogeneous across people. Classical ANOVA has no mechanism to distinguish "no effect" from "highly variable effect" — both show up as large MSwithin and a suppressed F-ratio. The wobble in the slope and the noise in the denominator are, from ANOVA's perspective, the same thing.

The same "it is an estimate" logic applies to the group means that feed into MSbetween in the first place. Those group means are themselves estimates, with standard errors that depend on each group's internal spread — and if the within-group variances differ, the means are estimated with different precisions. This is the homogeneity of variance assumption from §01 seen from another angle. It was introduced there as the condition under which pooling within-group variances into a single MSwithin is sensible; here we meet the same assumption mattering for a different reason, in the numerator — group means drawn from groups with different within-group spreads are not equally trustworthy estimates, and that is something ANOVA has no machinery to reflect.

To make the point concrete: consider two groups with the same N but very different within-group spreads — one tight, one diffuse. The mean from the diffuse group has a much larger standard error; it could shift substantially with a different sample. ANOVA enters both means into MSbetween with equal weight and then attempts to compensate by inflating the pooled denominator — a blunt correction that does not actually solve the underlying problem.

Welch's correction — weighting by precision

Welch's ANOVA addresses this properly by computing a weighted grand mean, where groups with tighter distributions (smaller variances, more trustworthy means) contribute more weight than groups with wider distributions. Degrees of freedom are also adjusted downward to reflect the reduced effective information. This is the same precision logic from Part 1 — a mean estimated with high precision (small SE) deserves more influence than one estimated imprecisely. When group variances differ substantially (a rough rule of thumb is a ratio greater than 3:1 or 4:1), Welch's is the more principled choice.

Once the homogeneity assumption fails, the F-ratio is computing a signal-to-noise ratio using a signal component whose trustworthiness it cannot assess.

The standard advice is to check the homogeneity assumption with Levene's test and proceed if it comes back non-significant. But there is a tension here: small samples (which is often the case), and thus a scenario where researchers most need the check to work — ironically, this is exactly where Levene's has the least statistical power, and also the regime where the CLT has not fully kicked in for the normality-of-residuals assumption either. All three assumptions — independence, homogeneity, normality — are simultaneously least satisfied and least detectable by formal checks at precisely the sample sizes where researchers most need them to hold. Visual inspection of group distributions is often more informative than the formal test in this regime.

The small-N trap

Small N → more likely to have unequal variances  ·  Small N → less power to detect them  ·  Therefore: more likely to have the problem AND more likely to conclude you don't.

Part II
When assumptions break — Linear Mixed Models
§ 02

The problem with assuming independence

Independence is the assumption we have kept returning to. It is what licenses the LLN in Part 1, what lets the CLT produce a well-behaved sampling distribution, and what makes MSwithin an honest estimate of noise in ANOVA. It is also the assumption most commonly violated in behavioural science. Repeated measures (the same participant contributing in multiple conditions, sharing a baseline) and nested designs (students within classrooms, trials within participants) both produce observations that are not independent of each other.

In both cases, the problem is the same: MSwithin no longer isolates random noise. Within-group variance should reflect only random individual fluctuation — the noise against which we judge the signal. But when data are clustered or repeated, some of that "within-group" variance is actually systematic individual-difference variance: Person A is reliably faster than Person B across all conditions. ANOVA cannot distinguish these sources, so it lumps them together as noise.

← Recall from Part 1 · §02 The CLT gives us normally distributed group means provided observations are independent. When they are not, the standard error formula σ/√N is wrong — the effective sample size is smaller than N suggests, because correlated observations carry less information than independent ones. Precision is lower than it appears.

The natural objection — "but doesn't repeated-measures ANOVA already handle this?" — is taken up in §03. First, the simulation below illustrates what happens to the F-ratio if you ignore the dependence and run a standard ANOVA on data with a within-subjects structure.

Simulation — The Hidden Structure in Within-Group Variance
Each dot is one observation. In a repeated-measures design, each participant contributes one observation per condition. The dots are colour-coded by participant. Notice that much of the within-group scatter is actually consistent individual differences (some people are always higher, some always lower) — structure that ANOVA treats as undifferentiated noise.
Individual differences (between-person SD)1.5
Condition effect (mean difference)0.8
ANOVA F (naive)
ANOVA p (naive)
True condition effect
Individual variation (noise in ANOVA)
What ANOVA sees — ignoring participant structure
The hidden structure — same data, colour by participant
What to notice: Increase individual differences and watch ANOVA's F weaken — even though the condition effect hasn't changed. The individual variation is swamping the signal. But look at the right panel: each participant shows a clear within-person shift between conditions. A model that knows about this structure can recover that signal.
§ 03

Repeated measures — structured dependence

Before turning to repeated measures, the model we are heading toward deserves a brief introduction. A Linear Mixed Model (LMM) is not a different kind of statistics — it is the same regression framework we have been working with, extended so that the model carries information about which observations belong together (same participant, same item, same cluster). The full machinery is built up in §04.

The natural objection before we get there: surely ANOVA already handles within-subjects data? Repeated-measures ANOVA exists. It does acknowledge that observations from the same participant are related. So why do we need LMMs at all?

The answer starts by recalling something we saw in §01: ANOVA is just a special case of regression. Regression always has a baseline — an intercept — and coefficients that represent shifts away from it. RM-ANOVA's key move is that each participant gets their own intercept instead of a single shared baseline: every person has their own overall level that lifts or lowers all their scores uniformly across conditions. In LMM parlance — which we will set out formally in §04 — this is called introducing a "per-participant random intercept". So although RM-ANOVA was developed as its own technique, its core machinery is the same one LMMs use: algebraically, RM-ANOVA is a Linear Mixed Model with only a per-participant random intercept and nothing else. On a balanced, complete dataset the two models produce identical F-statistics and identical p-values. Holding that equivalence in view is what makes the rest of this section work — the sphericity assumption, the role of random slopes, and the cases where LMMs do things RM-ANOVA cannot all follow from asking what the random-intercepts-only model can and cannot represent.

What repeated-measures ANOVA actually does

In a within-subjects design, the same participants contribute observations in every condition. RM-ANOVA acknowledges this dependence by pulling each participant's overall mean out of the error term before the condition effect is tested. In effect, the model subtracts each person from themselves, and then asks whether the residual pattern is consistent with a systematic condition effect. Independence is assumed between participants, not within them.

Repeated-measures ANOVA — partition Yij = μ + αi + βj + (αβ)ij
YijThe observation for person i in condition j.
μGrand mean — the overall average across conditions and people.
αiThe subject effect — person i's overall deviation from the grand mean. Partitioned out of the error term rather than lumped into it.
βjThe fixed effect of condition j — the quantity the F-test is targeting.
(αβ)ijThe subject × condition interaction — what RM-ANOVA treats as error. Residual variability left over after subject and condition main effects are removed.

The key move is that αi — the participant's stable individual level — is treated as a nuisance factor and removed. The F-test for the condition effect βj is built against the (αβ)ij interaction as its error term, which is a considerably smaller quantity than the undifferentiated MSwithin of a between-subjects ANOVA. Individual differences no longer contaminate the signal. This is a genuine improvement over between-subjects ANOVA whenever a within-subjects design is possible. But this action, treating individual differences as simply error may seem a bit strange, after all aren't many psychologists actually interested in individual differences. Remember that for later.

The sphericity assumption — what it actually requires

In a within-subjects design, the same participant contributes one score to each condition. Those scores are not independent of each other. A participant who is fast overall will tend to be fast in every condition; a participant who finds one condition hard may also find a related condition hard. Whenever we have repeated measures, we have to keep track of two things at once: how variable each condition is on its own, and how strongly each pair of conditions moves together across participants. This is what a covariance matrix records. It has one row and one column per condition. The diagonal entries are the variances within each condition (written σ², where σ is the standard deviation). The off-diagonal entries are the covariances between pairs of conditions (written ρ, here used as shorthand for the covariance between two conditions rather than a correlation in the narrow sense). With four conditions, the matrix is 4 × 4: four diagonal entries and twelve off-diagonal entries, though each off-diagonal pair is symmetric, so really six distinct pairwise numbers.

Sphericity is a constraint on that covariance matrix. The formal statement is that the variance of the differences between any two conditions must be the same for every pair — if you compute (condition 1 − condition 2) across participants and measure its spread, and then do the same for (condition 1 − condition 3), and for every other pair, you must get the same number every time. This is a structural echo of the homogeneity of variance assumption from §01 — both exist because the test statistic is built by pooling several quantities (there, within-group variances pooled into MSwithin; here, the variances of pairwise condition differences pooled into the F-test for repeated measures), and pooling only produces a meaningful summary when the things being pooled are roughly equivalent. An equivalent and easier pattern to read off the matrix is compound symmetry: every condition has the same variance σ², and every pair of conditions has the same covariance ρ. Every diagonal entry is identical. Every off-diagonal entry is identical. The matrix labelled compound symmetry below is exactly this pattern.

Compound symmetry
┌ σ²   ρ    ρ    ρ  ┐
│ ρ    σ²   ρ    ρ  │
│ ρ    ρ    σ²   ρ  │
└ ρ    ρ    ρ    σ² ┘

Every condition has the same variance (σ²). Every pair of conditions has the same covariance (ρ). One number on the diagonal, one number off it. This is the pattern RM-ANOVA assumes, and it is also the pattern a random-intercepts-only LMM implies.

Sphericity violated
┌ 1.00  0.72  0.48  0.22 ┐
│ 0.72  1.20  0.68  0.41 │
│ 0.48  0.68  0.90  0.65 │
└ 0.22  0.41  0.65  1.10 ┘

Diagonal: condition variances differ (1.00, 1.20, 0.90, 1.10) — no shared σ². Off-diagonal: covariances decay with distance from the diagonal — adjacent conditions are strongly correlated (~0.7), distant conditions weakly so (0.22 between conditions 1 and 4). This is what real repeated-measures data often look like: conditions tested close together in time are more correlated than ones tested far apart. RM-ANOVA cannot represent this directly; it can only adjust the degrees of freedom to compensate.

In real datasets, sphericity is routinely violated. Two patterns are common. First, conditions tested close together in time are more correlated than conditions separated by many others, so the off-diagonal entries are not constant — nearby pairs have larger ρ than distant pairs (see sphericity violated example earlier). Second, conditions differ in difficulty or variability, so the diagonal entries are not constant either. The standard response — Greenhouse–Geisser or Huynh–Feldt corrections — adjusts the degrees of freedom of the F-test to compensate, but leaves the underlying model unchanged. These are patches on a symptom, not fixes to the model.

The structural equivalence

Sphericity is not a separate or mysterious assumption imposed on RM-ANOVA. It is the covariance structure produced by a model that lets each participant have their own personal baseline (their own intercept) but forces them all to share the same response slope across conditions. Picture every participant's regression line as parallel to every other participant's — the line can sit higher or lower for different people, but its slope is identical for everyone. (This "each person has their own intercept but a shared slope" setup is what is meant by a "random intercept per participant", and we will return to it formally in §04.) Once each person is reduced to a single offset plus random noise around a shared slope, equal variances at every condition and equal covariances between every pair follow automatically — exactly compound symmetry. The reason RM-ANOVA struggles when sphericity is violated is not that a diagnostic test has failed. It is that the underlying model is too rigid to represent what the data actually contain: it has no way to let one participant's response slope be steeper or flatter than another's. Greenhouse–Geisser does not make the model richer; it adjusts the distribution of the test statistic to compensate for a rigid model fitting data it was never designed for.

What fixes sphericity — random slopes

Sphericity is violated mainly because participants differ in their response to the conditions. Some show a large condition effect, others a small one, others a reversed one. A model with only a random intercept cannot represent that — it lets subjects differ in their personal baseline, but forces every line to share the same slope across conditions. That rigidity — the parallel-lines setup we just named — is what produces compound symmetry in the first place.

The fix follows directly from naming the rigidity. If the model lets participants differ in baseline but not in slope, the obvious extension is to let the slope vary too. Adding a random slope does exactly this: each participant gets their own condition effect, drawn from a distribution with its own variance. The parallel-lines picture relaxes — every participant's line can now have its own steepness as well as its own starting point. This single extension does two things at once. It captures the individual differences in the effect that RM-ANOVA cannot see, and it relaxes compound symmetry from the inside of the model — because once subjects have their own slopes, the variances at different conditions and the covariances between them are no longer constrained to be equal.

Random slopes are a capability RM-ANOVA does not have, and this is where the two frameworks actually diverge. If you never need to fit slopes — if a shared per-subject intercept really is all the individual-level information the data asks for, and if the design is balanced and complete — RM-ANOVA and the random-intercepts LMM give the same answer and there is nothing to choose between them. The case for LMMs is not that they do ANOVA "better". It is that they generalise, and allowing each person an individual slope is the most common and most important way that generalisation shows up in practice.

← Recall from §02 · Simulation You have already seen what random slopes model. In the earlier simulation The Hidden Structure in Within-Group Variance, the right-hand panel drew a line per participant connecting their two condition observations. Each of those lines is that participant's slope — and they were not all parallel: different people had different gradients. That visible non-parallelism is exactly what a random-intercepts-only model (and therefore RM-ANOVA) cannot represent — it would force every line to be parallel, just sitting at different heights. A random slope is the statistical machinery for letting each line have its own steepness as well as its own starting point, taking the structure that was already on the page and putting it into the model as information rather than averaging it away as noise.

The deeper move — LMMs let us stop aggregating

Random slopes change more than the sphericity story. Almost everything in this resource up to here has been about generating summary statistics — means for our participants collapsed across trials, differences of means to consider experimental effects, ratios of variances to compute precision. That was not arbitrary. The CLT is what justified it: with enough independent observations, sample means behave predictably, so inference at the level of aggregates was safe. But every aggregation step threw information away, and every summary came with its own chain of assumptions — independence, normality of residuals, homogeneity, sphericity.

Linear mixed models let us stop aggregating. Instead of reducing each participant to one or two cell means and then running an F-test on the reduced numbers, we can fit the model to the individual trials directly. Each participant's condition effect becomes a slope estimated from the full set of their trial-level observations — not a straight line between two cell means, but a regression line through the complete distribution of their responses. The within-condition variability that RM-ANOVA would have averaged away is information the model uses to tell how precisely each person's slope is known. A key intuition to take away here for LMMs is that our ultimate goal is to compute estimates of things and we strive to achieve the best precision for such estimates (remember me talking about blurry lines?) — the message throughout so far, the impression you should get, is that LMMs naturally lead to improved precision in the main, if for no other reason than you aren't throwing away information.

ANOVA asked: is the difference between group means large enough relative to the pooled noise? LMMs ask: what is the structure of the data at the level they were actually generated — trial by trial, participant by participant — and what does that tell us about the effect? The F-ratio is one summary drawn from a richer fit.

Other places the equivalence breaks down

Random slopes are the most important generalisation, but LMMs extend their benefits beyond RM-ANOVA in three further ways that matter in practice.

Missing data

RM-ANOVA drops any participant who is missing an observation in any condition — listwise deletion, even if only one cell of many is missing. An LMM uses every available observation for every participant, estimating effects under a missing-at-random assumption rather than discarding cases.

Unbalanced or varying designs

If different participants have different numbers of trials per condition — standard in trial-level data, in longitudinal designs, whenever participants vary in compliance — RM-ANOVA has no principled handling. LMMs handle unbalanced designs natively.

Explicit covariance structures

When random slopes alone are not the right fix — for example, when adjacent timepoints in a longitudinal design are more correlated than distant ones — the covariance structure can be specified directly as unstructured, autoregressive, or Toeplitz. This requires going beyond standard lmer into tools like nlme::gls or glmmTMB, but it is unavailable to RM-ANOVA in any form.

Where this leaves RM-ANOVA

Repeated-measures ANOVA is not a different tool from LMMs — it is a special case of an LMM, the one where the only allowed random effect is a per-participant intercept. When that restricted model fits the data well, the two give identical answers. When it does not — because individuals differ in their response to conditions, because data are missing, or because the covariance structure is richer than compound symmetry — the LMM generalises, and RM-ANOVA is left with a patched F-statistic and no way to represent the structure the data contain. I guess you need to ask yourself: if you have used RM-ANOVA — and I have established you are already a closet LMM-user — why not embrace the approach more overtly?

§ 04

Linear Mixed Models — modelling the dependence structure

So it should be clear by now that a Linear Mixed Model (LMM) is not a fundamentally different kind of statistics. It is the basic framework that RM-ANOVA is quietly leaning on, extended by giving the model more information about how the data were generated. As we have just illustrated, an LMM knows that some observations belong to the same person — and it uses that knowledge to partition variance more intelligently. But there are a lot more benefits to this kind of model specification strategy than that, as we will see.

Fixed effects and random effects

An LMM decomposes each observation into three parts rather than two:

LMM — basic decomposition Yij = μ + β·Xj + ui + εij
YijThe observation for person i in condition j.
μThe grand mean — the overall average across conditions and people.
β·XjThe fixed effect of condition j — the systematic shift in mean level that applies to everyone equally. This is what ANOVA estimates too. β is the parameter we care about: the true condition effect.
uiThe random effect of person i — their individual deviation from the grand mean, assumed to be drawn from N(0, τ²). This captures the fact that some people are reliably higher or lower than average. ANOVA cannot see this; LMM estimates it explicitly.
εijThe residual error — the truly random, unexplained fluctuation. This is what the model cannot account for. Smaller ε means better precision.

Why smarter partitioning improves precision

The crucial move is what happens to the error term εij. In ANOVA, the within-group variance includes both genuine random noise and systematic individual differences (the ui terms) — they are inseparable. The MSwithin denominator is therefore inflated, F is suppressed, and the model has low precision for detecting the condition effect.

In an LMM, the random effects ui are estimated from the data and removed from the residuals. What remains — the εij — is the truly unexplained variation. This residual variance is smaller. A smaller residual means a smaller SE for the fixed effect β. And a smaller SE means a sharper, more precise estimate of the true condition effect. This is the same precision logic from Part 1: more information going into the model → tighter estimates → clearer focus on what we care about.

The key intuition

ANOVA is a model that knows only about conditions. It sees all within-condition variation as noise. An LMM is the same framework extended to also know about people — it recognises stable individual differences and estimates them explicitly. The residual shrinks; fixed-effect estimates sharpen. ANOVA's no-individual-differences assumption is almost always wrong in behavioural data, and when it is, you end up with a blurrier picture than you could have. LMM does not require more data; it requires a more accurate description of the data you already have.

Random slopes: when individual differences are not just in level

← Recall from §03 Random slopes came up several times in §03. As the principled fix for sphericity violations — a random-intercepts-only model (and therefore RM-ANOVA) forces every participant's regression line to be parallel, and a random slope is what lets each line have its own steepness. As the formal version of what the §02 simulation already showed visually: the non-parallel lines per participant in the right-hand panel. And as the move that motivates fitting the model on individual trials rather than on cell means. Here is the equation that makes explicit what a random slope is doing mathematically.

The model above gives each person their own intercept (baseline level) but assumes the condition effect β is the same for everyone (this was what we saw for the RM-ANOVA structure). Sometimes individual differences extend to the effect itself: some people show a large face-inversion effect, others show almost none. This is modelled by adding a random slope:

LMM with random slopes Yij = μ + (β + vi)·Xj + ui + εij
viPerson i's deviation from the average condition effect β. If vi is large and positive, this person shows a stronger condition effect than average; if negative, a weaker one. Estimating these allows the model to capture individual-level treatment response — and removes yet more systematic variance from the error term.

Each layer of structure added to the model — random intercepts, random slopes, nested clustering — is the same move: giving the model more knowledge about how the data were generated, which removes more systematic variance from the residuals, which sharpens the estimates of what we actually care about.

Partial pooling — borrowing strength across participants

There is a subtle but important mechanism underlying how LMMs estimate individual random effects, called partial pooling (sometimes called "shrinkage"). Understanding it reveals why LMMs are not only more honest about structure, but often more accurate at the individual level.

Consider estimating each participant's individual slope — their personal condition effect. One extreme would be to estimate each person's slope entirely from their own data, ignoring all other participants. This is no pooling: independent estimates, each based on whatever trials that person completed. With many trials per person this is fine; with few trials, the estimates are highly variable and unreliable. The other extreme would be to ignore individual differences in the effect altogether and give everyone the same slope — the group mean. This is complete pooling. In LMM terms, complete pooling is what you get when the random-effect variances are fixed to zero — the model is forbidden from estimating individual variation in the parts of the equation where the variance is zeroed out. Both classical ANOVA and RM-ANOVA are special cases of an LMM in exactly this sense. Classical ANOVA fixes every random effect to zero: no individual differences in baseline, no individual differences in response. RM-ANOVA relaxes this for the intercept (each participant gets their own baseline — the parallel-lines picture from §03), but the slope variance is still fixed to zero, forcing a single shared β across everyone. The pooling logic itself is the same move we first met in §01 with MSwithin: within-group variance from every group pooled into a single shared σ² on the assumption that all groups draw from a common noise distribution. Applied to the slope, the same logic gives one β for everyone, with individual differences in the effect set to zero by construction.

LMMs do neither. They partially pool: each person's slope estimate is a weighted compromise between their own data and the group average, where the weight is determined by how much data that person contributes and how noisy their responses are. In later web content I will go further into detail about partial pooling and how it is connected to improved estimates through a process called shrinkage.

Extraordinary individual differences require extraordinary data

A participant with many clean, consistent trials earns a slope estimate that stays close to their own data — their evidence is strong enough to support an individualised estimate. A participant with few trials, or highly variable responses, has their estimate pulled back toward the group mean — the model borrows strength from the rest of the sample to stabilise the estimate where individual data are thin.

This is exactly the right behaviour. We do not want to over-interpret a noisy participant who happened to show an extreme pattern in a handful of trials. Equally, we do not want to ignore a participant who consistently and clearly shows a pattern that differs from the group. Partial pooling implements this logic automatically — without any researcher degrees of freedom.

The group-level distribution of slopes also provides a principled regulariser: it tells the model what range of individual differences is plausible, and keeps estimates from wandering into implausible territory when data are sparse. This is why LMMs are often described as "borrowing strength" from the full dataset — information flows across participants in a way that benefits everyone's estimates.

The design incentive problem — and how LMMs remove it

Because ANOVA's MSwithin bundles genuine individual differences with random noise, there is a structural incentive to suppress individual differences by design — recruiting homogeneous samples, using easy tasks where everyone performs similarly, averaging away within-person fluctuation. All of these shrink MSwithin and inflate F. The cost is that designs optimised for large F-ratios are exactly the designs that minimise the very individual differences psychologists most often care about. LMMs invert this. By partitioning individual-difference variance rather than treating it as noise, a heterogeneous sample with high variability is no longer a problem to be designed away — it is information the model can use.

Random effects for items, too

So far the LMM has been written as if participants were the only grouping factor. The data contain another one: the items. In most behavioural experiments each participant completes many trials, and those trials draw from a set of items — faces, words, scenes, sentences, problem instances — that differ from each other in the same ways participants do. Some items are harder, some easier; some elicit a larger condition effect than others.

Because the model is now fitted on trial-level data rather than on cell means, we have the machinery to put items directly into the model alongside participants. A random intercept for items captures differences in baseline difficulty across the stimulus set. A random slope for items captures the fact that the condition effect varies from one item to another. Participants and items sit side by side as two separate grouping factors, neither nested inside the other — a structure called crossed random effects.

The consequence for inference is practical. Analyses that ignore item variance tacitly treat the items actually tested as the complete population of items, rather than as a sample drawn from a larger pool — which they almost always are. That produces standard errors that are too narrow and p-values that are too small, because the model is not crediting the item-level variability that a replication with different stimuli would produce. Modelling items explicitly forces the condition effect to hold across participants and across the item sample before it is declared reliable — the inference generalises to both populations, not just to the participant one. The take home message is always the same: we use a better model to get better estimates based on better precision.

There is another way to estimate these models that builds directly on the standard LMM — a Bayesian LMM. The model structure is identical; what changes is how the parameters are estimated and how uncertainty is expressed. We will cover Bayesian inference properly later in the resource, so there is no need to dig into it here. If you are already curious, the preview below unpacks what changes and why it matters. Otherwise, skip past it.

Teaser — coming later From LMM to Bayesian LMM — same structure, different estimation engine Click to expand. You can safely skip this on a first read — we will cover Bayesian inference properly in a later part of the resource.

A frequentist LMM and a Bayesian LMM have identical model structure: fixed effects, random intercepts, random slopes, residual error. The difference is in how parameters are estimated and how uncertainty is expressed. Where the frequentist approach finds the parameter values that make the observed data most likely (maximum likelihood / REML), the Bayesian approach combines prior beliefs about plausible parameter values with the evidence from the data, producing a posterior distribution for every parameter.

← Recall from Part 1 · §03 Part 1 contrasted a frequentist confidence interval — a statement about the procedure over many hypothetical repetitions — with a Bayesian credible interval, which is a direct probability statement about the parameter itself. The same distinction is playing out here at the model level. A Bayesian LMM does not just estimate effects differently; it produces posterior distributions that support the kinds of statements researchers actually want to make about their estimates.

This distinction matters practically in five ways:

Full uncertainty

Every parameter — condition effects, individual slopes, variance components — has a full posterior distribution, not just a point estimate ± SE. You can make direct probability statements: "There is a 97% probability the condition effect is positive" — exactly the credible-interval interpretation we set against the frequentist confidence interval in Part 1 §03. For individual slopes, you get a posterior per person, revealing not just whether people differ but how uncertain we are about each person's estimate.

Priors as regularisation

Weakly informative priors — "most effects are probably between −2 and +2 on the outcome scale", "very large variance components are unlikely" — keep estimates stable. Complex frequentist LMMs often fail to converge or produce boundary estimates (variance components of exactly zero). Bayesian LMMs typically resolve these instabilities, replacing silent failure with an honest posterior that reflects remaining uncertainty.

Explicit partial pooling

The shrinkage logic described above becomes fully transparent in the Bayesian frame. Each person's posterior slope is explicitly a compromise between their own data and the group-level prior — and the degree of shrinkage is a visible, interpretable quantity rather than a background computation. Participants with sparse data are pulled toward the group; participants with rich data stay near their own estimates.

Direct probability statements

Instead of "assuming the null is true, the probability of data at least this extreme is p = .04", you can say "there is an 85% probability the variance in slopes exceeds a practically meaningful threshold" or "for this participant, there is a 90% probability their effect is below the group average." These are the statements researchers — and clinicians, and real humans — actually care about.

Realistic likelihoods

Binary accuracy data, ordinal ratings, skewed reaction times — all can be embedded in the same hierarchical structure with appropriate likelihoods (Bernoulli, ordinal, ex-Gaussian), still with random intercepts, random slopes, priors, and posteriors. The model is built to match how the data were generated, not forced into normality assumptions that the CLT only partially rescues.

What LMMs fix — and what they do not

The progression from ANOVA → RM-ANOVA → LMM looks like a story of increasingly weak assumptions, which in one sense it is. But an honest characterisation is more specific: LMMs relax the dependence structure, not the distributional assumptions. Once this distinction is clear, the places where LMMs continue to mislead — and where further machinery is needed — come into focus.

Three things LMMs genuinely fix. Non-independence: observations from the same participant, item, or trial cluster are explicitly modelled via random effects, not treated as independent. Variance decomposition: between-participant, between-item, and residual variance are estimated separately rather than lumped into one error term. Partial pooling: individual estimates (a participant's slope, an item's effect) are weighted compromises between that unit's own data and the group-level distribution — sparse or noisy units borrow strength from the rest of the sample to stabilise their estimates, where ANOVA can only pool fully or not at all. These are substantial gains, and they change the conclusion of many analyses that ANOVA would have mishandled.

Note that partial pooling and the lowest-level residual assumptions are on different axes. Partial pooling is a capability ANOVA never had — it stabilises individual-level estimates by borrowing strength across the sample, and it operates on the random-effect estimation regardless of which likelihood the model uses. The residual-level assumptions are something else.

Every test we have considered so far in this resource — t-tests, ANOVA, RM-ANOVA, and the standard LMM — is fitted by ordinary least squares (OLS), an estimation procedure based on a single underlying likelihood: the assumption that the response, conditional on the predictors, is normally distributed. That one likelihood is what defines OLS — squaring deviations to measure error (as we saw in §01), the F-distribution that licences p-values, the assumption that residuals are Gaussian, independent, and identically distributed (iid) — all of these follow from it. Different response types (binary outcomes, counts, ordinal ratings, skewed reaction times) need a different likelihood, and a model that uses one of those is no longer a member of this family — it is a Generalised Linear Model (GLM), or a Generalised Linear Mixed Model (GLMM) when random effects are also needed. The fix for residual misspecification lives there, in stepping out of the Gaussian linear family altogether. We will consider statistical models for dependent variables that are binary (Y/N) or ordinal, and the tests that use a different likelihood model (GLM and GLMM), in a later website.

So at the fundamental level, an LMM (like all other models that use OLS) still assumes residuals are approximately normally distributed, homoscedastic, and independent after the modelled dependence has been removed. When those conditions fail, the LMM can still return estimates and standard errors that appear reasonable but are miscalibrated.

Residual assumptions persist

Even within the Gaussian family, the strict iid assumptions can fail: heteroscedasticity, skewed residuals, unmodelled correlation in the errors. The fixes here stay within OLS — robust (sandwich) standard errors, weighted regression, response transformations, residual correlation structures — rather than stepping out into a different likelihood family altogether.

Misspecified random effects

Omitting a random slope that is warranted, ignoring item-level or trial-level clustering, or missing temporal dependence — all of these push structure back into the residuals. The model then estimates the wrong standard error for the fixed effects. Barr et al.'s "maximal random-effect structure" recommendation addresses this, but many published LMMs are specified too austerely.

Distributional mismatch

Reaction times are skewed and heavy-tailed; accuracy is binary; ratings are ordinal. A Gaussian LMM applied to any of these is a misspecified likelihood. Generalised LMMs with appropriate likelihoods (Bernoulli, ordinal, ex-Gaussian, lognormal) fix this, but a standard lmer() call does not — the estimates look clean and the output is miscalibrated.

A further subtlety: LMMs handle clustering — participants, items, groups — but not automatically other forms of within-cluster dependence. Autocorrelation between adjacent trials, learning or fatigue effects across a session, carryover between successive conditions — all of these must be explicitly modelled, either through additional random effects or through a residual correlation structure. An LMM that assumes residuals are independent given the random intercepts is miscalibrated whenever the trials are ordered in time and no such ordering is accounted for.

Why does everyone go on about normality? — and how the CLT fits in

Because the classical inference tools — t-tests on coefficients, F-tests for overall model fit, confidence intervals, p-values — are derived assuming residuals are approximately normal. When that assumption fails badly, those tests can return numbers that look fine but are miscalibrated.

In practice, though, the picture is more nuanced than "all residual assumptions, always". The "iid Gaussian residuals" assumption breaks into three pieces — independent, identically distributed (homoscedastic), Gaussian — and the CLT helps with one of them strongly, the other two not at all.

Normality of residuals (the G in iid G residuals) — CLT does most of the work for you. In moderate-to-large samples, the CLT delivers an approximately Gaussian sampling distribution for sample means (and F-ratios, and t-statistics) regardless of the underlying residual shape. So in the large-N regime, residual normality stops being something you need to worry much about — your inference will still be approximately calibrated. This is what "robust to non-normality in large samples" means. Worry mainly when N is small and residuals look badly non-Gaussian — that combination is where the CLT has not yet kicked in and the assumption it would have rescued is also violated.

Homoscedasticity (equal variances, the id in iid G residuals) — CLT does nothing. If groups have different residual variances, MSwithin pools them under the assumption they're equal; if they aren't, the pooled estimate is meaningless regardless of N. The Welch correction we discussed in §01 is the actual fix, not larger sample size. More N just makes you more confidently miscalibrated.

Independence (the i in iid G residuals) — CLT does nothing. Correlated observations (repeated measures, clustering) have smaller effective sample size than N suggests, so CLT-derived standard errors are too small. The whole §02–§04 discussion of LMMs exists precisely because the CLT does not save you from violations of independence.

So the practical takeaway: in large samples, residual normality matters less than textbooks often imply. But homoscedasticity and independence matter at any sample size, and these are where the real diagnostic effort should go.

What to take away

ANOVA: strong assumptions about independence and error structure — often violated.

LMM: weaker assumptions about dependence structure — but still a model with distributional assumptions that must be checked. LMMs solve "my data are clustered." The rest of the residual story — what the CLT does and does not bail you out of — is unpacked in the callout above.

I hope by this point it is clear now that LMMs can provide substantial structural upgrades over ANOVA. Largely because LMMs move you closer to the structure of the data, not all the way to it. You still have to check residuals, model the right dependencies, and ensure the likelihood is appropriate for the data you actually have.

The progression is not about statistical fashion — it is about building models that increasingly match how the data were actually generated.

Part III
How good is the measurement itself? — Item Response Theory
§ 05

The measurement question: precision at the item level

We have now built increasingly sophisticated statistical models — from ANOVA to LMM — to extract sharper estimates from the same data. But there is a prior question we have not yet asked: how good is the measurement that produces the data in the first place?

A question before we begin

Suppose you want to measure how strong someone is. You have 20 dumbbells, but you do not know how much any of them weighs. You ask the person to lift as many as they can — just once — and you count the number they managed: say, 10. Now a second person also lifts 10.

Are those two people equally strong? You cannot know. The dumbbells might vary wildly in weight. Lifting 10 light dumbbells requires very different strength from lifting 10 heavy ones. Lifting the same 10 on a single occasion might not reflect what either person could do on a different day, with different dumbbells, after different rest. The count of 10 is a summary — but it is a summary that discards all the information about the items that produced it.

This is exactly the situation with psychological measurement using summed or averaged scores. The "items" — questionnaire items, recognition trials, reaction time stimuli — vary in difficulty, discriminating power, and relevance to the construct. Treating them as interchangeable and counting how many a person "passes" discards the most important information: which items, with what characteristics. Item Response Theory exists to put that information back.

← Recall from Part 1 — Level 1 averaging Part 1 identified two levels of averaging. Level 2 (across people → group mean) is what ANOVA and LMM work on. Level 1 is averaging within a person — across items on a questionnaire, or across trials in a task — to produce each person's score. Classical test theory and Cronbach's α operate here, and their logic rests on the same LLN assumptions: items should be independent and approximately exchangeable (tau-equivalent). When they are not, the person-level score is an imprecise proxy for the latent trait — and no amount of LMM sophistication can recover precision that was lost in the measurement.

Item Response Theory (IRT) is the answer to this problem at Level 1. Rather than assuming all items are interchangeable and simply summing or averaging them, IRT models the relationship between each individual item and the latent trait explicitly. I will discuss the utility of IRT as an approach in psychometrics that allows researchers to carefully hone the items they use for their study in later web content — at this point I want to just give a brief overview, since it so obviously relates to something people ignore as they blissfully generate summary statistics based on averages/sums of sets of items.

The item characteristic curve

In IRT, every item has a characteristic curve describing the probability of a correct (or endorsed) response as a function of a person's latent trait level θ (theta). The most common model — the 2-parameter logistic (2PL) — is governed by two item parameters:

2-Parameter Logistic IRT model P(correct | θ) = 1 / (1 + exp(−a(θ − b)))
θ (theta)The person's position on the latent trait — e.g. face recognition ability, anxiety level, cognitive capacity. This is what we want to estimate.
a (discrimination)How steeply the curve rises — how sharply the item distinguishes between people just below and just above its threshold. A high-a item is highly informative; a low-a item barely discriminates. In CTT terms, this is the item-total correlation.
b (difficulty / location)Where on the θ scale the curve is centred — the trait level at which a person has a 50% chance of responding correctly. An item with b = 2 is only informative for people high on the trait; one with b = −1 is informative at lower levels.
Simulation — Item Characteristic Curves & Information
Each curve shows how the probability of a correct response changes with trait level θ for three items with different discrimination (a) and difficulty (b) parameters. The information panel shows where on the θ scale each item is most informative — its contribution to precision of measurement.
Item 1 (teal)
Discrimination (a)1.0
Difficulty (b)-1.0
Item 2 (amber)
Discrimination (a)1.8
Difficulty (b)0.0
Item 3 (violet)
Discrimination (a)0.6
Difficulty (b)1.5
Item characteristic curves — P(correct) vs trait level θ
Item information — precision contributed at each θ level
Item 1
Item 2
Item 3
Total information
What to notice: High discrimination (a) makes the curve steep — the item sharply separates people near its difficulty level. Low discrimination makes it flat — the item barely helps regardless of trait level. The information panel shows that each item contributes precision primarily near its own difficulty location b. A good test covers the trait range with overlapping, high-discrimination items. A poor one has flat curves (low a) clustered in the wrong part of the range.

How IRT improves precision

A summed score treats all items as equal — a correct response on an easy item adds the same to your score as a correct response on a hard one. But an easy item that almost everyone passes tells you very little about where in the upper range of ability a person sits; a hard item that very few pass tells you almost nothing about the lower range. IRT weights each item's contribution to θ estimation by how informative it actually is at that location.

The result is a person-level estimate θ̂ that comes with a known standard error — the inverse of the square root of the total information at that point on the trait scale. This is the precision of the measurement, not just assumed but calculated. In regions of the trait scale where the test has many high-discrimination items, estimates are sharp. Where the test has few items or weak items, estimates are blurry — and the model tells you so.

What CTT/summed scores cannot tell you

A raw score of 18/30 carries no information about how precisely it estimates the underlying trait level. Two people with the same summed score could have very different true trait levels if they got different subsets of items correct. The SE of a CTT score is assumed constant across the trait range.

What IRT provides

Each θ̂ estimate has its own SE, varying across the trait range according to how much information the test provides there. Measurement precision is not assumed — it is calculated. This connects directly back to the SE logic in Part 1: precision of the measurement has the same structure as precision of the mean estimate.

A note on PCA and other summary methods

The problems with summed scores are not unique to simple averaging. Principal Components Analysis (PCA) is often used as a quick alternative to factor analysis — extracting component scores that summarise a set of items. But PCA makes no model of how items relate to an underlying trait; it finds directions of maximum variance in the observed data, which may or may not correspond to meaningful latent constructs. Researchers who interpret PCA component scores as measures of a psychological construct are making the same implicit assumptions as those who interpret summed scores — without testing them. The same questions apply: Is the component unidimensional? Do items contribute equivalently? Is the extracted variance signal or noise? PCA does not answer these questions; it sidesteps them.

The connection upward: measurement precision feeds statistical power

Poor measurement at Level 1 propagates upward to Level 2. When a summed score is a noisy proxy for the true latent trait, the person-level scores used in ANOVA or LMM carry extra measurement error on top of whatever natural variability exists in the trait itself. That extra error shows up in the statistical model as additional within-group variance — noise that obscures the condition effect, inflates MSwithin, suppresses F, and reduces statistical power. This is the same chain Part 1 established: σ at the item level propagates through σ/√k at the person level, and from there into the SE of every group mean the inferential model consumes. Unreliable measurement at the bottom of the chain weakens every stage above it — no matter how sophisticated the statistical model at the top.

Improving measurement quality — whether by adding more items (the LLN route), selecting higher-discrimination items (the IRT route), or matching item difficulty to the target population — reduces this contamination. Better measurement means cleaner person-level estimates, smaller residuals in the LMM, and sharper inference at the group level. Precision compounds across levels.

The complete picture

Statistical power is not only a function of sample size N. It is a function of the entire chain: how precise are the item-level measurements (IRT), how well do the person-level estimates capture the latent trait (reliability), how intelligently does the statistical model partition the variance it receives (LMM vs ANOVA), and how large is the true effect relative to the residual noise. Improving any link in the chain improves power — and IRT addresses the link that is usually treated as fixed. The chain itself continues beyond what we have covered here: stepping outside the Gaussian linear family altogether (binary outcomes, counts, skewed reaction times) is the GLM and GLMM territory we will pick up in a later resource.

The complete chain — precision at every level

Law of Large NumbersPart 1 · §01
Averaging independent observations produces a stable summary. More observations → mean closer to true value at rate σ/√N. Noise cancels; signal accumulates.
Central Limit TheoremPart 1 · §02
The distribution of sample means becomes approximately normal regardless of population shape. Standard error σ/√N governs precision — the key precision measure.
Confidence intervalsPart 1 · §03
Frequentist procedures that contain the true parameter a fixed proportion of the time under the stated assumptions. A single interval is a single realisation of that procedure — the interpretation is about the procedure, not about this particular interval.
Two-level averagingPart 1 · §04
CLT operates at Level 1 (items → person score) and Level 2 (people → group mean). Cronbach's α is one index of internal consistency, computed from the pattern of inter-item covariances — it rises with k following the LLN, but is only a reliability estimate under restrictive assumptions and says nothing about unidimensionality.
t-test / ANOVAPart 1 · §05 · Part 2 · §01
Group means compared via F-ratio (MS_between / MS_within). Inference procedures rest on the OLS likelihood — residuals independent, identically distributed, Gaussian. The CLT helps with the Gaussian piece in large samples; the other two must be handled separately.
Linear Mixed ModelsPart 2 · §02–04
When independence fails (repeated measures, clustering), LMM partitions individual-difference variance via random intercepts and slopes. Partial pooling stabilises sparse individual estimates; crossed random effects let participants and items both vary. Smaller residuals → sharper fixed-effect estimates → better precision without collecting more data.
Item Response TheoryPart 2 · §05
When item exchangeability fails, IRT models each item's discriminating power and difficulty explicitly. Person estimates carry known, location-specific SEs. Better measurement → cleaner person scores → less noise contaminating everything above.