Statistical Foundations · Part 1 of 2

Why We Average
— and What It Buys Us

From the Law of Large Numbers to Gaussian distributions — and why this is the foundation for t-tests and ANOVA.

§ 00

Why does adding things up and dividing work?

The sample mean — sum all your scores, divide by how many you have — is probably the most used operation in all of quantitative science. But why does it work? What principle justifies treating a single number as a summary of dozens of observations?

This is not a trivial question. When a researcher in psychology computes a participant's average score across 40 questionnaire items, or when a clinician uses a scale total as a proxy for some underlying construct, they are implicitly relying on a set of deep mathematical ideas. These ideas connect three things that are usually taught separately:

The Law of Large Numbers

As you add more observations, your sample mean gets closer and closer to the true underlying value. Randomness averages out — but only if you give it enough room to do so.

The Central Limit Theorem

The distribution of sample means becomes bell-shaped (Gaussian) as sample size grows — regardless of what the individual observations look like. This is why the normal distribution appears so widely in statistical practice.

Together, these ideas form the statistical bedrock of a practice that runs from polling and clinical trials to psychometric scale construction and latent variable modelling — and they are a central reason why mean-based procedures like t-tests and ANOVA often work. They are not the whole story: whether any particular test behaves as advertised also depends on design choices, independence assumptions, and the dependence structure of the data. This page traces the full chain from first principles to inference.

§ 01

The Law of Large Numbers

Suppose you toss a fair coin repeatedly. The true probability of heads is exactly 0.5 — but any single toss gives you either 0 or 1, never 0.5. After 4 tosses you might have seen 3 heads (75%). After 10, perhaps 6 heads (60%). But something remarkable happens as you accumulate more and more tosses: the running proportion of heads settles, with increasing reliability, toward 0.5.

This is the Law of Large Numbers (LLN) in its most basic form. It says that for any random variable with a well-defined mean, the sample mean converges to the population mean as sample size grows without bound.

Law of Large Numbers — informal statement n = (X₁ + X₂ + … + Xₙ) / n  →  μ  as  n → ∞
X₁, X₂, …, XₙIndividual observations — e.g. each coin toss, each person's height, each item response, or each reaction time in an experiment. Each is a random draw from the same underlying distribution.
nThe number of observations collected so far.
nThe sample mean — the arithmetic average of n observations: sum them all up and divide by n.
μ (mu)The true population mean — the average you would get if you could observe every possible value. This is usually unknown; we use X̄ to estimate it.
→ as n → ∞"Converges to" as n grows. In practice: the larger n is, the closer X̄ tends to be to μ.

The key insight is what averaging does mechanically. When you add up n observations and divide by n, you are diluting the influence of any individual data point. An extreme value that dominates a sum of 5 contributes almost nothing to a sum of 500. As each idiosyncratic observation becomes a tiny fraction of the whole, what remains is the signal they share: the underlying mean.

Looking ahead — this principle operates at two levels The LLN applies whenever you average over independent draws from the same distribution. In psychology and behavioural science, this happens at two distinct levels, and both matter:

(1) Within a person: averaging across multiple items on a questionnaire, or averaging reaction times across many trials, gives a more stable estimate of that individual's true score or speed.

(2) Across people: averaging across many participants' scores gives a more stable estimate of the true group mean.

We'll unpack both in §03. The Central Limit Theorem — coming next — applies at both levels too, and both applications are essential for understanding why the standard toolkit of inferential statistics works.

Try it yourself: a coin-toss simulator

Simulation 1 — Law of Large Numbers
Watch the running mean of coin tosses (1 = heads, 0 = tails) converge toward the true probability μ = 0.5 as you accumulate more flips. The same logic applies to any averaging process — reaction time trials, questionnaire items, or participants in a study.
Total flips (n) 0
Running mean X̄
True mean μ 0.500
|X̄ − μ| distance
Running mean X̄
True μ = 0.5

Notice that early on the line can swing wildly — this is sampling variability. But with more flips, the trajectory narrows and presses toward 0.5. The variability shrinks proportional to 1/√n: double the sample size and the expected distance from μ shrinks by a factor of √2.

Key Intuition

The mean works because independent noise cancels when you aggregate. Each individual observation carries genuine signal (the true mean) plus random error. Averaging many observations allows the errors — being unsystematic — to cancel each other out, leaving the signal behind. The more observations, the cleaner the signal.

§ 02

From Averaging to Gaussian: the Central Limit Theorem

The LLN tells us where the mean ends up. But it says nothing about the shape of the uncertainty around it. That is the job of the Central Limit Theorem (CLT). Before we get there, though, we need to clarify two things: the quantities we are trying to estimate, and what we mean when we say an estimate is precise.

A Useful Distinction: Mean vs. Variance (and Standard Deviation)

Any population has at least two important statistical properties:

A true mean μ — the average level of the trait across individuals.

A true variance σ² — a measure of how much individuals differ from that mean, computed as the average squared deviation from μ. We more often work with its square root, the true standard deviation σ, because σ is expressed in the same units as the data itself: if the trait is measured in milliseconds, σ is in milliseconds, while σ² is in milliseconds². Variance and standard deviation carry the same information — σ² is the bookkeeping form, σ is the readable form.

Both μ and σ are real features of the population — facts about the world, present whether or not anyone samples them. The CLT tells us how well we can estimate μ from a sample. It does not collapse σ away — true individual differences remain, and quantifying them is a separate scientific goal we return to in Part 2.

Before we leave σ behind, a point to keep in mind: σ will resurface almost immediately in a second role. Not as a target of estimation in its own right, but as the quantity that enters the formula for how precisely we can estimate μ from a finite sample. The same σ will therefore be doing two jobs in what follows — describing real variability between individuals in the population, and governing the error in our estimate of the mean. These are different things, and teaching routinely muddles them, so we will keep them apart deliberately.

What is the standard error, and why does it shrink?

As we saw in the earlier example of the law of large numbers, as n grows, the sample mean drifts towards the true mean μ. Our point estimate becomes more accurate with more sampling.

Every estimate drawn from finite data carries uncertainty: as we have illustrated, each run of a new sample experiment gives a slightly different mean. This is the natural consequence of any sampling from a putative distribution. Consequently, what we really have, around any point estimate, is a distribution of plausible alternatives — the range of means we would have obtained had the coin of sampling landed a different way. Like the variance we see in any given observed distribution (the more familiar standard deviation), the distribution around our estimate also captures variance in our mean estimation. This is captured by the term standard error, and it is what we mean by the precision of an estimate.

Standard error — in plain English SE = σ/√n
SEThe typical distance between one experiment's sample mean and the true mean μ. If you ran the same experiment many times, SE would be roughly the standard deviation of the resulting means — it is a standard deviation of an estimate, not of raw observations.
precisionAn inverse notion: small SE = high precision, large SE = low precision. When we say a later sample is "more precise" we mean its SE is smaller.

Two kinds of variability: world vs. estimation error

The σ that appears in SE = σ/√n is the same σ introduced in the Mean vs. Variance box above — the true standard deviation of the population. Here, however, it plays a different role, and the distinction is worth noting. σ and SE may look similar on the page: both are standard deviations, both are associated with a mean, and σ appears in both formulas. Yet they describe fundamentally different kinds of uncertainty. σ represents the real variability in the population, while SE measures the error in our estimate of μ. One describes how individuals differ; the other describes the uncertainty in our inference procedure — which is why σ is fixed while SE decreases as the sample size increases.

σ — real variability in the world
The spread of individual observations. A feature of the population: if individuals genuinely differ, σ captures that heterogeneity. It does not shrink with more data — collecting more participants sharpens your estimate of σ, but the world does not become less variable because you measured it more thoroughly.
SE — error in the estimate of μ
The spread of sample means across hypothetical replications of your study. A property of your estimation procedure, not of the raw data. It does shrink with more data: averaging cancels noise, and the error on your estimate gets smaller.

This is a nice illustration of a recurring hazard in statistical notation: the same symbol — and even the same word variance — is pressed into service for quantities that mean quite different things in practice. σ² at the population level describes how much individuals differ; σ²/n at the sampling-distribution level describes how much an estimate of the mean wobbles from one study to the next. Both are variances formally, but they live at different levels and play different scientific roles. A useful reading habit: when you see "mean ± X" in a paper, ask — X of what, and at which level — the population, or the estimator?

Computing these from a sample: population vs. sample formulas

In practice we almost never have the whole population in hand. We work from a sample of size n and use it to estimate σ. Three formulas cover everything you need.

Population SD (σ)
σ = √[ Σ(xᵢ − μ)² / N ]
Used when you have the whole population. Divides by N. Mainly a theoretical and simulation quantity — the widget below uses it, since there we "play god" with the true distribution. In real research σ is rarely available.
Sample SD (s)
s = √[ Σ(xᵢ − x̄)² / (n − 1) ]
The workhorse — what statistical software actually computes from your data. Divides by n − 1, not n: this is Bessel's correction. Intuitively, sample points sit closer on average to x̄ than to the (unknown) μ, so dividing by n would slightly under-estimate real variability; n − 1 compensates. In later discussions you will see this n − 1 'correction' popping up frequently.
SE of the mean
SE = s/√n   (or σ/√n if σ is known)
Whichever SD is available goes on top. The CLT is stated in terms of σ/√n; applied research almost always uses s/√n. But the important learning point is that √n on the bottom is a key driver — more data, smaller SE.

Forward pointer. Once s replaces σ, SE is itself an estimate and carries uncertainty of its own. For small n this is why inference about means uses Student's t rather than the normal — a point we return to when we build confidence intervals.

Why √n, not n?

This √n shrinkage is not new — it already appeared in Section 1, where doubling the sample size shrank the expected distance from μ by a factor of √2 rather than halving it outright. The question we deferred there was why the denominator is √n and not n itself.

Averaging cancels noise: observations above μ offset those below, and the more observations, the more thoroughly the errors cancel. But the cancellation is only partial — a lucky high observation cannot exactly undo an unlucky low one. The mathematics of variance says sums of independent quantities grow in n, so their averages shrink in √n. The practical consequence: quadrupling the sample halves the SE, rather than quartering it. Gains in precision are real but get expensive — the statistical reason behind the familiar lament that a study with twice the data is not twice as informative.

From width to shape: the Central Limit Theorem

With SE in hand we have the width of the distribution of sample means — a number on a scale. The CLT now goes one step further and tells us its shape. Imagine repeating an experiment many times, each time drawing a fresh sample of size n and computing its mean. You would get a different mean each time. The CLT makes a remarkable claim about the distribution of those means:

Central Limit Theorem — key result As n → ∞,  the distribution of X̄n approaches N(μ, σ²/n)
N(μ, σ²/n)A normal (Gaussian) distribution — the bell curve — centred on the true mean μ, with spread determined by σ²/n.
σ²/nThe variance of the sampling distribution of means (SE squared). This shrinks as n grows — more observations per sample → less uncertainty about where the mean sits.
σ/√nThe standard error of the mean — already met above. The square root of σ²/n; the typical distance of a sample mean from μ.

Does the CLT only apply to averages?

Yes — and this matters. The CLT is a theorem specifically about the behaviour of sums and averages of random variables. It does not say that raw individual observations will become normally distributed; they won't. A reaction time, a number of errors, an item response — these may be skewed or irregular no matter how many you collect.

What the CLT tells you is that if you take n such observations and compute their sum — or, equivalently, their average, that sum or average will be approximately normally distributed, and the approximation improves as n grows. In its classical form the CLT is a theorem about sums of independent random variables; because the mean is just the sum divided by n, the same result carries across to averages. This is why the Gaussian distribution is central to statistical inference: we are almost always working with sums, means, or totals, not raw individual values.

Explore it: watch the CLT emerge

Simulation 2 — Central Limit Theorem
The population here is skewed (exponential) with true mean μ = 1.0. Each "experiment" draws n observations and records their mean — just as you might compute an average RT for one participant, or an average score for one group. Watch the distribution of means accumulate, and watch precision improve as you run more experiments.
Sample size per experiment (n) 5
Experiments run (k) 0
True mean μ 1.000
Current mean estimaterunning average of the sample means — approaches μ as k grows
Population SD — σspread of raw observations pooled across experiments — stabilises at the true σ
SE of μ estimateuncertainty in our running estimate of μ — shrinks as k grows
Raw population (exponential — always skewed)
Distribution of sample means (k experiments)
Observed means histogram
Gaussian predicted by CLT
True μ = 1.0
Two levers, two different jobs

n (sample size per experiment) sets the width of the right panel — the sampling distribution of single-experiment means has spread σ/√n. Raise n and that histogram narrows: each individual experiment's mean estimate is sharper.

k (number of experiments) does two things. It reveals the shape of that sampling distribution — as k grows the histogram fills in and its Gaussian form becomes visible (CLT). And it sharpens our pooled estimate of μ: the Current mean estimate uses all k·n observations, so its uncertainty shrinks like σ/√(k·n). That is the SE reported in the chip, and that is why it keeps decreasing.

Reading the chips above

Current mean estimate — running average of the sample means. Converges on μ = 1.0 as k grows.

Population SD (σ) — spread of raw observations pooled across experiments. Wobbly at first, then stabilises at the true σ = 1.0 as observations accumulate; does not shrink further. This is variability in the world — a property of the population, not an inference error.

SE of μ estimate — σ̂/√(k·n), the uncertainty in the Current mean estimate. Steadily decreases as k grows. This is inference error — it shrinks as we collect more data. Note this is not the visual width of the right panel (which stays at σ/√n); it is the SE of our pooled μ estimate.

A last observation to take with you: at n = 1, the right panel mirrors the left — no averaging has happened, so the "distribution of means" is just the distribution of raw observations. Raise n and the histogram of means becomes symmetric and bell-shaped even though the raw population stays skewed. The orange curve — the Gaussian predicted by the CLT — is the shape it converges to.

§ 03

Confidence intervals — and why they mean less than you think

Under broad conditions, the CLT tells us that the sampling distribution of the mean is approximately normal, centred on the true value μ. That is mathematically elegant, but on its own it does not tell us what to do with a single experiment. The confidence interval is the inferential payoff — the tool that turns a sampling distribution into a statement about where the true value might be. Understanding how it is constructed, and what it really says, exposes an interesting philosophical corner of frequentist inference that most textbooks glide over.

The construction: substitute, then adjust

You have run one experiment. You have one sample mean and one sample standard deviation. Somewhere out there is a true population mean μ you will never observe. The CLT says that if you could repeat the experiment many times, the resulting sample means would cluster around μ in a normal distribution with spread σ/√n. But you only ran the experiment once. You do not know σ. You do not know μ. You have one number.

Because μ and σ are unknown, the confidence interval is built by substituting the sample estimates into the CLT's formula: the sample mean X̄ stands in for μ, and the sample standard deviation s stands in for σ. Substituting s for σ adds extra uncertainty — we are using an estimate inside an estimate — so instead of reading widths off the normal distribution, we read them off the t distribution with n−1 degrees of freedom, which is a little wider to account for that extra uncertainty. The middle 95% of that t distribution gives us the interval.

Confidence interval — construction CI95 = X̄ ± tn−1, .025 × (s/√n)
Your observed sample mean — treated, for the purposes of constructing the interval, as if it were the population mean.
s/√nThe estimated standard error of the mean — computed by substituting the sample SD (s) for the population SD (σ), which is unknown.
tn−1, .025The critical t-value capturing the middle 95% of a t-distribution with n−1 degrees of freedom. Approximately 2.78 at n=5, 2.06 at n=25, 1.98 at n=100 — converging on the familiar z-value of 1.96 for very large samples.

The substitution is what lets us compute an interval from a single experiment — but it also constrains how we are allowed to interpret it. Because we plugged in estimates rather than known values, we do not get to conclude "95% probability μ is in this interval." Instead we fall back on the one thing the CLT really does let us say — a statement about what would happen across many hypothetical repetitions of the study.

The correct interpretation

A 95% confidence interval is defined by the procedure used to compute it, not by the single interval in front of you. If the study were repeated many times, with a fresh sample each time, 95% of the intervals constructed by this procedure would contain the true value μ. The other 5% would not. You cannot tell which kind yours is.

Explore it: 95% of intervals contain the truth — which is not the same as "95% probability"

Simulation 3 — Confidence Intervals over Repeated Experiments
A true population mean μ = 5.0 sits beneath an amber line that runs across the plot. Each vertical bar is one experiment: a fresh sample of size n drawn from N(μ, σ²), producing its own mean (dot) and its own 95% CI (the bar). Intervals that contain μ are teal; intervals that miss μ are red. Run 100 experiments and watch roughly 95 land on truth — and about 5 miss. No individual interval knows which kind it is.
Sample size per experiment (n) 25
Experiments run 0
Intervals containing μ
Intervals missing μ
Coverage rate
Each bar = one experiment's 95% CI  ·  amber line = true μ
Interval contains μ
Interval misses μ
True μ = 5.0
What to look for

After 100 experiments the coverage rate should settle near 95%. Increase n and the intervals get narrower — the procedure is more precise — but the coverage rate stays the same. Precision and coverage are two different properties: a smaller interval is more informative when it is right, but the rate at which the procedure hits μ is fixed by the critical t-value, not by n. The t-value adjusts with sample size specifically to keep that coverage rate at 95% across the whole range.

What a single CI really tells you

Remember that we have set our confidence interval (CI) at 95%, which is to say that about 5% (or roughly 1 in 20) of our sampled experiments will generate a CI that does not contain the true mean (μ). Look at the plot above once it has 100 or so experiments on it. Typically about 5 of those intervals will miss μ; each individual interval either contains μ or it does not. There is no probability in between — the truth is fixed, the interval is fixed, and the relationship between them is either "contains" or "misses." Saying "there is a 95% probability that this interval contains μ" is, under the frequentist framework, saying something that is either 0 or 1 — not 0.95. The 95% lives in the procedure, not in any individual interval.

A note on interpretation — and a look ahead

The numerical bounds of a single CI — the 0 and the 0.66, the 4.2 and the 5.8 — do not tell us, for this particular study, where μ probably lies. They are the output of a procedure that, across repeated studies, would capture μ 95% of the time. A wider CI from a smaller sample and a narrower CI from a larger sample have the same coverage rate; they differ in precision, not in probability of being correct.

If that feels restrictive, it is — and it reflects how frequentist statistics was originally set up. In this tradition, probabilities are attached to procedures, not to particular parameter values. A different tradition — Bayesian statistics — attaches probabilities directly to parameters, and so can support statements like "there is an 85% probability the true effect lies between .2 and .5". The Bayesian credible interval looks superficially like a CI, but carries exactly the interpretation readers often reach for and that the frequentist CI cannot. We return to this alternative in Part 2 (Bayesian LMMs).

Why the weird interpretation is the honest one

It is tempting to read all of this as a bookkeeping technicality — to note the unusual interpretation and then revert, in practice, to treating the CI as "probably where μ is." But the weirdness is pointing at something real. The frequentist framework is committed to an austere ontology: probability is a long-run frequency property of a repeatable procedure, not a degree of belief about a fixed-but-unknown quantity. Under that commitment, μ is not a random variable. The interval is. And the only probability statement you get to make is about the behaviour of the procedure — not about the truth.

This forces two things to be binary at the same time, and it is worth keeping them distinct. The procedural assumptions — that the sampling distribution of the mean really is approximately normal, that observations are independent, that the sample SD is a reasonable substitute for σ — either hold for the population and design you drew from, or they do not. There is no partial credit; you were sampling from the right kind of world, or you were not. Separately, the realisation on your single run is also binary: μ is either inside the interval you just computed, or it is outside. The 95% does not apply to this interval. It applies to the long-run behaviour of the procedure across hypothetical repetitions.

A direct consequence follows, and it is worth stating plainly: under this framework, when you are wrong, you are totally wrong. The interval either contains μ or it does not. There is no "close to μ" partial credit for a near miss, no sliding scale of being wrong by a little. The procedure has an advertised coverage rate of 95%, but the output on any particular run is one of two things — a hit or a miss — and you have no way of telling which kind yours is without knowing μ, which by definition you do not. You ran a procedure whose long-run behaviour is well-characterised; you received one draw from that long-run behaviour; and the event on that draw is binary. This is part of why the Bayesian alternative, which attaches probability directly to μ and so admits degrees of being close, feels so natural to people the first time they meet it.

The substitute-then-adjust structure is the price of inference without priors. You construct the interval by plugging sample estimates into the CLT's formula (otherwise you have nothing to work with), and you interpret it under the rule that those estimates were a calculation device, not a claim of equality with the population values. The resulting interval is useful and principled — but its formal frequentist interpretation is narrower than the intuitive, Bayesian-flavoured reading many people reach for on first encounter.

Looking ahead The CLT + CI machinery is what makes t-tests and ANOVA work — the next section shows how the same logic of "mean ± precision" generates the t-statistic and the F-ratio. The odd interpretation of a single CI is the same odd interpretation that sits behind every p-value you will ever compute.
§ 04

Two Levels of Averaging in Behavioural Science

As foreshadowed in §01, the logic of LLN and CLT applies at two distinct levels in a typical study. Both are applications of the same mathematical idea; both carry the same assumptions; and both can fail in the same kinds of ways.

The Double Averaging Structure
LEVEL 1 Item 1 Item 2 Item k Σ / k average items Person 1 Person 2 Person N N person scores LEVEL 2 Σ / N Group Mean X̄ → N(μ, σ²/N) by CLT WHAT AVERAGING ACHIEVES AT EACH LEVEL Level 1: Items → Person Score • If items tap one construct, each = imperfect indicator • Averaging k items cancels independent error • LLN: score → person's true score as k grows • α: one index of internal consistency (under assumptions) Level 2: People → Group Estimate • Each person's score = draw from population • Averaging N scores reduces sampling error • LLN: group mean → population mean μ • CLT → N(μ, σ²/N), enabling t-tests & ANOVA Same theorems, two levels — same assumptions

Level 1: Averaging items — and what Cronbach's α is actually measuring

Consider a 30-item questionnaire measuring trait anxiety. Each item is a noisy measure of the participant's true anxiety level — it captures the construct imperfectly and also picks up item-specific variance plus random response error. When we average across those 30 responses, the LLN does its work: to the extent that item-specific noise is independent across items, it cancels. What remains is predominantly the shared signal — the person's underlying trait level.

Psychometric researchers are naturally wary about whether this averaging has actually worked — that is, whether the mean score we generate for each individual represents something close to their true trait level, or has been overwhelmed by item-specific noise. This concern has a name in measurement theory: reliability — the degree to which an observed score reflects the underlying trait rather than measurement error. A low-reliability measure is dominated by error; a high-reliability measure is one in which the averaging has done its work and the score is a stable estimate of the trait.

Cronbach's α is best thought of as one index of internal consistency, computed from the pattern of inter-item covariances. Under restrictive assumptions — essentially tau-equivalence (items measure the same construct with equal loadings) and uncorrelated errors — α can be interpreted as a lower bound on reliability in the classical-test-theory sense. Its connection to the LLN is direct: α increases as the number of items k increases, following the same logic as the LLN. Add more items and the aggregate becomes a more stable summary of whatever the items are jointly tapping — just as adding more coin flips brings the running mean closer to 0.5. A fuller treatment of α, its assumptions, and the alternatives (ω, composite reliability, factor-based indices) is the subject of a later website.

It is important not to misread what a high α tells you, however. A high α does not necessarily mean that the items are strongly intercorrelated. Because α is sensitive to the number of items, a scale with a large number of items can achieve a high α even when the average inter-item correlation is quite low. What a high α does tell you — and this is the LLN point — is that with enough items, the aggregate average is a reasonably stable summary of whatever signal the items share, whatever that signal happens to be. Whether it is the right signal, and whether all items contribute equally to it, are separate questions.

The sharpest version of this point

High reliability + poor construct validity = consistent measurement of the wrong thing.

A scale can be internally consistent — items agreeing with each other, α high, LLN doing its job — while systematically measuring something other than the intended construct. The averaging is working; the target is wrong. Reliability tells you the score is stable. It says nothing about whether the stable thing is the thing you care about. Construct validity is a separate question that reliability evidence alone cannot answer.

Although the LLN gives us some sense that the observed individual mean is probably a reasonable proxy for the individual's true score, there are a bunch of assumptions at play under the hood. Three are worth touching on.

First, that errors across items are uncorrelated and random — the key assumption of the LLN. For psychometric measures this may not be true: items with similar wording, shared response formats, or carry-over effects from one item to the next can introduce correlated error, at which point the cancelling logic of the LLN no longer holds.

Second, that all items contribute to the observed estimate of a participant's ability equally. This is called tau-equivalence, and we return to it in a later discussion. If some items are much stronger or weaker indicators than others, the averaging is not cleanly estimating one latent quantity.

Third, that all the numbers we are combining represent a single construct — for instance, the participant's singular latent ability on whatever is being tested. If this is wrong, we are combining apples and pears and thinking we have only apples, when in fact we have a fruit salad.

This is one reason why more flexible approaches — which we preview at the end of this page and explore fully in Part 2 — are sometimes necessary.

Level 2: Averaging people — why the CLT assures us of normality

Now we have N person-level scores, each treated as a single observation drawn from the population of trait levels. As we have seen in section 2, when we compute the group mean across N people, we are computing a sample mean of N draws from some population distribution — and the remarkable power of the CLT indicates that with sufficient draws we get a normal (Gaussian) distribution.

And to remind you, so long as the earlier assumptions we discussed are true, then the CLT now applies: regardless of what the population distribution of individual scores looks like — even if individual anxiety scores are skewed, bimodal, or non-normal — the distribution of sample means will be approximately normal, centred on the true population mean μ, with standard error σ/√N. This is not an assumption about the data; it is a mathematical consequence of independence and finite variance.

This is what makes t-tests, ANOVAs, and confidence intervals legitimate. We are not assuming individual data points are Gaussian. We are relying on the CLT's guarantee that means of sufficiently large, independent samples will be approximately Gaussian. Larger N means a better approximation and more reliable inference.

The shared logic

Both levels of our averaging thus rest on the same foundation: independent observations, averaging, noise cancellation, and convergence toward a well-behaved normal distribution. The same mathematical engine drives reliability at Level 1 and inference at Level 2. Which means the same failure modes — non-independence (observations influence one another, so knowing one carries information about others), heterogeneity (observations are drawn from different underlying distributions rather than a single common one), and violated exchangeability (the order or grouping of observations matters, so they cannot be freely relabelled) — threaten both.

§ 05

From Means to Inference: t-tests and ANOVA

We have now established the initial motivations for why frequentist statisticians average, (along with a bunch of their assumptions!), and why the means we compute from samples can be approximately normally distributed. The next question is natural: once we have group means, what do we do with them?

Suppose a researcher runs a simple two-condition experiment — say, a face recognition task under two conditions: typical and inverted faces. Fifty participants complete each condition. The researcher computes a mean accuracy for each condition: X̄₁ and X̄₂. The CLT, as we now know, assures us that each of these means is approximately normally distributed around its true condition mean μ₁ and μ₂ respectively.

The natural question all scientists will want to know is whether the observed difference X̄₁ − X̄₂ reflects a real difference in the underlying population means. But as I have tried to make clear throughout, the answer we get will always depend on how much we trust each mean estimate — how precisely it represents its true condition value (particularly given that, as we have established, we are also substituting the sample SD for the unknown population SD when we work out that precision). This is the theme of estimate fidelity or precision that runs through everything on this page: in the case of CLT, the message is that the more observations we aggregate, the sharper our estimates come into focus, and the more confidently we can judge whether an apparent difference between them is genuine.

The t-test: how precisely do we know each mean?

The t-statistic captures this idea of precision directly. The standard error of each mean — σ/√N, from the CLT — tells us how sharply that mean estimate is in focus. A small standard error means the estimate is tight: with high N, we can be confident the observed mean is close to the true condition mean. Consequently, with a low SE, we are indicating that the mean difference is being estimated with high precision (because it is being derived from two group means that also have a small SE), and so even a very small observed difference may be informative.

Independent samples t-statistic t = (X̄₁ − X̄₂) / SEdiff
X̄₁ − X̄₂The observed difference between sample means. This is what we want to interpret — but its meaning depends on how precisely each mean is known.
SEdiffThe standard error of the difference — derived from σ/√N for each group. This quantifies the precision of the observed difference: a small SE means both means are in sharp focus and thus the difference between them is equally sharp in focus; a large SE means the means are blurry and the difference could shift substantially with different samples.
tThe ratio of the observed difference to the precision of that estimate. A large t means the observed difference is large relative to the uncertainty in our mean estimates. Under the null hypothesis that μ₁ = μ₂, this ratio follows a known t-distribution, giving us a p-value. Note the use of a ratio here which captures the key idea of signal (what we want to give us a sharply focused estimate) over the noise (what is blurring our goal). The higher the signal (the numerator) the higher the t value we get and thus the more confident we are the estimated difference is probably an accurate estimate of the true difference. Remember the idea of ratios in this process — they come up a lot!

The CLT runs through every part of this. SEdiff is derived from σ/√N — precisely the standard error the CLT gives us for each sample mean. As N grows, the CLT makes our mean estimates sharper; the SE shrinks; and any observed difference between conditions comes into correspondingly clearer focus. The t-distribution itself depends on the normality of the sample means — which the CLT justifies. Without the CLT, the t-test has no foundation.

ANOVA: the same idea, extended to multiple groups

If there are more than two groups — say, three face conditions — pairwise t-tests become unwieldy and inflate the risk of false positives. Analysis of Variance (ANOVA) generalises the t-test logic to any number of groups simultaneously.

Instead of comparing two means directly, ANOVA partitions the total variance in the data into two components and examines their ratio:

ANOVA — the F-ratio F = Variance between groups / Variance within groups
Between-groups varianceHow much the group means vary from the overall grand mean. If conditions have different true means (a real effect), this will be large.
Within-groups varianceHow much individuals vary within each condition — the baseline "noise" level against which we judge the between-group signal.
FRemember I mentioned ratios earlier, and they capture the idea of signal over noise. Well here again the same basic principle applies, when F is large, the differences between group means are large relative to within-group noise — evidence for a real effect. Like the t-statistic, F follows a known distribution under the null hypothesis, allowing a p-value to be calculated.

The same fidelity logic underpins ANOVA. Each group mean is a CLT-guaranteed estimate of the true condition mean — sharper when N is large, blurrier when it is small. The F-ratio asks whether the differences between those group means are large relative to the precision with which each is known. The F-distribution, used to calculate p-values, is derived from the same normality of sample means that the CLT provides. ANOVA is t-test logic, generalised to multiple groups.

The full chain, end to end

We can now trace the complete logical sequence from raw observation to statistical inference:

Law of Large NumbersAveraging independent observations → stable summary
Central Limit TheoremDistribution of means → approximately normal
Group MeansEach condition mean ~ N(μ, σ²/N)
t-test / ANOVASignal-to-noise ratio on normally distributed means

Every step depends on the one before. The t-test requires normally distributed means. Normally distributed means require the CLT. The CLT applies to averages. And averaging cancels noise only when the LLN conditions — independence, identical distribution — are satisfied. The chain is only as strong as its weakest link.

What could go wrong?

To reiterate then, this chain of assumptions holds when: (a) observations within a person or condition are independent and exchangeable (i.e., noise is random and 'cancels out'); (b) sample sizes are large enough for the CLT approximation to be good (see simulation 2); (c) the items or trials we average over really are measuring the same thing with similar precision (see discussion of Cronbach).

When these conditions fail — very small samples, clustered or repeated-measures data, heterogeneous item quality, constructs that are genuinely multidimensional — the neat chain from LLN to valid inference breaks down. In Part 2, we examine what happens then, and how more flexible modelling frameworks (Linear Mixed Models; Item Response Theory) respond.

Error is not one thing

Alongside the observations of the LLN and the CLT is a wider psychometric philosophy called Classical test theory which I will discuss in a later website. For now it is worth highlighting another key underlying assumption. Classical test theory bundles all imprecision into a single term — E, the error. This was evident throughout this web-page: that measurement error is out there, but you can try to deal with it through repeated sampling. But in practice, "error" is not a single undifferentiated quantity. The deviation between an observed score and some true value can come from many sources simultaneously:

Sources of error in a typical study

Item variance — some items are harder, more ambiguous, or tap slightly different sub-facets of the construct. Person variance — individuals differ in their baseline levels and in how consistently they respond. Session variance — performance fluctuates across time of day, fatigue, motivation. Context variance — the experimenter, the setting, the order of conditions all introduce systematic differences that CTT folds silently into E.

Why it matters

When error comes from multiple structured sources, rolling them into a single E is not just imprecise — it is actively misleading. Some of what CTT calls "error" is actually systematic and predictable variation. A model that knows where the variance comes from can separate these sources, producing sharper estimates at each level. This is the core motivation for Linear Mixed Models (which partition person, item, and residual variance) and for Generalizability Theory (which formally estimates variance components from each source).

The same logic applies to the inference chain built in this page. When the LLN averages across items or across people, it cancels random error. But if some of the apparent error is actually structured — the same items always harder, the same people always more variable — averaging doesn't cancel it, it obscures it. The error term in CTT is not a scientific description of noise; it is a confession that the model does not know where the noise came from. What is ironic is that researchers often do know about sources of this error a priori — that is, they know about different participants, or that participants have repeated some things but not others, or that the means they derived come from a certain set of items, etc. So why not specify such parameter details in your statistical model? This will be a major theme for why we want to argue to step away from ANOVA in the first place!

Where we got to — a summary of the learning points
  1. The LLN. Averaging independent observations cancels random noise, and sample means drift toward μ as n grows. The error shrinks like 1/√n, not 1/n: more data helps, but precision gets more expensive the more of it you already have.
  2. σ vs SE. Two standard deviations that describe two different things. σ is real variability in the population — a fixed property of the world. SE is error in our estimate of μ — a property of our inference procedure. One does not shrink with more data; the other does.
  3. Sample SD (s) and Bessel's correction. In practice we estimate σ from the sample as s, with n − 1 in the denominator to correct for sample points sitting closer to x̄ than to μ. The n − 1 correction recurs throughout statistics.
  4. The CLT. For sufficiently large n, the distribution of sample means is approximately normal, centred on μ, with spread σ/√n — no matter what the population distribution looks like. This is what makes inference on means mathematically tractable.
  5. Confidence intervals describe a procedure, not a particular parameter. A 95% CI would capture μ 95% of the time across repeated studies; it is not a probability statement about any particular interval. Bayesian credible intervals (Part 2) attach probabilities to parameters directly.
  6. Signal over noise. The t- and F-statistics are ratios: observed difference divided by the precision of the estimate. The signal-to-noise idea — and ratios more generally — recur throughout inferential statistics.
  7. Two levels of averaging, same assumptions. Averaging items to form a person score (reliability, Cronbach's α) and averaging people to form a group estimate (CLT on means) rest on the same assumptions — independence, equal contribution, a single construct — and share the same failure modes.
  8. Assumptions everywhere. Each step of the chain rests on assumptions that may not hold in real behavioural data: uncorrelated errors, tau-equivalence, a single latent construct, random rather than structured noise. When those assumptions break, the neat chain from LLN to valid inference breaks too — which is where Part 2 picks up.

Part 2: When the Chain Breaks

This page established the foundations: averaging works when independence holds, and normally distributed means make inference possible. But in real behavioural data, independence is often violated, sample sizes are often small, and constructs are often multidimensional. What then?

Linear Mixed Models

When observations are clustered — repeated measures within participants, students within classrooms — standard ANOVA assumptions break. LMMs model individual variability explicitly rather than averaging it away.

Item Response Theory

When items differ in difficulty and discriminating power, tau-equivalence fails and summed scores are imprecise. IRT models each item's relationship to the latent trait, producing better-calibrated person estimates.

Measurement & Validity

When summed scores carry systematic bias or attenuated correlations, statistical power suffers in a predictable and correctable way. Understanding measurement quality is inseparable from understanding inference.