From the Law of Large Numbers to Gaussian distributions — and why this is the foundation for t-tests and ANOVA.
The sample mean — sum all your scores, divide by how many you have — is probably the most used operation in all of quantitative science. But why does it work? What principle justifies treating a single number as a summary of dozens of observations?
This is not a trivial question. When a researcher in psychology computes a participant's average score across 40 questionnaire items, or when a clinician uses a scale total as a proxy for some underlying construct, they are implicitly relying on a set of deep mathematical ideas. These ideas connect three things that are usually taught separately:
As you add more observations, your sample mean gets closer and closer to the true underlying value. Randomness averages out — but only if you give it enough room to do so.
The distribution of sample means becomes bell-shaped (Gaussian) as sample size grows — regardless of what the individual observations look like. This is why the normal distribution appears so widely in statistical practice.
Together, these ideas form the statistical bedrock of a practice that runs from polling and clinical trials to psychometric scale construction and latent variable modelling — and they are a central reason why mean-based procedures like t-tests and ANOVA often work. They are not the whole story: whether any particular test behaves as advertised also depends on design choices, independence assumptions, and the dependence structure of the data. This page traces the full chain from first principles to inference.
Suppose you toss a fair coin repeatedly. The true probability of heads is exactly 0.5 — but any single toss gives you either 0 or 1, never 0.5. After 4 tosses you might have seen 3 heads (75%). After 10, perhaps 6 heads (60%). But something remarkable happens as you accumulate more and more tosses: the running proportion of heads settles, with increasing reliability, toward 0.5.
This is the Law of Large Numbers (LLN) in its most basic form. It says that for any random variable with a well-defined mean, the sample mean converges to the population mean as sample size grows without bound.
| X₁, X₂, …, Xₙ | Individual observations — e.g. each coin toss, each person's height, each item response, or each reaction time in an experiment. Each is a random draw from the same underlying distribution. |
| n | The number of observations collected so far. |
| X̄n | The sample mean — the arithmetic average of n observations: sum them all up and divide by n. |
| μ (mu) | The true population mean — the average you would get if you could observe every possible value. This is usually unknown; we use X̄ to estimate it. |
| → as n → ∞ | "Converges to" as n grows. In practice: the larger n is, the closer X̄ tends to be to μ. |
The key insight is what averaging does mechanically. When you add up n observations and divide by n, you are diluting the influence of any individual data point. An extreme value that dominates a sum of 5 contributes almost nothing to a sum of 500. As each idiosyncratic observation becomes a tiny fraction of the whole, what remains is the signal they share: the underlying mean.
Notice that early on the line can swing wildly — this is sampling variability. But with more flips, the trajectory narrows and presses toward 0.5. The variability shrinks proportional to 1/√n: double the sample size and the expected distance from μ shrinks by a factor of √2.
The mean works because independent noise cancels when you aggregate. Each individual observation carries genuine signal (the true mean) plus random error. Averaging many observations allows the errors — being unsystematic — to cancel each other out, leaving the signal behind. The more observations, the cleaner the signal.
The LLN tells us where the mean ends up. But it says nothing about the shape of the uncertainty around it. That is the job of the Central Limit Theorem (CLT). Before we get there, though, we need to clarify two things: the quantities we are trying to estimate, and what we mean when we say an estimate is precise.
Any population has at least two important statistical properties:
A true mean μ — the average level of the trait across individuals.
A true variance σ² — a measure of how much individuals differ from that mean, computed as the average squared deviation from μ. We more often work with its square root, the true standard deviation σ, because σ is expressed in the same units as the data itself: if the trait is measured in milliseconds, σ is in milliseconds, while σ² is in milliseconds². Variance and standard deviation carry the same information — σ² is the bookkeeping form, σ is the readable form.
Both μ and σ are real features of the population — facts about the world, present whether or not anyone samples them. The CLT tells us how well we can estimate μ from a sample. It does not collapse σ away — true individual differences remain, and quantifying them is a separate scientific goal we return to in Part 2.
Before we leave σ behind, a point to keep in mind: σ will resurface almost immediately in a second role. Not as a target of estimation in its own right, but as the quantity that enters the formula for how precisely we can estimate μ from a finite sample. The same σ will therefore be doing two jobs in what follows — describing real variability between individuals in the population, and governing the error in our estimate of the mean. These are different things, and teaching routinely muddles them, so we will keep them apart deliberately.
As we saw in the earlier example of the law of large numbers, as n grows, the sample mean x̄ drifts towards the true mean μ. Our point estimate becomes more accurate with more sampling.
Every estimate drawn from finite data carries uncertainty: as we have illustrated, each run of a new sample experiment gives a slightly different mean. This is the natural consequence of any sampling from a putative distribution. Consequently, what we really have, around any point estimate, is a distribution of plausible alternatives — the range of means we would have obtained had the coin of sampling landed a different way. Like the variance we see in any given observed distribution (the more familiar standard deviation), the distribution around our estimate also captures variance in our mean estimation. This is captured by the term standard error, and it is what we mean by the precision of an estimate.
| SE | The typical distance between one experiment's sample mean and the true mean μ. If you ran the same experiment many times, SE would be roughly the standard deviation of the resulting means — it is a standard deviation of an estimate, not of raw observations. |
| precision | An inverse notion: small SE = high precision, large SE = low precision. When we say a later sample is "more precise" we mean its SE is smaller. |
The σ that appears in SE = σ/√n is the same σ introduced in the Mean vs. Variance box above — the true standard deviation of the population. Here, however, it plays a different role, and the distinction is worth noting. σ and SE may look similar on the page: both are standard deviations, both are associated with a mean, and σ appears in both formulas. Yet they describe fundamentally different kinds of uncertainty. σ represents the real variability in the population, while SE measures the error in our estimate of μ. One describes how individuals differ; the other describes the uncertainty in our inference procedure — which is why σ is fixed while SE decreases as the sample size increases.
This is a nice illustration of a recurring hazard in statistical notation: the same symbol — and even the same word variance — is pressed into service for quantities that mean quite different things in practice. σ² at the population level describes how much individuals differ; σ²/n at the sampling-distribution level describes how much an estimate of the mean wobbles from one study to the next. Both are variances formally, but they live at different levels and play different scientific roles. A useful reading habit: when you see "mean ± X" in a paper, ask — X of what, and at which level — the population, or the estimator?
In practice we almost never have the whole population in hand. We work from a sample of size n and use it to estimate σ. Three formulas cover everything you need.
Forward pointer. Once s replaces σ, SE is itself an estimate and carries uncertainty of its own. For small n this is why inference about means uses Student's t rather than the normal — a point we return to when we build confidence intervals.
This √n shrinkage is not new — it already appeared in Section 1, where doubling the sample size shrank the expected distance from μ by a factor of √2 rather than halving it outright. The question we deferred there was why the denominator is √n and not n itself.
Averaging cancels noise: observations above μ offset those below, and the more observations, the more thoroughly the errors cancel. But the cancellation is only partial — a lucky high observation cannot exactly undo an unlucky low one. The mathematics of variance says sums of independent quantities grow in n, so their averages shrink in √n. The practical consequence: quadrupling the sample halves the SE, rather than quartering it. Gains in precision are real but get expensive — the statistical reason behind the familiar lament that a study with twice the data is not twice as informative.
With SE in hand we have the width of the distribution of sample means — a number on a scale. The CLT now goes one step further and tells us its shape. Imagine repeating an experiment many times, each time drawing a fresh sample of size n and computing its mean. You would get a different mean each time. The CLT makes a remarkable claim about the distribution of those means:
| N(μ, σ²/n) | A normal (Gaussian) distribution — the bell curve — centred on the true mean μ, with spread determined by σ²/n. |
| σ²/n | The variance of the sampling distribution of means (SE squared). This shrinks as n grows — more observations per sample → less uncertainty about where the mean sits. |
| σ/√n | The standard error of the mean — already met above. The square root of σ²/n; the typical distance of a sample mean from μ. |
Yes — and this matters. The CLT is a theorem specifically about the behaviour of sums and averages of random variables. It does not say that raw individual observations will become normally distributed; they won't. A reaction time, a number of errors, an item response — these may be skewed or irregular no matter how many you collect.
What the CLT tells you is that if you take n such observations and compute their sum — or, equivalently, their average, that sum or average will be approximately normally distributed, and the approximation improves as n grows. In its classical form the CLT is a theorem about sums of independent random variables; because the mean is just the sum divided by n, the same result carries across to averages. This is why the Gaussian distribution is central to statistical inference: we are almost always working with sums, means, or totals, not raw individual values.
n (sample size per experiment) sets the width of the right panel — the sampling distribution of single-experiment means has spread σ/√n. Raise n and that histogram narrows: each individual experiment's mean estimate is sharper.
k (number of experiments) does two things. It reveals the shape of that sampling distribution — as k grows the histogram fills in and its Gaussian form becomes visible (CLT). And it sharpens our pooled estimate of μ: the Current mean estimate uses all k·n observations, so its uncertainty shrinks like σ/√(k·n). That is the SE reported in the chip, and that is why it keeps decreasing.
Reading the chips aboveCurrent mean estimate — running average of the sample means. Converges on μ = 1.0 as k grows.
Population SD (σ) — spread of raw observations pooled across experiments. Wobbly at first, then stabilises at the true σ = 1.0 as observations accumulate; does not shrink further. This is variability in the world — a property of the population, not an inference error.
SE of μ estimate — σ̂/√(k·n), the uncertainty in the Current mean estimate. Steadily decreases as k grows. This is inference error — it shrinks as we collect more data. Note this is not the visual width of the right panel (which stays at σ/√n); it is the SE of our pooled μ estimate.
A last observation to take with you: at n = 1, the right panel mirrors the left — no averaging has happened, so the "distribution of means" is just the distribution of raw observations. Raise n and the histogram of means becomes symmetric and bell-shaped even though the raw population stays skewed. The orange curve — the Gaussian predicted by the CLT — is the shape it converges to.
Under broad conditions, the CLT tells us that the sampling distribution of the mean is approximately normal, centred on the true value μ. That is mathematically elegant, but on its own it does not tell us what to do with a single experiment. The confidence interval is the inferential payoff — the tool that turns a sampling distribution into a statement about where the true value might be. Understanding how it is constructed, and what it really says, exposes an interesting philosophical corner of frequentist inference that most textbooks glide over.
You have run one experiment. You have one sample mean and one sample standard deviation. Somewhere out there is a true population mean μ you will never observe. The CLT says that if you could repeat the experiment many times, the resulting sample means would cluster around μ in a normal distribution with spread σ/√n. But you only ran the experiment once. You do not know σ. You do not know μ. You have one number.
Because μ and σ are unknown, the confidence interval is built by substituting the sample estimates into the CLT's formula: the sample mean X̄ stands in for μ, and the sample standard deviation s stands in for σ. Substituting s for σ adds extra uncertainty — we are using an estimate inside an estimate — so instead of reading widths off the normal distribution, we read them off the t distribution with n−1 degrees of freedom, which is a little wider to account for that extra uncertainty. The middle 95% of that t distribution gives us the interval.
| X̄ | Your observed sample mean — treated, for the purposes of constructing the interval, as if it were the population mean. |
| s/√n | The estimated standard error of the mean — computed by substituting the sample SD (s) for the population SD (σ), which is unknown. |
| tn−1, .025 | The critical t-value capturing the middle 95% of a t-distribution with n−1 degrees of freedom. Approximately 2.78 at n=5, 2.06 at n=25, 1.98 at n=100 — converging on the familiar z-value of 1.96 for very large samples. |
The substitution is what lets us compute an interval from a single experiment — but it also constrains how we are allowed to interpret it. Because we plugged in estimates rather than known values, we do not get to conclude "95% probability μ is in this interval." Instead we fall back on the one thing the CLT really does let us say — a statement about what would happen across many hypothetical repetitions of the study.
A 95% confidence interval is defined by the procedure used to compute it, not by the single interval in front of you. If the study were repeated many times, with a fresh sample each time, 95% of the intervals constructed by this procedure would contain the true value μ. The other 5% would not. You cannot tell which kind yours is.
Remember that we have set our confidence interval (CI) at 95%, which is to say that about 5% (or roughly 1 in 20) of our sampled experiments will generate a CI that does not contain the true mean (μ). Look at the plot above once it has 100 or so experiments on it. Typically about 5 of those intervals will miss μ; each individual interval either contains μ or it does not. There is no probability in between — the truth is fixed, the interval is fixed, and the relationship between them is either "contains" or "misses." Saying "there is a 95% probability that this interval contains μ" is, under the frequentist framework, saying something that is either 0 or 1 — not 0.95. The 95% lives in the procedure, not in any individual interval.
The numerical bounds of a single CI — the 0 and the 0.66, the 4.2 and the 5.8 — do not tell us, for this particular study, where μ probably lies. They are the output of a procedure that, across repeated studies, would capture μ 95% of the time. A wider CI from a smaller sample and a narrower CI from a larger sample have the same coverage rate; they differ in precision, not in probability of being correct.
If that feels restrictive, it is — and it reflects how frequentist statistics was originally set up. In this tradition, probabilities are attached to procedures, not to particular parameter values. A different tradition — Bayesian statistics — attaches probabilities directly to parameters, and so can support statements like "there is an 85% probability the true effect lies between .2 and .5". The Bayesian credible interval looks superficially like a CI, but carries exactly the interpretation readers often reach for and that the frequentist CI cannot. We return to this alternative in Part 2 (Bayesian LMMs).
It is tempting to read all of this as a bookkeeping technicality — to note the unusual interpretation and then revert, in practice, to treating the CI as "probably where μ is." But the weirdness is pointing at something real. The frequentist framework is committed to an austere ontology: probability is a long-run frequency property of a repeatable procedure, not a degree of belief about a fixed-but-unknown quantity. Under that commitment, μ is not a random variable. The interval is. And the only probability statement you get to make is about the behaviour of the procedure — not about the truth.
This forces two things to be binary at the same time, and it is worth keeping them distinct. The procedural assumptions — that the sampling distribution of the mean really is approximately normal, that observations are independent, that the sample SD is a reasonable substitute for σ — either hold for the population and design you drew from, or they do not. There is no partial credit; you were sampling from the right kind of world, or you were not. Separately, the realisation on your single run is also binary: μ is either inside the interval you just computed, or it is outside. The 95% does not apply to this interval. It applies to the long-run behaviour of the procedure across hypothetical repetitions.
A direct consequence follows, and it is worth stating plainly: under this framework, when you are wrong, you are totally wrong. The interval either contains μ or it does not. There is no "close to μ" partial credit for a near miss, no sliding scale of being wrong by a little. The procedure has an advertised coverage rate of 95%, but the output on any particular run is one of two things — a hit or a miss — and you have no way of telling which kind yours is without knowing μ, which by definition you do not. You ran a procedure whose long-run behaviour is well-characterised; you received one draw from that long-run behaviour; and the event on that draw is binary. This is part of why the Bayesian alternative, which attaches probability directly to μ and so admits degrees of being close, feels so natural to people the first time they meet it.
The substitute-then-adjust structure is the price of inference without priors. You construct the interval by plugging sample estimates into the CLT's formula (otherwise you have nothing to work with), and you interpret it under the rule that those estimates were a calculation device, not a claim of equality with the population values. The resulting interval is useful and principled — but its formal frequentist interpretation is narrower than the intuitive, Bayesian-flavoured reading many people reach for on first encounter.
As foreshadowed in §01, the logic of LLN and CLT applies at two distinct levels in a typical study. Both are applications of the same mathematical idea; both carry the same assumptions; and both can fail in the same kinds of ways.
Consider a 30-item questionnaire measuring trait anxiety. Each item is a noisy measure of the participant's true anxiety level — it captures the construct imperfectly and also picks up item-specific variance plus random response error. When we average across those 30 responses, the LLN does its work: to the extent that item-specific noise is independent across items, it cancels. What remains is predominantly the shared signal — the person's underlying trait level.
Psychometric researchers are naturally wary about whether this averaging has actually worked — that is, whether the mean score we generate for each individual represents something close to their true trait level, or has been overwhelmed by item-specific noise. This concern has a name in measurement theory: reliability — the degree to which an observed score reflects the underlying trait rather than measurement error. A low-reliability measure is dominated by error; a high-reliability measure is one in which the averaging has done its work and the score is a stable estimate of the trait.
Cronbach's α is best thought of as one index of internal consistency, computed from the pattern of inter-item covariances. Under restrictive assumptions — essentially tau-equivalence (items measure the same construct with equal loadings) and uncorrelated errors — α can be interpreted as a lower bound on reliability in the classical-test-theory sense. Its connection to the LLN is direct: α increases as the number of items k increases, following the same logic as the LLN. Add more items and the aggregate becomes a more stable summary of whatever the items are jointly tapping — just as adding more coin flips brings the running mean closer to 0.5. A fuller treatment of α, its assumptions, and the alternatives (ω, composite reliability, factor-based indices) is the subject of a later website.
It is important not to misread what a high α tells you, however. A high α does not necessarily mean that the items are strongly intercorrelated. Because α is sensitive to the number of items, a scale with a large number of items can achieve a high α even when the average inter-item correlation is quite low. What a high α does tell you — and this is the LLN point — is that with enough items, the aggregate average is a reasonably stable summary of whatever signal the items share, whatever that signal happens to be. Whether it is the right signal, and whether all items contribute equally to it, are separate questions.
High reliability + poor construct validity = consistent measurement of the wrong thing.
A scale can be internally consistent — items agreeing with each other, α high, LLN doing its job — while systematically measuring something other than the intended construct. The averaging is working; the target is wrong. Reliability tells you the score is stable. It says nothing about whether the stable thing is the thing you care about. Construct validity is a separate question that reliability evidence alone cannot answer.
Although the LLN gives us some sense that the observed individual mean is probably a reasonable proxy for the individual's true score, there are a bunch of assumptions at play under the hood. Three are worth touching on.
First, that errors across items are uncorrelated and random — the key assumption of the LLN. For psychometric measures this may not be true: items with similar wording, shared response formats, or carry-over effects from one item to the next can introduce correlated error, at which point the cancelling logic of the LLN no longer holds.
Second, that all items contribute to the observed estimate of a participant's ability equally. This is called tau-equivalence, and we return to it in a later discussion. If some items are much stronger or weaker indicators than others, the averaging is not cleanly estimating one latent quantity.
Third, that all the numbers we are combining represent a single construct — for instance, the participant's singular latent ability on whatever is being tested. If this is wrong, we are combining apples and pears and thinking we have only apples, when in fact we have a fruit salad.
This is one reason why more flexible approaches — which we preview at the end of this page and explore fully in Part 2 — are sometimes necessary.
Now we have N person-level scores, each treated as a single observation drawn from the population of trait levels. As we have seen in section 2, when we compute the group mean across N people, we are computing a sample mean of N draws from some population distribution — and the remarkable power of the CLT indicates that with sufficient draws we get a normal (Gaussian) distribution.
And to remind you, so long as the earlier assumptions we discussed are true, then the CLT now applies: regardless of what the population distribution of individual scores looks like — even if individual anxiety scores are skewed, bimodal, or non-normal — the distribution of sample means will be approximately normal, centred on the true population mean μ, with standard error σ/√N. This is not an assumption about the data; it is a mathematical consequence of independence and finite variance.
This is what makes t-tests, ANOVAs, and confidence intervals legitimate. We are not assuming individual data points are Gaussian. We are relying on the CLT's guarantee that means of sufficiently large, independent samples will be approximately Gaussian. Larger N means a better approximation and more reliable inference.
Both levels of our averaging thus rest on the same foundation: independent observations, averaging, noise cancellation, and convergence toward a well-behaved normal distribution. The same mathematical engine drives reliability at Level 1 and inference at Level 2. Which means the same failure modes — non-independence (observations influence one another, so knowing one carries information about others), heterogeneity (observations are drawn from different underlying distributions rather than a single common one), and violated exchangeability (the order or grouping of observations matters, so they cannot be freely relabelled) — threaten both.
We have now established the initial motivations for why frequentist statisticians average, (along with a bunch of their assumptions!), and why the means we compute from samples can be approximately normally distributed. The next question is natural: once we have group means, what do we do with them?
Suppose a researcher runs a simple two-condition experiment — say, a face recognition task under two conditions: typical and inverted faces. Fifty participants complete each condition. The researcher computes a mean accuracy for each condition: X̄₁ and X̄₂. The CLT, as we now know, assures us that each of these means is approximately normally distributed around its true condition mean μ₁ and μ₂ respectively.
The natural question all scientists will want to know is whether the observed difference X̄₁ − X̄₂ reflects a real difference in the underlying population means. But as I have tried to make clear throughout, the answer we get will always depend on how much we trust each mean estimate — how precisely it represents its true condition value (particularly given that, as we have established, we are also substituting the sample SD for the unknown population SD when we work out that precision). This is the theme of estimate fidelity or precision that runs through everything on this page: in the case of CLT, the message is that the more observations we aggregate, the sharper our estimates come into focus, and the more confidently we can judge whether an apparent difference between them is genuine.
The t-statistic captures this idea of precision directly. The standard error of each mean — σ/√N, from the CLT — tells us how sharply that mean estimate is in focus. A small standard error means the estimate is tight: with high N, we can be confident the observed mean is close to the true condition mean. Consequently, with a low SE, we are indicating that the mean difference is being estimated with high precision (because it is being derived from two group means that also have a small SE), and so even a very small observed difference may be informative.
| X̄₁ − X̄₂ | The observed difference between sample means. This is what we want to interpret — but its meaning depends on how precisely each mean is known. |
| SEdiff | The standard error of the difference — derived from σ/√N for each group. This quantifies the precision of the observed difference: a small SE means both means are in sharp focus and thus the difference between them is equally sharp in focus; a large SE means the means are blurry and the difference could shift substantially with different samples. |
| t | The ratio of the observed difference to the precision of that estimate. A large t means the observed difference is large relative to the uncertainty in our mean estimates. Under the null hypothesis that μ₁ = μ₂, this ratio follows a known t-distribution, giving us a p-value. Note the use of a ratio here which captures the key idea of signal (what we want to give us a sharply focused estimate) over the noise (what is blurring our goal). The higher the signal (the numerator) the higher the t value we get and thus the more confident we are the estimated difference is probably an accurate estimate of the true difference. Remember the idea of ratios in this process — they come up a lot! |
The CLT runs through every part of this. SEdiff is derived from σ/√N — precisely the standard error the CLT gives us for each sample mean. As N grows, the CLT makes our mean estimates sharper; the SE shrinks; and any observed difference between conditions comes into correspondingly clearer focus. The t-distribution itself depends on the normality of the sample means — which the CLT justifies. Without the CLT, the t-test has no foundation.
If there are more than two groups — say, three face conditions — pairwise t-tests become unwieldy and inflate the risk of false positives. Analysis of Variance (ANOVA) generalises the t-test logic to any number of groups simultaneously.
Instead of comparing two means directly, ANOVA partitions the total variance in the data into two components and examines their ratio:
| Between-groups variance | How much the group means vary from the overall grand mean. If conditions have different true means (a real effect), this will be large. |
| Within-groups variance | How much individuals vary within each condition — the baseline "noise" level against which we judge the between-group signal. |
| F | Remember I mentioned ratios earlier, and they capture the idea of signal over noise. Well here again the same basic principle applies, when F is large, the differences between group means are large relative to within-group noise — evidence for a real effect. Like the t-statistic, F follows a known distribution under the null hypothesis, allowing a p-value to be calculated. |
The same fidelity logic underpins ANOVA. Each group mean is a CLT-guaranteed estimate of the true condition mean — sharper when N is large, blurrier when it is small. The F-ratio asks whether the differences between those group means are large relative to the precision with which each is known. The F-distribution, used to calculate p-values, is derived from the same normality of sample means that the CLT provides. ANOVA is t-test logic, generalised to multiple groups.
We can now trace the complete logical sequence from raw observation to statistical inference:
Every step depends on the one before. The t-test requires normally distributed means. Normally distributed means require the CLT. The CLT applies to averages. And averaging cancels noise only when the LLN conditions — independence, identical distribution — are satisfied. The chain is only as strong as its weakest link.
To reiterate then, this chain of assumptions holds when: (a) observations within a person or condition are independent and exchangeable (i.e., noise is random and 'cancels out'); (b) sample sizes are large enough for the CLT approximation to be good (see simulation 2); (c) the items or trials we average over really are measuring the same thing with similar precision (see discussion of Cronbach).
When these conditions fail — very small samples, clustered or repeated-measures data, heterogeneous item quality, constructs that are genuinely multidimensional — the neat chain from LLN to valid inference breaks down. In Part 2, we examine what happens then, and how more flexible modelling frameworks (Linear Mixed Models; Item Response Theory) respond.
Alongside the observations of the LLN and the CLT is a wider psychometric philosophy called Classical test theory which I will discuss in a later website. For now it is worth highlighting another key underlying assumption. Classical test theory bundles all imprecision into a single term — E, the error. This was evident throughout this web-page: that measurement error is out there, but you can try to deal with it through repeated sampling. But in practice, "error" is not a single undifferentiated quantity. The deviation between an observed score and some true value can come from many sources simultaneously:
Item variance — some items are harder, more ambiguous, or tap slightly different sub-facets of the construct. Person variance — individuals differ in their baseline levels and in how consistently they respond. Session variance — performance fluctuates across time of day, fatigue, motivation. Context variance — the experimenter, the setting, the order of conditions all introduce systematic differences that CTT folds silently into E.
When error comes from multiple structured sources, rolling them into a single E is not just imprecise — it is actively misleading. Some of what CTT calls "error" is actually systematic and predictable variation. A model that knows where the variance comes from can separate these sources, producing sharper estimates at each level. This is the core motivation for Linear Mixed Models (which partition person, item, and residual variance) and for Generalizability Theory (which formally estimates variance components from each source).
The same logic applies to the inference chain built in this page. When the LLN averages across items or across people, it cancels random error. But if some of the apparent error is actually structured — the same items always harder, the same people always more variable — averaging doesn't cancel it, it obscures it. The error term in CTT is not a scientific description of noise; it is a confession that the model does not know where the noise came from. What is ironic is that researchers often do know about sources of this error a priori — that is, they know about different participants, or that participants have repeated some things but not others, or that the means they derived come from a certain set of items, etc. So why not specify such parameter details in your statistical model? This will be a major theme for why we want to argue to step away from ANOVA in the first place!