Moving beyond p < .05 — how big is your effect, and how many participants do you actually need?
An effect size is a standardised, unit-free measure of how big an effect is — independent of how many participants you collected. It answers the question that a p-value cannot: not just is there an effect?, but how meaningful is it?
There is no single effect size statistic. Which one you use depends on the kind of research question you are asking. There are three main families, each asking the "how big?" question in a different way:
Today we focus on mean difference effect sizes — specifically Cohen's d and its variants — because this is what you need for the experimental two-group designs at the heart of most lab studies. When you read a study and want to replicate it, d is typically what you are working from.
The p-value tells you the probability of observing a result as extreme as yours, assuming the null hypothesis is true. Cross the threshold of .05 and researchers declare "significance" — but this threshold is partly just a function of how many participants you collected.
The widget below fixes the true effect at d = 0.2 (a small effect by conventional standards). Watch what happens to the p-value as you increase the sample size. The underlying effect does not change at all.
At small N, p is large and the effect is "non-significant." At large N, p is tiny and the effect is "highly significant." Yet the thing being measured — the true effect — is the same d = 0.2 throughout. Significance is about precision of estimation, not about the size of the phenomenon.
Cohen's d expresses the mean difference between two groups in standard deviation units. It is structurally identical to a z-score — instead of asking "how many SDs above the mean is this single observation?", we ask "how many SDs apart are these two group means?"
The result is unit-free. Whether you measured reaction times in milliseconds or exam scores as percentages, d = 0.5 means the same thing: the two group means are half a standard deviation apart. This makes effects comparable across different studies and different measurement scales.
A natural way to visualise d is as two overlapping distributions. When d = 0 the groups are completely indistinguishable. As d increases, the distributions separate and the overlap shrinks.
Cohen's U3 is a useful way to explain d to non-statisticians: it is the percentage of Group 1 scores that fall below the mean of Group 2. At d = 1.0, 84% of one group scores below the average of the other — a substantial separation. At d = 0.2, only 58% do — barely better than chance.
Cohen's d divides the raw mean difference by a standard deviation. But which standard deviation? This turns out to be a genuine methodological question with no single correct answer. Different choices produce different variants of d, each making different assumptions about what the "reference" variability in your population should be.
The honest conclusion: all are defensible compromises. The right choice depends on your data and your assumptions — and the differences between them are smallest when the two group SDs are similar.
You have read a published study. It reports Cohen's d = 0.5 for a two-group comparison. Your task is to replicate it. The critical question: how many participants do you need?
To answer this, we need to be clear about the difference between two related but very different things: the standard deviation (SD) and the standard error (SE).
A property of the population. Describes how much individual scores vary around the mean. Does not change as you collect more participants — it is a stable characteristic of what you are measuring.
A property of your estimate. Describes how precisely your sample mean estimates the true population mean. Shrinks as N increases — larger samples produce more stable mean estimates.
The key relationship is: SE = SD / √N
The SD gives you the raw material — the inherent variability in what you are measuring. The SE is what happens to that variability when you estimate a mean from a sample of size N. A population can have enormous spread, yet your estimate of the mean can be very precise if N is large enough. This is exactly the σ/√N machinery developed in Part 1 — repurposed here as the engine that determines whether a true effect of size d is detectable at a given sample size.
This distinction matters for power analysis because it is the SE that determines how detectable your effect is. Your test statistic (t or z) divides the raw mean difference by the SE. As N increases, SE shrinks, so the same true effect d produces a larger and larger test statistic — eventually reliably clearing the significance threshold.
Power analysis works backwards from the question: given the effect I expect to find, how many participants do I need to detect it reliably? You supply three inputs and the analysis outputs the required N.
The underlying logic is geometric. You have two overlapping distributions of your test statistic. The null distribution (blue) is centred on zero — the "no effect" world. The alternative distribution (orange) is shifted by the true effect. Power (green shaded area) is the proportion of the alternative distribution that clears your critical value. Increasing N squeezes both distributions tighter, separating them. Power analysis finds the N at which the overlap drops to your acceptable threshold.
Try the calculator: set d = 0.8 (large) and compare the N to d = 0.4 (medium). Then compare d = 0.4 to d = 0.2 (small). The quadrupling effect is dramatic — and explains why so many underpowered studies in the literature have failed to replicate.
With a point null hypothesis — the conventional assumption that the true effect is exactly zero — statistical significance is ultimately just a function of how long you are willing to collect data. Because SE shrinks with N, the null and alternative distributions will always eventually separate, for any true effect however trivial. Run enough participants and even a meaningless difference becomes "statistically significant."
The solution is to stretch the null — to replace the fictional point at zero with an honest region around zero that reflects what genuine scepticism or practical meaningfulness actually requires. Two frameworks do this in different ways: Reverse Bayes / AnCred does it retrospectively from the data; ROPE and Meehl's crud threshold do it prospectively before the study. Both address the same structural flaw.
The point null says: "the true effect is precisely zero." But in the real world, no two groups are ever exactly identical. Any manipulation, any group difference, any intervention will move the mean by some amount — if only by a fraction. The question is never really is the effect zero? It is is the effect large enough to matter?
Because SE = SD / √N, increasing N always shrinks the SE, which always squeezes the null and alternative distributions further apart. This means any true effect — even one too small to care about — will produce p < .05 with a sufficiently large sample. Significance becomes a measure of your patience, not the importance of your finding.
Psychologist Paul Meehl identified this concretely with his concept of the crud factor: in psychology, almost any two groups you compare will differ on almost any measure you take, typically producing Cohen's d values around 0.2, purely because of the accumulated background correlation between all measured variables. A d of 0.2 is ambient noise — routinely observable, rarely meaningful. Yet with N = 200 per group it will produce p < .001. The point null has nothing useful to say about this.
The root cause is not a flaw in how statisticians set up the test — it is a mathematical property of what a mean actually is. Understanding it removes the mystery.
When you compute a sample mean, you are adding up N scores and dividing by N. If each score is an independent observation from the same population, each one carries a fresh, non-redundant piece of information about the true mean. Because the information is independent, it accumulates — each new participant makes your estimate a little more stable. This is not a convention or a design choice; it is what happens algebraically when you average independent quantities. The formal result is that the variance of your mean estimate equals σ²/N, which gives a standard error of σ/√N.
So the SE shrinks by the square root of N not because of anything NHST is doing, but because averaging is inherently a precision-gaining operation whenever observations are independent. More independent observations → more stable mean → smaller SE. This would be equally true if you built a Bayesian model or a likelihood ratio test instead. It is a property of the estimator, not the inference framework built on top of it.
Imagine estimating the true weight of a coin by flipping it and counting heads. With 4 flips you might get 3 heads just by luck — your estimate of P(heads) = 0.75 is unstable. With 400 flips, extreme runs of luck get "outvoted" by the mass of typical outcomes — your estimate converges reliably on 0.5. Your estimate didn't get better because of the statistical test you chose. It got better because you averaged more independent coin flips. That is SE shrinking with N, in its simplest form.
The key word throughout is independent. If your observations are not independent — say, you measured the same participant ten times and counted each measurement separately — the SE does not shrink at 1/√N, because repeat measurements from the same person share information rather than adding fresh information. This is why clustered data and repeated measures designs require more sophisticated models that account for the dependency structure: the Linear Mixed Models we built up in Part 2.
Now here is where the problem enters. The t-statistic that NHST computes is simply:
Because SE appears in the denominator, and SE shrinks toward zero as N grows, the t-statistic grows toward infinity for any nonzero true effect — no matter how small that effect is. And because p is a function of t, p must eventually cross .05 for any true effect whatsoever, given a large enough sample. There is no escaping this: it is built into the algebra of averaging.
This is not a bug in the statistics — the estimator is doing exactly what it should, becoming more and more precise. The bug is in the interpretation: treating "the t-statistic cleared the threshold" as equivalent to "the effect is meaningfully large." Those are completely different claims, and the mathematics guarantees they will eventually come apart for any small but nonzero true effect collected with sufficient patience.
Compute from the observed CI what the null region must be to make the result credible. Then ask whether that region is defensible given your domain knowledge. The Scepticism Limit (SL) is the boundary of that region.
Define the null region before the study based on subject-matter knowledge — e.g., "anything below d = 0.2 is crud and doesn't count." Then require significance to clear that expanded target. Forces a more honest answer without seeing the data first.
AnCred (Anlysis of Credibility), developed by statistician Robert Matthews, computes the SL directly from your confidence interval. The sceptic maintains a prior distribution centred on zero — they are specifically doubtful that any effect exists. The SL answers: how wide does that distribution need to be before your data can overcome their scepticism?
A wide sceptical prior is actually a weak form of scepticism — a sceptic who concedes that effects up to 40 points are possible is easy to convince. The genuinely demanding sceptic has a tight prior around zero. A low SL means your data defeated even that hard opponent.
The SL is expressed in your outcome's own units — points, milliseconds, whatever you measured — so you can immediately judge whether the implied sceptical region is defensible in your field. This is the key advantage over a p-value, which tells you nothing about the scale of doubt involved.
The verdict rule is simple:
When your result is non-significant — the CI straddles zero — the SL cannot be computed. But there is an equally important question for that situation: just because you didn't reach significance, does that mean the effect is absent?
The Advocacy Limit answers: what is the maximum effect size that even a committed advocate of the hypothesis can claim the data supports? A large AL means the study was simply too weak to say much either way — the absence of significance is not evidence of absence.
The most instructive example is the ORBITA trial, which tested whether stenting improved exercise duration in patients with stable angina. The result was non-significant (p ≈ .2), and many commentators concluded that stents don't work. But the AL was approximately 115 seconds — meaning the data were perfectly consistent with a clinically meaningful positive effect. The trial was underpowered, not decisive. The appropriate conclusion was "we don't yet know," not "no effect." This maps directly onto the small-N paradox in power analysis: a non-significant result from an underpowered study is an uninformative result, not evidence of zero effect. The AL makes that explicit and quantitative.
The SL framework uses the word prior differently from classical Bayesian analysis.
Counterintuitively, a flat prior is easy to defeat — it already spreads probability across large effects. The genuinely hard sceptic holds a tight prior around zero. A low SL means your data defeated that hard opponent. And this clarifies why NHST's point null is a false form of scepticism: it looks maximally demanding (all mass at exactly zero) but is trivially defeated with large N. The SL models scepticism as a region — which is how real scientific doubt actually operates.