Statistical Foundations · Part 4 of 4

Effect Size & Power Analysis

Moving beyond p < .05 — how big is your effect, and how many participants do you actually need?

Overview

What is an effect size?

An effect size is a standardised, unit-free measure of how big an effect is — independent of how many participants you collected. It answers the question that a p-value cannot: not just is there an effect?, but how meaningful is it?

There is no single effect size statistic. Which one you use depends on the kind of research question you are asking. There are three main families, each asking the "how big?" question in a different way:

Our focus today

Mean difference

Cohen's d, Hedges' g, Glass's Δ
"How far apart are these two group means?"
Used in: t-tests, experimental designs, clinical trials, lab replication studies
Also important

Variance explained

r², η², ω², f²
"How much of the outcome does my predictor account for?"
Used in: correlation, regression, ANOVA, factor analysis
Also important

Categorical / risk

Odds ratio, risk ratio, NNT
"How does group membership change the odds of an outcome?"
Used in: clinical trials, epidemiology, public health research

Today we focus on mean difference effect sizes — specifically Cohen's d and its variants — because this is what you need for the experimental two-group designs at the heart of most lab studies. When you read a study and want to replicate it, d is typically what you are working from.

Why not just use p-values? A p-value tells you whether an effect is likely to be real (rather than a sampling fluke). An effect size tells you whether it is worth caring about. You need both. A tiny, trivial effect can generate p < .001 with a large enough sample. Effect sizes are not dependent on N — they are a property of the phenomenon itself.
Section 1

Why effect size? The problem with p < .05

The p-value tells you the probability of observing a result as extreme as yours, assuming the null hypothesis is true. Cross the threshold of .05 and researchers declare "significance" — but this threshold is partly just a function of how many participants you collected.

The widget below fixes the true effect at d = 0.2 (a small effect by conventional standards). Watch what happens to the p-value as you increase the sample size. The underlying effect does not change at all.

The p-value problem — same effect, different N
40
0.20
Cohen's d (fixed)
t-statistic
p-value (two-tailed)
verdict

At small N, p is large and the effect is "non-significant." At large N, p is tiny and the effect is "highly significant." Yet the thing being measured — the true effect — is the same d = 0.2 throughout. Significance is about precision of estimation, not about the size of the phenomenon.

Cohen's benchmarks — use with caution Jacob Cohen (1988) offered rough labels: d = 0.2 (small), d = 0.5 (medium), d = 0.8 (large). These are starting points for intuition, not universal standards. A d of 0.2 might be negligible in a lab reaction time study but clinically important in a drug trial where the outcome is mortality. Always interpret effect size in context.
Section 2

Computing Cohen's d

Cohen's d expresses the mean difference between two groups in standard deviation units. It is structurally identical to a z-score — instead of asking "how many SDs above the mean is this single observation?", we ask "how many SDs apart are these two group means?"

d = (M₁ − M₂) / SDpooled

The result is unit-free. Whether you measured reaction times in milliseconds or exam scores as percentages, d = 0.5 means the same thing: the two group means are half a standard deviation apart. This makes effects comparable across different studies and different measurement scales.

A natural way to visualise d is as two overlapping distributions. When d = 0 the groups are completely indistinguishable. As d increases, the distributions separate and the overlap shrinks.

Cohen's d calculator — overlapping distributions
50
65
15
1.00 large
Cohen's d
84%
Cohen's U3
45%
Distribution overlap
Group 1 Group 2

Cohen's U3 is a useful way to explain d to non-statisticians: it is the percentage of Group 1 scores that fall below the mean of Group 2. At d = 1.0, 84% of one group scores below the average of the other — a substantial separation. At d = 0.2, only 58% do — barely better than chance.

Section 3

The denominator problem — all are compromises

Cohen's d divides the raw mean difference by a standard deviation. But which standard deviation? This turns out to be a genuine methodological question with no single correct answer. Different choices produce different variants of d, each making different assumptions about what the "reference" variability in your population should be.

The honest conclusion: all are defensible compromises. The right choice depends on your data and your assumptions — and the differences between them are smallest when the two group SDs are similar.

Denominator comparison — same data, three answers

Scenario: Control group M = 50, SD = 10, n = 20  ·  Treatment group M = 62, SD = 16, n = 20

Practical guidance For most lab replication studies with random assignment, Cohen's d (pooled SD) is the standard default and what you will typically see reported. Glass's Δ is justified when you have a genuine untreated baseline group and believe the treatment may have altered score variability. Hedges' g is preferred with small samples (n < 20 per group) because it corrects for the slight upward bias in small-sample SD estimates. With reasonable sample sizes all three converge.
Section 4

The replication bridge — from effect size to sample size

You have read a published study. It reports Cohen's d = 0.5 for a two-group comparison. Your task is to replicate it. The critical question: how many participants do you need?

To answer this, we need to be clear about the difference between two related but very different things: the standard deviation (SD) and the standard error (SE).

Standard deviation (SD)

A property of the population. Describes how much individual scores vary around the mean. Does not change as you collect more participants — it is a stable characteristic of what you are measuring.

Standard error (SE)

A property of your estimate. Describes how precisely your sample mean estimates the true population mean. Shrinks as N increases — larger samples produce more stable mean estimates.

The key relationship is: SE = SD / √N

The SD gives you the raw material — the inherent variability in what you are measuring. The SE is what happens to that variability when you estimate a mean from a sample of size N. A population can have enormous spread, yet your estimate of the mean can be very precise if N is large enough. This is exactly the σ/√N machinery developed in Part 1 — repurposed here as the engine that determines whether a true effect of size d is detectable at a given sample size.

SD vs SE — population spread vs. precision of your mean estimate
20
15.00
Population SD (fixed)
Standard error (SE)
Precision gain vs. SD
Population distribution (SD — fixed) Sampling distribution of mean (SE — shrinks with N)

This distinction matters for power analysis because it is the SE that determines how detectable your effect is. Your test statistic (t or z) divides the raw mean difference by the SE. As N increases, SE shrinks, so the same true effect d produces a larger and larger test statistic — eventually reliably clearing the significance threshold.

Section 5

Power analysis — reverse engineering your sample size

Power analysis works backwards from the question: given the effect I expect to find, how many participants do I need to detect it reliably? You supply three inputs and the analysis outputs the required N.

d
Expected effect size (from the study you are replicating)
α
Acceptable false positive rate (typically .05)
1−β
Desired power — probability of detecting the effect if real (typically .80)

The underlying logic is geometric. You have two overlapping distributions of your test statistic. The null distribution (blue) is centred on zero — the "no effect" world. The alternative distribution (orange) is shifted by the true effect. Power (green shaded area) is the proportion of the alternative distribution that clears your critical value. Increasing N squeezes both distributions tighter, separating them. Power analysis finds the N at which the overlap drops to your acceptable threshold.

Power analysis calculator
Effect size (d) — from the study you're replicating
0.50
Alpha level (α)
Desired power
64
participants needed per group
128 total
Null distribution (H₀ true) Alternative distribution (H₁ true) Power (area beyond critical value under H₁) α (area beyond critical value under H₀)
The N-scaling rule — commit this to memory Required N scales with the square of the effect size. Halve d → roughly quadruple N. This is why overestimating the effect size from a small pilot study is dangerous: the original estimate may be inflated by sampling error, and your replication will be designed for an effect that was over-optimistic. When replicating, consider using 90% power rather than 80%, or apply a modest downward correction to the published effect size to account for publication bias.

Try the calculator: set d = 0.8 (large) and compare the N to d = 0.4 (medium). Then compare d = 0.4 to d = 0.2 (small). The quadrupling effect is dramatic — and explains why so many underpowered studies in the literature have failed to replicate.

Advanced Section

The AnCred framework — credibility beyond p < .05

The central problem

With a point null hypothesis — the conventional assumption that the true effect is exactly zero — statistical significance is ultimately just a function of how long you are willing to collect data. Because SE shrinks with N, the null and alternative distributions will always eventually separate, for any true effect however trivial. Run enough participants and even a meaningless difference becomes "statistically significant."

The solution is to stretch the null — to replace the fictional point at zero with an honest region around zero that reflects what genuine scepticism or practical meaningfulness actually requires. Two frameworks do this in different ways: Reverse Bayes / AnCred does it retrospectively from the data; ROPE and Meehl's crud threshold do it prospectively before the study. Both address the same structural flaw.

Why the point null is a straw man

The point null says: "the true effect is precisely zero." But in the real world, no two groups are ever exactly identical. Any manipulation, any group difference, any intervention will move the mean by some amount — if only by a fraction. The question is never really is the effect zero? It is is the effect large enough to matter?

Because SE = SD / √N, increasing N always shrinks the SE, which always squeezes the null and alternative distributions further apart. This means any true effect — even one too small to care about — will produce p < .05 with a sufficiently large sample. Significance becomes a measure of your patience, not the importance of your finding.

Psychologist Paul Meehl identified this concretely with his concept of the crud factor: in psychology, almost any two groups you compare will differ on almost any measure you take, typically producing Cohen's d values around 0.2, purely because of the accumulated background correlation between all measured variables. A d of 0.2 is ambient noise — routinely observable, rarely meaningful. Yet with N = 200 per group it will produce p < .001. The point null has nothing useful to say about this.

The null-stretching solution Instead of asking "does the effect differ from precisely zero?", ask "does the effect clear a region around zero that represents either (a) what a reasonable sceptic should be willing to concede is plausible, or (b) the minimum size we'd consider practically meaningful?" Both reframings force significance to mean something real rather than just reflecting accumulated N.

Why does this happen? The mathematics of averaging

The root cause is not a flaw in how statisticians set up the test — it is a mathematical property of what a mean actually is. Understanding it removes the mystery.

When you compute a sample mean, you are adding up N scores and dividing by N. If each score is an independent observation from the same population, each one carries a fresh, non-redundant piece of information about the true mean. Because the information is independent, it accumulates — each new participant makes your estimate a little more stable. This is not a convention or a design choice; it is what happens algebraically when you average independent quantities. The formal result is that the variance of your mean estimate equals σ²/N, which gives a standard error of σ/√N.

So the SE shrinks by the square root of N not because of anything NHST is doing, but because averaging is inherently a precision-gaining operation whenever observations are independent. More independent observations → more stable mean → smaller SE. This would be equally true if you built a Bayesian model or a likelihood ratio test instead. It is a property of the estimator, not the inference framework built on top of it.

A simple way to see it

Imagine estimating the true weight of a coin by flipping it and counting heads. With 4 flips you might get 3 heads just by luck — your estimate of P(heads) = 0.75 is unstable. With 400 flips, extreme runs of luck get "outvoted" by the mass of typical outcomes — your estimate converges reliably on 0.5. Your estimate didn't get better because of the statistical test you chose. It got better because you averaged more independent coin flips. That is SE shrinking with N, in its simplest form.

The key word throughout is independent. If your observations are not independent — say, you measured the same participant ten times and counted each measurement separately — the SE does not shrink at 1/√N, because repeat measurements from the same person share information rather than adding fresh information. This is why clustered data and repeated measures designs require more sophisticated models that account for the dependency structure: the Linear Mixed Models we built up in Part 2.

Now here is where the problem enters. The t-statistic that NHST computes is simply:

t = mean difference / SE   =   mean difference / (SD / √N)

Because SE appears in the denominator, and SE shrinks toward zero as N grows, the t-statistic grows toward infinity for any nonzero true effect — no matter how small that effect is. And because p is a function of t, p must eventually cross .05 for any true effect whatsoever, given a large enough sample. There is no escaping this: it is built into the algebra of averaging.

This is not a bug in the statistics — the estimator is doing exactly what it should, becoming more and more precise. The bug is in the interpretation: treating "the t-statistic cleared the threshold" as equivalent to "the effect is meaningfully large." Those are completely different claims, and the mathematics guarantees they will eventually come apart for any small but nonzero true effect collected with sufficient patience.

The punchline for your studies When you read a study with a tiny effect size and a very low p-value, ask: is this significant because the effect is large, or because the sample was enormous? The p-value cannot tell you. Effect size partially helps — but as the AnCred framework below shows, effect size alone is not enough either, because a large effect from a small noisy study can be just as misleading. What you need is a way to evaluate both the size of the effect and the precision with which it was estimated — simultaneously. That is exactly what the Scepticism Limit does.
Route 1 — Retrospective (data-driven)
Reverse Bayes / AnCred (Matthews)

Compute from the observed CI what the null region must be to make the result credible. Then ask whether that region is defensible given your domain knowledge. The Scepticism Limit (SL) is the boundary of that region.

Route 2 — Prospective (theory-driven)
ROPE / Meehl's crud threshold

Define the null region before the study based on subject-matter knowledge — e.g., "anything below d = 0.2 is crud and doesn't count." Then require significance to clear that expanded target. Forces a more honest answer without seeing the data first.

The Scepticism Limit (SL) — how Reverse Bayes works

AnCred (Anlysis of Credibility), developed by statistician Robert Matthews, computes the SL directly from your confidence interval. The sceptic maintains a prior distribution centred on zero — they are specifically doubtful that any effect exists. The SL answers: how wide does that distribution need to be before your data can overcome their scepticism?

A wide sceptical prior is actually a weak form of scepticism — a sceptic who concedes that effects up to 40 points are possible is easy to convince. The genuinely demanding sceptic has a tight prior around zero. A low SL means your data defeated even that hard opponent.

SL = (U − L)² / (4 × √(U × L)) where U and L are the upper and lower CI bounds — both must be positive, i.e. the result must be significant with both bounds on the same side of zero

The SL is expressed in your outcome's own units — points, milliseconds, whatever you measured — so you can immediately judge whether the implied sceptical region is defensible in your field. This is the key advantage over a p-value, which tells you nothing about the scale of doubt involved.

The verdict rule is simple:

Observed effect > SL → Credible Your effect sits outside the sceptic's zone. They cannot accommodate your result without widening their prior to an indefensible degree.
Observed effect < SL → Fragile Your effect sits inside the sceptic's zone. Even a sceptic centred on zero can comfortably absorb your result as "essentially no effect" — regardless of the p-value.
Scepticism Limit calculator

These three scenarios all use reading accuracy (0–100 points) as the outcome. Load a preset or enter your own CI bounds.

Lower CI bound (L) — must be > 0
Upper CI bound (U)
Observed mean difference
CI width (U−L)
Scepticism Limit (SL)
SL / observed effect
verdict
Sceptical prior zone (shaded) vs. observed mean difference (vertical line)
0 scale in outcome units

The Advocacy Limit (AL) — for non-significant results

When your result is non-significant — the CI straddles zero — the SL cannot be computed. But there is an equally important question for that situation: just because you didn't reach significance, does that mean the effect is absent?

The Advocacy Limit answers: what is the maximum effect size that even a committed advocate of the hypothesis can claim the data supports? A large AL means the study was simply too weak to say much either way — the absence of significance is not evidence of absence.

The most instructive example is the ORBITA trial, which tested whether stenting improved exercise duration in patients with stable angina. The result was non-significant (p ≈ .2), and many commentators concluded that stents don't work. But the AL was approximately 115 seconds — meaning the data were perfectly consistent with a clinically meaningful positive effect. The trial was underpowered, not decisive. The appropriate conclusion was "we don't yet know," not "no effect." This maps directly onto the small-N paradox in power analysis: a non-significant result from an underpowered study is an uninformative result, not evidence of zero effect. The AL makes that explicit and quantitative.

AL in plain language When your study is non-significant, ask: is the upper bound of my CI clinically or practically meaningful? If the CI runs from −2 to +40 points, you have not shown the effect is zero — you have shown the data are consistent with effects from trivial to substantial. A large AL is a call for a better-powered study, not a declaration of null results.

A note on what "prior" means here

The SL framework uses the word prior differently from classical Bayesian analysis.

Bayesian epistemic prior Encodes your genuine belief state before seeing the data. A flat prior expresses ignorance — "I have no idea what to expect." Appropriate when you genuinely lack prior knowledge. A statement about your knowledge state.
SL sceptical prior Models a specific argumentative position: "I doubt an effect exists." A tight prior around zero is not ignorance — it is an informative claim. The SL asks what the sceptic must concede as possible, even while remaining doubtful. A dialectical tool, not an epistemic one.

Counterintuitively, a flat prior is easy to defeat — it already spreads probability across large effects. The genuinely hard sceptic holds a tight prior around zero. A low SL means your data defeated that hard opponent. And this clarifies why NHST's point null is a false form of scepticism: it looks maximally demanding (all mass at exactly zero) but is trivially defeated with large N. The SL models scepticism as a region — which is how real scientific doubt actually operates.

Connecting the two routes: ROPE, Meehl, and SL used together ROPE (Region of Practical Equivalence, Kruschke) and Meehl's crud threshold both stretch the null prospectively — you decide in advance what "negligible" means (e.g., d < 0.2 is just crud), then require significance to clear that expanded target. SL stretches it retrospectively — you let the data reveal what level of scepticism it can overcome, then judge that level. Used together, they give two independent evaluations: does your effect clear the region you pre-specified as meaningful (ROPE/Meehl), and is your data precise enough to defeat a reasonable sceptic (SL)? A result that clears both is genuinely robust. A result that clears only one deserves scrutiny.
The three scenarios — comparing p-value, effect size, and credibility
Scenario N/group Mean diff 95% CI p-value SL Verdict
A — large effect, small N 18 8 pts (0.16, 15.84) .048 38.5 pts Fragile
B — modest effect, large N 150 4 pts (1.28, 6.72) .004 2.5 pts Credible
C — large effect, adequate N 60 8 pts (3.71, 12.29) .0003 2.7 pts Credible
Scenarios A and C have the same observed effect (8 pts) and the same p-value threshold classification — yet their SLs are 38.5 vs 2.7. The SL reveals what the p-value hides: Scenario A's CI barely cleared zero and the width did the heavy lifting. Scenario C's narrower CI means the same effect is genuinely credible.