Statistical Foundations · Part 3 of 3

Where does your data come from?

Understanding the generative process behind your observations — and why it determines which statistical models are appropriate.

Section 1

Every number in your dataset has a story

When a participant responds correctly or incorrectly on a trial, when they report a pain rating of 4, when their reaction time is 312ms — each of these observations was produced by some real-world process. That process is not arbitrary. It has a characteristic shape: some outcomes are more probable than others, and the probabilities follow a pattern reflecting the nature of the generating mechanism.

This pattern is what statisticians formalise as a probability distribution. Statistical models are not just tools for analysing data—they are claims about how the data were generated.

Every analysis implicitly answers the question: what kind of process produced these observations?

ANOVA says: my residuals are Gaussian. Logistic regression says: my outcomes are Bernoulli. Poisson regression says: my counts follow a Poisson process. The model is not just a computational procedure — it is a substantive claim about the generative mechanism.

The central question

A common starting point is: "Which test should I use?"

A better starting point is: "What kind of data-generating process could have produced these observations?"

First decision

Before thinking about specific models, ask: is my dependent variable discrete or continuous?

  • Discrete: counts, categories, correct/incorrect — outcomes are separate, countable values
  • Continuous: measurements on a scale — reaction time, intensity, ratings (often treated as continuous)

This distinction immediately narrows the set of plausible generative models.

Why getting this wrong matters

If your model assumes the wrong distribution, the consequences range from negligible to severe depending on how far reality departs from the assumption. Standard errors may be wrong, p-values may be inflated or deflated, and effect size estimates may be biased. More subtly, the model may be making arithmetically impossible claims — estimating probabilities above 1, predicting negative counts — which is a sign that the assumed generative process simply does not match the data-generating reality.

Section 2

Eight distributions worth knowing

Each distribution is associated with a particular kind of generating process. The explorer below lets you see how the shape responds to parameter changes — building the intuition to recognise which family your DV is likely to belong to before you run any formal tests.

Section 3

The mean–variance independence question

There is one property that distinguishes the Gaussian from almost all other common distributions, and it is the property that matters most for ANOVA. In a Gaussian distribution, the mean and the variance are entirely independent parameters. Knowing the mean tells you nothing about the variance. You can have a distribution centred on 50 with very low spread, or the same centre with very high spread — they are simply different parameterisations with no necessary relationship between them.

In most other distributions the mean constrains the variance to some degree. The table below summarises where each distribution stands on this crucial question.

DistributionMeanVarianceIndependent?
Gaussianμσ²Yes ✓ — completely free parameters
Bernoullipp(1 − p)No ✗ — variance fully determined by mean
Binomialnpnp(1 − p)No ✗ — variance determined by mean and n
PoissonλλNo ✗ — variance always equals mean exactly
Log-normaleμ+σ²/2(eσ²−1)e2μ+σ²No ✗ — both depend on both parameters
Ex-Gaussianμ + τσ² + τ²Partial ≈ — τ affects both, but μ and σ provide extra freedom
Betaα/(α+β)αβ/[(α+β)²(α+β+1)]No ✗ — entangled through α and β
Neg. Binomialμμ + μ²/rNo ✗ — variance exceeds mean, scales with mean

This table is not a curiosity — it is the key to understanding ANOVA's assumptions. ANOVA decomposes total variability into between-group variance (signal) and within-group variance (noise). For the F-ratio to work properly, the within-group variance needs to be a free parameter — independent of where the group means happen to fall. If variance is locked to the mean, that independence is broken.

Section 4

Fitting distributions to data

Having developed intuitions about which distribution family your DV is likely to follow — from first principles, substantive knowledge, or visual inspection — you can go further and formally fit competing distributions to test that intuition. Rather than assuming a distribution, you let the data speak to which generative model provides the most plausible account of what was observed.

The logic is similar to model comparison in regression: fit each candidate distribution to your data using maximum likelihood estimation (which finds the parameter values that make the observed data most probable under that distribution), then compare the fits using an information criterion. The distribution with the lower AIC or BIC provides a better trade-off between fit quality and model complexity, since more parameters always improve fit and information criteria penalise for that.

Key idea

Maximum likelihood estimation asks: given this distribution family, what parameter values make my observed data most probable? AIC/BIC then asks: across several families, which one achieves the best fit without overfitting?

A practical workflow

1

Visual inspection

Plot a histogram of your DV overlaid with density curves from candidate distributions. Look at the shape — symmetric or skewed? Bounded or unbounded? Long-tailed or compact? This narrows the candidates before any formal fitting.

2

Fit candidate distributions

Use maximum likelihood to estimate the parameters of each plausible distribution. The fitdistrplus package in R makes this straightforward, with automatic starting values and convergence checks across a wide range of families.

3

Compare fits formally

Compare AIC and BIC across fitted distributions. Inspect Q-Q plots and P-P plots for each — these show whether the fitted distribution's quantiles match the empirical quantiles. A well-fitting distribution produces points close to the diagonal.

4

Interpret substantively

Statistical fit alone is not enough. The winning distribution should also make sense given what you know about the generating process. A distribution that fits well but is implausible mechanistically deserves scrutiny — it may be overfitting, or it may reveal something genuinely interesting about your data.

Competing generative stories for reaction time data

For positively skewed continuous variables such as reaction times, different generative stories imply different distributions:

The goal is not to pick a convenient model, but to test which of these generative accounts best matches the data.

library(fitdistrplus)

# Fit several candidate distributions to RT data
fit_norm    <- fitdist(rt_data, "norm")
fit_lnorm  <- fitdist(rt_data, "lnorm")
fit_gamma  <- fitdist(rt_data, "gamma")

# Compare by AIC
gofstat(list(fit_norm, fit_lnorm, fit_gamma))

# Visual comparison
denscomp(list(fit_norm, fit_lnorm, fit_gamma), legendtext = c("Normal", "Log-normal", "Gamma"))
qqcomp(  list(fit_norm, fit_lnorm, fit_gamma))

For ex-Gaussian specifically, the retimes package in R provides direct fitting and the decomposition into μ, σ, and τ components — which is particularly useful since τ carries theoretical interpretability as an index of attentional or executive function variability.

Note on GAMLSS

For regression models where you want to allow the distribution to vary as a function of predictors — not just estimate its overall shape — the GAMLSS framework (Generalised Additive Models for Location, Scale and Shape) extends GLMs to cover a very wide range of distributions including beta, ex-Gaussian, negative binomial, and many others, while letting all parameters of the distribution vary with covariates.

Section 5

ANOVA's generative assumption — a quick recap

If you have read Part 2 of this resource, you have seen the OLS-family argument in detail: ANOVA, RM-ANOVA, and the standard LMM all share one underlying likelihood — residuals are independent, identically distributed, and Gaussian (iid) — and that one likelihood is what defines the family. This section pins the same idea down in distributional terms, so the alternatives surveyed in Section 6 land cleanly.

ANOVA assumes that residual variation follows a Gaussian (normal) distribution: each score is the group's true mean plus a random error drawn from a normal distribution with mean zero and variance σ². This is not the same as saying the errors are "random" — randomness does not imply any particular distribution. ANOVA makes a stronger claim: that the unexplained variation behaves like the sum of many small, independent influences, an assumption under which Gaussian noise is a reasonable approximation. And that error variance is the same across all groups — the homogeneity-of-variance assumption.

Key idea

Randomness does not imply normality — Gaussian error is a specific assumption about how randomness is structured.

Yij = μj + εij    where    εij ~ N(0, σ²)
Each score equals a group mean plus Gaussian noise — with the same σ² across all groups

The critical feature of this model is that σ² is an entirely free parameter — it says nothing about where the group means μj happen to lie. The noise is the same whether the groups are close together or far apart. This is exactly the Gaussian mean–variance independence highlighted in Section 3's table. Distributions where that independence breaks down — Bernoulli, binomial, Poisson, negative binomial, log-normal, beta — produce data that ANOVA's likelihood cannot represent honestly. Part 2 unpacks the binary case in full detail (variance = p(1−p), variance heterogeneity is automatic the moment groups differ in mean); here we extend the same logic to the other common generating processes you will meet in behavioural data.

Why count DVs require care

With count data the generating process is typically Poisson or negative binomial — both of which have variance that scales with the mean. Applying ANOVA to raw counts implicitly assumes the noise is constant across groups with different mean counts, which is structurally implausible. When counts are large and roughly symmetric, the Gaussian approximation becomes tolerable, but it is always an approximation — and one that quietly weakens whenever group counts span a wide range or whenever dispersion exceeds what a Poisson process would produce.

Section 6

Matching the model to the generative process

The choice of statistical model should be guided by the likely generating process of your DV. The guide below covers the most common cases encountered in behavioural and cognitive research.

A practical map of behavioural data

In practice, most behavioural data fall into a small number of recurring types. Each type implies a different generative model:

Continuous
Reaction times / durations
Gaussian, Log-normal, Gamma
Right-skewed data. Compare symmetric noise vs multiplicative vs waiting-time processes.
Discrete
Binary outcomes
Bernoulli / Binomial
Variance determined by mean. ANOVA assumptions structurally violated at trial level.
Discrete
Counts
Poisson / Negative binomial
Variance scales with mean. Overdispersion common in behavioural data.
Continuous
Proportions
Beta
Bounded between 0 and 1. Gaussian models can produce impossible values.
The deeper logic

The goal is not to try every possible distribution, but to compare a small set of plausible generative models based on the nature of the data.

DV type
Binary outcome
→ Logistic regression / binomial GLM
Bernoulli errors. Variance locked to mean. ANOVA's Gaussian assumption fundamentally violated.
DV type
Count of rare events
→ Poisson regression
Variance equals mean. Integer-valued, right-skewed. Check for overdispersion first.
DV type
Overdispersed counts
→ Negative binomial regression
Variance exceeds mean. The realistic choice for most behavioural count data where Poisson's strict mean=variance constraint fails.
DV type
Reaction times
→ Ex-Gaussian model or log-transform
Right-skewed with long tail. Ex-Gaussian's τ parameter is theoretically interpretable. Log-transform is simpler but loses the decomposition.
DV type
Bounded proportions
→ Beta regression
Continuous on (0,1). Proportion of fixation time, proportion of trials correct across a block. Gaussian is unbounded — beta is the natural choice.
DV type
Total score, many items
→ ANOVA often acceptable
Binomial approaches Gaussian as n grows. Acceptable approximation with many items and middling difficulty — but not guaranteed near floor or ceiling.
DV type
Ordinal rating scale
→ ANOVA defensible with caution
Unequal intervals are a genuine violation. Tolerable with 5+ points, symmetric distributions. Ordinal mixed models for rigour.
DV type
Continuous interval measure
→ ANOVA — well-suited
Gaussian errors are plausible. Mean and variance genuinely independent. Assumptions met at the level of measurement.
The deeper lesson

Statistical models are not neutral computational procedures. They are claims about the world — specifically, about the process that generated your observations. Learning to ask "what distribution does my DV follow, and why?" before reaching for a test is one of the most transferable skills in quantitative research. Formal distribution fitting gives you the tools to move from plausible intuition to empirical evidence about which generative model is most defensible for your data.