The Numbers We Don't Earn

Section 1

Before the statistics: does the measurement even count?

The "What kind of DV does ANOVA need?" prelude in Part 2 §01 of the Foundations series spent its time inside ANOVA — showing that the dependent variable has to be measured at interval level or above for group means to be meaningful summaries. But that prelude took interval-level measurement as a given. It didn't ask the more uncomfortable question: do we actually have it?

When a participant circles 7 on a 1–10 pain scale, three separate assumptions ride silently on that number — and almost none of them are ever tested:

Fallacy 1

The fallacy of opinion

Your 7 and my 7 refer to the same internal experience. Subjective equivalence across people is assumed, not demonstrated.

Fallacy 2

The ordinal fallacy

The gap from 4 to 5 equals the gap from 6 to 7. Equal spacing between scale points is assumed, not demonstrated.

Fallacy 3

The fallacy of the instrument

My questionnaire and your questionnaire measure the same construct. Instrument equivalence, and invariance across groups, is assumed, not demonstrated.

These are not fringe complaints. They have occupied philosophers of measurement — Joel Michell, Paul Meehl, and others — for decades. Michell goes as far as calling psychology's unthinking use of numerical scales a pathology: we adopt the machinery of quantitative science (means, variances, correlations, factor loadings) without first establishing that the thing we are measuring has quantitative structure at all. Classical test theory's familiar equation

X = T + e

observed score = true score + error

encourages exactly this shortcut. Researchers read T as if it were a direct proxy for some latent ability — treating the observed score as a slightly-noisy readout of the underlying construct. CTT itself warns against this, but the warning rarely survives contact with applied research.

What this page is not

This is not the "summed scores lose information" argument from classical test theory — that one sits on a separate page. This page is the deeper layer beneath it: before you worry about whether summing is the right aggregation, you should worry about whether the numbers you are summing are numbers at all.

Section 2

The fallacy of opinion

A GP asks two patients to rate their pain on a 0–10 scale. Both say 7. The clinical note records the same number for both. The subsequent repeated-measures ANOVA treats those two 7s as identical observations, contributing identically to the group mean and the within-subjects error term.

But are they the same experience? We have no way of knowing. The patient who fractured her wrist last year and remembers that as a 9 has one internal benchmark for what a 7 feels like. The patient whose worst prior experience was a migraine he rated 4 has a different one. Their 7s are anchored in different personal histories, different tolerance thresholds, and — quite possibly — different underlying nociceptive responses.

Two people, both say "7"

The reported number is the same. The underlying experience could be anywhere along the scale — and we have no instrument that can tell us.

This is what makes self-report measurement uniquely difficult. In physics, if you and I measure the length of a table and both get 52 inches, we are appealing to an external, shared, operationally-defined reference: the inch. The scale is outside the observer. With a pain rating, the scale is inside the observer — calibrated privately, with no mechanism for cross-person alignment. The number travels from a private reference frame into a public dataset, where it is then treated as though it came from a shared one.

This is the backbone of essentially every survey-based subfield: wellbeing research, personality psychology, attitudinal research, quality-of-life studies in medicine. The entire inferential apparatus assumes that subjective scale anchors are, on average, shared across the population — an assumption that is rarely tested and, where tested (through anchoring-vignette studies, differential item functioning, or qualitative follow-ups), is often violated.

The uncomfortable point

Averaging subjective ratings from 300 participants does not eliminate this problem. If individuals' personal scales are miscalibrated in systematic ways — tied to culture, gender, age, clinical history — then the "noise" is structured bias, not random error. No amount of N solves structured bias.

Section 3

The ordinal fallacy

Suppose — charitably — that we solve the first problem. Everyone's subjective scale is perfectly calibrated. My 7 and your 7 reliably refer to the same underlying amount of whatever the construct is. We still have a second, equally serious problem: the spacing between the numbers on the scale.

When we compute a group mean from a set of Likert responses, we treat the scale as interval-level. That is, we assume the distance from 1 to 2 equals the distance from 2 to 3, which equals the distance from 6 to 7. Only under this assumption does arithmetic — summing, averaging, computing variances — correspond to anything meaningful on the underlying construct.

There is no a priori reason to believe this assumption holds. Consider a pain scale where 1 = "none," 4 = "moderate," 7 = "worst imaginable." Is the jump from 1 to 4 really the same distance as the jump from 4 to 7? Almost certainly not: the ceiling is a hard, qualitative threshold that compresses the upper end. Or consider a wellbeing item anchored at "strongly disagree" through "strongly agree." The jump from neutral to "agree" may be psychologically enormous, while "agree" to "strongly agree" may be trivial.

What Joel Michell calls the ordinal fallacy is the slide from ordered response categories to the assumption of quantitative structure. Scales like Likert give us ranks — an ordering — but the numbers themselves are a convenience. Michell's deeper point (1999, 2008) is that standard psychometric techniques — factor analysis, IRT, coefficient alpha — assume quantitative structure in order to run. They do not independently test for it. A factor loading is a correlation between an item and a latent score; it says nothing about whether the latent variable has the additive structure required for interval measurement.

See the fallacy at work

Here is a small trial dataset: 60 participants, randomly assigned to a drug or placebo arm, rating their pain relief on a 7-point scale. Under the standard analytic assumption — that the 7 scale points are equally spaced on the underlying construct — the drug arm reports a moderate benefit over placebo. Drag the slider to change only the assumption about scale spacing. The data never change.

Bend the scale, watch the effect size move

The response distributions are fixed. Only the assumed spacing of the 7 scale points on the underlying interval continuum is changing.

Upper / lower slope 1.00

Drug mean

—

Placebo mean

—

Mean difference

—

Cohen's d— —

Under equal spacing the effect is moderate — the kind of result that would appear in a published trial with a clean p-value and a plausible clinical story. Compress the top of the scale (the "ceiling" interpretation) and the same raw data produces a smaller effect. Stretch the top (the "catastrophic pain is qualitatively different" interpretation) and the same raw data produces a considerably larger one. Nothing about the underlying responses changed. The only thing that changed was an assumption we never tested in the first place.

Michell's pathology

Michell argues that psychology has, for most of its history, treated the quantitative structure of its constructs as a methodological convenience rather than an empirical claim. Stevens's (1946) taxonomy of nominal / ordinal / interval / ratio was meant as a warning — a classification of what inference each scale type supports. In practice it became a licence: if a scale has more than a handful of points and looks vaguely symmetric, researchers proceed with interval-level statistics and move on.

The formal escape — and why it's hard

There is a formal condition under which ordinal data can be promoted to interval status: additive conjoint measurement, developed by Luce and Tukey (1964). The idea is that if two variables combine additively to produce an outcome in a way that satisfies certain axioms (cancellation, solvability, Archimedean), then you have demonstrated — not assumed — quantitative structure. Almost no psychological measurement instrument has ever been shown to satisfy these axioms. Michell's verdict: we quantify without proof, and then act as though the proof had been supplied.

The practical response from within psychometrics has been to treat this as a nuisance best handled statistically. Item response theory (IRT), for example, estimates item difficulties and person abilities on a latent logit scale that is interval by construction — but the "interval-ness" comes from the model, not from an independent demonstration that the construct has that structure. The numbers are still earned by assumption, just a more principled-looking one.

Section 4

The fallacy of the instrument

The third fallacy is the one that bites hardest in applied research. Suppose we solve fallacy 1 (subjective scales are calibrated) and fallacy 2 (the spacing is genuinely interval). We still haven't answered: does my questionnaire measure the same thing as yours?

In the physical sciences, this is trivial. If I measure your height with my ruler and you measure it with yours, we should get the same number — because "length" has a shared, well-defined operational reference. In psychology, two instruments both claiming to measure "wellbeing" frequently disagree substantially. Why?

Two rulers, one "construct"?

Both questionnaires claim to measure "wellbeing." Questionnaire A is mostly loaded on hedonic affect, with a side of meaning. Questionnaire B is mostly loaded on meaning, with a side of affect. They are congeneric — measuring related but distinct latent blends — not parallel.

This is the problem of congeneric measurement. Two instruments can both load on the same general construct while also loading, differentially, on distinct sub-constructs. They are not interchangeable readouts of the same latent variable. A participant scoring high on questionnaire A may score only moderately on questionnaire B — not because they are inconsistent, but because the instruments are tapping different weighted blends of the same underlying landscape. Treating them as equivalent, or combining them in meta-analysis as if they were, produces the conceptual equivalent of averaging temperatures measured in Celsius with temperatures measured in decibels.

Psychometrics offers formal tools to diagnose this: tests of tau-equivalence ask whether items within a single instrument load equally on their common factor (a necessary condition for Cronbach's α to be valid as a reliability estimate). Congeneric models allow loadings to vary. Comparisons across instruments require either explicit multi-trait multi-method designs or structural-equation-model-based linking. Most applied papers use none of these (Graham, 2006; Peterson & Kim, 2013).

Measurement invariance: the same ruler for everyone?

A second and closely related issue: even within a single instrument, does the scale work the same way for all groups? If an item on a depression inventory functions differently for men versus women — say, men are less likely to endorse the item "I feel tearful" at the same underlying depression level — then group comparisons using total scores are systematically distorted. This is differential item functioning (DIF), and its detection requires measurement-invariance testing: a hierarchy of model comparisons that asks whether factor loadings, intercepts, and residuals are equivalent across groups.

Differential item functioning: one item, two response curves

The item "I feel tearful" crosses the 50% endorsement probability at a lower depression level for women than for men. At the same true depression level, a woman is more likely to endorse the item. A summed total score will therefore overestimate women's depression (or underestimate men's) — not because either group is misrepresenting, but because the item's difficulty is group-dependent.

Measurement invariance is the formal name for the property we implicitly require whenever we compare groups on a psychological scale. Testing it requires multi-group confirmatory factor analysis or its IRT equivalent. These tests are standard in large-scale educational testing (PISA, TIMSS) — the field will not report cross-country comparisons without them — but remain rare in clinical, social, and health psychology, where group comparisons on raw summed scores dominate the literature.

What psychometrics can and cannot do

Psychometrics can detect congeneric structure, DIF, and invariance failures when they exist. It cannot rescue an instrument whose underlying construct has no quantitative structure in the first place — the ordinal fallacy is upstream. Measurement invariance testing presupposes a measurement model, and the measurement model presupposes that "more" of the construct is a meaningful quantity at all.

Section 5

What the field does about all this

The honest answer is: usually nothing. The bulk of applied psychology proceeds as if these fallacies did not exist, because the tools for addressing them are either unfamiliar, computationally demanding, or philosophically uncomfortable. But three broad responses are available, and each trades off rigour against tractability in a different way.

Response 1

Operationalism

Pragmatic retreat: if a measure predicts behaviour, occupational outcomes, treatment response — its validity is demonstrated by use. The philosophical question of whether it captures a "real" quantitative construct is shelved. Most of working psychology is implicitly operationalist.

Response 2

Bayesian & probabilistic modelling

Don't assume exact measurement — model the uncertainty. Represent the latent construct with a posterior distribution rather than a point estimate. Propagate measurement error through downstream inferences. Hierarchical Bayesian models formalise what CTT only gestures at.

Response 3

IRT with explicit assumptions

Abandon the pretence that raw Likert numbers are interval data. Estimate person abilities and item characteristics on a logit scale, and report the assumptions (unidimensionality, monotonicity, invariance) as testable claims. Imperfect but auditable.

The deeper response is philosophical. Michell and Meehl would both argue that the discipline needs to decide, explicitly, whether a given construct is genuinely quantitative — and if the answer is no, or unknown, to use methods appropriate to ordered categorical data rather than methods that silently presuppose a metric that was never earned. That means ordinal regression, nonparametric analogues of familiar tests, and probabilistic modelling approaches — not ANOVA on Likert sums.

For the student encountering this the first time, the practical upshot is not paralysis. It is awareness: when you read a paper that computes a mean Likert score, a paired-sample t-test on wellbeing, or a Cronbach's α on a seven-item depression scale, you now know what assumptions are riding underneath. You can ask whether the authors tested them — most won't have — and whether their conclusions would survive if those assumptions were weakened. Often they would; sometimes they wouldn't. The fact that the question is askable at all is the beginning of a more honest measurement culture.

The one-line summary

Psychology's statistics are a sophisticated edifice built on a foundation that is rarely inspected. The fallacies of opinion, ordinal structure, and the instrument are cracks in that foundation — not necessarily fatal, but worth knowing about before you trust the building with too much weight.

Before the statistics: does the measurement even count?

The fallacy of opinion

The ordinal fallacy

The fallacy of the instrument

The fallacy of opinion

Two people, both say "7"

The ordinal fallacy

See the fallacy at work

Bend the scale, watch the effect size move

The formal escape — and why it's hard

The fallacy of the instrument

Two rulers, one "construct"?

Measurement invariance: the same ruler for everyone?

Differential item functioning: one item, two response curves

What the field does about all this

Operationalism

Bayesian & probabilistic modelling

IRT with explicit assumptions