Measuring What We Cannot See — A Psychometrics Teaching Resource

Part A

Cronbach's alpha and the reliability problem

When a psychologist administers a test — whether it measures working memory capacity, reading ability, or anxiety — they face a fundamental problem: the number produced by the test is not the number they actually want. What they want is the person's true level of the underlying ability or trait. What they obtain is something inevitably noisier. The core project of psychometrics is to understand how far these two things diverge, and to develop measurement procedures where the gap is as small as possible.

Classical Test Theory: the measurement model

Classical Test Theory (CTT) provides the foundational framework. Its central insight is simple: every observed score is the sum of two unobservable components — the true score and measurement error.

X = T + E

Here, X is the observed score (what is actually recorded), T is the true score (the idealised, error-free quantity we wish we could measure directly), and E is error. The model makes three key assumptions about the nature of error:

Error has expected value zero

Across many measurements, errors are equally likely to inflate or deflate a score. There is no systematic bias. Formally, E(ε) = 0.

Error is uncorrelated with the true score

How much error affects a score does not depend on how high or low the person's true ability is. High-ability respondents are not systematically luckier or unluckier than low-ability respondents.

Errors are uncorrelated across items

The error on one item does not predict the error on any other item. Guessing correctly on item 3 is entirely uninformative about what happens on items 4, 5, or 6. This is sometimes called the assumption of local independence.

These assumptions have a powerful consequence. Because errors are random and independent across items, observed score variance partitions neatly:

Var(X) = Var(T) + Var(E)

This allows a precise definition of reliability — the fundamental quantity CTT aims to estimate.

Definition: reliability

Reliability ρ_XX is the proportion of observed score variance attributable to true score variance:

ρ_XX = Var(T) / Var(X)
A value of 1.0 means all variance is true score variance (no error). A value of 0 means all variance is error.

Reliability so defined is a population parameter — not directly observable. Cronbach's alpha is one method of estimating it from data.

Cronbach's alpha: the formula and its logic

When a test comprises multiple items, the goal is usually to combine them into a total or average score. Cronbach's alpha (α) asks: how reliable is that composite? Given k items and a sample of responses, alpha is computed as:

α = (k / (k − 1)) × (1 − Σσ²_i / σ²_total)

Here σ²_i is the variance of item i, and σ²_total is the variance of the total sum score across all k items. The logic is intuitive: if items share a common signal (true score variance), then item variances will be small relative to total score variance, and alpha will be high. When items are standardised to have equal variance (unit variance), this reduces to the more transparent expression:

α = (k × r̄) / (1 + (k − 1) × r̄)

where r̄ is the mean pairwise inter-item correlation. This version makes two things immediately visible: reliability increases with the number of items k, and it increases with the average correlation between items r̄. The interactive widget below lets you explore this relationship directly.

One of the most intuitive ways to interpret alpha is via its square root. Because reliability is defined as the proportion of observed score variance attributable to true score variance, the square root of alpha estimates the correlation between observed scores and true scores (Nunnally & Bernstein, 1994). So a scale with α = 0.81 implies that observed scores correlate approximately 0.90 with the underlying true score — a reassuringly concrete statement about measurement quality. Conversely, α = 0.49 implies a correlation of only 0.70, meaning that nearly half the variance in observed scores is noise. This framing can make reliability values feel more tangible than the abstract variance-ratio definition alone.

It is important to note that alpha is an exact estimate of reliability only under a specific structural assumption called tau-equivalence: all items must carry the same true score component — in matrix terms, all factor loadings must be equal. When factor loadings differ (as is common in practice), alpha is technically a lower bound on reliability rather than an exact estimate. It is better understood as an approximation whose accuracy depends on how close the items are to being tau-equivalent.

The alpha trap: what a high value does not tell you

Despite decades of methodological critique, the dominant practice in scale development remains what Flake, Pek, & Hehman (2017) call the sum-and-alpha approach: items are written, summed into a total score, and a single alpha coefficient is reported alongside an appeal to face validity as the primary evidence that the scale measures its intended construct. No factor analysis, no dimensionality check, no DIF testing across groups. This approach is widespread precisely because it is easy and because reviewers have historically accepted it — but it leaves a cascade of assumptions untested and mistakes the absence of scrutiny for the presence of quality. as confirmation that a test is reliable, unidimensional, and fair across groups — all at once. This is sometimes called the alpha trap, and it matters because each of those three claims requires a different kind of evidence that alpha simply cannot provide.

A common but mistaken inference

"Our empathy scale produced α = 0.88, demonstrating that it is reliable, measures a single empathy construct, and is valid across genders."

Each part of this sentence is doing something alpha cannot support. A single reliability coefficient cannot confirm unidimensionality, and it cannot confirm measurement equivalence across groups. Only targeted checks — factor analysis for the first, measurement invariance testing or DIF analysis for the second — can address those questions.

Consider the empathy example more carefully. Empathy researchers have long distinguished between emotional empathy (feeling what another feels) and cognitive empathy (understanding another's perspective). A scale that mixes items tapping both facets may well produce α = 0.88 — the two facets are positively correlated, so the inter-item covariance is substantial, and alpha dutifully reports a high value. But the composite score is averaging across two conceptually distinct abilities, and downstream conclusions about individual differences in "empathy" will be systematically ambiguous. The high alpha did not warn you because it was never designed to.

More specifically, alpha provides no protection against three distinct failures:

Tau-equivalence violation causing underestimation. If some items load more heavily on the construct than others — which is normal — alpha underestimates reliability rather than accurately measuring it. A scale may be more reliable than alpha suggests, not less.
Correlated errors causing overestimation. When errors are not independent — because items share a reading passage, because respondents experience fatigue across a block of similar items, or because a scale is multidimensional with positively correlated factors — alpha inflates. Violations of the local independence assumption can cause alpha to overestimate reliability by as much as 20% (Gessaroli & Folske, 2002). Critically, alpha is biased in both directions depending on which assumption is violated, making it difficult to know whether a given value is too high or too low without additional diagnostic checks.
Multidimensionality. Two or more correlated but distinct traits generate positive inter-item covariances, inflating alpha. The composite score mixes incommensurable things. Alpha cannot detect this — only factor analysis can.
Measurement non-invariance. A test may function differently for different groups: the same item may be harder for one group not because of lower ability but because of cultural familiarity, translation effects, or differential construct relevance. Alpha computed on the combined sample will be blind to this. Establishing that a test works the same way across groups — measurement invariance — requires a separate programme of testing (confirmatory factor analysis with increasingly constrained models across groups, or item-level DIF analysis). These checks are conducted before group comparisons are made, not after.

Why Cronbach eventually moved beyond alpha

Perhaps the most striking evidence that alpha is insufficient comes from Cronbach himself. In later work, he explicitly distanced himself from the coefficient that bears his name, arguing that it addressed only one facet of reliability — the sampling of items — while leaving untouched the many other sources of variance that contribute to measurement error in practice. These include day-to-day fluctuation in a respondent's state, rater effects in subjectively scored assessments, differences in testing mode (paper versus computer, supervised versus unsupervised), and occasion-specific context effects.

His response was Generalisability Theory (Cronbach et al., 1963), a framework that replaces the single CTT error term with a partition of variance across multiple crossed facets — items, occasions, raters, contexts. A generalisability coefficient is then computed that reflects reliability not just across items but across whatever combination of facets the researcher wishes to generalise over. The practical implication is clear: a test with high alpha may still have poor generalisability if occasion-to-occasion variability is large, if rater disagreement is substantial, or if performance is sensitive to administration context. Alpha, by design, is blind to all of these sources.

The bottom line

Alpha is a useful first screening tool. A very low alpha almost certainly signals a problem. A high alpha rules out one specific failure mode — insufficient item intercorrelation — but confirms nothing else. Researchers who report only alpha and draw conclusions about unidimensionality, measurement equivalence, or generalisability across contexts are over-interpreting a statistic that was never designed to carry that weight.

How k and r̄ jointly determine alpha

Rewriting the standardised formula reveals an important limiting behaviour:

α = 1 / (1 + (1 − r̄) / (k × r̄))

As k grows large (with r̄ held constant and positive), the fraction (1 − r̄) / (k × r̄) shrinks toward zero, so alpha approaches 1. Critically, this only happens when r̄ > 0. If items are genuinely independent (r̄ = 0), alpha equals zero regardless of how many items you administer — there is no shared signal for aggregation to amplify. Explore this in the widget below.

The hierarchy of measurement assumptions

Not all CTT models are equally restrictive. There is a hierarchy of assumptions about how items relate to the underlying true score, running from most to least restrictive. Understanding this hierarchy clarifies both what alpha requires and why it so frequently underestimates reliability in practice.

Model	What it requires	Reliability estimator
Parallel	Items have identical true scores and identical error variances. Extremely restrictive; rarely defensible.	Split-half (Spearman-Brown)
Tau-equivalent	Items share the same true score (equal factor loadings) but may have different error variances.	Cronbach's alpha (exact)
Essentially tau-equivalent	True scores are linearly related across items — loadings equal but means may differ by a constant.	Cronbach's alpha (approximate)
Congeneric	Items measure the same construct with different loadings and different error variances. The most realistic case in practice.	Omega total; alpha is only a lower bound

The practical implication is stark: most psychological scales are congeneric — items capture the same construct but with demonstrably different factor loadings. In these circumstances, alpha is not a point estimate of reliability but a lower bound, and it can underestimate true reliability by as much as 20% when loadings are heterogeneous (Green & Yang, 2009; McNeish, 2017). A researcher who obtains α = 0.65 and concludes the scale is borderline acceptable may in fact be looking at a genuinely reliable measure that alpha is undervaluing because it mistakenly assumes all loadings are equal.

A brief historical note

The index universally called "Cronbach's alpha" was in fact derived by Guttman (1945) as his third lambda coefficient. Cronbach's (1951) genuine contributions were demonstrating that the index equals the average of all possible split-half reliabilities — elegantly removing the need to choose a specific split — and popularising it under a more accessible notation. McDonald (1999) proposed the more accurate label Guttman-Cronbach alpha. Cronbach acknowledged that the single-error-source framework of CTT was insufficient, and by the 1960s had developed Generalisability Theory as a more complete answer. That a coefficient so widely used carries the name of a researcher who ultimately advocated moving beyond it should give practitioners pause.

Number of items (k): 10

Mean inter-item correlation (r̄): 0.30

—

Cronbach's α

0 0.70 0.80 1.00

Key insight

If r̄ = 0 — true independence — alpha stays at zero regardless of k. The aggregation mechanism only works when there is some shared signal present in the first place.

The aggregation principle

The widget illustrates a principle with deep roots in statistics: when many noisy measurements all contain a small fragment of the same underlying signal, aggregating them can recover that signal with useful precision. Each individual item score is dominated by noise, but summing many items allows the noise — being random and uncorrelated — to partially cancel while the common signal accumulates coherently. This is why a 40-item test can be substantially more reliable than a 10-item test, even when the average inter-item correlation is unchanged.

However, this mechanism rests entirely on the CTT error assumptions. Errors must be genuinely random and uncorrelated across items. If items share a common source of unwanted variance — for example, groups of items sharing a reading passage, or later items all being affected by fatigue — then the errors are no longer independent, the cancellation logic breaks down, and alpha will overestimate reliability. The formula gives you a number, but the number is only meaningful if its assumptions are met.

Beyond Cronbach's alpha: choosing a reliability estimator

Because alpha requires tau-equivalence and most scales are congeneric, it routinely underestimates reliability. A family of modern alternatives avoids this by using the actual factor loadings from a factor analysis rather than assuming them all to be equal. The decision about which to use follows from two questions: is the scale unidimensional, and is the composite scored with equal item weights or with weights that reflect each item's individual reliability?

Decision framework

Step 1. Is the scale unidimensional? Use parallel analysis and EFA to check. If clearly multidimensional, analyse subscales separately or use omega hierarchical (ω_h) to estimate reliability for the general factor while accounting for subfactors.

Step 2 (if unidimensional). Is your composite an equal-weighted sum? If yes: use omega total (ω_t). Is it an optimally weighted composite where more discriminating items carry more weight? If yes: use Coefficient H — but see the small-sample caution below.

Omega total (composite reliability)

Omega total is the natural successor to alpha for congeneric scales. It requires a factor analysis of the items first, then uses the estimated factor loadings (λ_i) and item error variances (θ_i) directly:

ω_t = (Σλ_i)² / [ (Σλ_i)² + Σθ_i ]

Items with larger loadings contribute proportionally more to the numerator. When all loadings happen to be equal (tau-equivalence), ω_t and α give identical results — omega total subsumes alpha as a special case. When loadings differ, omega total is larger and more accurate. In R: MBESS::ci.reliability() or userfriendlyscience::scaleStructure().

An honest note on the practical difference

It is worth acknowledging that in practice, omega total and Cronbach's alpha often produce similar numerical values — the gap is frequently in the range of 0.02 to 0.08 for typical psychological scales. The debate about alpha is not primarily about whether the numbers are dramatically different; it is about what the number assumes and what conclusions it licenses. Omega total is more accurate because it makes fewer assumptions, not necessarily because it always produces a strikingly higher value. The more important distinctions — whether the scale is unidimensional, whether errors are independent, whether the test works equivalently across groups — cannot be resolved by any reliability coefficient alone, however well chosen. Both alpha and omega can be substantially affected by non-normality in the score distribution, and neither substitutes for a proper factor analysis. Reliability estimation is a starting point for evaluation, not its conclusion.

Omega hierarchical and multidimensional scales

When a scale contains a dominant general factor plus one or more minor subfactors, omega hierarchical (ω_h) estimates reliability for the general factor alone after partialling out variance from the group factors. This requires a Schmid-Leiman rotation to orthogonalise the general and group factors. The gap between ω_t and ω_h reflects variance attributable to subfactors rather than the main construct — if it is large, a composite score is pooling incommensurable sources of variance.

Coefficient H and a small-sample caution

Coefficient H (Hancock & Mueller, 2001) estimates the reliability of an optimally weighted composite, where each item's weight is proportional to its loading-to-error-variance ratio. It represents the theoretical ceiling of reliability for any linear composite of these items and is always at least as large as omega total. However, it is only appropriate to report H when the composite is actually scored using those optimal weights. Reporting H for a simple sum score will overstate the reliability of the measure you are actually using.

Small-sample caution for Coefficient H

Aguirre-Urreta, Rönkkö, & McIntosh (2018) showed through simulation that the sample estimate of Coefficient H is positively biased at small N — it systematically exceeds the population value. Simultaneously, the true reliability of the optimally weighted composite formed from sample data is negatively biased — the actual composite is less reliable than the statistic suggests. These opposing biases compound: at N = 25 with three items, the total discrepancy can exceed 25 percentage points, and it remains noticeable at N = 150. Composite reliability (omega total), by contrast, is essentially unbiased across all sample sizes examined. For typical psychological sample sizes, omega total is the safer and more accurate choice.

The standard error of measurement

Any reliability estimate immediately implies a standard error of measurement (SEM) — a concrete statement about how much an observed score typically deviates from a person's true score:

SEM = SD(X) × √(1 − ρ_XX)

The SEM is expressed in the same units as the test score, making it directly interpretable. A test with score SD = 8 and reliability = 0.84 has SEM = 8 × √0.16 = 3.2 points, so a 95% confidence interval for any observed score spans roughly ±6 points around it. A critical limitation of the CTT-based SEM is that it is constant across all ability levels — the same margin of error applies whether a person scores near the floor or near the ceiling. In reality, tests often measure middle-range ability more precisely than the extremes. This limitation is one of the central motivations for IRT, which replaces the single SEM with a test information function that shows exactly where along the ability continuum the test is precise and where it is not — the topic of Part B.

Part B

Item Response Theory: modelling the item

Classical Test Theory operates at the level of the test. It treats the item sum as the fundamental quantity and asks how reliable it is. Item Response Theory (IRT) zooms in a level further: it models how each individual item behaves, and makes explicit, testable predictions about the probability of a correct response as a function of a person's underlying ability.

This shift has significant practical consequences. In CTT, a person's score is partly a function of the difficulty of the particular items they happened to receive — an easier set of items produces higher scores, not higher ability. IRT places person ability and item difficulty on the same underlying scale, making them directly comparable and — crucially — independent of the specific items administered. This property, called measurement invariance or specific objectivity within Rasch theory, is what gives IRT its power for adaptive testing and cross-test comparisons.

The Rasch model (1PL)

The Rasch model — also called the one-parameter logistic model (1PL) — is the most parsimonious IRT model for binary (correct/incorrect) item responses. It proposes that the probability of person j answering item i correctly is determined by exactly two things: person ability (θ_j) and item difficulty (β_i).

P(X_ij = 1 | θ_j, β_i) = exp(θ_j − β_i) / (1 + exp(θ_j − β_i))

Both ability and difficulty are expressed on a common logit scale — the log-odds of a correct response. The model has an elegant, interpretable property: when person ability exactly matches item difficulty (θ = β), the probability of a correct response is exactly 0.5. As ability increasingly exceeds difficulty, the probability approaches 1; as difficulty increasingly exceeds ability, the probability approaches 0.

When plotted, this relationship produces an S-shaped (sigmoid) curve called an Item Characteristic Curve (ICC). Under the Rasch model, every item's ICC has the same shape — the same slope at its inflection point — with items differing only in their horizontal position (their difficulty). This is the key visual signature of the Rasch model: a family of parallel S-curves, equally steep, spread across the ability axis.

Why the Rasch model validates summing

There is a direct and important mathematical link between the Rasch model and the classical practice of summing item scores. Under the Rasch model, the raw sum score Σ_iX_ij is a sufficient statistic for person ability θ_j. This means the sum score contains every piece of information about ability that exists in the data — no additional information about ability can be extracted by examining the pattern of individual responses beyond what is captured in the total.

This is a remarkable result. It means that if the Rasch model accurately describes how your items work, the traditional practice of simply summing responses is not just convenient — it is statistically optimal. CTT and Rasch IRT converge on the same answer when the data conform to Rasch expectations.

Seen from this angle, fitting a Rasch model to your data is a way of empirically testing whether summing was justified in the first place. If the model fits, the sum is vindicated. If the model fails to fit, you know that something in the Rasch assumptions is wrong — items may discriminate differently, the data may be multidimensional, or there may be local dependence — and the simple sum is missing something important.

Rasch fit statistics: infit and outfit

Rasch fit statistics ask, item by item and person by person: do the observed responses conform to what the Rasch model predicts? The two most widely reported statistics are infit and outfit, both expressed as mean square residuals with an expected value of 1.0 under perfect model fit.

Statistic	How it is computed	Most sensitive to	Expected value
Infit	Information-weighted mean square of standardised residuals	Unexpected patterns near the item's difficulty level	1.0
Outfit	Unweighted mean square of standardised residuals	Unexpected patterns far from the item's difficulty level (outlier responses)	1.0

Values substantially above 1.0 indicate underfit: responses are more erratic than the model predicts — more noise than the Rasch mechanism would generate. Values substantially below 1.0 indicate overfit: responses are more deterministic than the model predicts (Guttman-like patterns where people always get easy items right and hard items wrong, with no "noise" at the boundaries). Common practical guidelines place acceptable fit between 0.7 and 1.3 for high-stakes assessments, with more generous bounds (0.5–1.5) sometimes used for other applications. These are guidelines, not rigid rules, and should be interpreted in context.

Rasch / 1PL

Under the Rasch model, all item characteristic curves have the same slope — items differ only in their horizontal position (difficulty). The dots show where each curve crosses P = 0.5, which is where ability and difficulty are matched. The raw sum score is sufficient for person ability precisely because of this uniform slope.

The 2PL model: when items discriminate differently

The Rasch model constrains every item to have the same discriminating power — the same slope on its ICC. The two-parameter logistic (2PL) model relaxes this by adding an item-specific discrimination parameter α_i:

P(X_ij = 1 | θ_j, α_i, β_i) = exp(α_i(θ_j − β_i)) / (1 + exp(α_i(θ_j − β_i)))

The discrimination parameter α_i controls the steepness of the ICC. A high value (say, 2.0) produces a steeply rising curve — the item sharply separates people whose ability is just above from those just below the difficulty threshold, making it highly informative about ability at that point. A low value (say, 0.5) produces a shallow, nearly flat curve — the item is only weakly related to ability and discriminates poorly. Select the 2PL view in the ICC widget above to see how curves of different slopes spread across the ability axis.

The practical implication is important: if items vary substantially in discrimination, then equal weighting is suboptimal. Items with high discrimination carry more information about ability and ideally should receive proportionally more weight. Simply summing scores — which treats each item as equally informative — will be less precise than the weighted combination implied by the 2PL estimates.

Deciding between 1PL and 2PL

Whether Rasch or 2PL is preferred is an empirical question answered by a likelihood ratio test (LRT). The 2PL adds one extra parameter per item (the discrimination α_i). If the 2PL fits significantly better — assessed against a χ² distribution with degrees of freedom equal to the number of items — then discriminations differ and the Rasch constraint is untenable. If the difference is not significant, parsimony favours the Rasch model and the sum score remains fully justified.

Item and test information functions

IRT provides something CTT cannot: a way to say precisely where on the ability continuum a test (or individual item) is most accurate. The item information function I_i(θ) quantifies how much statistical information an item contributes at each ability level. For the 2PL model:

I_i(θ) = α_i² × P_i(θ) × (1 − P_i(θ))

An item provides maximum information at the ability level where P_i(θ) = 0.5 — where ability and difficulty are matched. The product P(1 − P) is maximised at 0.5 and falls to zero at the extremes. The total test information function is the sum of all item information functions, and its reciprocal gives the standard error of measurement at each ability level:

I(θ) = Σ_i I_i(θ) SEM(θ) = 1 / √I(θ)

This is one of IRT's most practically valuable features. Instead of a single reliability coefficient that applies globally across all ability levels, IRT tells you exactly where on the ability scale the test is precise and where it is not. A well-designed test concentrates its items — and therefore its information — where precision is most needed.

Rasch person and item reliability

The Rasch model produces its own reliability-like statistics that are distinct from alpha but serve analogous roles. These are derived from the spread of person (or item) estimates relative to their estimation precision, and they offer a useful bridge between the CTT and IRT frameworks.

Statistic	What it quantifies	Benchmark
Person reliability	The signal-to-noise ratio of person ability estimates: true ability variance divided by mean-square measurement error. Analogous to the CTT reliability coefficient.	> 0.8 desirable
Person separation	The range of person estimates expressed in error units — how many statistically distinct ability strata the test can discriminate. A separation of 2 corresponds roughly to three distinguishable strata; 3 corresponds to five.	> 2.0 desirable
Item reliability	How replicable the rank order of item difficulties is across samples — the confidence that the difficulty hierarchy would reproduce in a new sample drawn from the same population.	> 0.8 desirable
Item separation	How well spread the items are along the difficulty continuum in error units. Low item separation suggests items are clustered around a narrow difficulty range, limiting the test's ability to target different ability levels.	> 2.0 desirable

An important nuance is that person reliability behaves like CTT reliability in one familiar way: it will be larger for heterogeneous samples (wide ability range) and longer tests, and smaller for homogeneous samples. This means person reliability is partly a property of the sample, not just the instrument — exactly the same limitation that motivated reliability corrections for restriction of range in CTT contexts. Item separation, by contrast, is more informative about test design: low item separation signals that the test lacks items targeting the extremes of the ability range, which should prompt targeted item development.

Part C

Before you fit a model: what to check

Neither CTT nor IRT is automatically appropriate for any given dataset. Both frameworks rest on assumptions that must be evaluated — not merely declared — before the resulting scores and estimates can be interpreted with confidence. The checks described in this section should be understood as a coherent programme of model evaluation, each addressing a different facet of the same underlying question: are these data consistent with the measurement model I intend to apply?

1. Unidimensionality

The most fundamental assumption underlying any single-number composite — whether a CTT sum score or an IRT ability estimate — is that the items primarily reflect variation along a single latent dimension. If items tap two or more substantially independent factors, a single sum combines incommensurable quantities, and the resulting score resists straightforward interpretation. Evaluating dimensionality is therefore the logical first step.

Parallel analysis

Compares the eigenvalues of the item correlation matrix against eigenvalues derived from random data of the same dimensions. Factors with eigenvalues exceeding the random baseline are plausibly real. More conservative and more defensible than the traditional Kaiser rule (eigenvalue > 1).

psych::fa.parallel(data)

Exploratory factor analysis

Examine factor loadings and the scree plot. A clearly dominant first factor — with substantially smaller subsequent factors — provides evidence for approximate unidimensionality. Also examine the ratio of the first to second eigenvalue.

psych::fa(data, nfactors = 1)

Confirmatory factor analysis

Formally test a single-factor structural model. Key indices: RMSEA < 0.06, CFI > 0.95, SRMR < 0.08 are common guidelines for adequate fit. Examine residual correlations for evidence of remaining local dependence.

lavaan::cfa(model, data = data)

PCAR (Rasch-specific)

Principal Components Analysis of Rasch Residuals. After the Rasch dimension is extracted, residuals should be structureless noise. A first-contrast eigenvalue below 2.0 is a practical guideline suggesting no substantial secondary dimension, though this threshold is debated and should not be applied mechanically.

eRm::PCM() + residual PCA

2. Local independence

Local independence states that, conditional on a person's ability level θ, their responses to individual items should be statistically independent. Knowing that someone answered item 3 correctly should tell you nothing more about their probability of answering item 7 correctly, once their ability is taken into account. Violations — called local item dependence (LID) — inflate alpha, distort IRT parameter estimates, and cause the effective test length to be smaller than the nominal item count.

The most common cause of LID is the presence of testlets: groups of items sharing a common stimulus such as a reading passage, a figure, or a scenario. Items within a testlet share stimulus-specific variance that has nothing to do with the target construct. Other causes include highly similar item wordings and sequential dependencies in item content.

Yen's Q3 statistic

The most widely used diagnostic for local dependence. Q3 is the pairwise Pearson correlation between the standardised residuals of two items after the IRT model has been fitted. Under local independence, these residuals should be uncorrelated. A corrected Q3 greater than approximately 0.2 is conventionally taken as evidence of problematic local dependence between a pair of items. Flagged pairs should be examined for shared content, format, or context. The raw Q3 has an expected negative value due to sum-score conditioning; correction for this bias is recommended before applying the threshold.

3. Monotonicity

IRT models assume that the probability of a correct response is a monotonically non-decreasing function of ability: higher ability should never make a correct response less likely. This seems self-evident, but can fail in practice — for example, when high-ability respondents detect an ambiguity that lower-ability respondents miss, when a distractor is differentially attractive to more knowledgeable respondents, or when item wording creates interpretation differences across the ability range.

Mokken scale analysis provides a nonparametric framework for evaluating monotonicity without committing to a specific parametric model. The item scalability coefficient H quantifies how well each item conforms to the expected stochastic ordering. Conventional thresholds: H < 0.3 — item is a poor fit and should be revised or removed; 0.3 ≤ H < 0.5 — acceptable; H ≥ 0.5 — strong scalability.

mokken::coefH(data)

4. Differential item functioning

A test is only fair if it measures the same construct in the same way across different groups of respondents — for example, across age groups, genders, or first-language backgrounds. If two people have the same true ability but one group systematically performs differently on a given item due to group membership rather than ability, that item exhibits differential item functioning (DIF). DIF items compromise the validity of group comparisons and should be investigated before drawing conclusions that involve between-group differences.

Mantel-Haenszel method: A contingency-table approach that compares item performance across groups matched on total score. Well-established, computationally simple, and widely used in large-scale testing programmes.
Logistic regression DIF: Regresses the item response on total score, group membership, and their interaction. Detects both uniform DIF (consistent group difference across all ability levels) and non-uniform DIF (a group-by-ability interaction, where the group difference itself changes across the ability range).
IRT-based DIF: Compares item parameter estimates obtained separately for each group. Items with substantially different difficulty or discrimination estimates exhibit DIF. Requires adequate sample sizes in each group for stable estimation.

5. Practical preconditions

Sample size

IRT estimation is sample-intensive. As rough guidance: Rasch / 1PL typically requires n ≥ 200; 2PL requires n ≥ 500; 3PL requires n ≥ 1000. Small samples yield unstable parameter estimates and may produce spuriously good or poor fit statistics. Note also that Coefficient H / maximal reliability is substantially biased upward at n < 150, making omega total the preferred reliability estimate for smaller studies.

Missing data

IRT estimation assumes data are missing at random (MAR) or missing completely at random (MCAR). Systematic missingness — for instance, due to test speededness or item non-response related to ability — can seriously bias parameter estimates. Examine the pattern before proceeding.

Speededness

IRT models assume that item non-response reflects inability, not time pressure. If a substantial proportion of respondents fail to reach later items, the test is effectively speeded and standard IRT models are inappropriate without modification.

Item format consistency

Binary items (correct/incorrect) and polytomous items (rating scales, partial credit) require different model families. Mixing formats without accommodation in the model produces incorrect estimates. Ensure the model chosen matches the response format of all items.

Putting it together: a practical workflow

The three parts of this resource form a logical hierarchy. Alpha (Part A) answers the most basic question: is the composite score adequately consistent? IRT (Part B) goes further: does the data conform to a specific model of how ability generates responses — and if so, what is that structure? The checks in Part C are the preconditions that must be satisfied before either of those questions can be answered coherently. Skipping Part C and proceeding directly to model fitting is like checking whether a ladder is well-constructed without first checking whether it is resting against the right wall.

Evaluate dimensionality

Parallel analysis + EFA. If the data are clearly multidimensional, analyse subscales separately before fitting any composite model.

Examine local independence

Fit a preliminary model and inspect Q3 residual correlations. Investigate item pairs above the threshold for shared content or context.

Check monotonicity

Mokken H coefficients. Items with H < 0.3 should be revised or removed before parametric IRT fitting.

Fit Rasch / 1PL

Examine item and person fit statistics (infit, outfit). Identify misfitting items and determine whether misfit reflects item problems, person aberrance, or model limitations.

Test Rasch against 2PL

Likelihood ratio test. If 2PL is significantly better, item discriminations vary and equal item weighting is suboptimal.

Evaluate DIF if group comparisons are planned

Test for differential item functioning before drawing conclusions about between-group ability differences.