A High Score Is Not Evidence of Understanding
June 11, 2026
5 min read
The Score Is Not the Signal
A 2023 study published in Frontiers in Physiology produced a finding that should concern every faculty member running summative exams: medical students who could recognize the correct answer on a multiple-choice question were 47 percentage points less likely to demonstrate conceptual knowledge of the underlying physiology when tested in a different format. Same material. Different question structure. Completely different result.
This is not a quirk of medical education. It reflects a structural flaw in how most institutions measure learning.
What Multiple-Choice Questions Actually Measure
Multiple-choice questions are the dominant format for large-scale assessment in higher education, and their practical advantages are real: they score quickly, scale easily across large cohorts, and produce consistent numerical outputs. A 2023 qualitative study of university teachers found that educators broadly believe MCQs can test higher-order cognition, but only up to the "apply" and "analyse" levels of Bloom's taxonomy. Synthesis, evaluation, and conceptual explanation remained outside what most faculty thought MCQs could reliably reach.
That ceiling matters. The most valuable signal in student learning assessment is not whether a student selected the correct option. It is whether the student can reconstruct the reasoning behind it.
When a student picks the right answer, it is impossible to determine whether they understood the concept or recognized a pattern from repeated exposure to similar items. Those two cognitive states produce identical outputs on a standard exam, but they diverge sharply in transfer, retention, and professional application.
The Cognitive Engagement Gap
Generative AI has made this problem more visible, but it did not create it. A 2026 paper in Education Sciences introduced the term cognitive engagement gap to describe a condition that has existed in passive assessment long before AI tools arrived: successful assessment completion no longer reliably indicates learning or epistemic development.
The authors argue that assessment historically enforced cognitive engagement by structural necessity. You had to think to produce an answer. That assumption began eroding well before AI, because a well-designed distractor can be eliminated through familiarity rather than reasoning. The format was never a guarantee of engagement. It was always a proxy.
Recall-based assessment treats recognition as evidence of understanding. The research consistently shows it is not. Students who perform well on familiar item banks often demonstrate significantly lower conceptual knowledge when the same content appears in unfamiliar formats or requires active articulation.
Why This Gap Survives Institutional Review
Assessment teams and accreditation bodies typically evaluate whether exam questions map to learning outcomes, whether items are properly calibrated for difficulty, and whether scores distribute as expected. None of these review processes can detect whether a passing score reflects genuine understanding or successful pattern recognition.
A 2025 commentary in Medical Science Educator described the paradox precisely: higher education has produced students who know more and understand less. The problem emerges from an assessment infrastructure that rewards performance on the measured format rather than mastery of the underlying concept.
A student who scores 78 on a multiple-choice exam and a student who scores 78 and can accurately explain every concept behind those questions are reported identically. The institution cannot tell them apart.
This is exactly what conceptual mastery assessment is designed to solve.
Assessment That Reveals What Scores Cannot
The alternative is not to eliminate multiple-choice questions. It is to pair them with an instrument that verifies the understanding behind the score. When a student must explain a concept, correct a specific misconception, and defend their reasoning under scrutiny, the cognitive process is fundamentally different from answer selection. The explanation either holds or it breaks down. There is no correct-adjacent option to select.
This is the logic behind teach-back assessment: explaining a concept to something that does not yet understand it requires a student to hold the full structure of an idea, locate the fault in a flawed version, and reconstruct it accurately. Pattern recognition cannot carry a student through that process.
Unlike a standard formative assessment platform that monitors whether a student selected the correct answer, Axiom Flow measures whether a student can explain and defend conceptual understanding under active examination. Its AI model, Sam, begins each assessment initialized with specific misconceptions. It updates its understanding only based on the quality of the student's teaching. If the explanation is vague, partial, or structurally incorrect, Sam remains wrong, and the score reflects that precisely.
The final MCQ component, taken by Sam using only what it was taught, confirms whether the teaching was accurate enough to produce correct application. The two-stage structure produces a record no single-format exam can: evidence that the student both understood the concept and could communicate it with precision.
The Institutional Stakes
Every institution running summative exams carries some risk that scores and mastery have diverged. The more familiar the item pool, the longer a course has run, and the larger the class size, the higher the probability that high-performing students include a proportion whose results reflect recognition rather than understanding.
This is not a question of academic integrity. It is a question of what assessment actually measures, and whether institutions are equipped with instruments that can tell the difference.
Enjoyed reading this? Share this article with your network.

