Sebastián Sarmiento
← Work / Index

Framework · Measurement 2023

A mastery model that survived a real curriculum

How we replaced a checklist of standards with a mastery model that teachers could act on — and that statistics could defend.

Confidentiality — Institution, datasets, and proprietary tooling are abstracted. What follows is the transferable decision logic, not internal exposure.

01 — Context

A mathematics program had adopted a mastery framing on paper, but operated as a coverage checklist in practice. Teachers marked standards “done”; the system reported green; and nobody could say what a green cell actually entitled you to believe about a student.

The brief looked like a reporting problem. It was a measurement problem wearing a reporting costume.

02 — The real decision

The decision was not “which dashboard?” It was: what is the unit of mastery, and what evidence licenses the claim that a student holds it? Everything downstream — item design, reporting, intervention — is determined by that one answer.

A mastery model is a claim about a learner. If you can’t say what would make the claim false, you are not measuring — you are decorating.

03 — My role

I led the curricular strategy and owned the measurement logic end to end — defining the constructs, designing and validating items, and specifying how the model would be read by teachers and rendered by the product. I worked between the classroom, the psychometrics, and the engineering, and translated in all three directions.

04 — Constraints

01
Minutes, not hours
Teachers had to interpret a result at a glance.
02
Finite item volume
We could not test every sub-skill directly.
03
Audit-ready
Reporting had to survive a skeptical head of department.

05 — The logic used

We modeled each mastery target as a latent construct with an explicit evidence model, calibrated with Item Response Theory so that difficulty and discrimination were properties of items, not opinions. Where targets were sequential, Bayesian Knowledge Tracing carried belief forward instead of resetting it every assessment.

construct        → evidence model → item bank
response         → IRT calibration → ability estimate
prior × evidence → BKT posterior   → mastery claim

06 — Alternatives considered

A raw percent-correct cutoff was simplest but conflated easy and hard evidence. A pure machine-learned classifier predicted well but couldn’t be explained to a teacher or defended in an audit. We chose the model we could argue for, accepting a small cost in raw fit for a large gain in legibility and accountability.

07 — The system designed

The output was not a dashboard but a small, honest object: a mastery claim, the evidence behind it, and a stated confidence — designed so a teacher could disagree with it intelligently.

Signature module The mastery claim, dissected

This student can model a linear relationship from a table — not just complete the worksheet that contained one.

Claim
States the construct, the conditions, and the “again”.
Evidence
Four items across two difficulties, plus one transfer task.
Confidence
High — held out, not self-confirmed.
Fig. 1 — Reconstructed from the production card; axes relabeled, data synthetic.

08 — Abstracted artifacts

09 — Validation & quality criteria

  • Items passed fit statistics and were reviewed for construct relevance, not just difficulty.
  • Mastery claims were checked against held-out performance, not against themselves.
  • A claim a teacher couldn’t act on was treated as a defect, not a feature.

10 — Reflections

The hardest work was deciding what not to measure. A smaller set of well-evidenced claims beat a complete map of guesses — and it is the part that transfers to every measurement problem I have touched since.