Capability-floor evaluation

Which AI model is adequate for which task — and when do local models catch up

Purpose

For each task class in the STX physics setting, identify the smallest and cheapest model that performs adequately. The result drives production routing decisions, cost projections, and the migration timeline for self-hosted open-weight models.

Deliverable. A versioned eval dataset + runner + a periodically-updated capability-floor report. Reusable beyond M’s contract — successors re-run quarterly as new models release.

The picture

For each task we run, we set a minimum score threshold the AI must hit to be useful. We then watch how each model tier — frontier cloud, capable open-weight, small/on-device — tracks against that threshold over time. When a tier crosses the threshold, that’s our estimated migration date for routing the task to that tier.

Four rising curves over 2023–2027 for the four tiers (AI API, server, server-local, on-device) against an 80% horizontal threshold. Vertical drop-lines mark when each tier crosses, with months-of-lag labelled: Tier 1 mid-2025, Tier 2 Q1 2026 (~8 months later), Tier 3 early 2026 (~12 months later), Tier 4 late 2027 (~28 months later).

Plot of AI capability by tier over time, with time-lag annotations showing when each tier crosses an 80% capability threshold

Catch-up timelines — what waiting buys you. Each tier reaches today’s frontier on a different cadence:

  • Tier 2 (server) lags Tier 1 by ~8 months on hard reasoning. Efficient MoE architectures (DeepSeek V4) have collapsed the dense-model gap that used to be larger. Wait two release cycles, get a comparable open-weight model.
  • Tier 3 (server-local) lags by ~12 months. The constraint is what fits on a single workstation; aggressive quantisation and smaller capable architectures (Qwen 3.5 27B at 85.5%, Gemma 4 31B at 84.3%) keep narrowing this. Wait a year, run frontier-equivalent on a 128 GB Mac.
  • Tier 4 (on-device) currently lags by ~24–28 months on hard reasoning, and the lag is mostly real hardware constraint, not self-imposed. The bottleneck is device RAM. The iPhone 15 Pro and iPhone 16 lineup have 8 GB total (Tom’s Guide), and Apple reserves ~4 GB for AI (The Register). That budget supports a ~3 B-parameter model at 2/4-bit quantisation — which is exactly what the Apple Intelligence on-device foundation model uses — but not a 7 B or 14 B model. Analysts argue full on-device LLMs need 20 GB+ (Wccftech analyst commentary). The iPhone 17 Pro is rumoured to ship with 12 GB (Tom’s Guide) — which would open headroom for ~7 B models. There are also research paths (Apple’s flash-storage weight streaming) that may shorten the lag without waiting for hardware refreshes. Treat the late-2027 crossing as a working estimate; update as hardware refreshes land.

Three things this plot makes concrete:

  • Some tasks always want state-of-the-art — high thresholds (complex multi-step physics reasoning, nuanced pedagogical scaffolding) may stay on the cloud frontier indefinitely.
  • Other tasks have lower thresholds — summarisation, formatting, structured extraction, simple Q&A. For these, open-weight catches up in months, and on-device follows a year or two behind.
  • The timeline is estimable. We’re not guessing when to migrate — we’re tracking published benchmark trajectories and projecting against our task-specific thresholds. The projection sharpens every time a new model releases.

Public benchmarks we draw on

The eval doesn’t recreate the wheel — it composes published benchmarks with STX-physics-specific task data. Primary sources for the model panel and trajectory estimation:

Benchmark What it measures Relevance
GPQA Graduate-level physics, chemistry, biology questions Direct upper-bound proxy for stx physics reasoning
MATH Competition-level mathematics Closely tracks formula-and-derivation tasks
MMLU Massive multitask language understanding Broad knowledge ceiling
MMMU College-level multimodal understanding (figures, diagrams) Tracks the multimodal-input path
HumanEval / MBPP Code generation Relevant for the code-execution MCP tool
Open LLM Leaderboard (HuggingFace) Aggregated open-weight comparisons Live tracker for tier 3 candidates
Artificial Analysis Cross-provider model comparisons (price, speed, quality) Live tracker for tier 1 and tier 2
Epoch AI Long-run AI capability trends Source for the trajectory projections
AILANG benchmarks Project-internal panel runs across multiple providers Primary source for AIPLA’s per-task scores

External benchmarks calibrate the model panel; the AIPLA eval adds physics-specific task definitions and the threshold scores.

Real numbers — GPQA Diamond, the closest public proxy

GPQA Diamond is graduate-level physics, chemistry, and biology — the closest public benchmark to upper-secondary physics reasoning, and harder than most STX physics tasks will require. A snapshot of current scores (verified 2026-05-15) across the four tiers:

Tier Model Hardware needed GPQA Diamond Source
1 — AI API Claude Opus 4.7 None (API; cloud-agnostic — Anthropic / Bedrock / Vertex) 94.2% Vellum leaderboard
1 — AI API Gemini 3.1 Pro Preview None (Vertex AI, EU regions available) 94.1% Artificial Analysis
1 — AI API GPT-5.5 (xhigh) None (OpenAI / Azure) 93.5% Artificial Analysis
1 — AI API Gemini 2.5 Pro None (Vertex AI EU — prototype default) ~85–86% Google model card
2 — Self-hosted server DeepSeek V4 Pro (1.6T MoE / 49B active) Cluster: 8× H100 / 4× H200 (~800+ GB VRAM at Q4_K_M; NVLink) 90.1% Framia benchmarks
2 — Self-hosted server DeepSeek V4 Flash (284B MoE / 13B active) ~4× H100 (~280 GB VRAM at FP8); top open-weight by OpenRouter usage 79.0% Codersera deep-dive
3 — Server-local Qwen 3.5 27B ~30 GB at Q6 — fits on 128 GB Mac or single H100/A100 85.5% Awesome Agents
3 — Server-local Gemma 4 31B ~32 GB at Q6 — fits on 128 GB Mac or single H100 84.3% ai.rs roundup
3 — Server-local Phi-4 (14B) ~14 GB FP8 — single consumer GPU (RTX 4090) or high-end laptop 56.1% TokenMix
4 — On-device Apple Intelligence (~3B) iPhone 15 Pro+, iPad M-series, Apple Silicon Mac n/a (not GPQA-benchmarked) Apple
4 — On-device Gemini Nano (3.25B) Pixel 8+, Samsung S24+, supporting Chromebooks n/a (not GPQA-benchmarked) Google

GPQA Diamond is a graduate-level science benchmark — Tier 4 models aren’t designed for that level of reasoning, which is why they don’t have published scores on it. They serve light text tasks (summarisation, formatting, short Q&A), which is what gets routed to them in practice.

Four observations:

  1. The cloud frontier sits at ~94% on graduate science. Whatever threshold AIPLA picks for its hardest tasks, Tier 1 already clears it.
  2. Tier 2 (UCPH-hosted server) is within 4 points of frontier. DeepSeek V4 Pro at 90.1% is essentially equivalent to cloud for AIPLA purposes. With UCPH IT providing GPU cluster hosting, AIPLA can serve near-frontier capability without sending any data to a cloud provider at all.
  3. Tier 3 (server-local workstation) clears 80% comfortably. Qwen 3.5 27B and Gemma 4 31B both land at 84–86%. A 128 GB Mac, a single H100, or a small departmental server runs these. No GPU cluster needed.
  4. Tier 4 (on-device) is the remaining gap. Smaller models trail at 40–56% on this benchmark and the truly tiny on-device models (Apple Intelligence, Gemini Nano) aren’t designed for graduate science at all. They serve light tasks, which is what gets routed to them.

Implication for the UCPH self-host migration timeline. “Cloud now, on-prem someday” is too conservative. A more accurate framing for AIPLA:

  • For tasks needing 80%+ capability: Tier 3 (Qwen 3.5 27B / Gemma 4 31B on a single machine) is viable today.
  • For tasks needing 88–92% capability: Tier 2 (DeepSeek V4 Pro on a UCPH GPU cluster) sits at 90.1% — viable today if the cluster exists.
  • For tasks needing 93%+ capability: Tier 1 (cloud API) is still the answer in 2026; the gap narrows every release cycle.
  • For light tasks: Tier 4 is already operational for what it’s designed for.

Where this populates from going forward. AIPLA’s own capability-floor report (a periodic deliverable, refreshed as new models release) will include current scores across the full model panel for each AIPLA task class — not just GPQA Diamond, which is a proxy. The eval composes published benchmark trajectories with AIPLA’s STX-physics-specific task definitions.

Live trackers (verify current scores here)

Task taxonomy

Anchored on the four AIPLA research questions, refined with AR’s curriculum priorities (experiments, mathematical representations, written assessment).

ID Task class Mapped RQ Source of examples
T1 Problem-set hints (mechanics, electromagnetism, etc.) RQ1 (exam problem sets) STX exam past papers; AR’s curriculum annotations
T2 Experimental troubleshooting RQ2 (experiments) Lab guides; teacher input
T3 Conceptual exploration / Socratic dialogue RQ3 (conceptual) Curriculum core-material topics
T4 Presentation critique RQ4 (presentations) Anonymised student slide decks (when available)
T5 Worksheet OCR / diagram interpretation cross-cuts RQ1–2 Photos of free-body diagrams, hand-drawn graphs
T6 Tabular / sensor data interpretation RQ2 CSV exports, Tracker outputs
T7 LaTeX / formula generation cross-cuts Embedded in T1, T3

Each task class needs ~20–50 graded items to be statistically meaningful. Initial scale: 10–20 per class for v0.1, expand iteratively.

Capability dimensions

Dimension Why it matters Measurement
Correctness Did the model give the right physics answer? Deterministic where possible (exam answer keys); LLM-as-judge otherwise
Pedagogical appropriateness Did it hint rather than solve? LLM-as-judge against ESRU / IBSE rubric
Multimodal accuracy Did it correctly read the diagram/photo? Human-graded sample + LLM-as-judge
LaTeX validity Are formulae renderable and correct? Parse + render check
Latency Acceptable for classroom UX? p50/p95 ms
Cost €/1000 tasks Per-provider rate card

Model panel

The full panel of models tracked is the four-tier table above. The eval composes published benchmark trajectories (GPQA, MMLU, MMMU, MATH, AILANG’s own runs) with AIPLA’s STX-physics-specific task definitions. Panel updates as new models release.

KPIs

Per task class, tracked over time:

  • Capability floor — smallest/cheapest model achieving ≥ 80% on the task class. This is the headline metric.
  • Cost ratio — €/1000 tasks at floor vs. €/1000 at frontier
  • Latency at floor — p95 ms
  • Local-readiness fraction — % of task classes where an open-weight model sits at the floor. When this approaches 100%, full local deployment becomes viable.

Scoring methodology

  • Deterministic where possible — exam problems with answer keys, structured extractions, LaTeX renders.
  • LLM-as-judge with rubric for qualitative dimensions (pedagogical appropriateness, conceptual depth). Judge model = Claude 4.7 or equivalent frontier; rubric versioned alongside the eval.
  • Human-rated calibration sample — ~20 items per task class graded by AR or a physics teacher, used to calibrate the LLM judge.

Versioning and reproducibility

  • Eval set lives under infrastructure/evals/tasks/ with version tags (v0.1, v0.2, …)
  • Runner is deterministic given a fixed model panel and dataset version
  • Results snapshots dated and never overwritten
  • Every capability-floor report revision cites the dataset and snapshot versions it draws on

How this connects

The Architecture model router reads the capability-floor matrix to choose models per task class. The local-readiness fraction signals to UCPH IT when on-prem self-hosting is worth the GPU investment. New model classes from the Strand C investigation plug into the same eval to be compared against the LLM baselines. The eval itself is a research instrument that outlives the contract.


Session analytics — pedagogical rubrics (v1.1+ direction)

The capability-floor eval covers model quality. A separate question is what to do with the raw session data AIPLA accumulates once pilots are running: every chat turn, workbench state write, and progress tick is already in BigQuery. The gap between that raw activity stream and pedagogically meaningful signal for teachers is where a rubric framework lives.

Two frameworks from the physics-education-research literature fit AIPLA’s data shape and are under consideration for v1.1+. Framework choice sits with JB and AR — engineering implements once the framework is picked and the BigQuery sink (sprint 1.2) is live.

ICAP (Chi & Wylie 2014) — engagement quality

Labels each student utterance/action with a cognitive engagement mode:

Mode Typical AIPLA signal
Passive — receives without processing Long pauses; one-word acknowledgements after a tutor message
Active — manipulates given information Slider drag, sim launch, pressing “launch” button
Constructive — produces something new Student turn with causal language (“I think…”, “because…”, “fordi…”)
Interactive — builds on the tutor’s reasoning Question that extends the tutor’s previous question; “but what if…”

Output: a per-session engagement histogram (e.g., “28% passive, 41% active, 22% constructive, 9% interactive”). Each mode maps to existing AIPLA data; no new instrumentation needed — detection runs as a post-session LLM-labelling pass over the BigQuery log. An ACL 2025 paper validated this approach on GenAI tutor conversations specifically.

A session with mostly Passive + Active engagement and low Constructive/Interactive is a signal for the teacher to adjust scaffolding — not a mark of poor tutor performance.

FCI misconception taxonomy (Hestenes et al.) — concept tracking

The Force Concept Inventory defines ~30 labelled Newtonian misconceptions, each with recognisable linguistic signatures. Applied to AIPLA’s Boldkast and KineBot sessions, it lets the teacher analytics chat say something specific rather than generic:

Misconception Linguistic signature in chat Relevant activity
velocity-proportional-to-force “if I push harder it goes faster for longer” Boldkast, KineBot
motion-implies-active-force “what’s pushing it now?” during free-flight Boldkast
vector-composition-nonvectorial “the horizontal and vertical cancel each other” Boldkast — the key appresent DRA for vx/vy independence
gravitational-mass-dependence “heavier falls faster” Boldkast
position-velocity-undiscriminated treats x(t) and v(t) graphs as the same KineBot graph plotter

FCI labelling is a per-turn classification against a finite misconception list — tractable as a lightweight LLM call. Note: FCI is Newtonian-mechanics-specific; LED Planck and future quantum/waves activities need a different taxonomy (no FCI equivalent exists for those domains yet — AR’s input needed).

Minimum viable starting point

A combined ICAP + FCI pass on Boldkast sessions, run post-session, with results surfaced in the teacher analytics chat as:

“Group bold-kazoo-87: 38% constructive engagement — above class average. Two FCI misconception signals detected: motion-implies-active-force (turns 12, 17) and vector-composition-nonvectorial (turn 23). The tutor addressed the second but not the first.”

This requires: BigQuery sink live (1.2), per-activity DRA map authored by AR + JB, and a labelling-pass design review with JB. Not before mid-point review.