Capability-floor evaluation
Which AI model is adequate for which task — and when do local models catch up
Purpose
For each task class in the STX physics setting, identify the smallest and cheapest model that performs adequately. The result drives production routing decisions, cost projections, and the migration timeline for self-hosted open-weight models.
Deliverable. A versioned eval dataset + runner + a periodically-updated capability-floor report. Reusable beyond M’s contract — successors re-run quarterly as new models release.
The picture
For each task we run, we set a minimum score threshold the AI must hit to be useful. We then watch how each model tier — frontier cloud, capable open-weight, small/on-device — tracks against that threshold over time. When a tier crosses the threshold, that’s our estimated migration date for routing the task to that tier.
Catch-up timelines — what waiting buys you. Each tier reaches today’s frontier on a different cadence:
- Tier 2 (server) lags Tier 1 by ~8 months on hard reasoning. Efficient MoE architectures (DeepSeek V4) have collapsed the dense-model gap that used to be larger. Wait two release cycles, get a comparable open-weight model.
- Tier 3 (server-local) lags by ~12 months. The constraint is what fits on a single workstation; aggressive quantisation and smaller capable architectures (Qwen 3.5 27B at 85.5%, Gemma 4 31B at 84.3%) keep narrowing this. Wait a year, run frontier-equivalent on a 128 GB Mac.
- Tier 4 (on-device) currently lags by ~24–28 months on hard reasoning, and the lag is mostly real hardware constraint, not self-imposed. The bottleneck is device RAM. The iPhone 15 Pro and iPhone 16 lineup have 8 GB total (Tom’s Guide), and Apple reserves ~4 GB for AI (The Register). That budget supports a ~3 B-parameter model at 2/4-bit quantisation — which is exactly what the Apple Intelligence on-device foundation model uses — but not a 7 B or 14 B model. Analysts argue full on-device LLMs need 20 GB+ (Wccftech analyst commentary). The iPhone 17 Pro is rumoured to ship with 12 GB (Tom’s Guide) — which would open headroom for ~7 B models. There are also research paths (Apple’s flash-storage weight streaming) that may shorten the lag without waiting for hardware refreshes. Treat the late-2027 crossing as a working estimate; update as hardware refreshes land.
Three things this plot makes concrete:
- Some tasks always want state-of-the-art — high thresholds (complex multi-step physics reasoning, nuanced pedagogical scaffolding) may stay on the cloud frontier indefinitely.
- Other tasks have lower thresholds — summarisation, formatting, structured extraction, simple Q&A. For these, open-weight catches up in months, and on-device follows a year or two behind.
- The timeline is estimable. We’re not guessing when to migrate — we’re tracking published benchmark trajectories and projecting against our task-specific thresholds. The projection sharpens every time a new model releases.
Public benchmarks we draw on
The eval doesn’t recreate the wheel — it composes published benchmarks with STX-physics-specific task data. Primary sources for the model panel and trajectory estimation:
| Benchmark | What it measures | Relevance |
|---|---|---|
| GPQA | Graduate-level physics, chemistry, biology questions | Direct upper-bound proxy for stx physics reasoning |
| MATH | Competition-level mathematics | Closely tracks formula-and-derivation tasks |
| MMLU | Massive multitask language understanding | Broad knowledge ceiling |
| MMMU | College-level multimodal understanding (figures, diagrams) | Tracks the multimodal-input path |
| HumanEval / MBPP | Code generation | Relevant for the code-execution MCP tool |
| Open LLM Leaderboard (HuggingFace) | Aggregated open-weight comparisons | Live tracker for tier 3 candidates |
| Artificial Analysis | Cross-provider model comparisons (price, speed, quality) | Live tracker for tier 1 and tier 2 |
| Epoch AI | Long-run AI capability trends | Source for the trajectory projections |
| AILANG benchmarks | Project-internal panel runs across multiple providers | Primary source for AIPLA’s per-task scores |
External benchmarks calibrate the model panel; the AIPLA eval adds physics-specific task definitions and the threshold scores.
Real numbers — GPQA Diamond, the closest public proxy
GPQA Diamond is graduate-level physics, chemistry, and biology — the closest public benchmark to upper-secondary physics reasoning, and harder than most STX physics tasks will require. A snapshot of current scores (verified 2026-05-15) across the four tiers:
| Tier | Model | Hardware needed | GPQA Diamond | Source |
|---|---|---|---|---|
| 1 — AI API | Claude Opus 4.7 | None (API; cloud-agnostic — Anthropic / Bedrock / Vertex) | 94.2% | Vellum leaderboard |
| 1 — AI API | Gemini 3.1 Pro Preview | None (Vertex AI, EU regions available) | 94.1% | Artificial Analysis |
| 1 — AI API | GPT-5.5 (xhigh) | None (OpenAI / Azure) | 93.5% | Artificial Analysis |
| 1 — AI API | Gemini 2.5 Pro | None (Vertex AI EU — prototype default) | ~85–86% | Google model card |
| 2 — Self-hosted server | DeepSeek V4 Pro (1.6T MoE / 49B active) | Cluster: 8× H100 / 4× H200 (~800+ GB VRAM at Q4_K_M; NVLink) | 90.1% | Framia benchmarks |
| 2 — Self-hosted server | DeepSeek V4 Flash (284B MoE / 13B active) | ~4× H100 (~280 GB VRAM at FP8); top open-weight by OpenRouter usage | 79.0% | Codersera deep-dive |
| 3 — Server-local | Qwen 3.5 27B | ~30 GB at Q6 — fits on 128 GB Mac or single H100/A100 | 85.5% | Awesome Agents |
| 3 — Server-local | Gemma 4 31B | ~32 GB at Q6 — fits on 128 GB Mac or single H100 | 84.3% | ai.rs roundup |
| 3 — Server-local | Phi-4 (14B) | ~14 GB FP8 — single consumer GPU (RTX 4090) or high-end laptop | 56.1% | TokenMix |
| 4 — On-device | Apple Intelligence (~3B) | iPhone 15 Pro+, iPad M-series, Apple Silicon Mac | n/a (not GPQA-benchmarked) | Apple |
| 4 — On-device | Gemini Nano (3.25B) | Pixel 8+, Samsung S24+, supporting Chromebooks | n/a (not GPQA-benchmarked) |
GPQA Diamond is a graduate-level science benchmark — Tier 4 models aren’t designed for that level of reasoning, which is why they don’t have published scores on it. They serve light text tasks (summarisation, formatting, short Q&A), which is what gets routed to them in practice.
Four observations:
- The cloud frontier sits at ~94% on graduate science. Whatever threshold AIPLA picks for its hardest tasks, Tier 1 already clears it.
- Tier 2 (UCPH-hosted server) is within 4 points of frontier. DeepSeek V4 Pro at 90.1% is essentially equivalent to cloud for AIPLA purposes. With UCPH IT providing GPU cluster hosting, AIPLA can serve near-frontier capability without sending any data to a cloud provider at all.
- Tier 3 (server-local workstation) clears 80% comfortably. Qwen 3.5 27B and Gemma 4 31B both land at 84–86%. A 128 GB Mac, a single H100, or a small departmental server runs these. No GPU cluster needed.
- Tier 4 (on-device) is the remaining gap. Smaller models trail at 40–56% on this benchmark and the truly tiny on-device models (Apple Intelligence, Gemini Nano) aren’t designed for graduate science at all. They serve light tasks, which is what gets routed to them.
Implication for the UCPH self-host migration timeline. “Cloud now, on-prem someday” is too conservative. A more accurate framing for AIPLA:
- For tasks needing 80%+ capability: Tier 3 (Qwen 3.5 27B / Gemma 4 31B on a single machine) is viable today.
- For tasks needing 88–92% capability: Tier 2 (DeepSeek V4 Pro on a UCPH GPU cluster) sits at 90.1% — viable today if the cluster exists.
- For tasks needing 93%+ capability: Tier 1 (cloud API) is still the answer in 2026; the gap narrows every release cycle.
- For light tasks: Tier 4 is already operational for what it’s designed for.
Where this populates from going forward. AIPLA’s own capability-floor report (a periodic deliverable, refreshed as new models release) will include current scores across the full model panel for each AIPLA task class — not just GPQA Diamond, which is a proxy. The eval composes published benchmark trajectories with AIPLA’s STX-physics-specific task definitions.
Live trackers (verify current scores here)
- Artificial Analysis — cross-provider model comparisons (price, speed, quality, latency); live and updates frequently
- Open LLM Leaderboard (HuggingFace) — comprehensive open-weight tracker
- LMArena — human-preference comparisons; useful for chat-style tasks
- Epoch AI Benchmarking Dashboard — long-run trajectories and projections
- GPQA leaderboard — primary source for GPQA Diamond results
Task taxonomy
Anchored on the four AIPLA research questions, refined with AR’s curriculum priorities (experiments, mathematical representations, written assessment).
| ID | Task class | Mapped RQ | Source of examples |
|---|---|---|---|
| T1 | Problem-set hints (mechanics, electromagnetism, etc.) | RQ1 (exam problem sets) | STX exam past papers; AR’s curriculum annotations |
| T2 | Experimental troubleshooting | RQ2 (experiments) | Lab guides; teacher input |
| T3 | Conceptual exploration / Socratic dialogue | RQ3 (conceptual) | Curriculum core-material topics |
| T4 | Presentation critique | RQ4 (presentations) | Anonymised student slide decks (when available) |
| T5 | Worksheet OCR / diagram interpretation | cross-cuts RQ1–2 | Photos of free-body diagrams, hand-drawn graphs |
| T6 | Tabular / sensor data interpretation | RQ2 | CSV exports, Tracker outputs |
| T7 | LaTeX / formula generation | cross-cuts | Embedded in T1, T3 |
Each task class needs ~20–50 graded items to be statistically meaningful. Initial scale: 10–20 per class for v0.1, expand iteratively.
Capability dimensions
| Dimension | Why it matters | Measurement |
|---|---|---|
| Correctness | Did the model give the right physics answer? | Deterministic where possible (exam answer keys); LLM-as-judge otherwise |
| Pedagogical appropriateness | Did it hint rather than solve? | LLM-as-judge against ESRU / IBSE rubric |
| Multimodal accuracy | Did it correctly read the diagram/photo? | Human-graded sample + LLM-as-judge |
| LaTeX validity | Are formulae renderable and correct? | Parse + render check |
| Latency | Acceptable for classroom UX? | p50/p95 ms |
| Cost | €/1000 tasks | Per-provider rate card |
Model panel
The full panel of models tracked is the four-tier table above. The eval composes published benchmark trajectories (GPQA, MMLU, MMMU, MATH, AILANG’s own runs) with AIPLA’s STX-physics-specific task definitions. Panel updates as new models release.
KPIs
Per task class, tracked over time:
- Capability floor — smallest/cheapest model achieving ≥ 80% on the task class. This is the headline metric.
- Cost ratio — €/1000 tasks at floor vs. €/1000 at frontier
- Latency at floor — p95 ms
- Local-readiness fraction — % of task classes where an open-weight model sits at the floor. When this approaches 100%, full local deployment becomes viable.
Scoring methodology
- Deterministic where possible — exam problems with answer keys, structured extractions, LaTeX renders.
- LLM-as-judge with rubric for qualitative dimensions (pedagogical appropriateness, conceptual depth). Judge model = Claude 4.7 or equivalent frontier; rubric versioned alongside the eval.
- Human-rated calibration sample — ~20 items per task class graded by AR or a physics teacher, used to calibrate the LLM judge.
Versioning and reproducibility
- Eval set lives under
infrastructure/evals/tasks/with version tags (v0.1, v0.2, …) - Runner is deterministic given a fixed model panel and dataset version
- Results snapshots dated and never overwritten
- Every capability-floor report revision cites the dataset and snapshot versions it draws on
How this connects
The Architecture model router reads the capability-floor matrix to choose models per task class. The local-readiness fraction signals to UCPH IT when on-prem self-hosting is worth the GPU investment. New model classes from the Strand C investigation plug into the same eval to be compared against the LLM baselines. The eval itself is a research instrument that outlives the contract.
Session analytics — pedagogical rubrics (v1.1+ direction)
The capability-floor eval covers model quality. A separate question is what to do with the raw session data AIPLA accumulates once pilots are running: every chat turn, workbench state write, and progress tick is already in BigQuery. The gap between that raw activity stream and pedagogically meaningful signal for teachers is where a rubric framework lives.
Two frameworks from the physics-education-research literature fit AIPLA’s data shape and are under consideration for v1.1+. Framework choice sits with JB and AR — engineering implements once the framework is picked and the BigQuery sink (sprint 1.2) is live.
ICAP (Chi & Wylie 2014) — engagement quality
Labels each student utterance/action with a cognitive engagement mode:
| Mode | Typical AIPLA signal |
|---|---|
| Passive — receives without processing | Long pauses; one-word acknowledgements after a tutor message |
| Active — manipulates given information | Slider drag, sim launch, pressing “launch” button |
| Constructive — produces something new | Student turn with causal language (“I think…”, “because…”, “fordi…”) |
| Interactive — builds on the tutor’s reasoning | Question that extends the tutor’s previous question; “but what if…” |
Output: a per-session engagement histogram (e.g., “28% passive, 41% active, 22% constructive, 9% interactive”). Each mode maps to existing AIPLA data; no new instrumentation needed — detection runs as a post-session LLM-labelling pass over the BigQuery log. An ACL 2025 paper validated this approach on GenAI tutor conversations specifically.
A session with mostly Passive + Active engagement and low Constructive/Interactive is a signal for the teacher to adjust scaffolding — not a mark of poor tutor performance.
FCI misconception taxonomy (Hestenes et al.) — concept tracking
The Force Concept Inventory defines ~30 labelled Newtonian misconceptions, each with recognisable linguistic signatures. Applied to AIPLA’s Boldkast and KineBot sessions, it lets the teacher analytics chat say something specific rather than generic:
| Misconception | Linguistic signature in chat | Relevant activity |
|---|---|---|
velocity-proportional-to-force |
“if I push harder it goes faster for longer” | Boldkast, KineBot |
motion-implies-active-force |
“what’s pushing it now?” during free-flight | Boldkast |
vector-composition-nonvectorial |
“the horizontal and vertical cancel each other” | Boldkast — the key appresent DRA for vx/vy independence |
gravitational-mass-dependence |
“heavier falls faster” | Boldkast |
position-velocity-undiscriminated |
treats x(t) and v(t) graphs as the same | KineBot graph plotter |
FCI labelling is a per-turn classification against a finite misconception list — tractable as a lightweight LLM call. Note: FCI is Newtonian-mechanics-specific; LED Planck and future quantum/waves activities need a different taxonomy (no FCI equivalent exists for those domains yet — AR’s input needed).
Minimum viable starting point
A combined ICAP + FCI pass on Boldkast sessions, run post-session, with results surfaced in the teacher analytics chat as:
“Group bold-kazoo-87: 38% constructive engagement — above class average. Two FCI misconception signals detected:
motion-implies-active-force(turns 12, 17) andvector-composition-nonvectorial(turn 23). The tutor addressed the second but not the first.”
This requires: BigQuery sink live (1.2), per-activity DRA map authored by AR + JB, and a labelling-pass design review with JB. Not before mid-point review.