The three strands

What gets built, what gets investigated, and when

The four-month contract splits into three strands. A and B are priority delivery — working pilots in front of teachers. C is a scoping investigation that informs what gets built after the contract ends.

Strand	Type	Priority	Deliverable
A — Pedagogical bot infrastructure	Build	High	Working pilot in front of teachers by end of contract
B — Simulation & game creation	Build	High	Working pilot similar to A
C — Investigation and scoping	Research	Lower	5–10pp scoping note with feasibility + recommendations

Year-1 posture: breadth over depth. The emphasis this year is exploring a wide range of tools and activity types — many thin, end-to-end probes — over scaling or polishing any single feature. Year 1 is about discovering which interfaces and activities are worth pursuing, so coverage of the design space is worth more than depth on any one bet. Read the skill and workbench lists below in that light: a deliberately broad exploration, each item committed only to the scope its marker shows.

Timeline at a glance

gantt
    title When each strand gets attention
    dateFormat YYYY-MM-DD
    axisFormat %b
    todayMarker off

    section A — Bots
    Build core                :a1, 2026-05-15, 42d
    Post-holiday iterate      :a2, 2026-07-15, 30d
    Teacher pilot             :a3, 2026-08-14, 32d

    section B — Sims/games
    Build                     :b1, 2026-07-15, 28d
    Pilot                     :b2, after b1, 14d

    section C — Scoping
    Investigation             :c1, 2026-07-15, 21d
    Scoping note              :c2, after c1, 21d

    section Gates
    Jutland demo (v0.1)       :milestone, m0, 2026-05-27, 0d
    Mid-point review          :milestone, m1, 2026-06-26, 0d
    M + JB holiday (wk 27)    :crit, 2026-06-29, 7d
    Final handover            :milestone, m2, 2026-09-15, 0d

A v0.1 demo URL landed by 27 May, in time for JB and AR’s Jutland teacher visit. Strand A’s core build then ran to the mid-point review on 26 June; both then go on holiday (week 27, 29 June – 5 July). Strands B and C kick off on return. Teacher pilot starts when Danish teachers return mid-August.

How the strands relate

flowchart TB
    A[Strand A<br/>Pedagogical bots]
    B[Strand B<br/>Simulation / games]
    C[Strand C<br/>Scoping note]

    subgraph Shared["Shared infrastructure"]
        direction LR
        Auth[Auth / sessions]
        Logs[Anonymised chat logs]
        Router[Model router]
        RAG[RAG store]
    end

    Eval[Capability-floor eval]

    A --> Shared
    B --> Shared
    Shared --> Router
    Router --> Eval

    C -.investigates.-> Eval
    C -.may bridge to.-> A

    style A fill:#fce7e8,stroke:#901a1e,stroke-width:2px
    style B fill:#fce7e8,stroke:#901a1e,stroke-width:2px
    style C fill:#fff,stroke:#901a1e,stroke-width:2px,stroke-dasharray:5
    style Eval fill:#f5f5f5,stroke:#666

Strand B reuses Strand A’s plumbing — the load-bearing decision that makes both pilots achievable in the contract window. Strand C may produce a concrete prototype (the Plaud-like interface) that bridges back into Strand A.

Skills

What’s committed vs. direction-of-travel. Each skill below is tagged with a scope marker. We set the v1 deliverable conservatively and over-deliver where we can; the architecture enables all of these but the 4-month contract only commits a focused subset.

v0.1 — Jutland teacher demo (Wed 27 May)
v1 — Teacher pilot (mid-August onwards)
Year-2 — Architectural direction; enabled by the platform but not in the 4-month deliverable

Concrete skills that will sit on top of the Strand A infrastructure. Designed with AR; refined as the teacher pilot reveals what actually gets used.

The canonical pairing. AR’s preferred workflow (Examples page) pairs physics-sim-builder with a tutor-config skill (typically problem-set-helper-config) — sim + tutor designed together as a unit, with the tutor’s prompt referencing the simulation’s specific UI elements. Standalone-tutor skills (no paired sim) cover topics like pure conceptual dialogue where there’s no UI to reference.

Teacher-facing (primary)

All teacher tasks happen in the same UI app as students — a single multi-surface frontend where the AI directs which surface each output goes to (chat for conversation, workspace for active views like dashboards and artefacts, sidebar for persistent context, modal for focused approval flows). See ADR-015.

Content creation

Skill	What it does	Surface	Scope
`physics-sim-builder`	Generates a paired sim+tutor unit. Phenomenon-simulation form factor (sliders, observation, parameter exploration). Structured inputs: Topic (full scope) + Student Difficulty (emphasis). Anti-reduction discipline: address the misconception inside the complete topic, never collapse to it. Output structure: HTML simulation + Predict→Explore→Reflect→Summarize learning flow + paired Socratic tutor referencing the sim’s UI. Modelled on AR’s iteratively refined trial prompt.	MCP App (sim) + tutor-prompt feed into a config skill	Year-2
`physics-lab-builder`	Generates a paired virtual-lab+tutor unit. Procedural form factor (draggable equipment, wiring, calibration, measurement workflow). Structured inputs: Experiment source (PDF or text of the real procedure) + Equipment list (anti-hallucination ground truth) + Student Difficulty. Output structure: Interactive HTML lab + checklist-driven procedure + paired Socratic tutor referencing the lab’s specific equipment and measurement steps. Modelled on AR’s second trial (LED Planck-constant lab — see Examples).	MCP App (lab) + tutor-prompt feed into a config skill	Year-2
`illustration-builder`	Generates SVG/PNG illustrations for student materials	A2UI inline image	Year-2
`misconception-pair`	Generates a correct illustration and a textbook-distractor wrong variant — pulls on the pedagogical pattern AR is already using to mine instructive AI failure modes	A2UI side-by-side	Year-2

v1 alternative — a hand-curated library of paired sim+tutor units, plus PhET sims paired with AIPLA-authored tutors for topics PhET already covers well. PhET sims are iframe-embeddable and pre-translated; AIPLA hosts a Socratic tutor whose prompt references the specific PhET sim’s UI. Same paired-unit pattern as AR’s; we just don’t author the sim ourselves for these. The AI-generation skills above ship when the artefact-generation pipeline is mature.

Bot configuration

Skill	What it does	Surface	Scope
`problem-set-helper-config`	Configures a student-facing tutor for a problem set or simulation. When paired with `physics-sim-builder` the tutor’s prompt is generated alongside the sim and references its specific UI; standalone, it just structures the hint sequence.	A2UI configuration form	v1
`concept-dialogue-config`	Configures a standalone Socratic conceptual-exploration tutor for a topic (no paired sim)	A2UI configuration form	v1
`lab-troubleshoot-config`	Configures a tutor that helps students debug experimental data	A2UI form + AILANG Parse for uploads	Year-2

Dashboard / management

Skill	What it does	Surface	Scope
`class-status`	Current class usage, budget remaining, time-to-reset, active groups	A2UI dashboard block	Year-2
`manage-class`	Create class, generate group codes, bind groups to activities, list groups	A2UI form	✅ v0.1 shipped (code generation + class binding live)
`review-artefacts`	Pending generated artefacts awaiting teacher approval before students see them	A2UI list with inline preview + approve/reject	Year-2
`chat-log-search`	Search this teacher’s anonymised class chat logs	A2UI search results	Year-2 (BigQuery direct for v1)

Student-facing (consumes teacher configurations)

Skill	What it does	Scope
Problem-set hints — Boldkast	Socratic tutor for projectile motion (Danish stx); paired with the Boldkast simulator workbench	✅ v0.1 shipped 2026-05-20 — live at aipla-v01-frontend-wgwhd7mspa-lz.a.run.app
LED Planck virtual lab	Socratic tutor for Planck’s constant (Danish stx physics-A); paired with AR’s LED threshold voltage lab (procedural-lab artefact class — distinct from Boldkast’s phenomenon-sim)	✅ shipped 2026-05-27 — second physics skill
KineBot kinematics tutor	Socratic tutor for NCERT/CBSE Class 11 kinematics (English); paired with 7 canvas-based simulations. Beta cohort: DK’s Indian students	✅ shipped 2026-05-28 — third physics skill, establishes external-artefact migration runbook
Conceptual exploration	Socratic dialogue on a topic the teacher pre-configured	v1
Student image / document upload	Students upload handwritten diagrams, photos of experimental setups, or draft notes; tutor gives feedback grounded in the curriculum DRAs for the activity. Confirmed as the most-requested student-facing feature in the 3 June teacher check-in	✅ v1.1 shipped — paperclip/camera in the student chat, the tutor sees the image; curriculum-grounded feedback via the activity’s cited docs also live (RAG)
Lab data troubleshooting	Photo / CSV / Tracker uploads + deeper analysis beyond single-image feedback	v1.2

Stretch (Year-2 / post-handover)

Skill	What it does	Notes
`oral-exam-prep`	AI plays the role of a Danish physics examiner; student draws a topic, prepares, then presents. AI gives formative feedback on conceptual completeness and DRA coverage. Confirmed as high-value in 3 June teacher check-in.	v1.1 stretch — requires DRA map per topic + voice mode
`student-note-taking`	AI compares student notes (typed or handwritten photo) to curriculum DRAs for the active activity; flags gaps and vague phrasing; tracks improvement across drafts over a school year.	v1.2 — pedagogically significant (students generating representations, not just consuming sims)
`student-as-creator` (Strand B)	Lets students generate their own simulations with AI assistance. Confirmed as a desired direction in 3 June teacher check-in.	Same machinery as `physics-sim-builder`; different skill prompts and review gates
`physics-sandbox-tool` (Strand B)	Interactive 2D physics sandbox embedded as an MCP App — drag-and-drop objects, run simulations, query state via MCP tools.	Build on Matter.js or Planck.js (MIT-licensed). Post-handover stretch.
`video-analysis-feedback`	Analyses recorded video of a student working through a task and gives formative feedback on progress	JB; multimodal; depends on Gemini video capabilities and student video consent posture
`student-model-tracker`	Maintains and updates a per-group concept network (C3)	The most speculative item; gated by the Strand C scoping recommendation

For the authoritative, code-verified status (what’s live, in progress, and backlog) see Current status on the Timeline page; the home page carries the headline. The lists below are the detailed build log behind them.

Platform features shipped alongside skills (early build, to 2026-06-01 — later shipped work is in Current status):

Teacher UI — five /teacher/* routes wired to real backend: class list, class detail (groups + assigned activities), activity configuration (teaching goal live; parameters and code tabs in progress), session report, analytics chat placeholder. {teacher_focus} injection confirmed end-to-end.
Chat log pipeline — BigQuery-backed; every chat turn and workbench event stored keyed on anonymous group ID. Durable session reports. aiplatform logs CLI for researcher access.
Lesson picker — /lessons route; students see all activities their group can access.
Proactive tutor greet — tutor speaks first on join, eliminating the dead-start UX gap.
Commit-on-submit workbench state — slider changes held locally, flushed only on explicit commit (Play button or chat submit).
TTS read-aloud — every tutor message has a read-aloud button; browser speechSynthesis, Danish, no privacy gate.
Mobile tab layout — chat ↔︎ workspace tab swap below tablet width; works on a single shared phone.
Student progress checklist — self-assessment checklist wired to the agent observability pipeline.
Boldkast vx/vy component arrows — velocity decomposition shown visually in the simulator, making the independence of horizontal and vertical motion directly visible rather than inferential.
MCP Apps spec compliance — standard JSON-RPC ui/initialize handshake across all artefacts; new artefacts follow the same path without custom hooks.

v1.1 direction (next sprint window, pre-August pilot):

Status (late June). Several items below have since shipped — selectable tutor personas, summary-first session reports, the researcher role, student image upload, teacher activity creation (concept activities) with an authoring co-pilot, and a teacher co-working co-pilot. Others remain planned or backlog: call-teacher, full bidirectional voice, the in-session consent prompt, the offline-lab workbench, the live class dashboard, and the mobile-performance pass. The authoritative, code-verified breakdown is Current status on the Timeline page; the list below is the original feedback-driven direction it grew from.

Teacher activity creation and branching — teachers create activities from scratch (topic, workbench type, Socratic prompt, uploaded materials) or branch from existing ones. Non-sim activities first class: a workbench can be a quiz, a drawing board, or nothing. Confirmed as the primary design priority in the 3 June teacher check-in.
Student image / document upload — handwritten diagrams, experimental photos, draft notes. Tutor gives feedback grounded in the activity’s curriculum concepts (e.g. “what are the units?” → student re-uploads corrected). The upload affordance and the uploaded image live on the workbench surface, not inline in the chat (16 June). Privacy guardrail: uploads are for no-person-in-frame material — diagrams, graphs, notes — which keeps the consent profile low. Also powers the no-laptop flow: at the end of class the AI ingests photographed handwritten notes and summarises them against the activity’s learning goals. Most-requested student feature in the 3 June check-in; reinforced 9 June.
Proactive sim-reactive tutor — tutor comments on slider changes without waiting for a student message. Teachers explicitly asked for this.
Student in-session consent prompt — opt-in for chat logging shown at session start. Keeps anonymous group auth; gated on JB sign-off.
Log summaries as primary display — narrative summary first in session reports; full transcript collapsed. Also the privacy strategy for eventual audio inclusion. Teachers can optionally share a session summary back to the student as formative feedback (teacher-gated, their choice; raised 16 June).
Researcher role — separate permission tier above teacher: cross-class, cross-teacher access to all sessions and raw BigQuery. Raised again 15 June (third session running); now specified in ADR-016.
Call teacher — a student can escalate to a human teacher mid-session; the teacher sees a “raised hand” signal in the live class view. Raised 15 June; planned, not yet built.
Student–AI relationship & permission framing — students read the tutor as an unfamiliar third entity — not a casual home chatbot, not their teacher — and are unsure they have permission to ask it; the introduction text goes unread (a student reached for Google for something the bot could answer). The fix is in-flow onboarding: the teacher frames the bot’s role at session start, the bot invites questions early and low-stakes (“you can ask me anything”), and the ask affordance is discoverable at the point of need. Surfaced 16 June, sharpened by JB. Pairs with call-teacher and the not-a-sycophant persona.
Bidirectional voice — student speaks → tutor responds with audio (sound in and out). Teachers stressed this on 9 June. The shipped path is turn-based TTS read-aloud plus a voice provider + selectable personas; full duplex voice is deferred (the late-June target slipped past the freeze). Research angle on whether voice changes Socratic interaction quality.
Group code persistence — codes last a school year (currently 30 days); connects to student portfolio download.
Selectable tutor personas — the tutor’s interaction style is a preset a teacher (and optionally a student) chooses per activity — e.g. concise and directive (“just try this”, no follow-ups), warm-but-not-sycophantic, or a rigorous exam-level persona — alongside the Socratic-questioning default, not replacing it. Raised 9 June; sycophancy was observed first-hand in the 16 June demo, reinforcing the warm-but-not-sycophantic preset.
Formative-first, with deliberate friction — the tutor optimises for what to do next, not just what happened, and deliberately preserves visible friction a teacher can see and correct rather than dispensing frictionless answers. Entry timing (typed vs pasted) is a candidate summative-integrity signal. Raised 9 June. Teachers also want a rolling class-level summary during the lesson (~every 5 minutes) — confirmed wanted 15 June; the live dashboard is gated on the R1 analytics-framework decision (before the 29 June freeze).
Offline-lab workbench — for lessons run without laptops: the teacher sets up a real experiment, students run it offline and enter their measurements, and the AI checks the entered data for mistakes in chat — keeping the experiment hands-on. Raised 9 June.
Mobile performance — runtime performance on phones (distinct from the mobile tab layout already shipped): students share a single phone, so load and response times on mobile are a usability gate. Flagged 15 June; audio latency is the same concern on the bidirectional-voice path.

Workbench type expansion: the workbench extends beyond HTML simulators to drawing boards (Excalidraw/tldraw), phone-sensor experiment tools, video analysis, lab notebooks, and a computational tool — a calculator extending to code execution to help with maths (raised 16 June; broadened by JB). Twenty-three Danish physics apps at jitt.dk are immediate artefact candidates. Non-sim workbenches (quiz, lab notebook, drawing board) are also first class — teachers confirmed they want activities with no simulator at all. Teacher activity authoring gains a referenceable library of common-curriculum PDFs (by stx level — A / B / C) teachers can cite or extend with their own uploads, choosing which sources feed a given activity (the activity’s RAG inputs — confirmed as an explicit teacher control 15 June), and the AI can co-design the missing workbench elements around the equipment a teacher already has. Simulations are now a validated building block (POC complete); the 9 June priority is letting teachers add their own and mix activity types into hybrid lessons. The 16 June steer (JB) is to lead demos with this breadth of workbench types — showing the platform’s flexibility — rather than depth on any single simulator.

Vetted artefact library lives on GitHub for portability — survives any future hosting move (UCPH IT, alternate provider). Approved artefacts skip the content-review pipeline on each use; new artefacts pass through it.

Budget controls. All skill invocations bill against the AIPLA project’s centrally-managed keys — students and teachers never see or provide an API key. Per-group and per-class budget caps enforced at the model router (ADR-014) so one class can’t drain another’s allocation. Teacher dashboard surfaces current usage and time-to-reset.

Strand A — Pedagogical bot infrastructure

Goal. Deploy purpose-configured chatbots (“pedagogical bots”) that teachers use with their stx physics classes. The canonical pattern (per AR’s existing GenAI trials) is a paired unit — a sandboxed HTML simulation in workspace plus a Socratic tutor in chat whose prompt references that specific simulation’s UI. Sim and tutor are designed together; the tutor is the simulation’s pedagogical scaffold, not a generic physics chatbot.

v1 commits to the chat-tutor side plus a hand-curated library of paired sim+tutor units teachers can hand out. The on-demand AI generation of new paired units (physics-sim-builder, misconception-pair) is direction-of-travel for Year-2 — the architecture supports it but v1 ships the curated library.

Requirements (from the brief, updated with 3 June teacher feedback):

GDPR / privacy-by-design. EU-hosted, no PII leakage to US providers where avoidable. Holds up to UCPH data-protection review and to teachers’ and parents’ scrutiny. Student in-session consent prompt added for chat logging (gated on JB sign-off).
Usage-based billing on the project side. Students and teachers don’t need their own accounts or subscriptions.
Researcher access to all data. Researchers (JB, AR, M) need a separate permission tier above teacher — cross-class, cross-teacher access to all sessions and raw BigQuery. Chat log summaries (not verbatim transcripts) are the preferred research artefact; this also lowers the privacy profile when audio is eventually included.
Teacher activity creation. Teachers configure activities from scratch or branch from existing ones — setting topic, workbench type, Socratic prompt, and uploaded materials — without a developer in the loop. Non-sim activities (quiz-only, drawing board, lab notebook) are first class. Confirmed as the primary design priority from teacher feedback.
Resource upload. Teachers upload curriculum extracts, problem sets, lab guides. Students upload images and documents (handwritten diagrams, experimental photos, draft notes) for AI feedback.
Rich physics output. LaTeX formulae, tables, diagrams, graphs where possible.
Multimodal input. Photos of hand-drawn free-body diagrams, experimental setups, draft notes. CSV/Excel sensor exports. Video analysis.
Socratic constraint — not taking over. Teachers observed the tutor being too directive in early sessions. Explicit length caps and question-ending rules are enforced in the base system prompt; tutor responds to sim events but leaves students room to notice things themselves.

Deliverable. A working pilot we can put in front of a small group of teachers by the end of the contract, plus documentation that lets us hand it off cleanly to the postdocs joining the project later.

Timeline. See Architecture for the ADRs that govern how Strand A gets built. Build phase weeks 3–6; internal iteration weeks 6–9; teacher pilot weeks 13–17. Mid-point review at week 9 is the first hard gate.

Strand B — Student-as-creator

Goal. Extend Strand A’s teacher-creates-artefacts pattern to students creating their own simulations and games with AI assistance. Same MCP-App + safety-pipeline machinery as Strand A; the change is in the skills exposed to students.

Two directions inside Strand B:

Generative — student-as-creator skill, AI builds a sim from a student prompt. Same machinery as physics-sim-builder; tighter review gates.
Interactive sandbox — physics-sandbox-tool, a Matter.js-based MCP App where students drag-and-drop objects and run physics. This is the Algodoo-replacement direction — interactive scene building inside AIPLA, no desktop-software lock-in, MCP-tool-accessible for AI-assisted scene authoring.

GitHub-based workflow is the working assumption for the artefact library — versioned, inspectable, portable across hosting (matters because UCPH-IT-takeover is on the self-hosting horizon). Students get exposure to a real engineering workflow; researchers get a complete edit history.

Deliverable. A working pilot similar to Strand A.

Status (July). The infrastructure half is effectively delivered by Strand A: AI-generated physics sims run as portable MCP Apps — inside the platform and in external AI hosts — behind the artefact review pipeline (ADR-013). What remains is the student-facing slice: the student-as-creator skill (tighter review gates than the teacher path), versioned student artefacts in the GitHub-based library, and a pilot framing so a class authors sims in weeks 14–15. The games direction rides the same machinery.

Timeline. Kickoff after the mid-point review (week 7+). Pilot-ready by weeks 14–15. The interactive sandbox is post-handover stretch unless an early scope opportunity appears.

Strand C — Investigation and scoping

Three core open questions that go beyond what gets built in this contract. The deliverable is a written scoping note (5–10 pages) covering these — plus any newer scoping inputs that surface (see the exam archive below) — with a recommendation on which directions are worth pursuing and rough effort estimates.

C1 — Beyond LLMs

Are there model classes that might be better than LLMs for some of the project’s tasks, or that could supplement them? Vision-language models (VLMs) and embodied/world-model classes are obvious candidates worth investigating.

Questions to answer: What’s available now? What’s GDPR-tractable? What would cost to host? Where does each model class shine vs. fall short for stx physics tasks?

This feeds directly into the capability-floor eval — once a new model class is in the eval, its capability becomes tracked alongside the LLM panel.

C2 — Beyond chat interfaces

Chat is the default but probably not the right interface for every physics task. Voice interfaces, tree-structured / branching interfaces, concept-map interfaces all have a case.

One concrete option to investigate — a Plaud-like setup (see plaud.ai): record a student group’s discussion, transcribe, then have AI analyse it. This sits at the intersection of A and C — if a thin end-to-end prototype is feasible, it could be a useful target to design against during Strand A development.

C3 — Student models

The most speculative item — and the one the team is most excited about.

Can we have an AI maintain and update a model of a student’s understanding over the course of a topic — for example, as a network of concepts and relations — and compare it to a reference model of the same topic?

sequenceDiagram
    autonumber
    participant S as Student
    participant Bot as AI tutor
    participant SM as Student model<br/>(concept network)
    participant Ref as Reference model<br/>(curated for the topic)

    Note over S,Bot: Over the course of a topic
    loop Each interaction
        S->>Bot: free-form work, discussion, questions
        Bot->>SM: extract & update<br/>concepts and relations
    end
    SM->>Ref: compare topology
    Ref-->>Bot: gaps, divergences,<br/>misconceptions
    Bot-->>S: formative feedback<br/>"here is what I can see in your understanding"

Why this is exciting. Intelligent tutoring systems have long modelled student understanding (Bayesian Knowledge Tracing, knowledge-component models, etc.), but the cost of building and maintaining these models has historically been high. LLMs change that arithmetic substantially: they can ingest free-form student work and produce structured concept-map outputs without requiring per-domain hand-engineering. If this turns out to work well, it opens up:

Formative feedback at three levels — individual student, group, whole class. Output back to students framed as “here is what I, the AI, can see in your understanding.”
Teacher visibility into how a class is converging or diverging from a reference understanding.
A research instrument for studying how student understanding evolves over a topic — potentially novel data for physics-education research.

One concrete realization — the teachable robot. Instead of the AI quietly modelling the student behind the scenes, flip it: the concept network is the object the student manipulates, shown in the workbench, and the chatbot answers from it. The map starts disconnected or wrongly connected; the robot’s answers are wrong in ways the student can see. Students correct the robot by showing it evidence — an experiment result, a worked problem — and the robot updates its model, which changes its answers. The developing map becomes the assessment object, formative and summative, usable at examinations. This is the learning-by-teaching paradigm (cf. Vanderbilt’s Betty’s Brain; the protégé effect), with physics-specific concept networks in the lineage of Koponen & Nousiainen. It lands directly on the existing Chatbot | Workbench form factor — workbench as concept-map editor, chat as the robot reasoning over it — and adds a concept-map editor to the workbench types already being built.

The decision that makes or breaks it: the robot must reason from the map, not from the model’s own physics knowledge. An LLM “knows” the right answer regardless of what the student’s map says; if it answers from that, the map is decorative and the mechanic collapses. The graph has to be the source of truth, with the model only verbalizing a traversal of the student’s network (closed-world over their map). This is also what makes it a research contribution rather than a demo, and it is squarely a scoping question.

What the scoping needs to assess. Clear-eyed feasibility, not advocacy:

Are LLMs actually consistent enough at extracting concept-map representations from free-form student input? Run trials against curated examples.
How well do LLM-extracted models align with expert-built reference models? Calibrate on physics-education concept inventories where these exist (e.g., FCI for mechanics).
What is the smallest useful slice? A “thin end-to-end prototype” — one topic, one concept network, one comparison — would teach more than a long literature review.
What’s the privacy posture? Student understanding models are sensitive data; the GDPR analysis needs to be in scope from day one.

Recommendation framing. The scoping note should give JB a decision-quality assessment: is this worth investing in as a Year-2 development item, is it worth a small Year-2 exploratory study, or is it a dead end in current model capability? M’s recommendation will come with rough effort estimates and risk markers.

Timeline. Scoping kickoff after the mid-July review. Initial assessment based on existing literature + AILANG benchmark probes. A thin proof-of-concept against a single physics topic is ideal but not required — if it fits in weeks 14–15 without compromising Strand A or B, do it.

Exam archive as a standards corpus

A newer scoping input (9 June teacher feedback): a complete archive of national written exams back to 2010. Two uses are in tension, and the point of scoping is to weigh them rather than decide prematurely:

Standards / concept extraction — mine the papers to build a ground-truth set of concepts and expected competences per topic and stx level. This feeds the C3 reference model and the teacher-analytics DRA vocabulary.
Exam-training tool — students practise against past papers with AI feedback: the written-exam analogue of the oral-exam-prep direction.

The papers are copyrighted: concept extraction sidesteps most of the issue, verbatim re-serving does not. Either way the IP and GDPR posture is in scope before anything ships.