The three strands

What gets built, what gets investigated, and when

The four-month contract splits into three strands. A and B are priority delivery — working pilots in front of teachers. C is a scoping investigation that informs what gets built after the contract ends.

Strand Type Priority Deliverable
A — Pedagogical bot infrastructure Build High Working pilot in front of teachers by end of contract
B — Simulation & game creation Build High Working pilot similar to A
C — Investigation and scoping Research Lower 5–10pp scoping note with feasibility + recommendations

Timeline at a glance

gantt
    title When each strand gets attention
    dateFormat YYYY-MM-DD
    axisFormat %b
    todayMarker off

    section A — Bots
    Build core                :a1, 2026-05-15, 42d
    Post-holiday iterate      :a2, 2026-07-15, 30d
    Teacher pilot             :a3, 2026-08-14, 32d

    section B — Sims/games
    Build                     :b1, 2026-07-15, 28d
    Pilot                     :b2, after b1, 14d

    section C — Scoping
    Investigation             :c1, 2026-07-15, 21d
    Scoping note              :c2, after c1, 21d

    section Gates
    Jutland demo (v0.1)       :milestone, m0, 2026-05-27, 0d
    Mid-point review          :milestone, m1, 2026-06-26, 0d
    M + JB holiday (wk 27)    :crit, 2026-06-29, 7d
    Final handover            :milestone, m2, 2026-09-15, 0d

A v0.1 demo URL lands by Wed 27 May, in time for JB and Aswin’s Jutland teacher visit. Strand A’s core build then runs to the mid-point review on 26 June, before both go on holiday (week 27, 29 June – 5 July). Strands B and C kick off on return. Teacher pilot starts when Danish teachers return mid-August.

How the strands relate

flowchart TB
    A[Strand A<br/>Pedagogical bots]
    B[Strand B<br/>Simulation / games]
    C[Strand C<br/>Scoping note]

    subgraph Shared["Shared infrastructure"]
        direction LR
        Auth[Auth / sessions]
        Logs[Anonymised chat logs]
        Router[Model router]
        RAG[RAG store]
    end

    Eval[Capability-floor eval]

    A --> Shared
    B --> Shared
    Shared --> Router
    Router --> Eval

    C -.investigates.-> Eval
    C -.may bridge to.-> A

    style A fill:#fce7e8,stroke:#901a1e,stroke-width:2px
    style B fill:#fce7e8,stroke:#901a1e,stroke-width:2px
    style C fill:#fff,stroke:#901a1e,stroke-width:2px,stroke-dasharray:5
    style Eval fill:#f5f5f5,stroke:#666

Strand B reuses Strand A’s plumbing — the load-bearing decision that makes both pilots achievable in the contract window. Strand C may produce a concrete prototype (the Plaud-like interface) that bridges back into Strand A.

Skills

What’s committed vs. direction-of-travel. Each skill below is tagged with a scope marker. We set the v1 deliverable conservatively and over-deliver where we can; the architecture enables all of these but the 4-month contract only commits a focused subset.

  • v0.1 — Jutland teacher demo (Wed 27 May)
  • v1 — Teacher pilot (mid-August onwards)
  • Year-2 — Architectural direction; enabled by the platform but not in the 4-month deliverable

Concrete skills that will sit on top of the Strand A infrastructure. Designed with AR; refined as the teacher pilot reveals what actually gets used.

The canonical pairing. AR’s preferred workflow (Examples page) pairs physics-sim-builder with a tutor-config skill (typically problem-set-helper-config) — sim + tutor designed together as a unit, with the tutor’s prompt referencing the simulation’s specific UI elements. Standalone-tutor skills (no paired sim) cover topics like pure conceptual dialogue where there’s no UI to reference.

Teacher-facing (primary)

All teacher tasks happen in the same UI app as students — a single multi-surface frontend where the AI directs which surface each output goes to (chat for conversation, workspace for active views like dashboards and artefacts, sidebar for persistent context, modal for focused approval flows). See ADR-015.

Content creation

Skill What it does Surface Scope
physics-sim-builder Generates a paired sim+tutor unit. Phenomenon-simulation form factor (sliders, observation, parameter exploration). Structured inputs: Topic (full scope) + Student Difficulty (emphasis). Anti-reduction discipline: address the misconception inside the complete topic, never collapse to it. Output structure: HTML simulation + Predict→Explore→Reflect→Summarize learning flow + paired Socratic tutor referencing the sim’s UI. Modelled on AR’s iteratively refined prompt (sources/aswin-trials/prompt-aswin.txt). MCP App (sim) + tutor-prompt feed into a config skill Year-2
physics-lab-builder Generates a paired virtual-lab+tutor unit. Procedural form factor (draggable equipment, wiring, calibration, measurement workflow). Structured inputs: Experiment source (PDF or text of the real procedure) + Equipment list (anti-hallucination ground truth) + Student Difficulty. Output structure: Interactive HTML lab + checklist-driven procedure + paired Socratic tutor referencing the lab’s specific equipment and measurement steps. Modelled on AR’s second trial (LED Planck-constant lab — see Examples). MCP App (lab) + tutor-prompt feed into a config skill Year-2
illustration-builder Generates SVG/PNG illustrations for student materials A2UI inline image Year-2
misconception-pair Generates a correct illustration and a textbook-distractor wrong variant — pulls on the pedagogical pattern AR is already using to mine instructive AI failure modes A2UI side-by-side Year-2

v1 alternative — a hand-curated library of paired sim+tutor units, plus PhET sims paired with AIPLA-authored tutors for topics PhET already covers well. PhET sims are iframe-embeddable and pre-translated; AIPLA hosts a Socratic tutor whose prompt references the specific PhET sim’s UI. Same paired-unit pattern as AR’s; we just don’t author the sim ourselves for these. The AI-generation skills above ship when the artefact-generation pipeline is mature.

Bot configuration

Skill What it does Surface Scope
problem-set-helper-config Configures a student-facing tutor for a problem set or simulation. When paired with physics-sim-builder the tutor’s prompt is generated alongside the sim and references its specific UI; standalone, it just structures the hint sequence. A2UI configuration form v1
concept-dialogue-config Configures a standalone Socratic conceptual-exploration tutor for a topic (no paired sim) A2UI configuration form v1
lab-troubleshoot-config Configures a tutor that helps students debug experimental data A2UI form + AILANG Parse for uploads Year-2

Dashboard / management

Skill What it does Surface Scope
class-status Current class usage, budget remaining, time-to-reset, active groups A2UI dashboard block Year-2
manage-class Create class, generate group codes, bind groups to activities, list groups A2UI form ✅ v0.1 shipped (code generation + class binding live)
review-artefacts Pending generated artefacts awaiting teacher approval before students see them A2UI list with inline preview + approve/reject Year-2
chat-log-search Search this teacher’s anonymised class chat logs A2UI search results Year-2 (BigQuery direct for v1)

Student-facing (consumes teacher configurations)

Skill What it does Scope
Problem-set hints — Boldkast Socratic tutor for projectile motion (Danish stx); paired with the Boldkast simulator workbench ✅ v0.1 shipped 2026-05-20 — live at aipla-v01-frontend-wgwhd7mspa-lz.a.run.app
LED Planck virtual lab Socratic tutor for Planck’s constant (Danish stx physics-A); paired with AR’s LED threshold voltage lab (procedural-lab artefact class — distinct from Boldkast’s phenomenon-sim) ✅ shipped 2026-05-27 — second physics skill, validates the two-artefact-class pattern
Conceptual exploration Socratic dialogue on a topic the teacher pre-configured v1
Lab data troubleshooting Photo / CSV / Tracker uploads + analysis Year-2 (multimodal upload not in v1)

Stretch (Year-2 / post-handover)

Skill What it does Notes
student-as-creator (Strand B) Lets students generate their own simulations with AI assistance Same machinery as physics-sim-builder; different skill prompts and review gates
physics-sandbox-tool (Strand B) Interactive 2D physics sandbox embedded as an MCP App — drag-and-drop objects, run simulations, query state via MCP tools. The “Algodoo-but-with-API” direction discussed in the 18 May meeting. Validated by AR’s LED Planck-constant lab — a concrete instance of the form factor (movable equipment, click-terminal wiring, calibration, real lab procedure) AIPLA’s stretch sandbox would generalise. Build on Matter.js or Planck.js (MIT-licensed JS 2D physics). Algodoo itself is desktop-only with no external API; this stretch goal would give AIPLA an interactive scene-building surface that fits the MCP App architecture natively.
video-analysis-feedback Analyses recorded video of a student working through a task and gives formative feedback on progress New from JB; multimodal; depends on Gemini’s video capabilities and on the safety/privacy posture for student video
student-model-tracker Maintains and updates a per-group concept network (C3) The most speculative item; gated by the Strand C scoping recommendation

Net v1 commitment. Out of the skills above, v1 ships five: problem-set-helper-config, concept-dialogue-config, manage-class (code generation + class binding), and the two student-facing chat skills (Boldkast + LED Planck, with KineBot to follow). Plus the hand-curated sim library teachers can hand out. Everything else is the architecture’s direction-of-travel, not the 4-month deliverable.

Platform features shipped alongside skills (post-Jutland, as of 2026-05-27):

  • Teacher UI — five /teacher/* routes wired to real backend: class list, class detail (groups + assigned activities), activity configuration (teaching goal live; parameters and code tabs wireframed for v1.1), session report, analytics chat placeholder. {teacher_focus} injection confirmed end-to-end.
  • Lesson picker/lessons route replacing the hardcoded v0.1 redirect; students see all activities their group can access.
  • Proactive tutor greet — tutor speaks first on join, eliminating the dead-start UX gap observed in the first live student test.
  • Commit-on-submit workbench state — slider changes held locally, flushed to the agent only on explicit commit (Play button or chat submit). Keeps the agent context clean during exploratory interaction.
  • Sim-core architectural principle — workbench artefacts contain the simulation element only; instructions, data recording, and hints live in the tutor. Learned from the LED Planck integration; now the rule for all incoming artefacts.
  • TTS read-aloud — every tutor message has a read-aloud button using the browser’s native speechSynthesis API. No privacy gate; speaks in Danish. Useful for students who find reading the tutor’s full response a barrier.
  • Mobile tab layout — on phone-sized screens the chat and workspace panels swap via a tab toggle rather than appearing side-by-side; the platform works on a single shared phone, matching the typical school lab setup.
  • Student progress checklist — self-assessment checklist embedded in the workspace; each tick is wired to the agent observability pipeline so teachers can see at a glance how far each group progressed through the activity’s steps.
  • Boldkast vx/vy component arrows — the Boldkast simulator now shows decomposed velocity component arrows (vx horizontal, vy vertical) overlaid on the trajectory canvas, making the independence of horizontal and vertical motion visually present rather than purely inferential.
  • MCP Apps spec compliance — all workbench artefacts (Boldkast, LED Planck) migrated to the standard JSON-RPC ui/initialize handshake; the iframe-message harness is now fully on-spec, which means any future artefact follows the same integration path without custom hooks.

Workbench type expansion (from JB post-Jutland feedback, 2026-05-26): the workbench will expand beyond HTML simulators to include drawing boards (Excalidraw/tldraw), phone-sensor experiment tools, video analysis, and lab notebooks. Twenty-three Danish physics apps at jitt.dk, built by a Jutland teacher and freely available, are immediate artefact candidates for activities 4–8 without generating new HTML. Design brief in the prototypes folder; implementation from v1.1.

Vetted artefact library lives on GitHub for portability — survives any future hosting move (UCPH IT, alternate provider). Approved artefacts skip the content-review pipeline on each use; new artefacts pass through it.

Budget controls. All skill invocations bill against the AIPLA project’s centrally-managed keys — students and teachers never see or provide an API key. Per-group and per-class budget caps enforced at the model router (ADR-014) so one class can’t drain another’s allocation. Teacher dashboard surfaces current usage and time-to-reset.


Strand A — Pedagogical bot infrastructure

Goal. Deploy purpose-configured chatbots (“pedagogical bots”) that teachers use with their stx physics classes. The canonical pattern (per AR’s existing GenAI trials) is a paired unit — a sandboxed HTML simulation in workspace plus a Socratic tutor in chat whose prompt references that specific simulation’s UI. Sim and tutor are designed together; the tutor is the simulation’s pedagogical scaffold, not a generic physics chatbot.

v1 commits to the chat-tutor side plus a hand-curated library of paired sim+tutor units teachers can hand out. The on-demand AI generation of new paired units (physics-sim-builder, misconception-pair) is direction-of-travel for Year-2 — the architecture supports it but v1 ships the curated library.

Requirements (from the brief):

  • GDPR / privacy-by-design. EU-hosted, no PII leakage to US providers where avoidable. Holds up to UCPH data-protection review and to teachers’ and parents’ scrutiny.
  • Usage-based billing on the project side. Students and teachers don’t need their own accounts or subscriptions.
  • Researcher access to chat logs. Retrievable, anonymisable, and analysable. Core research data.
  • Resource upload. Teachers and the project team upload their own materials (curriculum extracts, problem sets, lab guides) that a given bot draws on. RAG-style.
  • Rich physics output. LaTeX formulae, tables, diagrams, graphs where possible. Output compatible with TI-Nspire / GeoGebra would be a plus.
  • Multimodal input. Photos of hand-drawn free-body diagrams, screenshots of Tracker video-analysis results, CSV/Excel exports from sensors.
  • Teacher bot configuration. Teachers configure or instantiate bots for a specific topic / assignment without a developer in the loop each time. Doesn’t have to be polished in v1, but the path to it should be clear.

Deliverable. A working pilot we can put in front of a small group of teachers by the end of the contract, plus documentation that lets us hand it off cleanly to the postdocs joining the project later.

Timeline. See Architecture for the ADRs that govern how Strand A gets built. Build phase weeks 3–6; internal iteration weeks 6–9; teacher pilot weeks 13–17. Mid-point review at week 9 is the first hard gate.


Strand B — Student-as-creator

Goal. Extend Strand A’s teacher-creates-artefacts pattern to students creating their own simulations and games with AI assistance. Same MCP-App + safety-pipeline machinery as Strand A; the change is in the skills exposed to students.

Two directions inside Strand B:

  • Generativestudent-as-creator skill, AI builds a sim from a student prompt. Same machinery as physics-sim-builder; tighter review gates.
  • Interactive sandboxphysics-sandbox-tool, a Matter.js-based MCP App where students drag-and-drop objects and run physics. This is the Algodoo-replacement direction — interactive scene building inside AIPLA, no desktop-software lock-in, MCP-tool-accessible for AI-assisted scene authoring.

GitHub-based workflow is the working assumption for the artefact library — versioned, inspectable, portable across hosting (matters because UCPH-IT-takeover is on the self-hosting horizon). Students get exposure to a real engineering workflow; researchers get a complete edit history.

Deliverable. A working pilot similar to Strand A.

Timeline. Kickoff after the mid-point review (week 7+). Pilot-ready by weeks 14–15. Less new infrastructure than the original framing suggested — most of the heavy lifting is already done by Strand A’s artefact machinery. The interactive sandbox is post-handover stretch unless an early scope opportunity appears.


Strand C — Investigation and scoping

Three open questions that go beyond what gets built in this contract. The deliverable is a written scoping note (5–10 pages) covering all three, with a recommendation on which directions are worth pursuing and rough effort estimates.

C1 — Beyond LLMs

Are there model classes that might be better than LLMs for some of the project’s tasks, or that could supplement them? Vision-language models (VLMs) and embodied/world-model classes are obvious candidates worth investigating.

Questions to answer: What’s available now? What’s GDPR-tractable? What would cost to host? Where does each model class shine vs. fall short for stx physics tasks?

This feeds directly into the capability-floor eval — once a new model class is in the eval, its capability becomes tracked alongside the LLM panel.

C2 — Beyond chat interfaces

Chat is the default but probably not the right interface for every physics task. Voice interfaces, tree-structured / branching interfaces, concept-map interfaces all have a case.

One concrete option to investigate — a Plaud-like setup (see plaud.ai): record a student group’s discussion, transcribe, then have AI analyse it. This sits at the intersection of A and C — if a thin end-to-end prototype is feasible, it could be a useful target to design against during Strand A development.

C3 — Student models

The most speculative item — and the one the team is most excited about.

Can we have an AI maintain and update a model of a student’s understanding over the course of a topic — for example, as a network of concepts and relations — and compare it to a reference model of the same topic?

sequenceDiagram
    autonumber
    participant S as Student
    participant Bot as AI tutor
    participant SM as Student model<br/>(concept network)
    participant Ref as Reference model<br/>(curated for the topic)

    Note over S,Bot: Over the course of a topic
    loop Each interaction
        S->>Bot: free-form work, discussion, questions
        Bot->>SM: extract & update<br/>concepts and relations
    end
    SM->>Ref: compare topology
    Ref-->>Bot: gaps, divergences,<br/>misconceptions
    Bot-->>S: formative feedback<br/>"here is what I can see in your understanding"

Why this is exciting. Intelligent tutoring systems have long modelled student understanding (Bayesian Knowledge Tracing, knowledge-component models, etc.), but the cost of building and maintaining these models has historically been high. LLMs change that arithmetic substantially: they can ingest free-form student work and produce structured concept-map outputs without requiring per-domain hand-engineering. If this turns out to work well, it opens up:

  • Formative feedback at three levels — individual student, group, whole class. Output back to students framed as “here is what I, the AI, can see in your understanding.”
  • Teacher visibility into how a class is converging or diverging from a reference understanding.
  • A research instrument for studying how student understanding evolves over a topic — potentially novel data for physics-education research.

What the scoping needs to assess. Clear-eyed feasibility, not advocacy:

  • Are LLMs actually consistent enough at extracting concept-map representations from free-form student input? Run trials against curated examples.
  • How well do LLM-extracted models align with expert-built reference models? Calibrate on physics-education concept inventories where these exist (e.g., FCI for mechanics).
  • What is the smallest useful slice? A “thin end-to-end prototype” — one topic, one concept network, one comparison — would teach more than a long literature review.
  • What’s the privacy posture? Student understanding models are sensitive data; the GDPR analysis needs to be in scope from day one.

Recommendation framing. The scoping note should give JB a decision-quality assessment: is this worth investing in as a Year-2 development item, is it worth a small Year-2 exploratory study, or is it a dead end in current model capability? Mark’s recommendation will come with rough effort estimates and risk markers.

Timeline. Scoping kickoff after the mid-July review. Initial assessment based on existing literature + AILANG benchmark probes. A thin proof-of-concept against a single physics topic is ideal but not required — if it fits in weeks 14–15 without compromising Strand A or B, do it.