gantt
title When each strand gets attention
dateFormat YYYY-MM-DD
axisFormat %b
todayMarker off
section A — Bots
Build core :a1, 2026-05-15, 42d
Post-holiday iterate :a2, 2026-07-15, 30d
Teacher pilot :a3, 2026-08-14, 32d
section B — Sims/games
Build :b1, 2026-07-15, 28d
Pilot :b2, after b1, 14d
section C — Scoping
Investigation :c1, 2026-07-15, 21d
Scoping note :c2, after c1, 21d
section Gates
Jutland demo (v0.1) :milestone, m0, 2026-05-27, 0d
Mid-point review :milestone, m1, 2026-06-26, 0d
M + JB holiday (wk 27) :crit, 2026-06-29, 7d
Final handover :milestone, m2, 2026-09-15, 0d
The three strands
What gets built, what gets investigated, and when
The four-month contract splits into three strands. A and B are priority delivery — working pilots in front of teachers. C is a scoping investigation that informs what gets built after the contract ends.
| Strand | Type | Priority | Deliverable |
|---|---|---|---|
| A — Pedagogical bot infrastructure | Build | High | Working pilot in front of teachers by end of contract |
| B — Simulation & game creation | Build | High | Working pilot similar to A |
| C — Investigation and scoping | Research | Lower | 5–10pp scoping note with feasibility + recommendations |
Timeline at a glance
A v0.1 demo URL lands by Wed 27 May, in time for JB and Aswin’s Jutland teacher visit. Strand A’s core build then runs to the mid-point review on 26 June, before both go on holiday (week 27, 29 June – 5 July). Strands B and C kick off on return. Teacher pilot starts when Danish teachers return mid-August.
How the strands relate
flowchart TB
A[Strand A<br/>Pedagogical bots]
B[Strand B<br/>Simulation / games]
C[Strand C<br/>Scoping note]
subgraph Shared["Shared infrastructure"]
direction LR
Auth[Auth / sessions]
Logs[Anonymised chat logs]
Router[Model router]
RAG[RAG store]
end
Eval[Capability-floor eval]
A --> Shared
B --> Shared
Shared --> Router
Router --> Eval
C -.investigates.-> Eval
C -.may bridge to.-> A
style A fill:#fce7e8,stroke:#901a1e,stroke-width:2px
style B fill:#fce7e8,stroke:#901a1e,stroke-width:2px
style C fill:#fff,stroke:#901a1e,stroke-width:2px,stroke-dasharray:5
style Eval fill:#f5f5f5,stroke:#666
Strand B reuses Strand A’s plumbing — the load-bearing decision that makes both pilots achievable in the contract window. Strand C may produce a concrete prototype (the Plaud-like interface) that bridges back into Strand A.
Skills
What’s committed vs. direction-of-travel. Each skill below is tagged with a scope marker. We set the v1 deliverable conservatively and over-deliver where we can; the architecture enables all of these but the 4-month contract only commits a focused subset.
- v0.1 — Jutland teacher demo (Wed 27 May)
- v1 — Teacher pilot (mid-August onwards)
- Year-2 — Architectural direction; enabled by the platform but not in the 4-month deliverable
Concrete skills that will sit on top of the Strand A infrastructure. Designed with AR; refined as the teacher pilot reveals what actually gets used.
The canonical pairing. AR’s preferred workflow (Examples page) pairs physics-sim-builder with a tutor-config skill (typically problem-set-helper-config) — sim + tutor designed together as a unit, with the tutor’s prompt referencing the simulation’s specific UI elements. Standalone-tutor skills (no paired sim) cover topics like pure conceptual dialogue where there’s no UI to reference.
Teacher-facing (primary)
All teacher tasks happen in the same UI app as students — a single multi-surface frontend where the AI directs which surface each output goes to (chat for conversation, workspace for active views like dashboards and artefacts, sidebar for persistent context, modal for focused approval flows). See ADR-015.
Content creation
| Skill | What it does | Surface | Scope |
|---|---|---|---|
physics-sim-builder |
Generates a paired sim+tutor unit. Phenomenon-simulation form factor (sliders, observation, parameter exploration). Structured inputs: Topic (full scope) + Student Difficulty (emphasis). Anti-reduction discipline: address the misconception inside the complete topic, never collapse to it. Output structure: HTML simulation + Predict→Explore→Reflect→Summarize learning flow + paired Socratic tutor referencing the sim’s UI. Modelled on AR’s iteratively refined prompt (sources/aswin-trials/prompt-aswin.txt). |
MCP App (sim) + tutor-prompt feed into a config skill | Year-2 |
physics-lab-builder |
Generates a paired virtual-lab+tutor unit. Procedural form factor (draggable equipment, wiring, calibration, measurement workflow). Structured inputs: Experiment source (PDF or text of the real procedure) + Equipment list (anti-hallucination ground truth) + Student Difficulty. Output structure: Interactive HTML lab + checklist-driven procedure + paired Socratic tutor referencing the lab’s specific equipment and measurement steps. Modelled on AR’s second trial (LED Planck-constant lab — see Examples). | MCP App (lab) + tutor-prompt feed into a config skill | Year-2 |
illustration-builder |
Generates SVG/PNG illustrations for student materials | A2UI inline image | Year-2 |
misconception-pair |
Generates a correct illustration and a textbook-distractor wrong variant — pulls on the pedagogical pattern AR is already using to mine instructive AI failure modes | A2UI side-by-side | Year-2 |
v1 alternative — a hand-curated library of paired sim+tutor units, plus PhET sims paired with AIPLA-authored tutors for topics PhET already covers well. PhET sims are iframe-embeddable and pre-translated; AIPLA hosts a Socratic tutor whose prompt references the specific PhET sim’s UI. Same paired-unit pattern as AR’s; we just don’t author the sim ourselves for these. The AI-generation skills above ship when the artefact-generation pipeline is mature.
Bot configuration
| Skill | What it does | Surface | Scope |
|---|---|---|---|
problem-set-helper-config |
Configures a student-facing tutor for a problem set or simulation. When paired with physics-sim-builder the tutor’s prompt is generated alongside the sim and references its specific UI; standalone, it just structures the hint sequence. |
A2UI configuration form | v1 |
concept-dialogue-config |
Configures a standalone Socratic conceptual-exploration tutor for a topic (no paired sim) | A2UI configuration form | v1 |
lab-troubleshoot-config |
Configures a tutor that helps students debug experimental data | A2UI form + AILANG Parse for uploads | Year-2 |
Dashboard / management
| Skill | What it does | Surface | Scope |
|---|---|---|---|
class-status |
Current class usage, budget remaining, time-to-reset, active groups | A2UI dashboard block | Year-2 |
manage-class |
Create class, generate group codes, bind groups to activities, list groups | A2UI form | ✅ v0.1 shipped (code generation + class binding live) |
review-artefacts |
Pending generated artefacts awaiting teacher approval before students see them | A2UI list with inline preview + approve/reject | Year-2 |
chat-log-search |
Search this teacher’s anonymised class chat logs | A2UI search results | Year-2 (BigQuery direct for v1) |
Student-facing (consumes teacher configurations)
| Skill | What it does | Scope |
|---|---|---|
| Problem-set hints — Boldkast | Socratic tutor for projectile motion (Danish stx); paired with the Boldkast simulator workbench | ✅ v0.1 shipped 2026-05-20 — live at aipla-v01-frontend-wgwhd7mspa-lz.a.run.app |
| LED Planck virtual lab | Socratic tutor for Planck’s constant (Danish stx physics-A); paired with AR’s LED threshold voltage lab (procedural-lab artefact class — distinct from Boldkast’s phenomenon-sim) | ✅ shipped 2026-05-27 — second physics skill, validates the two-artefact-class pattern |
| Conceptual exploration | Socratic dialogue on a topic the teacher pre-configured | v1 |
| Lab data troubleshooting | Photo / CSV / Tracker uploads + analysis | Year-2 (multimodal upload not in v1) |
Stretch (Year-2 / post-handover)
| Skill | What it does | Notes |
|---|---|---|
student-as-creator (Strand B) |
Lets students generate their own simulations with AI assistance | Same machinery as physics-sim-builder; different skill prompts and review gates |
physics-sandbox-tool (Strand B) |
Interactive 2D physics sandbox embedded as an MCP App — drag-and-drop objects, run simulations, query state via MCP tools. The “Algodoo-but-with-API” direction discussed in the 18 May meeting. Validated by AR’s LED Planck-constant lab — a concrete instance of the form factor (movable equipment, click-terminal wiring, calibration, real lab procedure) AIPLA’s stretch sandbox would generalise. | Build on Matter.js or Planck.js (MIT-licensed JS 2D physics). Algodoo itself is desktop-only with no external API; this stretch goal would give AIPLA an interactive scene-building surface that fits the MCP App architecture natively. |
video-analysis-feedback |
Analyses recorded video of a student working through a task and gives formative feedback on progress | New from JB; multimodal; depends on Gemini’s video capabilities and on the safety/privacy posture for student video |
student-model-tracker |
Maintains and updates a per-group concept network (C3) | The most speculative item; gated by the Strand C scoping recommendation |
Net v1 commitment. Out of the skills above, v1 ships five: problem-set-helper-config, concept-dialogue-config, manage-class (code generation + class binding), and the two student-facing chat skills (Boldkast + LED Planck, with KineBot to follow). Plus the hand-curated sim library teachers can hand out. Everything else is the architecture’s direction-of-travel, not the 4-month deliverable.
Platform features shipped alongside skills (post-Jutland, as of 2026-05-27):
- Teacher UI — five
/teacher/*routes wired to real backend: class list, class detail (groups + assigned activities), activity configuration (teaching goal live; parameters and code tabs wireframed for v1.1), session report, analytics chat placeholder.{teacher_focus}injection confirmed end-to-end. - Lesson picker —
/lessonsroute replacing the hardcoded v0.1 redirect; students see all activities their group can access. - Proactive tutor greet — tutor speaks first on join, eliminating the dead-start UX gap observed in the first live student test.
- Commit-on-submit workbench state — slider changes held locally, flushed to the agent only on explicit commit (Play button or chat submit). Keeps the agent context clean during exploratory interaction.
- Sim-core architectural principle — workbench artefacts contain the simulation element only; instructions, data recording, and hints live in the tutor. Learned from the LED Planck integration; now the rule for all incoming artefacts.
- TTS read-aloud — every tutor message has a read-aloud button using the browser’s native
speechSynthesisAPI. No privacy gate; speaks in Danish. Useful for students who find reading the tutor’s full response a barrier. - Mobile tab layout — on phone-sized screens the chat and workspace panels swap via a tab toggle rather than appearing side-by-side; the platform works on a single shared phone, matching the typical school lab setup.
- Student progress checklist — self-assessment checklist embedded in the workspace; each tick is wired to the agent observability pipeline so teachers can see at a glance how far each group progressed through the activity’s steps.
- Boldkast vx/vy component arrows — the Boldkast simulator now shows decomposed velocity component arrows (vx horizontal, vy vertical) overlaid on the trajectory canvas, making the independence of horizontal and vertical motion visually present rather than purely inferential.
- MCP Apps spec compliance — all workbench artefacts (Boldkast, LED Planck) migrated to the standard JSON-RPC
ui/initializehandshake; the iframe-message harness is now fully on-spec, which means any future artefact follows the same integration path without custom hooks.
Workbench type expansion (from JB post-Jutland feedback, 2026-05-26): the workbench will expand beyond HTML simulators to include drawing boards (Excalidraw/tldraw), phone-sensor experiment tools, video analysis, and lab notebooks. Twenty-three Danish physics apps at jitt.dk, built by a Jutland teacher and freely available, are immediate artefact candidates for activities 4–8 without generating new HTML. Design brief in the prototypes folder; implementation from v1.1.
Vetted artefact library lives on GitHub for portability — survives any future hosting move (UCPH IT, alternate provider). Approved artefacts skip the content-review pipeline on each use; new artefacts pass through it.
Budget controls. All skill invocations bill against the AIPLA project’s centrally-managed keys — students and teachers never see or provide an API key. Per-group and per-class budget caps enforced at the model router (ADR-014) so one class can’t drain another’s allocation. Teacher dashboard surfaces current usage and time-to-reset.
Strand A — Pedagogical bot infrastructure
Goal. Deploy purpose-configured chatbots (“pedagogical bots”) that teachers use with their stx physics classes. The canonical pattern (per AR’s existing GenAI trials) is a paired unit — a sandboxed HTML simulation in workspace plus a Socratic tutor in chat whose prompt references that specific simulation’s UI. Sim and tutor are designed together; the tutor is the simulation’s pedagogical scaffold, not a generic physics chatbot.
v1 commits to the chat-tutor side plus a hand-curated library of paired sim+tutor units teachers can hand out. The on-demand AI generation of new paired units (physics-sim-builder, misconception-pair) is direction-of-travel for Year-2 — the architecture supports it but v1 ships the curated library.
Requirements (from the brief):
- GDPR / privacy-by-design. EU-hosted, no PII leakage to US providers where avoidable. Holds up to UCPH data-protection review and to teachers’ and parents’ scrutiny.
- Usage-based billing on the project side. Students and teachers don’t need their own accounts or subscriptions.
- Researcher access to chat logs. Retrievable, anonymisable, and analysable. Core research data.
- Resource upload. Teachers and the project team upload their own materials (curriculum extracts, problem sets, lab guides) that a given bot draws on. RAG-style.
- Rich physics output. LaTeX formulae, tables, diagrams, graphs where possible. Output compatible with TI-Nspire / GeoGebra would be a plus.
- Multimodal input. Photos of hand-drawn free-body diagrams, screenshots of Tracker video-analysis results, CSV/Excel exports from sensors.
- Teacher bot configuration. Teachers configure or instantiate bots for a specific topic / assignment without a developer in the loop each time. Doesn’t have to be polished in v1, but the path to it should be clear.
Deliverable. A working pilot we can put in front of a small group of teachers by the end of the contract, plus documentation that lets us hand it off cleanly to the postdocs joining the project later.
Timeline. See Architecture for the ADRs that govern how Strand A gets built. Build phase weeks 3–6; internal iteration weeks 6–9; teacher pilot weeks 13–17. Mid-point review at week 9 is the first hard gate.
Strand B — Student-as-creator
Goal. Extend Strand A’s teacher-creates-artefacts pattern to students creating their own simulations and games with AI assistance. Same MCP-App + safety-pipeline machinery as Strand A; the change is in the skills exposed to students.
Two directions inside Strand B:
- Generative —
student-as-creatorskill, AI builds a sim from a student prompt. Same machinery asphysics-sim-builder; tighter review gates. - Interactive sandbox —
physics-sandbox-tool, a Matter.js-based MCP App where students drag-and-drop objects and run physics. This is the Algodoo-replacement direction — interactive scene building inside AIPLA, no desktop-software lock-in, MCP-tool-accessible for AI-assisted scene authoring.
GitHub-based workflow is the working assumption for the artefact library — versioned, inspectable, portable across hosting (matters because UCPH-IT-takeover is on the self-hosting horizon). Students get exposure to a real engineering workflow; researchers get a complete edit history.
Deliverable. A working pilot similar to Strand A.
Timeline. Kickoff after the mid-point review (week 7+). Pilot-ready by weeks 14–15. Less new infrastructure than the original framing suggested — most of the heavy lifting is already done by Strand A’s artefact machinery. The interactive sandbox is post-handover stretch unless an early scope opportunity appears.
Strand C — Investigation and scoping
Three open questions that go beyond what gets built in this contract. The deliverable is a written scoping note (5–10 pages) covering all three, with a recommendation on which directions are worth pursuing and rough effort estimates.
C1 — Beyond LLMs
Are there model classes that might be better than LLMs for some of the project’s tasks, or that could supplement them? Vision-language models (VLMs) and embodied/world-model classes are obvious candidates worth investigating.
Questions to answer: What’s available now? What’s GDPR-tractable? What would cost to host? Where does each model class shine vs. fall short for stx physics tasks?
This feeds directly into the capability-floor eval — once a new model class is in the eval, its capability becomes tracked alongside the LLM panel.
C2 — Beyond chat interfaces
Chat is the default but probably not the right interface for every physics task. Voice interfaces, tree-structured / branching interfaces, concept-map interfaces all have a case.
One concrete option to investigate — a Plaud-like setup (see plaud.ai): record a student group’s discussion, transcribe, then have AI analyse it. This sits at the intersection of A and C — if a thin end-to-end prototype is feasible, it could be a useful target to design against during Strand A development.
C3 — Student models
Can we have an AI maintain and update a model of a student’s understanding over the course of a topic — for example, as a network of concepts and relations — and compare it to a reference model of the same topic?
sequenceDiagram
autonumber
participant S as Student
participant Bot as AI tutor
participant SM as Student model<br/>(concept network)
participant Ref as Reference model<br/>(curated for the topic)
Note over S,Bot: Over the course of a topic
loop Each interaction
S->>Bot: free-form work, discussion, questions
Bot->>SM: extract & update<br/>concepts and relations
end
SM->>Ref: compare topology
Ref-->>Bot: gaps, divergences,<br/>misconceptions
Bot-->>S: formative feedback<br/>"here is what I can see in your understanding"
Why this is exciting. Intelligent tutoring systems have long modelled student understanding (Bayesian Knowledge Tracing, knowledge-component models, etc.), but the cost of building and maintaining these models has historically been high. LLMs change that arithmetic substantially: they can ingest free-form student work and produce structured concept-map outputs without requiring per-domain hand-engineering. If this turns out to work well, it opens up:
- Formative feedback at three levels — individual student, group, whole class. Output back to students framed as “here is what I, the AI, can see in your understanding.”
- Teacher visibility into how a class is converging or diverging from a reference understanding.
- A research instrument for studying how student understanding evolves over a topic — potentially novel data for physics-education research.
What the scoping needs to assess. Clear-eyed feasibility, not advocacy:
- Are LLMs actually consistent enough at extracting concept-map representations from free-form student input? Run trials against curated examples.
- How well do LLM-extracted models align with expert-built reference models? Calibrate on physics-education concept inventories where these exist (e.g., FCI for mechanics).
- What is the smallest useful slice? A “thin end-to-end prototype” — one topic, one concept network, one comparison — would teach more than a long literature review.
- What’s the privacy posture? Student understanding models are sensitive data; the GDPR analysis needs to be in scope from day one.
Recommendation framing. The scoping note should give JB a decision-quality assessment: is this worth investing in as a Year-2 development item, is it worth a small Year-2 exploratory study, or is it a dead end in current model capability? Mark’s recommendation will come with rough effort estimates and risk markers.
Timeline. Scoping kickoff after the mid-July review. Initial assessment based on existing literature + AILANG benchmark probes. A thin proof-of-concept against a single physics topic is ideal but not required — if it fits in weeks 14–15 without compromising Strand A or B, do it.