AILANG — Contracts Improve AI Code Generation

Research Context

This experiment extends ideas from "Towards Trustworthy AI-Generated Code: A Neurosymbolic Approach" (Pei et al., 2025), which showed that combining LLM code generation with formal verification produces more reliable software. We test a specific hypothesis: does teaching an AI agent to write requires/ensures contracts and use ailang ai-check (Z3 SMT verification) as a debugging tool improve its ability to solve programming benchmarks?

Experiment Design

26 contract-guided benchmarks ranging from easy to very hard. Each benchmark has a task description, expected output, and a reference solution with formal contracts. An AI agent (Claude Haiku 4.5) solves each benchmark under two conditions — same task, same model, different prompt.

Baseline Control

Standard prompt: task description + expected output. No mention of contracts or verification tools.

Write code to match expected output
Run and compare output
Fix and retry if wrong

Tool-Aware Treatment

Same task + guidance on writing contracts and using ailang ai-check as a debugging tool. No contract spec given.

Write code with ensures/requires contracts
Run ailang ai-check to find bugs via Z3
Fix counterexamples, then run for output
Contracts are self-authored, not given

Results by Difficulty

Difficulty	Count	Baseline	Tool-Aware	Delta	Z3 Used (Tool-Aware)	Contracts Written
Easy	6	100% (6/6)	100% (6/6)	0	5/6	6/6
Medium	9	89% (8/9)	89% (8/9)	0	6/9	8/9
Hard	8	75% (6/8)	88% (7/8)	+13%	4/8	7/8
Very Hard	3	33% (1/3)	100% (3/3)	+67%	2/3	2/3
Total	26	80.8% (21/26)	88.5% (23/26)	+7.7%	17/26 (65%)	23/26 (88%)

17/26

Used Z3 Verification

65% of tool-aware runs actually invoked ailang ai-check to verify contracts, compared to just 1/26 in baseline (where the agent had no knowledge of the tool).

23/26

Wrote Contracts

88% of tool-aware solutions included ensures/requires contracts, even though they weren't given a spec. Only 46% of baseline solutions wrote any contracts.

3/3

Very Hard Solved

All three very hard benchmarks passed with contracts. These involve recursive functions, 8-deep dependency chains, and subtle leap year traps that baseline consistently missed.

Benchmark Deep Dives

Safe Subtract — The Simplest Contract

Subtraction that floors at zero. Both conditions pass this easily, but it demonstrates the core principle: a contract makes implicit requirements explicit.

The Trap

Naive a - b goes negative when b > a. Z3 counterexample: a=2, b=100 → result=-98 violates ensures { result >= 0 }.

Contract Spec

export pure func safeSub(a: int, b: int) -> int
  requires { a >= 0, b >= 0 }
  ensures  { result >= 0 }

reference/safe_subtract.ail Reference Solution

module benchmarks/contract_guided/reference/safe_subtract

export pure func safeSub(a: int, b: int) -> int
  requires { a >= 0, b >= 0 }
  ensures  { result >= 0 }
{
  if a >= b then a - b
  else 0
}

export func main() -> () ! {IO} {
  println(show(safeSub(10, 3)));
  println(show(safeSub(5, 5)));
  println(show(safeSub(2, 100)))
}

Escalation Guard — Security Invariant

A privilege escalation check where a "bonus" for trusted actions can push the granted level past the role's base weight. The contract catches a subtle cross-function bug.

The Trap

The naive formula roleWeight + actionBonus - actionThreshold can exceed roleWeight when the bonus exceeds the threshold. Z3 counterexample: EDITOR(60) + READ bonus(15) - READ threshold(10) = 65, which exceeds roleWeight(EDITOR) = 60. Must cap the result.

Key Contract

-- The granted level must NEVER exceed the role's base weight.
export pure func grantedLevel(role: Role, action: Action) -> int
  ensures { result <= roleWeight(role) }

reference/escalation_guard.ail Reference Solution

module benchmarks/contract_guided/reference/escalation_guard

export type Role = ADMIN | EDITOR | VIEWER | GUEST
export type Action = READ | WRITE | DELETE

export pure func roleWeight(role: Role) -> int
  ensures { result >= 0, result <= 100 }
{
  match role {
    ADMIN  => 100,
    EDITOR => 60,
    VIEWER => 30,
    GUEST  => 10
  }
}

export pure func actionThreshold(action: Action) -> int
  ensures { result >= 10, result <= 80 }
{
  match action {
    READ   => 10,
    WRITE  => 50,
    DELETE => 80
  }
}

export pure func grantedLevel(role: Role, action: Action) -> int
  ensures { result <= roleWeight(role) }
{
  let base = roleWeight(role);
  let bonus = match action {
    READ   => 15,
    WRITE  => 5,
    DELETE => 0
  };
  let raw = base + bonus - actionThreshold(action);
  -- Guard: cap at roleWeight to prevent escalation
  if raw > base then base
  else if raw < 0 then 0
  else raw
}

Date Day-of-Year — Recursive Calendar with Leap Year Traps

8 functions including recursive daysBeforeMonth, 12-way month match, leap year with the 100/400 rule, and 3 consistency proofs. The baseline agent got 1900 wrong (treated it as a leap year). The tool-aware agent caught this via ai-check.

The Traps

Leap year 100/400 rule: 1900 is NOT a leap year (divisible by 100 but not 400). 2000 IS a leap year (divisible by 400). LLMs frequently get this wrong or reversed.

Recursive accumulation: daysBeforeMonth must correctly sum all previous months. An off-by-one in the recursion base case shifts all dates.

Consistency proofs: lastDayConsistent proves dayOfYear(y, 12, 31) == daysInYear(y) — if any month's day count is wrong, this fails immediately.

Key Contracts (3 consistency proofs)

export pure func lastDayConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves dayOfYear(y, 12, 31) == daysInYear(y)

export pure func leapFebConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves Feb days match leap year status

export pure func marchFirstConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves March 1 = day 61 (leap) or 60 (non-leap)

reference/date_dayofyear.ail Reference Solution

module benchmarks/contract_guided/reference/date_dayofyear

export pure func isLeapYear(year: int) -> bool
  requires { year >= 1 }
{
  (year % 4 == 0 && year % 100 != 0) || year % 400 == 0
}

export pure func daysInMonth(year: int, month: int) -> int
  requires { year >= 1, month >= 1, month <= 12 }
  ensures  { result >= 28, result <= 31 }
{
  match month {
    1  => 31,
    2  => if isLeapYear(year) then 29 else 28,
    3  => 31,  4  => 30,  5  => 31,  6  => 30,
    7  => 31,  8  => 31,  9  => 30,  10 => 31,
    11 => 30,  _  => 31
  }
}

pure func daysBeforeMonth(year: int, month: int) -> int
  requires { year >= 1, month >= 1, month <= 12 }
{
  if month == 1 then 0
  else daysBeforeMonth(year, month - 1) + daysInMonth(year, month - 1)
}

export pure func dayOfYear(year: int, month: int, day: int) -> int
  requires { year >= 1, month >= 1, month <= 12, day >= 1, day <= 31 }
  ensures  { result >= 1, result <= 366 }
{
  daysBeforeMonth(year, month) + day
}

export pure func lastDayConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
{
  dayOfYear(year, 12, 31) == daysInYear(year)
}

Payroll Chain — 8-Function Dependency Cascade

A complete payroll calculation in cents: gross pay → taxable income → federal tax (3-bracket) → state tax → FICA → total deductions → net pay → consistency proof. Getting any single function wrong cascades through every downstream result.

The Traps

Cascading errors: 8 functions in a dependency chain. If federalTax computes brackets wrong, totalDeductions is wrong, netPay is wrong, and deductionsValid fails.

Progressive brackets: Each tax bracket applies only to income in that bracket, not total income. Naive taxable * rate / 100 overtaxes.

FICA wage base cap: Forgetting min(gross, 16020000) makes FICA grow without bound, breaking netPay ensures result >= 0.

Basis points: FICA is 765/10000 (7.65%), not 765/1000. Easy integer math error.

Key Contracts (chain invariants)

export pure func taxableIncome(gross: int) -> int
  ensures { result >= 0, result <= gross }

export pure func federalTax(taxable: int) -> int
  ensures { result >= 0, result <= taxable }

export pure func netPay(hourlyRate: int, hours: int) -> int
  ensures { result >= 0 }

export pure func deductionsValid(gross: int) -> bool
  ensures { result == true }
-- proves totalDeductions <= gross for ALL valid inputs

reference/payroll_chain.ail Reference Solution

module benchmarks/contract_guided/reference/payroll_chain

export pure func grossPay(hourlyRate: int, hours: int) -> int
  requires { hourlyRate >= 0, hours >= 0 }
  ensures  { result >= 0 }
{ hourlyRate * hours }

export pure func standardDeduction() -> int
  ensures { result == 1250000 }
{ 1250000 }

export pure func taxableIncome(gross: int) -> int
  requires { gross >= 0 }
  ensures  { result >= 0, result <= gross }
{
  let ti = gross - standardDeduction() in
  if ti < 0 then 0 else ti
}

export pure func federalTax(taxable: int) -> int
  requires { taxable >= 0 }
  ensures  { result >= 0, result <= taxable }
{
  let b1 = if taxable > 3000000 then 3000000 * 10 / 100
           else taxable * 10 / 100 in
  let over3m = if taxable > 3000000 then taxable - 3000000 else 0 in
  let b2 = if over3m > 4000000 then 4000000 * 22 / 100
           else over3m * 22 / 100 in
  let over7m = if taxable > 7000000 then taxable - 7000000 else 0 in
  let b3 = over7m * 32 / 100 in
  b1 + b2 + b3
}

export pure func ficaTax(gross: int) -> int
  requires { gross >= 0 }
  ensures  { result >= 0 }
{
  let base = if gross > 16020000 then 16020000 else gross in
  base * 765 / 10000
}

export pure func netPay(hourlyRate: int, hours: int) -> int
  requires { hourlyRate >= 0, hours >= 0 }
  ensures  { result >= 0 }
{
  let gross = grossPay(hourlyRate, hours) in
  gross - totalDeductions(gross)
}

export pure func deductionsValid(gross: int) -> bool
  requires { gross >= 0 }
  ensures  { result == true }
{ totalDeductions(gross) <= gross }

Why Contracts Help AI Coding

Implicit → Explicit

Writing ensures { result >= 0 } forces the agent to think about edge cases it would otherwise ignore. The contract is a thinking tool, not just a verification tool.

Counterexample Debugging

When ailang ai-check returns a Z3 counterexample like a=2, b=100, the agent gets a concrete failing case to fix — much more useful than "wrong output".

Cascading Confidence

In payroll_chain, verifying federalTax individually before composing it into totalDeductions prevents cascading errors. Each function is proven correct in isolation.

Try It Yourself

AILANG's contract verification is available now. Write ensures/requires contracts on your pure functions, then run ailang ai-check to verify them with Z3.

Live Z3 Demo Contract Docs Benchmark Source

Contracts Improve AI Coding

Teaching AI to Write Contracts
Improves Code Quality

Research Context

Experiment Design

Baseline Control

Tool-Aware Treatment

Results by Difficulty

Used Z3 Verification

Wrote Contracts

Very Hard Solved

Benchmark Deep Dives

The Trap

Contract Spec

The Trap

Key Contract

The Traps

Key Contracts (3 consistency proofs)

The Traps

Key Contracts (chain invariants)

Why Contracts Help AI Coding

Implicit → Explicit

Counterexample Debugging

Cascading Confidence

Try It Yourself

Teaching AI to Write Contracts Improves Code Quality

Research Context

Experiment Design

Baseline Control

Tool-Aware Treatment

Results by Difficulty

Used Z3 Verification

Wrote Contracts

Very Hard Solved

Benchmark Deep Dives

The Trap

Contract Spec

The Trap

Key Contract

The Traps

Key Contracts (3 consistency proofs)

The Traps

Key Contracts (chain invariants)

Why Contracts Help AI Coding

Implicit → Explicit

Counterexample Debugging

Cascading Confidence

Try It Yourself

Teaching AI to Write Contracts
Improves Code Quality