AILANG

Contracts Improve AI Coding

Experimental evidence from 26 benchmarks
Experimental Results

Teaching AI to Write Contracts
Improves Code Quality

When an LLM agent learns to write ensures/requires contracts and verify them with ailang ai-check, it solves harder problems more reliably. Inspired by neurosymbolic verification research.

80.8%
Baseline
21/26 pass
88.5%
With Contracts
23/26 pass
+7.7%
Overall Delta
+2 benchmarks
33→100%
Very Hard
1/3 → 3/3

Research Context

This experiment extends ideas from "Towards Trustworthy AI-Generated Code: A Neurosymbolic Approach" (Pei et al., 2025), which showed that combining LLM code generation with formal verification produces more reliable software. We test a specific hypothesis: does teaching an AI agent to write requires/ensures contracts and use ailang ai-check (Z3 SMT verification) as a debugging tool improve its ability to solve programming benchmarks?

Experiment Design

26 contract-guided benchmarks ranging from easy to very hard. Each benchmark has a task description, expected output, and a reference solution with formal contracts. An AI agent (Claude Haiku 4.5) solves each benchmark under two conditions — same task, same model, different prompt.

Baseline Control

Standard prompt: task description + expected output. No mention of contracts or verification tools.

  • Write code to match expected output
  • Run and compare output
  • Fix and retry if wrong

Tool-Aware Treatment

Same task + guidance on writing contracts and using ailang ai-check as a debugging tool. No contract spec given.

  • Write code with ensures/requires contracts
  • Run ailang ai-check to find bugs via Z3
  • Fix counterexamples, then run for output
  • Contracts are self-authored, not given

Results by Difficulty

Difficulty Count Baseline Tool-Aware Delta Z3 Used (Tool-Aware) Contracts Written
Easy 6 100% (6/6) 100% (6/6) 0 5/6 6/6
Medium 9 89% (8/9) 89% (8/9) 0 6/9 8/9
Hard 8 75% (6/8) 88% (7/8) +13% 4/8 7/8
Very Hard 3 33% (1/3) 100% (3/3) +67% 2/3 2/3
Total 26 80.8% (21/26) 88.5% (23/26) +7.7% 17/26 (65%) 23/26 (88%)
17/26

Used Z3 Verification

65% of tool-aware runs actually invoked ailang ai-check to verify contracts, compared to just 1/26 in baseline (where the agent had no knowledge of the tool).

23/26

Wrote Contracts

88% of tool-aware solutions included ensures/requires contracts, even though they weren't given a spec. Only 46% of baseline solutions wrote any contracts.

3/3

Very Hard Solved

All three very hard benchmarks passed with contracts. These involve recursive functions, 8-deep dependency chains, and subtle leap year traps that baseline consistently missed.

Benchmark Deep Dives

Safe Subtract — The Simplest Contract
Subtraction that floors at zero. Both conditions pass this easily, but it demonstrates the core principle: a contract makes implicit requirements explicit.

The Trap

Naive a - b goes negative when b > a. Z3 counterexample: a=2, b=100 → result=-98 violates ensures { result >= 0 }.

Contract Spec

export pure func safeSub(a: int, b: int) -> int
  requires { a >= 0, b >= 0 }
  ensures  { result >= 0 }
reference/safe_subtract.ail Reference Solution
module benchmarks/contract_guided/reference/safe_subtract

export pure func safeSub(a: int, b: int) -> int
  requires { a >= 0, b >= 0 }
  ensures  { result >= 0 }
{
  if a >= b then a - b
  else 0
}

export func main() -> () ! {IO} {
  println(show(safeSub(10, 3)));
  println(show(safeSub(5, 5)));
  println(show(safeSub(2, 100)))
}
Escalation Guard — Security Invariant
A privilege escalation check where a "bonus" for trusted actions can push the granted level past the role's base weight. The contract catches a subtle cross-function bug.

The Trap

The naive formula roleWeight + actionBonus - actionThreshold can exceed roleWeight when the bonus exceeds the threshold. Z3 counterexample: EDITOR(60) + READ bonus(15) - READ threshold(10) = 65, which exceeds roleWeight(EDITOR) = 60. Must cap the result.

Key Contract

-- The granted level must NEVER exceed the role's base weight.
export pure func grantedLevel(role: Role, action: Action) -> int
  ensures { result <= roleWeight(role) }
reference/escalation_guard.ail Reference Solution
module benchmarks/contract_guided/reference/escalation_guard

export type Role = ADMIN | EDITOR | VIEWER | GUEST
export type Action = READ | WRITE | DELETE

export pure func roleWeight(role: Role) -> int
  ensures { result >= 0, result <= 100 }
{
  match role {
    ADMIN  => 100,
    EDITOR => 60,
    VIEWER => 30,
    GUEST  => 10
  }
}

export pure func actionThreshold(action: Action) -> int
  ensures { result >= 10, result <= 80 }
{
  match action {
    READ   => 10,
    WRITE  => 50,
    DELETE => 80
  }
}

export pure func grantedLevel(role: Role, action: Action) -> int
  ensures { result <= roleWeight(role) }
{
  let base = roleWeight(role);
  let bonus = match action {
    READ   => 15,
    WRITE  => 5,
    DELETE => 0
  };
  let raw = base + bonus - actionThreshold(action);
  -- Guard: cap at roleWeight to prevent escalation
  if raw > base then base
  else if raw < 0 then 0
  else raw
}
Date Day-of-Year — Recursive Calendar with Leap Year Traps
8 functions including recursive daysBeforeMonth, 12-way month match, leap year with the 100/400 rule, and 3 consistency proofs. The baseline agent got 1900 wrong (treated it as a leap year). The tool-aware agent caught this via ai-check.

The Traps

Leap year 100/400 rule: 1900 is NOT a leap year (divisible by 100 but not 400). 2000 IS a leap year (divisible by 400). LLMs frequently get this wrong or reversed.

Recursive accumulation: daysBeforeMonth must correctly sum all previous months. An off-by-one in the recursion base case shifts all dates.

Consistency proofs: lastDayConsistent proves dayOfYear(y, 12, 31) == daysInYear(y) — if any month's day count is wrong, this fails immediately.

Key Contracts (3 consistency proofs)

export pure func lastDayConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves dayOfYear(y, 12, 31) == daysInYear(y)

export pure func leapFebConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves Feb days match leap year status

export pure func marchFirstConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
-- proves March 1 = day 61 (leap) or 60 (non-leap)
reference/date_dayofyear.ail Reference Solution
module benchmarks/contract_guided/reference/date_dayofyear

export pure func isLeapYear(year: int) -> bool
  requires { year >= 1 }
{
  (year % 4 == 0 && year % 100 != 0) || year % 400 == 0
}

export pure func daysInMonth(year: int, month: int) -> int
  requires { year >= 1, month >= 1, month <= 12 }
  ensures  { result >= 28, result <= 31 }
{
  match month {
    1  => 31,
    2  => if isLeapYear(year) then 29 else 28,
    3  => 31,  4  => 30,  5  => 31,  6  => 30,
    7  => 31,  8  => 31,  9  => 30,  10 => 31,
    11 => 30,  _  => 31
  }
}

pure func daysBeforeMonth(year: int, month: int) -> int
  requires { year >= 1, month >= 1, month <= 12 }
{
  if month == 1 then 0
  else daysBeforeMonth(year, month - 1) + daysInMonth(year, month - 1)
}

export pure func dayOfYear(year: int, month: int, day: int) -> int
  requires { year >= 1, month >= 1, month <= 12, day >= 1, day <= 31 }
  ensures  { result >= 1, result <= 366 }
{
  daysBeforeMonth(year, month) + day
}

export pure func lastDayConsistent(year: int) -> bool
  requires { year >= 1 }
  ensures  { result == true }
{
  dayOfYear(year, 12, 31) == daysInYear(year)
}
Payroll Chain — 8-Function Dependency Cascade
A complete payroll calculation in cents: gross pay → taxable income → federal tax (3-bracket) → state tax → FICA → total deductions → net pay → consistency proof. Getting any single function wrong cascades through every downstream result.

The Traps

Cascading errors: 8 functions in a dependency chain. If federalTax computes brackets wrong, totalDeductions is wrong, netPay is wrong, and deductionsValid fails.

Progressive brackets: Each tax bracket applies only to income in that bracket, not total income. Naive taxable * rate / 100 overtaxes.

FICA wage base cap: Forgetting min(gross, 16020000) makes FICA grow without bound, breaking netPay ensures result >= 0.

Basis points: FICA is 765/10000 (7.65%), not 765/1000. Easy integer math error.

Key Contracts (chain invariants)

export pure func taxableIncome(gross: int) -> int
  ensures { result >= 0, result <= gross }

export pure func federalTax(taxable: int) -> int
  ensures { result >= 0, result <= taxable }

export pure func netPay(hourlyRate: int, hours: int) -> int
  ensures { result >= 0 }

export pure func deductionsValid(gross: int) -> bool
  ensures { result == true }
-- proves totalDeductions <= gross for ALL valid inputs
reference/payroll_chain.ail Reference Solution
module benchmarks/contract_guided/reference/payroll_chain

export pure func grossPay(hourlyRate: int, hours: int) -> int
  requires { hourlyRate >= 0, hours >= 0 }
  ensures  { result >= 0 }
{ hourlyRate * hours }

export pure func standardDeduction() -> int
  ensures { result == 1250000 }
{ 1250000 }

export pure func taxableIncome(gross: int) -> int
  requires { gross >= 0 }
  ensures  { result >= 0, result <= gross }
{
  let ti = gross - standardDeduction() in
  if ti < 0 then 0 else ti
}

export pure func federalTax(taxable: int) -> int
  requires { taxable >= 0 }
  ensures  { result >= 0, result <= taxable }
{
  let b1 = if taxable > 3000000 then 3000000 * 10 / 100
           else taxable * 10 / 100 in
  let over3m = if taxable > 3000000 then taxable - 3000000 else 0 in
  let b2 = if over3m > 4000000 then 4000000 * 22 / 100
           else over3m * 22 / 100 in
  let over7m = if taxable > 7000000 then taxable - 7000000 else 0 in
  let b3 = over7m * 32 / 100 in
  b1 + b2 + b3
}

export pure func ficaTax(gross: int) -> int
  requires { gross >= 0 }
  ensures  { result >= 0 }
{
  let base = if gross > 16020000 then 16020000 else gross in
  base * 765 / 10000
}

export pure func netPay(hourlyRate: int, hours: int) -> int
  requires { hourlyRate >= 0, hours >= 0 }
  ensures  { result >= 0 }
{
  let gross = grossPay(hourlyRate, hours) in
  gross - totalDeductions(gross)
}

export pure func deductionsValid(gross: int) -> bool
  requires { gross >= 0 }
  ensures  { result == true }
{ totalDeductions(gross) <= gross }

Why Contracts Help AI Coding

Implicit → Explicit

Writing ensures { result >= 0 } forces the agent to think about edge cases it would otherwise ignore. The contract is a thinking tool, not just a verification tool.

Counterexample Debugging

When ailang ai-check returns a Z3 counterexample like a=2, b=100, the agent gets a concrete failing case to fix — much more useful than "wrong output".

Cascading Confidence

In payroll_chain, verifying federalTax individually before composing it into totalDeductions prevents cascading errors. Each function is proven correct in isolation.

Try It Yourself

AILANG's contract verification is available now. Write ensures/requires contracts on your pure functions, then run ailang ai-check to verify them with Z3.