Experiment Design
26 contract-guided benchmarks ranging from easy to very hard. Each benchmark has a task description, expected output, and a reference solution with formal contracts. An AI agent (Claude Haiku 4.5) solves each benchmark under two conditions — same task, same model, different prompt.
Baseline Control
Standard prompt: task description + expected output. No mention of contracts or verification tools.
- Write code to match expected output
- Run and compare output
- Fix and retry if wrong
Tool-Aware Treatment
Same task + guidance on writing contracts and using ailang ai-check as a debugging tool. No contract spec given.
- Write code with ensures/requires contracts
- Run ailang ai-check to find bugs via Z3
- Fix counterexamples, then run for output
- Contracts are self-authored, not given
Results by Difficulty
| Difficulty | Count | Baseline | Tool-Aware | Delta | Z3 Used (Tool-Aware) | Contracts Written |
|---|---|---|---|---|---|---|
| Easy | 6 | 100% (6/6) | 100% (6/6) | 0 | 5/6 | 6/6 |
| Medium | 9 | 89% (8/9) | 89% (8/9) | 0 | 6/9 | 8/9 |
| Hard | 8 | 75% (6/8) | 88% (7/8) | +13% | 4/8 | 7/8 |
| Very Hard | 3 | 33% (1/3) | 100% (3/3) | +67% | 2/3 | 2/3 |
| Total | 26 | 80.8% (21/26) | 88.5% (23/26) | +7.7% | 17/26 (65%) | 23/26 (88%) |
Used Z3 Verification
65% of tool-aware runs actually invoked ailang ai-check to verify contracts,
compared to just 1/26 in baseline (where the agent had no knowledge of the tool).
Wrote Contracts
88% of tool-aware solutions included ensures/requires contracts,
even though they weren't given a spec. Only 46% of baseline solutions wrote any contracts.
Very Hard Solved
All three very hard benchmarks passed with contracts. These involve recursive functions, 8-deep dependency chains, and subtle leap year traps that baseline consistently missed.
Benchmark Deep Dives
The Trap
Naive a - b goes negative when b > a.
Z3 counterexample: a=2, b=100 → result=-98 violates ensures { result >= 0 }.
Contract Spec
export pure func safeSub(a: int, b: int) -> int requires { a >= 0, b >= 0 } ensures { result >= 0 }
module benchmarks/contract_guided/reference/safe_subtract
export pure func safeSub(a: int, b: int) -> int
requires { a >= 0, b >= 0 }
ensures { result >= 0 }
{
if a >= b then a - b
else 0
}
export func main() -> () ! {IO} {
println(show(safeSub(10, 3)));
println(show(safeSub(5, 5)));
println(show(safeSub(2, 100)))
}The Trap
The naive formula roleWeight + actionBonus - actionThreshold can exceed
roleWeight when the bonus exceeds the threshold. Z3 counterexample:
EDITOR(60) + READ bonus(15) - READ threshold(10) = 65, which exceeds
roleWeight(EDITOR) = 60. Must cap the result.
Key Contract
-- The granted level must NEVER exceed the role's base weight. export pure func grantedLevel(role: Role, action: Action) -> int ensures { result <= roleWeight(role) }
module benchmarks/contract_guided/reference/escalation_guard
export type Role = ADMIN | EDITOR | VIEWER | GUEST
export type Action = READ | WRITE | DELETE
export pure func roleWeight(role: Role) -> int
ensures { result >= 0, result <= 100 }
{
match role {
ADMIN => 100,
EDITOR => 60,
VIEWER => 30,
GUEST => 10
}
}
export pure func actionThreshold(action: Action) -> int
ensures { result >= 10, result <= 80 }
{
match action {
READ => 10,
WRITE => 50,
DELETE => 80
}
}
export pure func grantedLevel(role: Role, action: Action) -> int
ensures { result <= roleWeight(role) }
{
let base = roleWeight(role);
let bonus = match action {
READ => 15,
WRITE => 5,
DELETE => 0
};
let raw = base + bonus - actionThreshold(action);
-- Guard: cap at roleWeight to prevent escalation
if raw > base then base
else if raw < 0 then 0
else raw
}daysBeforeMonth, 12-way month match,
leap year with the 100/400 rule, and 3 consistency proofs. The baseline agent got 1900 wrong
(treated it as a leap year). The tool-aware agent caught this via ai-check.
The Traps
Leap year 100/400 rule: 1900 is NOT a leap year (divisible by 100 but not 400).
2000 IS a leap year (divisible by 400). LLMs frequently get this wrong or reversed.
Recursive accumulation: daysBeforeMonth must correctly sum
all previous months. An off-by-one in the recursion base case shifts all dates.
Consistency proofs: lastDayConsistent proves
dayOfYear(y, 12, 31) == daysInYear(y) — if any month's day count is wrong,
this fails immediately.
Key Contracts (3 consistency proofs)
export pure func lastDayConsistent(year: int) -> bool requires { year >= 1 } ensures { result == true } -- proves dayOfYear(y, 12, 31) == daysInYear(y) export pure func leapFebConsistent(year: int) -> bool requires { year >= 1 } ensures { result == true } -- proves Feb days match leap year status export pure func marchFirstConsistent(year: int) -> bool requires { year >= 1 } ensures { result == true } -- proves March 1 = day 61 (leap) or 60 (non-leap)
module benchmarks/contract_guided/reference/date_dayofyear
export pure func isLeapYear(year: int) -> bool
requires { year >= 1 }
{
(year % 4 == 0 && year % 100 != 0) || year % 400 == 0
}
export pure func daysInMonth(year: int, month: int) -> int
requires { year >= 1, month >= 1, month <= 12 }
ensures { result >= 28, result <= 31 }
{
match month {
1 => 31,
2 => if isLeapYear(year) then 29 else 28,
3 => 31, 4 => 30, 5 => 31, 6 => 30,
7 => 31, 8 => 31, 9 => 30, 10 => 31,
11 => 30, _ => 31
}
}
pure func daysBeforeMonth(year: int, month: int) -> int
requires { year >= 1, month >= 1, month <= 12 }
{
if month == 1 then 0
else daysBeforeMonth(year, month - 1) + daysInMonth(year, month - 1)
}
export pure func dayOfYear(year: int, month: int, day: int) -> int
requires { year >= 1, month >= 1, month <= 12, day >= 1, day <= 31 }
ensures { result >= 1, result <= 366 }
{
daysBeforeMonth(year, month) + day
}
export pure func lastDayConsistent(year: int) -> bool
requires { year >= 1 }
ensures { result == true }
{
dayOfYear(year, 12, 31) == daysInYear(year)
}The Traps
Cascading errors: 8 functions in a dependency chain. If federalTax
computes brackets wrong, totalDeductions is wrong, netPay is wrong, and
deductionsValid fails.
Progressive brackets: Each tax bracket applies only to income in that bracket,
not total income. Naive taxable * rate / 100 overtaxes.
FICA wage base cap: Forgetting min(gross, 16020000) makes FICA grow
without bound, breaking netPay ensures result >= 0.
Basis points: FICA is 765/10000 (7.65%), not 765/1000. Easy integer math error.
Key Contracts (chain invariants)
export pure func taxableIncome(gross: int) -> int ensures { result >= 0, result <= gross } export pure func federalTax(taxable: int) -> int ensures { result >= 0, result <= taxable } export pure func netPay(hourlyRate: int, hours: int) -> int ensures { result >= 0 } export pure func deductionsValid(gross: int) -> bool ensures { result == true } -- proves totalDeductions <= gross for ALL valid inputs
module benchmarks/contract_guided/reference/payroll_chain
export pure func grossPay(hourlyRate: int, hours: int) -> int
requires { hourlyRate >= 0, hours >= 0 }
ensures { result >= 0 }
{ hourlyRate * hours }
export pure func standardDeduction() -> int
ensures { result == 1250000 }
{ 1250000 }
export pure func taxableIncome(gross: int) -> int
requires { gross >= 0 }
ensures { result >= 0, result <= gross }
{
let ti = gross - standardDeduction() in
if ti < 0 then 0 else ti
}
export pure func federalTax(taxable: int) -> int
requires { taxable >= 0 }
ensures { result >= 0, result <= taxable }
{
let b1 = if taxable > 3000000 then 3000000 * 10 / 100
else taxable * 10 / 100 in
let over3m = if taxable > 3000000 then taxable - 3000000 else 0 in
let b2 = if over3m > 4000000 then 4000000 * 22 / 100
else over3m * 22 / 100 in
let over7m = if taxable > 7000000 then taxable - 7000000 else 0 in
let b3 = over7m * 32 / 100 in
b1 + b2 + b3
}
export pure func ficaTax(gross: int) -> int
requires { gross >= 0 }
ensures { result >= 0 }
{
let base = if gross > 16020000 then 16020000 else gross in
base * 765 / 10000
}
export pure func netPay(hourlyRate: int, hours: int) -> int
requires { hourlyRate >= 0, hours >= 0 }
ensures { result >= 0 }
{
let gross = grossPay(hourlyRate, hours) in
gross - totalDeductions(gross)
}
export pure func deductionsValid(gross: int) -> bool
requires { gross >= 0 }
ensures { result == true }
{ totalDeductions(gross) <= gross }Why Contracts Help AI Coding
Implicit → Explicit
Writing ensures { result >= 0 }
forces the agent to think about edge cases it would otherwise ignore.
The contract is a thinking tool, not just a verification tool.
Counterexample Debugging
When ailang ai-check
returns a Z3 counterexample like a=2, b=100,
the agent gets a concrete failing case to fix — much more useful than "wrong output".
Cascading Confidence
In payroll_chain, verifying federalTax
individually before composing it into totalDeductions
prevents cascading errors. Each function is proven correct in isolation.
Try It Yourself
AILANG's contract verification is available now. Write ensures/requires
contracts on your pure functions, then run ailang ai-check
to verify them with Z3.