Benchmarks

Eval scores — the structural targets our AI optimizes against during development.

69
Test Files
93.9%
Composite Score
7 metrics, 17 features
8
Parsers Compared
6.1%
Roadmap Gap (Honest)
aspirational features that lower our score
These are eval scores, not post-hoc benchmarks. AILANG Parse is built with an eval-first methodology: we define structural test files with hand-verified ground truth, then an AI agent writes AILANG parsing code specifically to pass them. High scores are a design outcome of this development loop, not cherry-picking. This inverts the traditional approach where parsers are hand-coded and then measured. The comparison tables below show the real capability gap — and every result is fully reproducible.

Head-to-Head: DOCX Parsing (29 Files)

The fairest comparison: 29 DOCX files that all 8 tools can parse. Same files, same format, same metrics. No format advantage — just parsing quality on the most important Office format.

Apples-to-apples. All scores below are computed on the same 29 DOCX files, including 15 challenge files with hand-verified ground truth testing track changes, comments, equations, bookmarks, merged cells, multi-level lists, field codes, footnotes, and formatting. Seven scoring dimensions drawn from ECMA-376 spec references. Run date: 2026-04-06.
ToolVersionCompositeFeat. Det.Struct. QualityContent FidelityMetadata
AILANG Parsev0.3.0 (AILANG)90.9%89.7%94.2%78.7%99.5%
Raw OOXMLv1.0.0 (stdlib)82.4%78.1%75.5%66.0%100.0%
Kreuzbergv4.7.2 (Rust)73.8%61.3%59.9%60.7%87.7%
Pandocv3.9.0 (Haskell)69.0%67.3%72.5%63.3%9.3%
Doclingv2.84.0 (IBM)59.7%52.8%62.3%58.6%0.0%
Unstructuredv0.22.1659.8%57.1%55.7%61.2%0.0%

29 DOCX files, 7 scoring dimensions. AILANG Parse leads by 8.5+ points even on pure DOCX. Scores include aspirational dimensions (bookmarks, footnote text, section breaks) that intentionally lower every tool's score — including ours. Parser versions as of 2026-04-06.

Why the gap? AILANG Parse parses Office XML directly — reading w:tbl, w:p, and w:r nodes from the ZIP archive and mapping them to typed blocks. Raw OOXML (stdlib Python) takes the same XML approach but loses quality on heading levels, author attribution, and list structure. Kreuzberg and Pandoc use intermediate Markdown, dropping track changes, comments, and text boxes entirely. The gap is architectural, not cosmetic.

Full Suite: OfficeDocBench (69 Files, 11 Formats, 7 Metrics)

The full suite spans 11 formats with 54 core files plus 15 challenge files. Scores use 7 weighted metrics including ECMA-376 spec-driven dimensions. Coverage-Adjusted = Composite × format coverage — penalizes tools that skip formats.

Open source. The full OfficeDocBench suite — ground truth annotations, scoring scripts, adapter interface, and all parser results — is available on GitHub. Run uv run benchmarks/officedocbench/eval_officedocbench.py --all to reproduce every result on this page. Apache 2.0 licensed.
ToolFilesCoverageCompositeAdjustedFeat. Det.Struct. QualityContent FidelityMetadata
AILANG Parse69/69100%93.9%93.9%91.9%95.3%81.4%99.6%
Kreuzberg66/6996%71.3%68.2%66.8%63.9%61.0%86.2%
LlamaParse69/69100%54.4%54.4%61.0%60.6%45.0%17.4%
Raw OOXML43/6962%84.4%52.6%82.0%80.3%68.1%100.0%
MarkItDown52/6975%67.9%51.2%72.0%73.1%64.6%15.4%
Pandoc45/6965%74.6%48.6%75.7%81.2%66.1%24.4%
Docling42/6961%64.0%38.9%61.3%72.2%61.4%2.4%
Unstructured43/6962%62.1%38.7%62.4%62.4%61.8%2.3%

Coverage-Adjusted = Composite × (files parsed / total files). Raw OOXML scores 84.4% on the 43 files it can parse, but only covers 62% of the benchmark. AILANG Parse and LlamaParse are the only tools with 100% format coverage; AILANG Parse leads on every dimension. Run date: 2026-04-08.

Per-Format Breakdown (AILANG Parse)

DOCX
29 files
90.9%
PPTX
8 files
94.4%
XLSX
6 files
96.3%
ODT
6 files
92.6%
ODP
2 files
97.7%
ODS
4 files
100%
EPUB
3 files
98.5%
HTML
5 files
94.2%
CSV
2 files
100%
MD
3 files
97.7%

DOCX score drops to 90.9% due to aspirational metrics (bookmarks, footnotes, section breaks, hyperlinks). Scores below 100% represent real roadmap targets. Formats with blue borders are not supported by most competitors.


Exclusive Features AILANG Parse Only

These structural features are extracted by AILANG Parse but missed by most or all other parsers tested. They represent real capabilities that matter for document understanding — and they're why the composite gap exists.

FeatureAILANG ParseRaw OOXMLPandocKreuzbergOthersWhy It Matters
Track Changes3/32/33/30/30/3Legal review, contract redlining, audit trails
Comments2/22/20/20/20/2Review feedback, annotation extraction
Headers/Footers3/32/20/32/30–1Document metadata, page numbering, letterheads
Text Boxes2/21/20/20/20/2Callouts, sidebars, floating content
Equations (§22.1)1/10/10/10/10/1OMML math content extraction
Field Codes (§17.16)1/11/10/11/10/1Dates, page counts, cross-references
Merged CellsYesYesNoNoNoComplex tables, financial reports, schedules

Raw OOXML and Pandoc detect some features through XML/AST access, but miss others. No tool detects footnotes, bookmarks, section breaks, hyperlinks, or styles yet — these are aspirational targets for all.


Feature Detection Heatmap (17 Features)

Feature-by-feature detection across all 69 test files. Fractions show files where the feature was correctly detected out of files where it should be present. Features marked ASPIRATIONAL are ECMA-376 spec targets that no tool handles yet.

FeatureAILANG ParseRaw OOXMLPandocKreuzbergMarkItDownUnstructuredDocling
Headings30/3021/2122/2324/299/920/2120/21
Tables36/3616/1618/1829/3417/179/1013/15
Track Changes3/32/33/30/30/30/3
Comments2/22/20/20/20/20/2
Headers/Footers3/32/20/32/31/20/2
Text Boxes2/21/20/20/20/20/2
Images7/83/45/53/72/30/42/4
Lists15/151/411/128/134/44/41/4
Sheet Names4/51/14/40/10/1
Equations (§22.1)1/10/10/10/10/10/1
Field Codes (§17.16)1/11/10/11/10/10/1
Footnotes ASPIRATIONAL0/10/10/10/10/10/1
Hyperlinks ASPIRATIONAL0/10/10/10/10/10/1
Styles ASPIRATIONAL0/10/10/10/10/10/1
Bookmarks (§17.13.6) ASPIRATIONAL0/10/10/10/10/10/1
Section Breaks (§17.6) ASPIRATIONAL0/80/80/80/80/80/8

Aspirational features (purple rows) are ECMA-376 spec-referenced targets that intentionally lower every tool's score. They represent the next frontier — features that should be extracted but currently aren't by any parser. AILANG Parse uniquely detects equations and leads on track changes, comments, headers/footers, text boxes, and lists.


What We Measure: 7 Scoring Dimensions

OfficeDocBench uses 7 weighted scoring dimensions, many driven by ECMA-376 spec references. Each dimension scores [0, 1] and the composite is a weighted average. Aspirational sub-dimensions intentionally score 0 when ground truth data exists but no parser handles the feature yet.

DimensionWeightWhat It TestsSpec References
Feature Detection15%Binary: does the parser detect each feature present in the document? 17 features tested including aspirational targets.§17.13.6, §22.1, §17.16, §17.6
Structural Recall20%Completeness: correct count of tables, track changes, comments, headings, headers/footers, images. Type matching for track change operations.
Structural Quality15%Heading level distribution, TC author attribution, comment text matching, list numbering accuracy, table merge span accuracy, heading text match, section break detection, comment range accuracy.§17.9, §18.3.1.55, §17.6, §17.13.1
Content Fidelity15%Key phrase recall, paragraph count, element ordering (LCS), hyperlink extraction, style preservation, equation text, field display text, footnote text, bookmark detection.§22.1, §17.16, §17.13.6
Text Jaccard10%Word-level Jaccard similarity between ground truth and parser output. Measures raw text extraction quality.
Element Count15%Per-type count precision across 9 element types: headings, tables, track changes, comments, images, lists, text boxes, footnotes, speaker notes.
Metadata10%Exact match on title, author, created, modified timestamps. Sheet name accuracy with partial credit for partial matches.

Weights emphasize structural extraction (Structural Recall 20% + Element Count 15% = 35%) since that is the benchmark's primary purpose. Aspirational sub-dimensions within Structural Quality and Content Fidelity intentionally lower scores to create roadmap targets.


Why AILANG?

AILANG Parse is written in AILANG, a language designed for building AI-native applications. AILANG isn't incidental to the results — it's why AILANG Parse achieves them.

Deterministic by Default

AILANG's effect system separates pure parsing logic from IO. The pure func annotation guarantees the XML→Block pipeline produces identical output for identical input — no hidden state, no ambient randomness. That's why structural recall scores 98.8% across all 69 test files.

Algebraic Data Types

The Block ADT (9 variants: Text, Heading, Table, Image, Audio, Video, List, Section, Change) is exhaustively matched by the compiler. Adding a new block type is a compile error until every parser and formatter handles it. Other parsers typically use untyped dictionaries.

Zero Runtime Dependencies

Every parser is AILANG + stdlib. No Python, no Java, no system libraries. ZIP extraction, XML parsing, and format routing are all AILANG code with inline tests and Z3-verified contracts. The entire parser fits in a single binary.

Contracts + Inline Tests

50+ contracts verified by Z3 at build time — format detection returns valid categories, mappers preserve element counts, filters respect bounds. Inline tests [...] on every function catch regressions before CI.

AI as an Effect

PDF and image parsing use AILANG's AI effect — the model is pluggable (--ai gemini-2.5-flash, --ai claude-haiku, --ai granite-docling). Office parsing doesn't use AI at all: it's deterministic XML extraction. The benchmark scores reflect the code, not the model.

See the Code

Here's how AILANG Parse extracts track changes — a feature unique to direct XML parsing:

docparse/types/document.ail
-- A content block extracted from a document
export type Block = TextBlock({text: string, style: string, level: int})
                  | TableBlock({rows: [[TableCell]], headers: [TableCell]})
                  | ImageBlock({data: string, description: string, mime: string})
                  | AudioBlock({data: string, transcription: string, mime: string})
                  | VideoBlock({data: string, description: string, mime: string})
                  | ListBlock({items: [string], ordered: bool})
                  | HeadingBlock({text: string, level: int})
                  | SectionBlock({kind: string, blocks: [Block]})
                  | ChangeBlock({changeType: string, author: string, date: string, text: string})
docparse/services/docx_parser.ail
-- Extract track change annotations from a paragraph
-- Finds w:ins, w:del, w:moveTo, w:moveFrom and creates ChangeBlocks
-- with author, date, change type, and affected text.
pure func extractParagraphChanges(p: XmlNode) -> [Block] {
  let children = getChildren(p);
  flatMap(extractChangeFromNode, children)
}

-- Create a ChangeBlock from a track change XML node
pure func makeChangeBlock(changeType: string, node: XmlNode) -> Block {
  let author = getOrElse(getAttr(node, "w:author"), "Unknown");
  let date = getOrElse(getAttr(node, "w:date"), "");
  let text = extractChangeText(node);
  ChangeBlock({changeType: changeType, author: author, date: date, text: text})
}
docparse/services/format_router.ail
-- Z3-verified format detection with inline tests
export pure func detectFormat(ext: string) -> string
  ensures {
    result == "zip-office" || result == "pdf" || result == "image" ||
    result == "audio" || result == "video" || result == "csv" ||
    result == "markdown" || result == "html" || result == "epub" ||
    result == "zip-odf" || result == "text" || result == "unknown"
  }
  tests [
    ("docx", "zip-office"),
    ("pptx", "zip-office"),
    ("xlsx", "zip-office"),
    ("odt", "zip-odf"),
    ("pdf", "pdf"),
    ("png", "image"),
    ("csv", "csv"),
    ("md", "markdown")
  ]

Full source: 31 AILANG modules, 50+ contracts, 76+ inline tests.

Learn more about AILANG — effect system, algebraic types, Z3 contracts, pluggable AI, and compiles to Go or WASM: ailang.sunholo.com

Eval-Driven Development

AILANG Parse is not hand-coded and then benchmarked. The benchmarks come first, and AI writes the parsing code against them. This is why the scores are high — and why we want to find documents that lower them.

01

Define the eval

We write a test file with known structure — merged cells, track changes, speaker notes, threaded comments — and hand-verify the expected output as ground truth.

02

AI writes the parser

An AI agent reads the relevant spec (ECMA-376 for Office, RFC 5322 for email, and more as we expand) and writes AILANG extraction code, targeting the eval. Z3 contracts verify correctness.

03

Score improves

The eval score goes up because the code was written specifically to pass it. This is the point — and it's why finding new test cases that lower the score is so valuable.

Traditional parsers are hand-coded and then measured. AILANG Parse inverts this. And because we test for structural features that other parsers skip entirely — track changes, merged cells, comments, threaded annotations — the comparison tables above reveal real capability gaps, not just scoring differences.

Help Us Find Our Gaps

We score 93.9% today — intentionally including aspirational metrics that lower our own score. ECMA-376 alone is over 5,000 pages, and we're expanding into email, calendar, and more formats. There are documents in the wild that will break our parser. We want to find them.

If you have a document that AILANG Parse doesn't handle well, send it to us. If it reveals a gap that gets added to our eval corpus, you get 1 month of Business tier free (worth €99).

1. Submit
Email your document (anonymize if needed)
2. We parse
We run it through the pipeline
3. New eval
If it reveals a gap, it enters the corpus
4. AI fixes
AI writes new code to pass the eval
5. You win
1 month Business tier free
Submit a Document docparse@sunholo.com

Your document is used only for eval purposes. We can anonymize content before adding it to the benchmark corpus. Documents that don't reveal new gaps are deleted. Any format welcome — Office, email, calendar, or anything else you throw at us.


PDF Benchmark (OmniDocBench)

For PDFs, AILANG Parse delegates to whatever AI model you plug in via AILANG's AI effect. The benchmark measures the model's accuracy, not our parsing code — and we're transparent about that.

ModelText ED ↓Table TEDS ↑Reading Order ED ↓
Gemini 2.5 Flash0.1830.8710.141
Gemini 2.0 Flash0.2100.8420.168
Ollama (granite-docling)0.8900.1200.920
Recommended: gemini-2.5-flash — best balance of accuracy and speed. Ollama models score low due to structured JSON output limitations.

Run the Benchmarks

Every number on this page comes from benchmarks/officedocbench/results/summary.json, regenerated whenever the eval runs. Three ways to reproduce, depending on what you have installed:

# 1. AILANG Parse only — instant, no extra dependencies
uv run benchmarks/officedocbench/eval_officedocbench.py

# 2. Single competitor — requires that adapter installed
uv run benchmarks/officedocbench/eval_officedocbench.py --adapter kreuzberg

# 3. Full leaderboard — all 8 adapters
uv pip install -e '.[competitors]'
uv run benchmarks/officedocbench/eval_officedocbench.py --all

Useful flags: --live re-parses files instead of using cached golden outputs, --format docx filters to one format, --json / --latex change report output.

The --all run writes summary.json (canonical) plus a mirror at docs/data/officedocbench-summary.json. The website reads the mirror at page load and rewrites every score in place — no rebuild step. Adding your own parser? Implement OfficeDocBenchAdapter in adapters/ and register it in eval_officedocbench.py:load_adapter().

Frequently Asked Questions

How does AILANG Parse compare to other document parsers on benchmarks?

On OfficeDocBench (69 files, 11 formats, 7 scoring dimensions, 17 features), AILANG Parse scores 93.9% composite with 100% format coverage. The benchmark tests table merge spans, track changes, comments, heading hierarchy, equation text, field codes, bookmarks, section breaks, and more across DOCX, PPTX, XLSX, and 8 additional formats. Coverage-adjusted, the nearest competitor (Kreuzberg) scores 68.0%.

Is the OfficeDocBench benchmark open source?

Yes. The benchmark suite, golden output files, and scoring scripts are all open source. You can run the full comparison on your own machine. The methodology uses structural diff scoring, not text similarity, so it measures actual structural fidelity.

Why do Docling and Unstructured score the same on OfficeDocBench?

Both Docling and Unstructured use the same fundamental approach for Office formats: convert to PDF first, then apply ML-based layout detection. Since the same structural information is lost during PDF conversion — merge attributes, track changes, comments — both tools hit the same ceiling. AILANG Parse bypasses this ceiling entirely by reading Office XML directly.

Why does AILANG Parse score so much higher than competitors?

AILANG Parse is built with an eval-first methodology: we define structural benchmark tests with hand-verified ground truth, then an AI agent writes AILANG parsing code specifically to pass them. We intentionally include aspirational metrics (bookmarks, footnotes, section breaks) that lower our own score to 93.9% — creating transparent roadmap targets. The gap reflects fundamentally different development approaches, not cherry-picked tests. Every result is fully reproducible.

How can I submit a document that doesn't parse correctly?

Email it to docparse@sunholo.com. If your document reveals a parsing gap that gets added to our eval corpus, you receive 1 month of Business tier free (worth €99). You can anonymize content before sending. We accept any format — Office, email, calendar, or anything else. See the Document Submission Program for details.