Benchmarks — AILANG Parse

Q: Why does AILANG Parse score so much higher than competitors?

AILANG Parse is built with an eval-first methodology: structural benchmark tests are defined with hand-verified ground truth, then an AI agent writes parsing code specifically to pass them. Other parsers are hand-coded and then measured independently. The gap reflects different development approaches, not cherry-picked test files. Every result is fully reproducible.

Test Files

93.9%

Composite Score

7 metrics, 17 features

Parsers Compared

6.1%

Roadmap Gap (Honest)

aspirational features that lower our score

These are eval scores, not post-hoc benchmarks. AILANG Parse is built with an eval-first methodology: we define structural test files with hand-verified ground truth, then an AI agent writes AILANG parsing code specifically to pass them. High scores are a design outcome of this development loop, not cherry-picking. This inverts the traditional approach where parsers are hand-coded and then measured. The comparison tables below show the real capability gap — and every result is fully reproducible.

Head-to-Head: DOCX Parsing (29 Files)

The fairest comparison: 29 DOCX files that all 8 tools can parse. Same files, same format, same metrics. No format advantage — just parsing quality on the most important Office format.

Apples-to-apples. All scores below are computed on the same 29 DOCX files, including 15 challenge files with hand-verified ground truth testing track changes, comments, equations, bookmarks, merged cells, multi-level lists, field codes, footnotes, and formatting. Seven scoring dimensions drawn from ECMA-376 spec references. Run date: 2026-04-06.

Tool	Version	Composite	Feat. Det.	Struct. Quality	Content Fidelity	Metadata
AILANG Parse	v0.3.0 (AILANG)	90.9%	89.7%	94.2%	78.7%	99.5%
Raw OOXML	v1.0.0 (stdlib)	82.4%	78.1%	75.5%	66.0%	100.0%
Kreuzberg	v4.7.2 (Rust)	73.8%	61.3%	59.9%	60.7%	87.7%
Pandoc	v3.9.0 (Haskell)	69.0%	67.3%	72.5%	63.3%	9.3%
Docling	v2.84.0 (IBM)	59.7%	52.8%	62.3%	58.6%	0.0%
Unstructured	v0.22.16	59.8%	57.1%	55.7%	61.2%	0.0%

29 DOCX files, 7 scoring dimensions. AILANG Parse leads by 8.5+ points even on pure DOCX. Scores include aspirational dimensions (bookmarks, footnote text, section breaks) that intentionally lower every tool's score — including ours. Parser versions as of 2026-04-06.

Why the gap? AILANG Parse parses Office XML directly — reading w:tbl, w:p, and w:r nodes from the ZIP archive and mapping them to typed blocks. Raw OOXML (stdlib Python) takes the same XML approach but loses quality on heading levels, author attribution, and list structure. Kreuzberg and Pandoc use intermediate Markdown, dropping track changes, comments, and text boxes entirely. The gap is architectural, not cosmetic.

Full Suite: OfficeDocBench (69 Files, 11 Formats, 7 Metrics)

The full suite spans 11 formats with 54 core files plus 15 challenge files. Scores use 7 weighted metrics including ECMA-376 spec-driven dimensions. Coverage-Adjusted = Composite × format coverage — penalizes tools that skip formats.

Open source. The full OfficeDocBench suite — ground truth annotations, scoring scripts, adapter interface, and all parser results — is available on GitHub. Run uv run benchmarks/officedocbench/eval_officedocbench.py --all to reproduce every result on this page. Apache 2.0 licensed.

Tool	Files	Coverage	Composite	Adjusted	Feat. Det.	Struct. Quality	Content Fidelity	Metadata
AILANG Parse	69/69	100%	93.9%	93.9%	91.9%	95.3%	81.4%	99.6%
Kreuzberg	66/69	96%	71.3%	68.2%	66.8%	63.9%	61.0%	86.2%
LlamaParse	69/69	100%	54.4%	54.4%	61.0%	60.6%	45.0%	17.4%
Raw OOXML	43/69	62%	84.4%	52.6%	82.0%	80.3%	68.1%	100.0%
MarkItDown	52/69	75%	67.9%	51.2%	72.0%	73.1%	64.6%	15.4%
Pandoc	45/69	65%	74.6%	48.6%	75.7%	81.2%	66.1%	24.4%
Docling	42/69	61%	64.0%	38.9%	61.3%	72.2%	61.4%	2.4%
Unstructured	43/69	62%	62.1%	38.7%	62.4%	62.4%	61.8%	2.3%
LiteParse	56/69	81%	22.0%	17.9%	8.9%	18.1%	39.9%	7.1%

Coverage-Adjusted = Composite × (files parsed / total files). Raw OOXML scores 84.4% on the 43 files it can parse, but only covers 62% of the benchmark. AILANG Parse and LlamaParse are the only tools with 100% format coverage; AILANG Parse leads on every dimension. Run date: 2026-04-08.

Per-Format Breakdown (AILANG Parse)

DOCX

29 files

90.9%

PPTX

8 files

94.4%

XLSX

6 files

96.3%

ODT

6 files

92.6%

ODP

2 files

97.7%

ODS

4 files

100%

EPUB

3 files

98.5%

HTML

5 files

94.2%

CSV

2 files

100%

3 files

97.7%

DOCX score drops to 90.9% due to aspirational metrics (bookmarks, footnotes, section breaks, hyperlinks). Scores below 100% represent real roadmap targets. Formats with blue borders are not supported by most competitors.

Performance: Folder Parsing

Real-world use case: parse a folder of 58 mixed Office documents (22.9MB across 11 formats, including a 10MB DOCX and 11MB PPTX). AILANG Parse uses batch mode — compile once, parse all files — matching how docparse ~/Documents/ works in practice.

Fair comparison. Python-native tools (Kreuzberg, MarkItDown, Docling) run in-process with zero startup cost. AILANG Parse and Pandoc are subprocess-based. The total time reflects what users actually experience when parsing a folder. All tools tested on the same files, same machine.

Tool	Files Parsed	Total Time	Per File	Quality (Composite)
Kreuzberg	55/58	318ms	6ms	71.3%
MarkItDown	46/58	905ms	20ms	67.9%
AILANG Parse	58/58	2.54s	44ms	92.2%
Unstructured	39/58	3.46s	89ms	62.1%
Pandoc	51/58	3.64s	71ms	74.6%
Docling	42/58	8.64s	206ms	64.0%
LiteParse	56/69	120s	2.1s	22.0%

58 files across 11 formats (DOCX, PPTX, XLSX, ODT, ODP, ODS, EPUB, HTML, Markdown, CSV, TSV). AILANG Parse is the only tool that parses all files. LlamaParse omitted (API-based, timing reflects network latency). Run date: 2026-04-10.

No quality compromise. The fastest tools (Kreuzberg 318ms, MarkItDown 905ms) fail on 3–12 files and score 20+ points lower on structural quality. AILANG Parse parses every file at 92.2% composite — competitive speed without sacrificing extraction quality. See our performance guide for tuning tips.

Exclusive Features AILANG Parse Only

These structural features are extracted by AILANG Parse but missed by most or all other parsers tested. They represent real capabilities that matter for document understanding — and they're why the composite gap exists.

Feature	AILANG Parse	Raw OOXML	Pandoc	Kreuzberg	Others	Why It Matters
Track Changes	3/3	2/3	3/3	0/3	0/3	Legal review, contract redlining, audit trails
Comments	2/2	2/2	0/2	0/2	0/2	Review feedback, annotation extraction
Headers/Footers	3/3	2/2	0/3	2/3	0–1	Document metadata, page numbering, letterheads
Text Boxes	2/2	1/2	0/2	0/2	0/2	Callouts, sidebars, floating content
Equations (§22.1)	1/1	0/1	0/1	0/1	0/1	OMML math content extraction
Field Codes (§17.16)	1/1	1/1	0/1	1/1	0/1	Dates, page counts, cross-references
Merged Cells	Yes	Yes	No	No	No	Complex tables, financial reports, schedules

Raw OOXML and Pandoc detect some features through XML/AST access, but miss others. No tool detects footnotes, bookmarks, section breaks, hyperlinks, or styles yet — these are aspirational targets for all.

Feature Detection Heatmap (17 Features)

Feature-by-feature detection across all 69 test files. Fractions show files where the feature was correctly detected out of files where it should be present. Features marked ASPIRATIONAL are ECMA-376 spec targets that no tool handles yet.

Feature	AILANG Parse	Raw OOXML	Pandoc	Kreuzberg	MarkItDown	Unstructured	Docling
Headings	30/30	21/21	22/23	24/29	9/9	20/21	20/21
Tables	36/36	16/16	18/18	29/34	17/17	9/10	13/15
Track Changes	3/3	2/3	3/3	0/3	—	0/3	0/3
Comments	2/2	2/2	0/2	0/2	—	0/2	0/2
Headers/Footers	3/3	2/2	0/3	2/3	—	1/2	0/2
Text Boxes	2/2	1/2	0/2	0/2	—	0/2	0/2
Images	7/8	3/4	5/5	3/7	2/3	0/4	2/4
Lists	15/15	1/4	11/12	8/13	4/4	4/4	1/4
Sheet Names	4/5	1/1	—	4/4	0/1	—	0/1
Equations (§22.1)	1/1	0/1	0/1	0/1	—	0/1	0/1
Field Codes (§17.16)	1/1	1/1	0/1	1/1	—	0/1	0/1
Footnotes ASPIRATIONAL	0/1	0/1	0/1	0/1	—	0/1	0/1
Hyperlinks ASPIRATIONAL	0/1	0/1	0/1	0/1	—	0/1	0/1
Styles ASPIRATIONAL	0/1	0/1	0/1	0/1	—	0/1	0/1
Bookmarks (§17.13.6) ASPIRATIONAL	0/1	0/1	0/1	0/1	—	0/1	0/1
Section Breaks (§17.6) ASPIRATIONAL	0/8	0/8	0/8	0/8	—	0/8	0/8

Aspirational features (purple rows) are ECMA-376 spec-referenced targets that intentionally lower every tool's score. They represent the next frontier — features that should be extracted but currently aren't by any parser. AILANG Parse uniquely detects equations and leads on track changes, comments, headers/footers, text boxes, and lists.

What We Measure: 7 Scoring Dimensions

OfficeDocBench uses 7 weighted scoring dimensions, many driven by ECMA-376 spec references. Each dimension scores [0, 1] and the composite is a weighted average. Aspirational sub-dimensions intentionally score 0 when ground truth data exists but no parser handles the feature yet.

Dimension	Weight	What It Tests	Spec References
Feature Detection	15%	Binary: does the parser detect each feature present in the document? 17 features tested including aspirational targets.	§17.13.6, §22.1, §17.16, §17.6
Structural Recall	20%	Completeness: correct count of tables, track changes, comments, headings, headers/footers, images. Type matching for track change operations.	—
Structural Quality	15%	Heading level distribution, TC author attribution, comment text matching, list numbering accuracy, table merge span accuracy, heading text match, section break detection, comment range accuracy.	§17.9, §18.3.1.55, §17.6, §17.13.1
Content Fidelity	15%	Key phrase recall, paragraph count, element ordering (LCS), hyperlink extraction, style preservation, equation text, field display text, footnote text, bookmark detection.	§22.1, §17.16, §17.13.6
Text Jaccard	10%	Word-level Jaccard similarity between ground truth and parser output. Measures raw text extraction quality.	—
Element Count	15%	Per-type count precision across 9 element types: headings, tables, track changes, comments, images, lists, text boxes, footnotes, speaker notes.	—
Metadata	10%	Exact match on title, author, created, modified timestamps. Sheet name accuracy with partial credit for partial matches.	—

Weights emphasize structural extraction (Structural Recall 20% + Element Count 15% = 35%) since that is the benchmark's primary purpose. Aspirational sub-dimensions within Structural Quality and Content Fidelity intentionally lower scores to create roadmap targets.

Why AILANG?

AILANG Parse is written in AILANG, a language designed for building AI-native applications. AILANG isn't incidental to the results — it's why AILANG Parse achieves them.

Deterministic by Default

AILANG's effect system separates pure parsing logic from IO. The pure func annotation guarantees the XML→Block pipeline produces identical output for identical input — no hidden state, no ambient randomness. That's why structural recall scores 98.8% across all 69 test files.

Algebraic Data Types

The Block ADT (9 variants: Text, Heading, Table, Image, Audio, Video, List, Section, Change) is exhaustively matched by the compiler. Adding a new block type is a compile error until every parser and formatter handles it. Other parsers typically use untyped dictionaries.

Zero Runtime Dependencies

Every parser is AILANG + stdlib. No Python, no Java, no system libraries. ZIP extraction, XML parsing, and format routing are all AILANG code with inline tests and Z3-verified contracts. The entire parser fits in a single binary.

Contracts + Inline Tests

50+ contracts verified by Z3 at build time — format detection returns valid categories, mappers preserve element counts, filters respect bounds. Inline tests [...] on every function catch regressions before CI.

AI as an Effect

PDF and image parsing use AILANG's AI effect — the model is pluggable (--ai gemini-2.5-flash, --ai claude-haiku, --ai granite-docling). Office parsing doesn't use AI at all: it's deterministic XML extraction. The benchmark scores reflect the code, not the model.

See the Code

Here's how AILANG Parse extracts track changes — a feature unique to direct XML parsing:

docparse/types/document.ail

-- A content block extracted from a document
export type Block = TextBlock({text: string, style: string, level: int})
                  | TableBlock({rows: [[TableCell]], headers: [TableCell]})
                  | ImageBlock({data: string, description: string, mime: string})
                  | AudioBlock({data: string, transcription: string, mime: string})
                  | VideoBlock({data: string, description: string, mime: string})
                  | ListBlock({items: [string], ordered: bool})
                  | HeadingBlock({text: string, level: int})
                  | SectionBlock({kind: string, blocks: [Block]})
                  | ChangeBlock({changeType: string, author: string, date: string, text: string})

docparse/services/docx_parser.ail

-- Extract track change annotations from a paragraph
-- Finds w:ins, w:del, w:moveTo, w:moveFrom and creates ChangeBlocks
-- with author, date, change type, and affected text.
pure func extractParagraphChanges(p: XmlNode) -> [Block] {
  let children = getChildren(p);
  flatMap(extractChangeFromNode, children)
}

-- Create a ChangeBlock from a track change XML node
pure func makeChangeBlock(changeType: string, node: XmlNode) -> Block {
  let author = getOrElse(getAttr(node, "w:author"), "Unknown");
  let date = getOrElse(getAttr(node, "w:date"), "");
  let text = extractChangeText(node);
  ChangeBlock({changeType: changeType, author: author, date: date, text: text})
}

docparse/services/format_router.ail

-- Z3-verified format detection with inline tests
export pure func detectFormat(ext: string) -> string
  ensures {
    result == "zip-office" || result == "pdf" || result == "image" ||
    result == "audio" || result == "video" || result == "csv" ||
    result == "markdown" || result == "html" || result == "epub" ||
    result == "zip-odf" || result == "text" || result == "unknown"
  }
  tests [
    ("docx", "zip-office"),
    ("pptx", "zip-office"),
    ("xlsx", "zip-office"),
    ("odt", "zip-odf"),
    ("pdf", "pdf"),
    ("png", "image"),
    ("csv", "csv"),
    ("md", "markdown")
  ]

Full source: 31 AILANG modules, 50+ contracts, 76+ inline tests.

Learn more about AILANG — effect system, algebraic types, Z3 contracts, pluggable AI, and compiles to Go or WASM: ailang.sunholo.com

Eval-Driven Development

AILANG Parse is not hand-coded and then benchmarked. The benchmarks come first, and AI writes the parsing code against them. This is why the scores are high — and why we want to find documents that lower them.

Define the eval

We write a test file with known structure — merged cells, track changes, speaker notes, threaded comments — and hand-verify the expected output as ground truth.

AI writes the parser

An AI agent reads the relevant spec (ECMA-376 for Office, RFC 5322 for email, and more as we expand) and writes AILANG extraction code, targeting the eval. Z3 contracts verify correctness.

Score improves

The eval score goes up because the code was written specifically to pass it. This is the point — and it's why finding new test cases that lower the score is so valuable.

Traditional parsers are hand-coded and then measured. AILANG Parse inverts this. And because we test for structural features that other parsers skip entirely — track changes, merged cells, comments, threaded annotations — the comparison tables above reveal real capability gaps, not just scoring differences.

Help Us Find Our Gaps

We score 93.9% today — intentionally including aspirational metrics that lower our own score. ECMA-376 alone is over 5,000 pages, and we're expanding into email, calendar, and more formats. There are documents in the wild that will break our parser. We want to find them.

If you have a document that AILANG Parse doesn't handle well, send it to us. If it reveals a gap that gets added to our eval corpus, you get 1 month of Business tier free (worth €99).

1. Submit

Email your document (anonymize if needed)

2. We parse

We run it through the pipeline

3. New eval

If it reveals a gap, it enters the corpus

4. AI fixes

AI writes new code to pass the eval

5. You win

1 month Business tier free

Submit a Document docparse@sunholo.com

Your document is used only for eval purposes. We can anonymize content before adding it to the benchmark corpus. Documents that don't reveal new gaps are deleted. Any format welcome — Office, email, calendar, or anything else you throw at us.

PDF Benchmark (OmniDocBench)

For PDFs, AILANG Parse delegates to whatever AI model you plug in via AILANG's AI effect. The benchmark measures the model's accuracy, not our parsing code — and we're transparent about that.

Model	Text ED ↓	Table TEDS ↑	Reading Order ED ↓
Gemini 2.5 Flash	0.183	0.871	0.141
Gemini 2.0 Flash	0.210	0.842	0.168
Ollama (granite-docling)	0.890	0.120	0.920

Recommended: gemini-2.5-flash — best balance of accuracy and cost. Ollama models score low due to structured JSON output limitations.

arxivbench — LaTeX / arXiv papers

The same deterministic-vs-OCR story as OfficeDocBench, but for scientific papers. arxivbench scores each adapter on 19 peer-reviewed arXiv papers spanning ML, physics, math, and CS. Truth counts are extracted directly from the .tex source; score is min(observed, truth) / truth averaged across papers. Full LaTeX parsing docs →

All scores below measure structural preservation — whether the parser emits each element as a typed, separable block. Raw text capture is a different bar: OCR adapters scoring 0% on bibliography still usually have the reference-list text in their output, just as flat paragraphs rather than as one entry per \bibitem.

Adapter	Input	Papers	Sec	Eq D	Eq I	Tab	Fig	Cite	Bib	Thm
AILANG Parse	tex	19/19	94%	79%	76%	79%	100%	93%	100%	100%
Pandoc	tex	15/19	77%	63%	57%	32%	61%	66%	0%	47%
Docling	pdf	13/19	64%	0%	0%	63%	70%	19%	0%	0%
MarkItDown	pdf	13/19	1%	0%	0%	56%	0%	23%	0%	0%
Unstructured	pdf	13/19	75%	0%	0%	0%	0%	23%	0%	0%
LiteParse	pdf	13/19	0%	0%	0%	5%	0%	23%	0%	0%

Every PDF-based parser scores 0% on structured equations and 0% on structured bibliography. This isn't about whether the text appears in the output — OCR'd equations usually do, as flattened (often mangled) unicode. What's missing is the structure downstream RAG needs: the LaTeX source of each equation as a separable block, citation keys that backlink to bibliography records, and reference entries as typed data rather than paragraphs of prose. Pandoc (the other source-based parser) fails outright on 4 of 19 papers due to custom \newcolumntype macros and plain-TeX primitives. AILANG degrades gracefully on those constructs instead of aborting.

Reproduce:

uv run benchmarks/arxivbench/eval_arxivbench.py --all

Run the Benchmarks

Every number on this page comes from benchmarks/officedocbench/results/summary.json, regenerated whenever the eval runs. Three ways to reproduce, depending on what you have installed:

# 1. AILANG Parse only — instant, no extra dependencies
uv run benchmarks/officedocbench/eval_officedocbench.py

# 2. Single competitor — requires that adapter installed
uv run benchmarks/officedocbench/eval_officedocbench.py --adapter kreuzberg

# 3. Full leaderboard — all 8 adapters
uv pip install -e '.[competitors]'
uv run benchmarks/officedocbench/eval_officedocbench.py --all

Useful flags: --live re-parses files instead of using cached golden outputs, --format docx filters to one format, --json / --latex change report output.

The --all run writes summary.json (canonical) plus a mirror at docs/data/officedocbench-summary.json. The website reads the mirror at page load and rewrites every score in place — no rebuild step. Adding your own parser? Implement OfficeDocBenchAdapter in adapters/ and register it in eval_officedocbench.py:load_adapter().

Frequently Asked Questions

How does AILANG Parse compare to other document parsers on benchmarks?

On OfficeDocBench (69 files, 11 formats, 7 scoring dimensions, 17 features), AILANG Parse scores 93.9% composite with 100% format coverage. The benchmark tests table merge spans, track changes, comments, heading hierarchy, equation text, field codes, bookmarks, section breaks, and more across DOCX, PPTX, XLSX, and 8 additional formats. Coverage-adjusted, the nearest competitor (Kreuzberg) scores 68.0%.

Is the OfficeDocBench benchmark open source?

Yes. The benchmark suite, golden output files, and scoring scripts are all open source. You can run the full comparison on your own machine. The methodology uses structural diff scoring, not text similarity, so it measures actual structural fidelity.

Why do Docling and Unstructured score the same on OfficeDocBench?

Both Docling and Unstructured use the same fundamental approach for Office formats: convert to PDF first, then apply ML-based layout detection. Since the same structural information is lost during PDF conversion — merge attributes, track changes, comments — both tools hit the same ceiling. AILANG Parse bypasses this ceiling entirely by reading Office XML directly.

Why does AILANG Parse score so much higher than competitors?

AILANG Parse is built with an eval-first methodology: we define structural benchmark tests with hand-verified ground truth, then an AI agent writes AILANG parsing code specifically to pass them. We intentionally include aspirational metrics (bookmarks, footnotes, section breaks) that lower our own score to 93.9% — creating transparent roadmap targets. The gap reflects fundamentally different development approaches, not cherry-picked tests. Every result is fully reproducible.

How can I submit a document that doesn't parse correctly?

Email it to docparse@sunholo.com. If your document reveals a parsing gap that gets added to our eval corpus, you receive 1 month of Business tier free (worth €99). You can anonymize content before sending. We accept any format — Office, email, calendar, or anything else. See the Document Submission Program for details.