Head-to-Head: DOCX Parsing (29 Files)
The fairest comparison: 29 DOCX files that all 8 tools can parse. Same files, same format, same metrics. No format advantage — just parsing quality on the most important Office format.
| Tool | Version | Composite | Feat. Det. | Struct. Quality | Content Fidelity | Metadata |
|---|---|---|---|---|---|---|
| AILANG Parse | v0.3.0 (AILANG) | 90.9% | 89.7% | 94.2% | 78.7% | 99.5% |
| Raw OOXML | v1.0.0 (stdlib) | 82.4% | 78.1% | 75.5% | 66.0% | 100.0% |
| Kreuzberg | v4.7.2 (Rust) | 73.8% | 61.3% | 59.9% | 60.7% | 87.7% |
| Pandoc | v3.9.0 (Haskell) | 69.0% | 67.3% | 72.5% | 63.3% | 9.3% |
| Docling | v2.84.0 (IBM) | 59.7% | 52.8% | 62.3% | 58.6% | 0.0% |
| Unstructured | v0.22.16 | 59.8% | 57.1% | 55.7% | 61.2% | 0.0% |
29 DOCX files, 7 scoring dimensions. AILANG Parse leads by 8.5+ points even on pure DOCX. Scores include aspirational dimensions (bookmarks, footnote text, section breaks) that intentionally lower every tool's score — including ours. Parser versions as of 2026-04-06.
w:tbl, w:p, and w:r nodes from the ZIP archive and mapping them to typed blocks. Raw OOXML (stdlib Python) takes the same XML approach but loses quality on heading levels, author attribution, and list structure. Kreuzberg and Pandoc use intermediate Markdown, dropping track changes, comments, and text boxes entirely. The gap is architectural, not cosmetic.Full Suite: OfficeDocBench (69 Files, 11 Formats, 7 Metrics)
The full suite spans 11 formats with 54 core files plus 15 challenge files. Scores use 7 weighted metrics including ECMA-376 spec-driven dimensions. Coverage-Adjusted = Composite × format coverage — penalizes tools that skip formats.
uv run benchmarks/officedocbench/eval_officedocbench.py --all to reproduce every result on this page. Apache 2.0 licensed.| Tool | Files | Coverage | Composite | Adjusted | Feat. Det. | Struct. Quality | Content Fidelity | Metadata |
|---|---|---|---|---|---|---|---|---|
| AILANG Parse | 69/69 | 100% | 93.9% | 93.9% | 91.9% | 95.3% | 81.4% | 99.6% |
| Kreuzberg | 66/69 | 96% | 71.3% | 68.2% | 66.8% | 63.9% | 61.0% | 86.2% |
| LlamaParse | 69/69 | 100% | 54.4% | 54.4% | 61.0% | 60.6% | 45.0% | 17.4% |
| Raw OOXML | 43/69 | 62% | 84.4% | 52.6% | 82.0% | 80.3% | 68.1% | 100.0% |
| MarkItDown | 52/69 | 75% | 67.9% | 51.2% | 72.0% | 73.1% | 64.6% | 15.4% |
| Pandoc | 45/69 | 65% | 74.6% | 48.6% | 75.7% | 81.2% | 66.1% | 24.4% |
| Docling | 42/69 | 61% | 64.0% | 38.9% | 61.3% | 72.2% | 61.4% | 2.4% |
| Unstructured | 43/69 | 62% | 62.1% | 38.7% | 62.4% | 62.4% | 61.8% | 2.3% |
Coverage-Adjusted = Composite × (files parsed / total files). Raw OOXML scores 84.4% on the 43 files it can parse, but only covers 62% of the benchmark. AILANG Parse and LlamaParse are the only tools with 100% format coverage; AILANG Parse leads on every dimension. Run date: 2026-04-08.
Per-Format Breakdown (AILANG Parse)
DOCX score drops to 90.9% due to aspirational metrics (bookmarks, footnotes, section breaks, hyperlinks). Scores below 100% represent real roadmap targets. Formats with blue borders are not supported by most competitors.
Exclusive Features AILANG Parse Only
These structural features are extracted by AILANG Parse but missed by most or all other parsers tested. They represent real capabilities that matter for document understanding — and they're why the composite gap exists.
| Feature | AILANG Parse | Raw OOXML | Pandoc | Kreuzberg | Others | Why It Matters |
|---|---|---|---|---|---|---|
| Track Changes | 3/3 | 2/3 | 3/3 | 0/3 | 0/3 | Legal review, contract redlining, audit trails |
| Comments | 2/2 | 2/2 | 0/2 | 0/2 | 0/2 | Review feedback, annotation extraction |
| Headers/Footers | 3/3 | 2/2 | 0/3 | 2/3 | 0–1 | Document metadata, page numbering, letterheads |
| Text Boxes | 2/2 | 1/2 | 0/2 | 0/2 | 0/2 | Callouts, sidebars, floating content |
| Equations (§22.1) | 1/1 | 0/1 | 0/1 | 0/1 | 0/1 | OMML math content extraction |
| Field Codes (§17.16) | 1/1 | 1/1 | 0/1 | 1/1 | 0/1 | Dates, page counts, cross-references |
| Merged Cells | Yes | Yes | No | No | No | Complex tables, financial reports, schedules |
Raw OOXML and Pandoc detect some features through XML/AST access, but miss others. No tool detects footnotes, bookmarks, section breaks, hyperlinks, or styles yet — these are aspirational targets for all.
Feature Detection Heatmap (17 Features)
Feature-by-feature detection across all 69 test files. Fractions show files where the feature was correctly detected out of files where it should be present. Features marked ASPIRATIONAL are ECMA-376 spec targets that no tool handles yet.
| Feature | AILANG Parse | Raw OOXML | Pandoc | Kreuzberg | MarkItDown | Unstructured | Docling |
|---|---|---|---|---|---|---|---|
| Headings | 30/30 | 21/21 | 22/23 | 24/29 | 9/9 | 20/21 | 20/21 |
| Tables | 36/36 | 16/16 | 18/18 | 29/34 | 17/17 | 9/10 | 13/15 |
| Track Changes | 3/3 | 2/3 | 3/3 | 0/3 | — | 0/3 | 0/3 |
| Comments | 2/2 | 2/2 | 0/2 | 0/2 | — | 0/2 | 0/2 |
| Headers/Footers | 3/3 | 2/2 | 0/3 | 2/3 | — | 1/2 | 0/2 |
| Text Boxes | 2/2 | 1/2 | 0/2 | 0/2 | — | 0/2 | 0/2 |
| Images | 7/8 | 3/4 | 5/5 | 3/7 | 2/3 | 0/4 | 2/4 |
| Lists | 15/15 | 1/4 | 11/12 | 8/13 | 4/4 | 4/4 | 1/4 |
| Sheet Names | 4/5 | 1/1 | — | 4/4 | 0/1 | — | 0/1 |
| Equations (§22.1) | 1/1 | 0/1 | 0/1 | 0/1 | — | 0/1 | 0/1 |
| Field Codes (§17.16) | 1/1 | 1/1 | 0/1 | 1/1 | — | 0/1 | 0/1 |
| Footnotes ASPIRATIONAL | 0/1 | 0/1 | 0/1 | 0/1 | — | 0/1 | 0/1 |
| Hyperlinks ASPIRATIONAL | 0/1 | 0/1 | 0/1 | 0/1 | — | 0/1 | 0/1 |
| Styles ASPIRATIONAL | 0/1 | 0/1 | 0/1 | 0/1 | — | 0/1 | 0/1 |
| Bookmarks (§17.13.6) ASPIRATIONAL | 0/1 | 0/1 | 0/1 | 0/1 | — | 0/1 | 0/1 |
| Section Breaks (§17.6) ASPIRATIONAL | 0/8 | 0/8 | 0/8 | 0/8 | — | 0/8 | 0/8 |
Aspirational features (purple rows) are ECMA-376 spec-referenced targets that intentionally lower every tool's score. They represent the next frontier — features that should be extracted but currently aren't by any parser. AILANG Parse uniquely detects equations and leads on track changes, comments, headers/footers, text boxes, and lists.
What We Measure: 7 Scoring Dimensions
OfficeDocBench uses 7 weighted scoring dimensions, many driven by ECMA-376 spec references. Each dimension scores [0, 1] and the composite is a weighted average. Aspirational sub-dimensions intentionally score 0 when ground truth data exists but no parser handles the feature yet.
| Dimension | Weight | What It Tests | Spec References |
|---|---|---|---|
| Feature Detection | 15% | Binary: does the parser detect each feature present in the document? 17 features tested including aspirational targets. | §17.13.6, §22.1, §17.16, §17.6 |
| Structural Recall | 20% | Completeness: correct count of tables, track changes, comments, headings, headers/footers, images. Type matching for track change operations. | — |
| Structural Quality | 15% | Heading level distribution, TC author attribution, comment text matching, list numbering accuracy, table merge span accuracy, heading text match, section break detection, comment range accuracy. | §17.9, §18.3.1.55, §17.6, §17.13.1 |
| Content Fidelity | 15% | Key phrase recall, paragraph count, element ordering (LCS), hyperlink extraction, style preservation, equation text, field display text, footnote text, bookmark detection. | §22.1, §17.16, §17.13.6 |
| Text Jaccard | 10% | Word-level Jaccard similarity between ground truth and parser output. Measures raw text extraction quality. | — |
| Element Count | 15% | Per-type count precision across 9 element types: headings, tables, track changes, comments, images, lists, text boxes, footnotes, speaker notes. | — |
| Metadata | 10% | Exact match on title, author, created, modified timestamps. Sheet name accuracy with partial credit for partial matches. | — |
Weights emphasize structural extraction (Structural Recall 20% + Element Count 15% = 35%) since that is the benchmark's primary purpose. Aspirational sub-dimensions within Structural Quality and Content Fidelity intentionally lower scores to create roadmap targets.
Why AILANG?
AILANG Parse is written in AILANG, a language designed for building AI-native applications. AILANG isn't incidental to the results — it's why AILANG Parse achieves them.
Deterministic by Default
AILANG's effect system separates pure parsing logic from IO. The pure func annotation guarantees the XML→Block pipeline produces identical output for identical input — no hidden state, no ambient randomness. That's why structural recall scores 98.8% across all 69 test files.
Algebraic Data Types
The Block ADT (9 variants: Text, Heading, Table, Image, Audio, Video, List, Section, Change) is exhaustively matched by the compiler. Adding a new block type is a compile error until every parser and formatter handles it. Other parsers typically use untyped dictionaries.
Zero Runtime Dependencies
Every parser is AILANG + stdlib. No Python, no Java, no system libraries. ZIP extraction, XML parsing, and format routing are all AILANG code with inline tests and Z3-verified contracts. The entire parser fits in a single binary.
Contracts + Inline Tests
50+ contracts verified by Z3 at build time — format detection returns valid categories, mappers preserve element counts, filters respect bounds. Inline tests [...] on every function catch regressions before CI.
AI as an Effect
PDF and image parsing use AILANG's AI effect — the model is pluggable (--ai gemini-2.5-flash, --ai claude-haiku, --ai granite-docling). Office parsing doesn't use AI at all: it's deterministic XML extraction. The benchmark scores reflect the code, not the model.
See the Code
Here's how AILANG Parse extracts track changes — a feature unique to direct XML parsing:
-- A content block extracted from a document
export type Block = TextBlock({text: string, style: string, level: int})
| TableBlock({rows: [[TableCell]], headers: [TableCell]})
| ImageBlock({data: string, description: string, mime: string})
| AudioBlock({data: string, transcription: string, mime: string})
| VideoBlock({data: string, description: string, mime: string})
| ListBlock({items: [string], ordered: bool})
| HeadingBlock({text: string, level: int})
| SectionBlock({kind: string, blocks: [Block]})
| ChangeBlock({changeType: string, author: string, date: string, text: string})
-- Extract track change annotations from a paragraph
-- Finds w:ins, w:del, w:moveTo, w:moveFrom and creates ChangeBlocks
-- with author, date, change type, and affected text.
pure func extractParagraphChanges(p: XmlNode) -> [Block] {
let children = getChildren(p);
flatMap(extractChangeFromNode, children)
}
-- Create a ChangeBlock from a track change XML node
pure func makeChangeBlock(changeType: string, node: XmlNode) -> Block {
let author = getOrElse(getAttr(node, "w:author"), "Unknown");
let date = getOrElse(getAttr(node, "w:date"), "");
let text = extractChangeText(node);
ChangeBlock({changeType: changeType, author: author, date: date, text: text})
}
-- Z3-verified format detection with inline tests
export pure func detectFormat(ext: string) -> string
ensures {
result == "zip-office" || result == "pdf" || result == "image" ||
result == "audio" || result == "video" || result == "csv" ||
result == "markdown" || result == "html" || result == "epub" ||
result == "zip-odf" || result == "text" || result == "unknown"
}
tests [
("docx", "zip-office"),
("pptx", "zip-office"),
("xlsx", "zip-office"),
("odt", "zip-odf"),
("pdf", "pdf"),
("png", "image"),
("csv", "csv"),
("md", "markdown")
]
Full source: 31 AILANG modules, 50+ contracts, 76+ inline tests.
Eval-Driven Development
AILANG Parse is not hand-coded and then benchmarked. The benchmarks come first, and AI writes the parsing code against them. This is why the scores are high — and why we want to find documents that lower them.
Define the eval
We write a test file with known structure — merged cells, track changes, speaker notes, threaded comments — and hand-verify the expected output as ground truth.
AI writes the parser
An AI agent reads the relevant spec (ECMA-376 for Office, RFC 5322 for email, and more as we expand) and writes AILANG extraction code, targeting the eval. Z3 contracts verify correctness.
Score improves
The eval score goes up because the code was written specifically to pass it. This is the point — and it's why finding new test cases that lower the score is so valuable.
Help Us Find Our Gaps
We score 93.9% today — intentionally including aspirational metrics that lower our own score. ECMA-376 alone is over 5,000 pages, and we're expanding into email, calendar, and more formats. There are documents in the wild that will break our parser. We want to find them.
If you have a document that AILANG Parse doesn't handle well, send it to us. If it reveals a gap that gets added to our eval corpus, you get 1 month of Business tier free (worth €99).
Your document is used only for eval purposes. We can anonymize content before adding it to the benchmark corpus. Documents that don't reveal new gaps are deleted. Any format welcome — Office, email, calendar, or anything else you throw at us.
PDF Benchmark (OmniDocBench)
For PDFs, AILANG Parse delegates to whatever AI model you plug in via AILANG's AI effect. The benchmark measures the model's accuracy, not our parsing code — and we're transparent about that.
| Model | Text ED ↓ | Table TEDS ↑ | Reading Order ED ↓ |
|---|---|---|---|
| Gemini 2.5 Flash | 0.183 | 0.871 | 0.141 |
| Gemini 2.0 Flash | 0.210 | 0.842 | 0.168 |
| Ollama (granite-docling) | 0.890 | 0.120 | 0.920 |
gemini-2.5-flash — best balance of accuracy and speed. Ollama models score low due to structured JSON output limitations.Run the Benchmarks
Every number on this page comes from benchmarks/officedocbench/results/summary.json, regenerated whenever the eval runs. Three ways to reproduce, depending on what you have installed:
# 1. AILANG Parse only — instant, no extra dependencies
uv run benchmarks/officedocbench/eval_officedocbench.py
# 2. Single competitor — requires that adapter installed
uv run benchmarks/officedocbench/eval_officedocbench.py --adapter kreuzberg
# 3. Full leaderboard — all 8 adapters
uv pip install -e '.[competitors]'
uv run benchmarks/officedocbench/eval_officedocbench.py --all
Useful flags: --live re-parses files instead of using cached golden outputs, --format docx filters to one format, --json / --latex change report output.
The --all run writes summary.json (canonical) plus a mirror at docs/data/officedocbench-summary.json. The website reads the mirror at page load and rewrites every score in place — no rebuild step. Adding your own parser? Implement OfficeDocBenchAdapter in adapters/ and register it in eval_officedocbench.py:load_adapter().
Frequently Asked Questions
How does AILANG Parse compare to other document parsers on benchmarks?
On OfficeDocBench (69 files, 11 formats, 7 scoring dimensions, 17 features), AILANG Parse scores 93.9% composite with 100% format coverage. The benchmark tests table merge spans, track changes, comments, heading hierarchy, equation text, field codes, bookmarks, section breaks, and more across DOCX, PPTX, XLSX, and 8 additional formats. Coverage-adjusted, the nearest competitor (Kreuzberg) scores 68.0%.
Is the OfficeDocBench benchmark open source?
Yes. The benchmark suite, golden output files, and scoring scripts are all open source. You can run the full comparison on your own machine. The methodology uses structural diff scoring, not text similarity, so it measures actual structural fidelity.
Why do Docling and Unstructured score the same on OfficeDocBench?
Both Docling and Unstructured use the same fundamental approach for Office formats: convert to PDF first, then apply ML-based layout detection. Since the same structural information is lost during PDF conversion — merge attributes, track changes, comments — both tools hit the same ceiling. AILANG Parse bypasses this ceiling entirely by reading Office XML directly.
Why does AILANG Parse score so much higher than competitors?
AILANG Parse is built with an eval-first methodology: we define structural benchmark tests with hand-verified ground truth, then an AI agent writes AILANG parsing code specifically to pass them. We intentionally include aspirational metrics (bookmarks, footnotes, section breaks) that lower our own score to 93.9% — creating transparent roadmap targets. The gap reflects fundamentally different development approaches, not cherry-picked tests. Every result is fully reproducible.
How can I submit a document that doesn't parse correctly?
Email it to docparse@sunholo.com. If your document reveals a parsing gap that gets added to our eval corpus, you receive 1 month of Business tier free (worth €99). You can anonymize content before sending. We accept any format — Office, email, calendar, or anything else. See the Document Submission Program for details.