Why AILANG Parse? Stop Parsing Photos of Your Documents

The Invisible Assumption

Most business documents originate as Office files — DOCX, PPTX, XLSX. Yet the default instinct is: export to PDF, then parse. Converting DOCX to PDF before parsing is destroying structured data and then spending compute trying to recover it.

It's like taking a photograph of a spreadsheet and then using computer vision to read the numbers — when you could just open the file.

A DOCX file is not opaque binary. It's a ZIP archive containing well-structured XML, governed by the ECMA-376 standard. Track changes have author attribution and timestamps. Tables have explicit merge spans. Comments are threaded. Headers and footers are section-scoped. All of this information is right there in the XML.

What Gets Lost

Feature	Direct XML Parsing	After PDF Conversion
Track changes	Author, date, and change type preserved	Completely lost — either accepted or rejected view only
Merged cells	Structural colspan/rowspan from XML	Flattened or mangled by layout detection
Comments	Author-attributed, threaded, with anchors	Dropped entirely in PDF rendering
Headers / footers	Per-section, first/even/odd distinct	Mixed into body text or ignored
Text boxes	Position, content, and z-order preserved	Lost or garbled by spatial analysis
Metadata	Title, author, dates, keywords, revision count	Stripped or partially preserved
Footnotes	Numbered, linked to reference points	Detached from context, misattributed
Hyperlinks	URL targets from relationship XML	Visual blue text — URL lost

Every row in this table represents real-world data that legal, compliance, and audit teams rely on. Track changes in a contract are not cosmetic — they are the negotiation history. Losing them is not a quality tradeoff; it is data destruction.

Why Nobody Did This Before

The OOXML specification is over 5,000 pages. Track changes alone involve multiple XML namespaces with move tracking and range markers that span paragraph boundaries. Merged cells in DOCX tables use a different model than HTML colspan — they use vMerge and gridSpan attributes that reference a separate grid definition. Numbered lists resolve through a chain of abstract numbering definitions, numbering instances, and style overrides.

Nobody wanted to write a correct parser from scratch. The rational economic choice was: convert to PDF (which LibreOffice already handles) and parse the PDF (which has a mature ecosystem). The structured data loss was the price of tractability.

What changed

AILANG made it tractable. AI agents write the extraction logic from the ECMA-376 spec, and Z3 formal verification catches edge cases that manual testing would miss. The result: 63 verified contracts across 31 modules, covering filter bounds, structural invariants, and 1:1 mapper preservation.

The moat is not the parser itself — it is the parser-building system. When a new OOXML edge case appears, an AI agent reads the relevant spec section, writes the extraction logic, and Z3 proves it correct for all inputs. The gap between "discovered" and "fixed" collapses from weeks to hours.

The Numbers

OfficeDocBench is the first structural benchmark specifically for Office document parsing — testing track changes, merged cells, comments, headers/footers, text boxes, images, and metadata across 69 test files in 11 formats.

How parsers compare:

Coverage-adjusted composite (penalises tools that skip formats):

AILANG Parse

93.9%

Kreuzberg

68.0%

Raw OOXML

52.6%

Pandoc

48.2%

Docling

38.5%

Unstructured

38.7%

MarkItDown

51.2%

The gap comes almost entirely from track changes, merged cells, comments, and the long tail of formats most tools skip — structural features that don't survive PDF conversion.

These scores are also our development targets — the AI agent optimizes against OfficeDocBench. When you see 93.9%, you're seeing eval-driven development, not cherry-picked benchmarks. How eval-driven development works →

How it runs: Zero AI calls, zero network requests, zero per-page billing. Deterministic output — same input always produces the same blocks. Full methodology and results →

Help us raise the bar. Submit documents that challenge our parser to docparse@sunholo.com. If they improve our eval corpus, you get 1 month of Business tier free. Details →

Works Alongside Your Existing Pipeline

AILANG Parse reads Office formats directly from XML. For PDFs, scanned documents, and images, tools like Unstructured, Docling, and LlamaParse have mature OCR pipelines and layout analysis.

The typical approach elsewhere is to convert to PDF (losing structure), use basic text extractors (missing track changes), or skip Office formats entirely. AILANG Parse handles Office formats with full structural fidelity — use it standalone or alongside your existing PDF pipeline.

With Unstructured

Route DOCX/PPTX/XLSX through AILANG Parse for structural extraction. Route PDFs through Unstructured's layout analysis. Merge into a unified block stream.

With Docling

Use AILANG Parse for Office formats where Docling converts to PDF internally. Use Docling for scientific papers and PDF-native documents where its table detection excels.

With LlamaParse

AILANG Parse handles deterministic Office parsing locally — no API calls. Send only PDFs and images to LlamaParse's cloud API to reduce cost and latency.

Standalone

AILANG Parse includes its own AI-backed PDF parsing via any model (Gemini, Claude, Ollama). One tool, every format, pluggable AI.

See for Yourself

Upload a DOCX with track changes, or an XLSX with merged cells, or a PPTX with speaker notes. Compare the output to what your current parser returns. The structural data is either there or it is not.

Try in Browser See Benchmarks Self-Host

Frequently Asked Questions

Why does converting DOCX to PDF lose data?

A DOCX is structured XML. Converting to PDF flattens it into visual rendering, destroying track changes, merge attributes, and comments. AILANG Parse reads the XML directly.

How should I parse Office documents for a RAG pipeline?

AILANG Parse extracts a typed Block ADT that preserves heading hierarchy, table structure, and metadata — producing cleaner chunks than PDF-first parsers that flatten everything into paragraphs.

Is DOCX or PDF better for AI ingestion?

DOCX is dramatically better when the source was authored in Word or Google Docs. AILANG Parse reads DOCX XML directly — no AI calls, no per-page billing — while PDF conversion pipelines re-OCR every page and lose structural data like merged cells, track changes and comments.

What problems does PDF conversion cause for LLM document processing?

PDF conversion flattens tables, strips revision metadata, and makes reading order unreliable. AILANG Parse eliminates all three by reading Office XML directly.

What is structural document extraction?

Preserving semantic structure — headings, table cells, merged cells, tracked changes — not just flat text. AILANG Parse produces a typed Block ADT where each element is a distinct structured type.