Stop Parsing Photos of Your Documents

Your DOCX is already structured XML. Why are you destroying it?

The Invisible Assumption

Most business documents originate as Office files — DOCX, PPTX, XLSX. Yet the default instinct is: export to PDF, then parse. Converting DOCX to PDF before parsing is destroying structured data and then spending compute trying to recover it.

It's like taking a photograph of a spreadsheet and then using computer vision to read the numbers — when you could just open the file.

A DOCX file is not opaque binary. It's a ZIP archive containing well-structured XML, governed by the ECMA-376 standard. Track changes have author attribution and timestamps. Tables have explicit merge spans. Comments are threaded. Headers and footers are section-scoped. All of this information is right there in the XML.

What Gets Lost

FeatureDirect XML ParsingAfter PDF Conversion
Track changes Author, date, and change type preserved Completely lost — either accepted or rejected view only
Merged cells Structural colspan/rowspan from XML Flattened or mangled by layout detection
Comments Author-attributed, threaded, with anchors Dropped entirely in PDF rendering
Headers / footers Per-section, first/even/odd distinct Mixed into body text or ignored
Text boxes Position, content, and z-order preserved Lost or garbled by spatial analysis
Metadata Title, author, dates, keywords, revision count Stripped or partially preserved
Footnotes Numbered, linked to reference points Detached from context, misattributed
Hyperlinks URL targets from relationship XML Visual blue text — URL lost
Every row in this table represents real-world data that legal, compliance, and audit teams rely on. Track changes in a contract are not cosmetic — they are the negotiation history. Losing them is not a quality tradeoff; it is data destruction.

Why Nobody Did This Before

The OOXML specification is over 5,000 pages. Track changes alone involve multiple XML namespaces with move tracking and range markers that span paragraph boundaries. Merged cells in DOCX tables use a different model than HTML colspan — they use vMerge and gridSpan attributes that reference a separate grid definition. Numbered lists resolve through a chain of abstract numbering definitions, numbering instances, and style overrides.

Nobody wanted to write a correct parser from scratch. The rational economic choice was: convert to PDF (which LibreOffice already handles) and parse the PDF (which has a mature ecosystem). The structured data loss was the price of tractability.

What changed

AILANG made it tractable. AI agents write the extraction logic from the ECMA-376 spec, and Z3 formal verification catches edge cases that manual testing would miss. The result: 63 verified contracts across 31 modules, covering filter bounds, structural invariants, and 1:1 mapper preservation.

The moat is not the parser itself — it is the parser-building system. When a new OOXML edge case appears, an AI agent reads the relevant spec section, writes the extraction logic, and Z3 proves it correct for all inputs. The gap between "discovered" and "fixed" collapses from weeks to hours.

The Numbers

OfficeDocBench is the first structural benchmark specifically for Office document parsing — testing track changes, merged cells, comments, headers/footers, text boxes, images, and metadata across 69 test files in 11 formats.

How parsers compare:

Coverage-adjusted composite (penalises tools that skip formats):

AILANG Parse
93.9%
Kreuzberg
68.0%
Raw OOXML
52.6%
Pandoc
48.2%
Docling
38.5%
Unstructured
38.7%
MarkItDown
51.2%

The gap comes almost entirely from track changes, merged cells, comments, and the long tail of formats most tools skip — structural features that don't survive PDF conversion.

These scores are also our development targets — the AI agent optimizes against OfficeDocBench. When you see 93.9%, you're seeing eval-driven development, not cherry-picked benchmarks. How eval-driven development works →

How it runs: Zero AI calls, zero network requests, zero per-page billing. Deterministic output — same input always produces the same blocks. Full methodology and results →
Help us raise the bar. Submit documents that challenge our parser to docparse@sunholo.com. If they improve our eval corpus, you get 1 month of Business tier free. Details →

Works Alongside Your Existing Pipeline

AILANG Parse reads Office formats directly from XML. For PDFs, scanned documents, and images, tools like Unstructured, Docling, and LlamaParse have mature OCR pipelines and layout analysis.

The typical approach elsewhere is to convert to PDF (losing structure), use basic text extractors (missing track changes), or skip Office formats entirely. AILANG Parse handles Office formats with full structural fidelity — use it standalone or alongside your existing PDF pipeline.

With Unstructured

Route DOCX/PPTX/XLSX through AILANG Parse for structural extraction. Route PDFs through Unstructured's layout analysis. Merge into a unified block stream.

With Docling

Use AILANG Parse for Office formats where Docling converts to PDF internally. Use Docling for scientific papers and PDF-native documents where its table detection excels.

With LlamaParse

AILANG Parse handles deterministic Office parsing locally — no API calls. Send only PDFs and images to LlamaParse's cloud API to reduce cost and latency.

Standalone

AILANG Parse includes its own AI-backed PDF parsing via any model (Gemini, Claude, Ollama). One tool, every format, pluggable AI.

See for Yourself

Upload a DOCX with track changes, or an XLSX with merged cells, or a PPTX with speaker notes. Compare the output to what your current parser returns. The structural data is either there or it is not.

Frequently Asked Questions

Why does converting DOCX to PDF lose data?

A DOCX is structured XML. Converting to PDF flattens it into visual rendering, destroying track changes, merge attributes, and comments. AILANG Parse reads the XML directly.

How should I parse Office documents for a RAG pipeline?

AILANG Parse extracts a typed Block ADT that preserves heading hierarchy, table structure, and metadata — producing cleaner chunks than PDF-first parsers that flatten everything into paragraphs.

Is DOCX or PDF better for AI ingestion?

DOCX is dramatically better when the source was authored in Word or Google Docs. AILANG Parse reads DOCX XML directly — no AI calls, no per-page billing — while PDF conversion pipelines re-OCR every page and lose structural data like merged cells, track changes and comments.

What problems does PDF conversion cause for LLM document processing?

PDF conversion flattens tables, strips revision metadata, and makes reading order unreliable. AILANG Parse eliminates all three by reading Office XML directly.

What is structural document extraction?

Preserving semantic structure — headings, table cells, merged cells, tracked changes — not just flat text. AILANG Parse produces a typed Block ADT where each element is a distinct structured type.