Universal Document Parsing

Stop parsing
photos of your
documents.

Your DOCX is already structured XML. AILANG reads it directly — full structure preserved, no AI calls, no per-page bill.

Parse a document → View benchmarks

Free, in your browser — no account required

Everyone else

DOCX structured XML

→

PDF

track changesmerged cellscomments

→

ML rebuild 2–5 sec

→

Flat text

53–73% OfficeDocBench

AILANG Parse

DOCX structured XML

→

Direct XML instant

→

Block ADT

track changesmerged cellscomments

→

Your format

JSONMarkdownHTMLQuarto+5

93.9% OfficeDocBench — eval-driven

The Problem

PDF-first parsers destroy what's already there

Converting DOCX to PDF is like photographing a spreadsheet. Track changes, merged cells, comments — all gone. Then ML spends seconds reconstructing what the XML already had.

HeadingBlock

TextBlock

TableBlock

ChangeBlock

ImageBlock

SectionBlock

Only AILANG Parse

Track changes — preserved, not flattened

See who changed what and when. Author, timestamp, original vs revised text. The audit trail that vanishes when you convert to PDF.

0 of 5 other parsers tested extract track changes. AILANG Parse gets 3/3.

Payment terms: net-30 net-60 days from invoice date.Alice Chen · Mar 15

Delivery deadline: Q2 2026 Q3 2026Bob Martinez · Mar 16

Liability cap: $500,000 $1,000,000Alice Chen · Mar 17

Only AILANG Parse

Merged cells — structural, not flattened

colspan and rowspan preserved as typed metadata. Other parsers atomize merged cells into individual elements, breaking table structure.

Others: Flattened

Region		Q1	Q2
EMEA	UK	$125K	$140K
	DE	$98K	$112K

→

AILANG Parse: Structural

Region colspan:2		Q1	Q2
EMEA rowspan:2	UK	$125K	$140K
EMEA rowspan:2	DE	$98K	$112K

OfficeDocBench

69 files. 11 formats. Open-source.

Same files, same metrics. Coverage-adjusted scores penalise tools that skip formats — only AILANG Parse handles all 69 files.

AILANG Parse93.9%

Kreuzberg68.0%

Raw OOXML52.6%

Pandoc48.2%

Docling38.5%

Unstructured38.7%

MarkItDown51.2%

Full methodology and results →

Development Model

Benchmarks first,
code second

Traditional parsers are hand-coded then measured. We define structural tests with hand-verified ground truth, then AI writes code to pass them. High scores are a design outcome, not a coincidence.

How eval-driven development works →

Define ground truth

28 structural tests
hand-verified outputs

→

AI writes parser code

AILANG modules
targeting each test

→

Run benchmarks

tables, track changes
merged cells, comments

93.9%

Composite score

Privacy by Default

Office files stay in your browser

DOCX, PPTX, XLSX and other Office formats are parsed entirely client-side via WebAssembly or locally via the CLI. No upload, no server, zero bytes sent.

PDFs and images require AI — your key, your provider, your choice. The API processes server-side but never stores your files. Privacy policy →

Try it now ↓

100% Local Browser & CLI

.docx .pptx .xlsx

→

WASM / CLI

→

Block ADT

Zero bytes leave your machine. No server, no upload, no AI needed.

Encrypted API — all formats

.pdf .docx .png

→

API Server

→

Block ADT

Processed in memory, never stored. Office formats don't use AI even via API.

Universal

15 formats in, one structure out

Every format produces the same typed Block ADT. Office is deterministic. PDFs use any AI — Gemini, Claude, OpenAI, or Ollama locally.

.docx

.pptx

.xlsx

.odt

.html

.md

.csv

.tex

.pdf

.png

TextBlock

HeadingBlock

TableBlock

ImageBlock

ChangeBlock

93.9%

OfficeDocBench

~100×

Cheaper

Input formats

Dependency

Try It Now in your browser

Loading WASM runtime...

All parsing happens locally via WebAssembly. Your files never leave the browser.

Try a sample:

Drop a file here

Or click to browse

Local: DOCXPPTXXLSXODTODPODS HTMLMDCSVEPUBEMLTEXRTF

AI key: PDF*PNG*JPG* MP4*MP3*WAV*

AI Settings

Unlocks PDF, image, audio, and video parsing. Also describes embedded images in DOCX/PPTX files.

Google API Key Model

Stored in localStorage only — never sent to our servers. Get a free key at aistudio.google.com.

Original Parsed JSON Markdown A2UI

Drop a document or try a sample to see it parsed

Need to parse more than one file?

Open the Workbench for a multi-file library, persistent insights, and CLI / SDK snippet copy — same WASM engine, more elbow room.

Open Workbench →

Try it

Parse documents in your browser. No signup, no install.

Open Workbench

Use the API

1,000 requests/month free. Python, JS, Go SDKs.

Get API Key

Install locally

Unlimited local parsing. Only requires AILANG.

Install CLI

Free

€0

1,000 req/mo · 50 AI parses

Get API Key

Pro

€29/mo

100,000 req/mo · 500 AI parses

Start Pro

Business

€99/mo

500,000 req/mo · 2,000 AI parses

Start Business

Custom

Let’s talk

Higher volumes · on-prem · SLAs

Get in touch

Per document, not per page

Each request parses an entire document — a 200-page DOCX costs the same as a 1-page DOCX.
Pro: ~100× cheaper — per-page APIs charge $25–$250 for 1,000 docs (at 25 pages each). No per-page fees. See full pricing →

Pro tip: Keep source Office formats (DOCX, PPTX, XLSX) instead of converting to PDF. Office parsing is deterministic, instant, and doesn’t count against your AI quota. A 5 MB DOCX typically becomes a 10–30 MB PDF due to font embedding — costing more, taking longer, and using AI when none is needed. Send the original and save your AI parses for scanned documents and images.

CLI, WASM, and bring-your-own-key are free and unlimited. Pricing is for the hosted API only. Full pricing details →

How Document Parsers Work

	AILANG Parse	PDF-first parsers	Wrapper libraries
Method	Direct XML extraction	Convert to PDF → ML reconstruction	mammoth / pandas text extraction
Track changes	Full (author, date, type)	Lost in PDF conversion	Not extracted
Merged cells	Structural (colspan/rowspan)	Flattened	Flattened
Comments	Author-attributed	Dropped	Dropped
Dependencies	AILANG only	Python + ML libs	Python + wrappers
Browser / WASM	Yes	No	No
Privacy	Client-side only	Server required	Server required
OfficeDocBench	93.9%	61–79%	71–84%
Cost per DOCX (25pg avg)	€0.00029 (per doc)	$0.025–$0.25 (per page)	Free (self-hosted)
Pricing model	Per document (any page count)	Per page ($0.001–$0.01/pg)	Open source

PDF-first: Unstructured, Docling, LlamaParse. Wrappers: MarkItDown.
Full benchmark methodology and results →

Supported Formats 15+ formats

Office formats parsed deterministically from XML. Plain text formats parsed natively. For PDFs, images, audio, and video: bring any AI provider — Gemini, Claude, OpenAI, Ollama (fully local), even LlamaParse or Unstructured. Structured output regardless of provider. No vendor lock-in.

.csv / .tsv

Native

.md

Native

.qmd

Generate only

Gemini · Claude · OpenAI · Ollama

Learn more →

.png / .jpg

AI-powered

.wav / .mp3

AI-powered

.mp4 / .mov

AI-powered

Deterministic (XML parsing) Native (text parsing) AI-powered (any provider) Generate only (output format)

Structural Extraction Our moat

Most parsers convert Office files to PDF first, losing structure. AILANG Parse reads the XML directly. These features require direct XML access.

Track Changes

Insert, delete, and move blocks with author attribution and timestamps. Structured ChangeBlock data, not rendered accept/reject text.

Exclusive

Merged Cells

Table headers, rows, and cells with colspan/rowspan merge info preserved. Not atomized into individual cell elements.

Exclusive

Headers & Footers

Extracted as semantic SectionBlocks with typed sub-blocks. Not flattened into body text.

Better

Text Boxes & Shapes

Content extracted from DrawingML and VML shapes, including legacy VML images that other parsers silently drop.

Exclusive

Embedded Images

Images detected with base64 data and MIME type. Optional AI descriptions via Gemini, Claude, or Ollama multimodal.

Better

Guaranteed Correctness

No silent data loss. 63 structural contracts guarantee every block, merged cell, and track change from the source document appears in the output. Filter bounds, 1:1 mapper preservation, and structural invariants — verified at compile time.

Unique

What You Get real output from real documents

Every format produces the same structured Block ADT — track changes, merged cells, comments, and more preserved as typed data, not flattened text.

track_changes_move.docx Exclusive Moved paragraphs preserved with author & timestamp

Author: Jesse Rosenthal Created: 2016-04-16 Changes: 2 tracked

Here is some text.

Here is the text to be moved.

↘ Moved hereJesse Rosenthal · 2016-04-16

Here is the text to be moved.

Here is some more text.

↖ Moved from hereJesse Rosenthal · 2016-04-16

Here is the text to be moved.

View JSON

{
  "type": "change",
  "changeType": "move-to",
  "author": "Jesse Rosenthal",
  "date": "2016-04-16T08:20:00Z",
  "text": "Here is the text to be moved."
}

merged_cells.docx Exclusive Column spans & merged cells preserved structurally

Author: Shay Hill Created: 2023-01-23

0-0	0-12 colspan: 2		0-3
12-0	1-1	1-2	1-3
merged ↑	2-1	2-2	2-3
3-0	34-123 colspan: 3
4-0	merged ↑ · colspan: 3

View JSON

{
  "type": "table",
  "headers": ["0-0", { "text": "0-12", "colSpan": 2 }, "0-3"],
  "rows": [
    ["12-0", "1-1", "1-2", "1-3"],
    [{ "colSpan": 1, "merged": true }, "2-1", "2-2", "2-3"],
    ["3-0", { "text": "34-123", "colSpan": 3 }]
  ]
}

tables.docx Structural Headers, rows, and multi-paragraph cells

A table, with and without a header row

Name	Game	Fame	Blame
Lebron James	Basketball	Very High	Leaving Cleveland
Ryan Braun	Baseball	Moderate	Steroids
Russell Wilson	Football	High	Tacky uniform

Sinple	Table
Without	Header

Simple Multiparagraph	Table Full
Of Paragraphs	In each Cell.

View JSON

{
  "type": "table",
  "headers": ["Name", "Game", "Fame", "Blame"],
  "rows": [
    ["Lebron James", "Basketball", "Very High", "Leaving Cleveland"],
    ["Ryan Braun", "Baseball", "Moderate", "Steroids"],
    ["Russell Wilson", "Football", "High", "Tacky uniform"]
  ]
}

comments.docx Exclusive Comments preserved with author attribution

Author: Jesse Rosenthal Created: 2016-05-09 Comments: 5

I want some text to have a comment on it.

This is a new paragraph.

And so is this.

One more. And this is one with a comment in a comment.

Jesse Rosenthal

I left a comment.

Jesse Rosenthal

A comment across paragraphs.

Jesse Rosenthal

This one has multiple paragraphs. See?

Jesse Rosenthal

Do something.

Jesse Rosenthal

Do something else.

View JSON

{
  "type": "section",
  "kind": "comment",
  "blocks": [
    { "text": "[Jesse Rosenthal] I left a comment." }
  ]
}

interview.mp3 AI-powered Transcription with speaker detection & language ID

interview.mp3

audio/mp3

2 speakers Language: en 4 blocks extracted

“Welcome to the show. Today we're discussing document parsing and why structure matters more than text extraction...”

Summary: Discussion about document parsing approaches

View JSON

{
  "type": "audio",
  "transcription": "Welcome to the show. Today we're discussing...",
  "mime": "audio/mp3"
}

tutorial.mp4 AI-powered Visual + audio extraction with structured content blocks

tutorial.mp4

video/mp4 · 4 blocks extracted

Technical tutorial showing a spreadsheet with quarterly sales data, followed by a presenter explaining trends

Q1 Sales Review

Region	Revenue	Growth
North America	$2.4M	+12%
EMEA	$1.8M	+8%

“As you can see from the data, North America led growth this quarter...”

View JSON

{
  "type": "video",
  "description": "Technical tutorial showing a spreadsheet...",
  "mime": "video/mp4"
}

Real output from actual test documents. Track changes, merged cells, and comments are preserved structurally — not flattened to text. Expand “View JSON” on any tab to see the raw Block ADT.

Benchmarks OfficeDocBench — eval-driven

93.9%

AILANG Parse composite

100%

Format coverage

68.0%

Kreuzberg (adjusted)

52.6%

Raw OOXML (adjusted)

48.2%

Pandoc (adjusted)

Metric	AILANG Parse v0.3.0	Raw OOXML	Pandoc v3.9	Kreuzberg v4.7	MarkItDown v0.1.5
Composite score	93.9%	84.4%	74.0%	71.1%	67.9%
Coverage-adjusted	93.9%	52.6%	48.2%	68.0%	51.2%
Format coverage	100%	62%	65%	96%	35%
Track changes	3/3	2/3	3/3	0/3	—
Comments	2/2	2/2	0/2	0/2	—
Headers & footers	3/3	2/2	0/3	2/3	—
Text boxes	2/2	1/2	0/2	0/2	—
Equations (§22.1)	1/1	0/1	0/1	0/1	—
Formats supported	10	3	5	9	5
Runtime dependencies	AILANG only	Python stdlib	Pandoc binary	Python + libs	Python + libs

Eval-driven — open-source, reproducible
AILANG Parse uses these benchmarks as eval targets during development — the AI writes code specifically to pass them. High scores are a design feature, not a coincidence. 69 test files (11 formats, 17 features) sourced from Pandoc, Apache POI, LibreOffice, Unstructured, python-pptx, and Project Gutenberg — not cherry-picked. Ground truth, 7-metric scoring code, and all 8 adapter implementations are published.
Run date: 2026-04-07. Coverage-adjusted = composite × (files parsed / total files) — AILANG Parse is the only parser with 100% coverage.
View on GitHub · How eval-driven development works

Found a document we don't parse well? Submit it to docparse@sunholo.com. If it improves our eval corpus, you get 1 month of Business tier free (€99 value). Details →

How It Works & CLI Usage

A single pipeline handles all formats. The format router detects the file type, selects the appropriate parser, and produces a unified Block ADT.

Document

.docx .pdf .epub ...

→

Format Router

Extension detection

→

Parser

16 AILANG modules

→

Block ADT

9 typed variants

→

Output

JSON + Markdown

TextBlock HeadingBlock TableBlock ImageBlock ListBlock SectionBlock ChangeBlock AudioBlock VideoBlock

Usage

One command. Every format. CLI or library.

Office Documents (no AI needed)

# Parse any document (Office formats — instant, no AI)
docparse report.docx
docparse slides.pptx
docparse data.xlsx
docparse document.odt
docparse book.epub

PDFs and Images (AI-powered)

# PDF and image extraction (auto-selects AI backend)
docparse invoice.pdf
docparse scan.png

# Choose your AI backend
docparse doc.pdf --ai gemini-2.5-flash       # Google (default)
docparse doc.pdf --ai claude-haiku-4-5       # Anthropic
docparse doc.pdf --ai granite-docling        # Local Ollama (free)

Dev Tools

docparse --check                         # Type-check all 18 modules
docparse --test                          # Run 51 inline tests
docparse --prove                         # Z3 contract verification
docparse report.docx --verify           # Runtime contract checks
docparse report.docx --describe         # AI image descriptions
docparse report.docx --summarize        # AI document summary

Parse a document.
Keep the structure.

Free in your browser, no account required. 1,000 API requests/month free. CLI and WASM are unlimited.

Try AILANG Parse →

Stop parsingphotos of yourdocuments.

PDF-first parsers destroy what's already there

Track changes — preserved, not flattened

Merged cells — structural, not flattened

69 files. 11 formats. Open-source.

Benchmarks first,code second

Office files stay in your browser

15 formats in, one structure out

Drop a file here

Parsing Pipeline

AI Settings

A table, with and without a header row

Q1 Sales Review

Usage

Parse a document.Keep the structure.

Stop parsing
photos of your
documents.

Benchmarks first,
code second

Parse a document.
Keep the structure.