Universal Document Parsing

Stop parsing
photos of your
documents.

Your DOCX is already structured XML. AILANG reads it directly — full structure preserved, no AI calls, no per-page bill.

Free, in your browser — no account required

Everyone else
DOCX structured XML
PDF
track changesmerged cellscomments
ML rebuild 2–5 sec
Flat text
53–73% OfficeDocBench
AILANG Parse
DOCX structured XML
Direct XML instant
Block ADT
track changesmerged cellscomments
Your format
JSONMarkdownHTMLQuarto+5
93.9% OfficeDocBench — eval-driven
The Problem

PDF-first parsers destroy what's already there

Converting DOCX to PDF is like photographing a spreadsheet. Track changes, merged cells, comments — all gone. Then ML spends seconds reconstructing what the XML already had.

HeadingBlock
TextBlock
TableBlock
ChangeBlock
ImageBlock
SectionBlock
Only AILANG Parse

Track changes — preserved, not flattened

See who changed what and when. Author, timestamp, original vs revised text. The audit trail that vanishes when you convert to PDF.

0 of 5 other parsers tested extract track changes. AILANG Parse gets 3/3.

Payment terms: net-30 net-60 days from invoice date.Alice Chen · Mar 15
Delivery deadline: Q2 2026 Q3 2026Bob Martinez · Mar 16
Liability cap: $500,000 $1,000,000Alice Chen · Mar 17
Only AILANG Parse

Merged cells — structural, not flattened

colspan and rowspan preserved as typed metadata. Other parsers atomize merged cells into individual elements, breaking table structure.

Others: Flattened
RegionQ1Q2
EMEAUK$125K$140K
DE$98K$112K
AILANG Parse: Structural
Region colspan:2Q1Q2
EMEA rowspan:2UK$125K$140K
DE$98K$112K
OfficeDocBench

69 files. 11 formats. Open-source.

Same files, same metrics. Coverage-adjusted scores penalise tools that skip formats — only AILANG Parse handles all 69 files.

AILANG Parse93.9%
Kreuzberg68.0%
Raw OOXML52.6%
Pandoc48.2%
Docling38.5%
Unstructured38.7%
MarkItDown51.2%

Full methodology and results →

Development Model

Benchmarks first,
code second

Traditional parsers are hand-coded then measured. We define structural tests with hand-verified ground truth, then AI writes code to pass them. High scores are a design outcome, not a coincidence.

How eval-driven development works →

01
Define ground truth
28 structural tests
hand-verified outputs
02
AI writes parser code
AILANG modules
targeting each test
03
Run benchmarks
tables, track changes
merged cells, comments
=
93.9%
Composite score
Privacy by Default

Office files stay in your browser

DOCX, PPTX, XLSX and other Office formats are parsed entirely client-side via WebAssembly or locally via the CLI. No upload, no server, zero bytes sent.

PDFs and images require AI — your key, your provider, your choice. The API processes server-side but never stores your files. Privacy policy →

Try it now ↓

100% Local Browser & CLI
.docx .pptx .xlsx
WASM / CLI
Block ADT
Zero bytes leave your machine. No server, no upload, no AI needed.
Encrypted API — all formats
.pdf .docx .png
API Server
Block ADT
Processed in memory, never stored. Office formats don't use AI even via API.
Universal

14 formats in, one structure out

Every format produces the same typed Block ADT. Office is deterministic. PDFs use any AI — Gemini, Claude, OpenAI, or Ollama locally.

.docx
.pptx
.xlsx
.odt
.html
.md
.csv
.pdf
.png
AILANG Parse
TextBlock
HeadingBlock
TableBlock
ImageBlock
ChangeBlock
93.9%
OfficeDocBench
~100×
Cheaper
14
Input formats
1
Dependency
Try It Now in your browser
Loading WASM runtime...
All parsing happens locally via WebAssembly. Your files never leave the browser.
Try a sample:

Drop a file here

Or click to browse

Local: DOCXPPTXXLSXODTODPODS HTMLMDCSVEPUBEML
AI key: PDF*PNG*JPG* MP4*MP3*WAV*

AI Settings

Unlocks PDF, image, audio, and video parsing. Also describes embedded images in DOCX/PPTX files.

Stored in localStorage only — never sent to our servers. Get a free key at aistudio.google.com.
Original Parsed JSON Markdown A2UI
Drop a document or try a sample to see it parsed
Need to parse more than one file?
Open the Workbench for a multi-file library, persistent insights, and CLI / SDK snippet copy — same WASM engine, more elbow room.
Open Workbench →

1
Try it
Parse documents in your browser. No signup, no install.
Open Workbench
2
Use the API
1,000 requests/month free. Python, JS, Go SDKs.
Get API Key
3
Install locally
Unlimited local parsing. Only requires AILANG.
Install CLI
Free
€0
1,000 req/mo · 50 AI parses
Get API Key
Pro
29/mo
100,000 req/mo · 500 AI parses
Start Pro
Business
99/mo
500,000 req/mo · 2,000 AI parses
Start Business
Custom
Let’s talk
Higher volumes · on-prem · SLAs
Get in touch
Per document, not per page
Each request parses an entire document — a 200-page DOCX costs the same as a 1-page DOCX.
Pro: ~100× cheaper — per-page APIs charge $25–$250 for 1,000 docs (at 25 pages each). No per-page fees. See full pricing →
Pro tip: Keep source Office formats (DOCX, PPTX, XLSX) instead of converting to PDF. Office parsing is deterministic, instant, and doesn’t count against your AI quota. A 5 MB DOCX typically becomes a 10–30 MB PDF due to font embedding — costing more, taking longer, and using AI when none is needed. Send the original and save your AI parses for scanned documents and images.

CLI, WASM, and bring-your-own-key are free and unlimited. Pricing is for the hosted API only. Full pricing details →


How Document Parsers Work
AILANG Parse PDF-first parsers Wrapper libraries
Method Direct XML extraction Convert to PDF → ML reconstruction mammoth / pandas text extraction
Track changes Full (author, date, type) Lost in PDF conversion Not extracted
Merged cells Structural (colspan/rowspan) Flattened Flattened
Comments Author-attributed Dropped Dropped
Speed <1s (CLI: <1ms) 2–5 seconds ~500ms
Dependencies AILANG only Python + ML libs Python + wrappers
Browser / WASM Yes No No
Privacy Client-side only Server required Server required
OfficeDocBench 93.9% 61–79% 71–84%
Cost per DOCX (25pg avg) €0.00029 (per doc) $0.025–$0.25 (per page) Free (self-hosted)
Pricing model Per document (any page count) Per page ($0.001–$0.01/pg) Open source

PDF-first: Unstructured, Docling, LlamaParse. Wrappers: MarkItDown.
Full benchmark methodology and results →


Supported Formats 15+ formats

Office formats parsed deterministically from XML. Plain text formats parsed natively. For PDFs, images, audio, and video: bring any AI provider — Gemini, Claude, OpenAI, Ollama (fully local), even LlamaParse or Unstructured. Structured output regardless of provider. No vendor lock-in.

Deterministic (XML parsing) Native (text parsing) AI-powered (any provider) Generate only (output format)

Structural Extraction Our moat

Most parsers convert Office files to PDF first, losing structure. AILANG Parse reads the XML directly. These features require direct XML access.

Track Changes
Insert, delete, and move blocks with author attribution and timestamps. Structured ChangeBlock data, not rendered accept/reject text.
Exclusive
Merged Cells
Table headers, rows, and cells with colspan/rowspan merge info preserved. Not atomized into individual cell elements.
Exclusive
Headers & Footers
Extracted as semantic SectionBlocks with typed sub-blocks. Not flattened into body text.
Better
Text Boxes & Shapes
Content extracted from DrawingML and VML shapes, including legacy VML images that other parsers silently drop.
Exclusive
Embedded Images
Images detected with base64 data and MIME type. Optional AI descriptions via Gemini, Claude, or Ollama multimodal.
Better
Guaranteed Correctness
No silent data loss. 63 structural contracts guarantee every block, merged cell, and track change from the source document appears in the output. Filter bounds, 1:1 mapper preservation, and structural invariants — verified at compile time.
Unique

What You Get real output from real documents

Every format produces the same structured Block ADT — track changes, merged cells, comments, and more preserved as typed data, not flattened text.

track_changes_move.docx Exclusive Moved paragraphs preserved with author & timestamp
Author: Jesse Rosenthal Created: 2016-04-16 Changes: 2 tracked

Here is some text.

Here is the text to be moved.

↘ Moved hereJesse Rosenthal · 2016-04-16
Here is the text to be moved.

Here is some more text.

↖ Moved from hereJesse Rosenthal · 2016-04-16
Here is the text to be moved.
View JSON
{
  "type": "change",
  "changeType": "move-to",
  "author": "Jesse Rosenthal",
  "date": "2016-04-16T08:20:00Z",
  "text": "Here is the text to be moved."
}
merged_cells.docx Exclusive Column spans & merged cells preserved structurally
Author: Shay Hill Created: 2023-01-23
0-0 0-12 colspan: 2 0-3
12-01-11-21-3
merged ↑2-12-22-3
3-034-123 colspan: 3
4-0merged ↑ · colspan: 3
View JSON
{
  "type": "table",
  "headers": ["0-0", { "text": "0-12", "colSpan": 2 }, "0-3"],
  "rows": [
    ["12-0", "1-1", "1-2", "1-3"],
    [{ "colSpan": 1, "merged": true }, "2-1", "2-2", "2-3"],
    ["3-0", { "text": "34-123", "colSpan": 3 }]
  ]
}
tables.docx Structural Headers, rows, and multi-paragraph cells

A table, with and without a header row

NameGameFameBlame
Lebron JamesBasketballVery HighLeaving Cleveland
Ryan BraunBaseballModerateSteroids
Russell WilsonFootballHighTacky uniform
SinpleTable
WithoutHeader
Simple

Multiparagraph
Table

Full
Of

Paragraphs
In each

Cell.
View JSON
{
  "type": "table",
  "headers": ["Name", "Game", "Fame", "Blame"],
  "rows": [
    ["Lebron James", "Basketball", "Very High", "Leaving Cleveland"],
    ["Ryan Braun", "Baseball", "Moderate", "Steroids"],
    ["Russell Wilson", "Football", "High", "Tacky uniform"]
  ]
}
comments.docx Exclusive Comments preserved with author attribution
Author: Jesse Rosenthal Created: 2016-05-09 Comments: 5

I want some text to have a comment on it.

This is a new paragraph.

And so is this.

One more. And this is one with a comment in a comment.

JR
Jesse Rosenthal
I left a comment.
JR
Jesse Rosenthal
A comment across paragraphs.
JR
Jesse Rosenthal
This one has multiple paragraphs. See?
JR
Jesse Rosenthal
Do something.
JR
Jesse Rosenthal
Do something else.
View JSON
{
  "type": "section",
  "kind": "comment",
  "blocks": [
    { "text": "[Jesse Rosenthal] I left a comment." }
  ]
}
interview.mp3 AI-powered Transcription with speaker detection & language ID
interview.mp3
audio/mp3
2 speakers Language: en 4 blocks extracted
“Welcome to the show. Today we're discussing document parsing and why structure matters more than text extraction...”
Summary: Discussion about document parsing approaches
View JSON
{
  "type": "audio",
  "transcription": "Welcome to the show. Today we're discussing...",
  "mime": "audio/mp3"
}
tutorial.mp4 AI-powered Visual + audio extraction with structured content blocks
tutorial.mp4
video/mp4 · 4 blocks extracted
Technical tutorial showing a spreadsheet with quarterly sales data, followed by a presenter explaining trends

Q1 Sales Review

RegionRevenueGrowth
North America$2.4M+12%
EMEA$1.8M+8%
“As you can see from the data, North America led growth this quarter...”
View JSON
{
  "type": "video",
  "description": "Technical tutorial showing a spreadsheet...",
  "mime": "video/mp4"
}
Real output from actual test documents. Track changes, merged cells, and comments are preserved structurally — not flattened to text. Expand “View JSON” on any tab to see the raw Block ADT.

Benchmarks OfficeDocBench — eval-driven
93.9%
AILANG Parse composite
100%
Format coverage
68.0%
Kreuzberg (adjusted)
52.6%
Raw OOXML (adjusted)
48.2%
Pandoc (adjusted)
Metric AILANG Parse v0.3.0 Raw OOXML Pandoc v3.9 Kreuzberg v4.7 MarkItDown v0.1.5
Composite score 93.9% 84.4% 74.0% 71.1% 67.9%
Coverage-adjusted 93.9% 52.6% 48.2% 68.0% 51.2%
Format coverage 100% 62% 65% 96% 35%
Track changes 3/3 2/3 3/3 0/3
Comments 2/2 2/2 0/2 0/2
Headers & footers 3/3 2/2 0/3 2/3
Text boxes 2/2 1/2 0/2 0/2
Equations (§22.1) 1/1 0/1 0/1 0/1
Formats supported 10 3 5 9 5
Runtime dependencies AILANG only Python stdlib Pandoc binary Python + libs Python + libs
Eval-driven — open-source, reproducible
AILANG Parse uses these benchmarks as eval targets during development — the AI writes code specifically to pass them. High scores are a design feature, not a coincidence. 69 test files (11 formats, 17 features) sourced from Pandoc, Apache POI, LibreOffice, Unstructured, python-pptx, and Project Gutenberg — not cherry-picked. Ground truth, 7-metric scoring code, and all 8 adapter implementations are published.
Run date: 2026-04-07. Coverage-adjusted = composite × (files parsed / total files) — AILANG Parse is the only parser with 100% coverage.
View on GitHub  ·  How eval-driven development works
Found a document we don't parse well? Submit it to docparse@sunholo.com. If it improves our eval corpus, you get 1 month of Business tier free (€99 value). Details →

How It Works & CLI Usage

A single pipeline handles all formats. The format router detects the file type, selects the appropriate parser, and produces a unified Block ADT.

Document
.docx .pdf .epub ...
Format Router
Extension detection
Parser
16 AILANG modules
Block ADT
9 typed variants
Output
JSON + Markdown
TextBlock HeadingBlock TableBlock ImageBlock ListBlock SectionBlock ChangeBlock AudioBlock VideoBlock

Usage

One command. Every format. CLI or library.

Office Documents (no AI needed)
# Parse any document (Office formats — instant, no AI) docparse report.docx docparse slides.pptx docparse data.xlsx docparse document.odt docparse book.epub
PDFs and Images (AI-powered)
# PDF and image extraction (auto-selects AI backend) docparse invoice.pdf docparse scan.png # Choose your AI backend docparse doc.pdf --ai gemini-2.5-flash # Google (default) docparse doc.pdf --ai claude-haiku-4-5 # Anthropic docparse doc.pdf --ai granite-docling # Local Ollama (free)
Dev Tools
docparse --check # Type-check all 18 modules docparse --test # Run 51 inline tests docparse --prove # Z3 contract verification docparse report.docx --verify # Runtime contract checks docparse report.docx --describe # AI image descriptions docparse report.docx --summarize # AI document summary

Parse a document.
Keep the structure.

Free in your browser, no account required. 1,000 API requests/month free. CLI and WASM are unlimited.

Try AILANG Parse →