Your DOCX is already structured XML. AILANG reads it directly — full structure preserved, no AI calls, no per-page bill.
Free, in your browser — no account required
Converting DOCX to PDF is like photographing a spreadsheet. Track changes, merged cells, comments — all gone. Then ML spends seconds reconstructing what the XML already had.
See who changed what and when. Author, timestamp, original vs revised text. The audit trail that vanishes when you convert to PDF.
0 of 5 other parsers tested extract track changes. AILANG Parse gets 3/3.
colspan and rowspan preserved as typed metadata. Other parsers atomize merged cells into individual elements, breaking table structure.
| Region | Q1 | Q2 | |
| EMEA | UK | $125K | $140K |
| DE | $98K | $112K |
| Region colspan:2 | Q1 | Q2 | |
| EMEA rowspan:2 | UK | $125K | $140K |
| DE | $98K | $112K | |
Same files, same metrics. Coverage-adjusted scores penalise tools that skip formats — only AILANG Parse handles all 69 files.
Traditional parsers are hand-coded then measured. We define structural tests with hand-verified ground truth, then AI writes code to pass them. High scores are a design outcome, not a coincidence.
DOCX, PPTX, XLSX and other Office formats are parsed entirely client-side via WebAssembly or locally via the CLI. No upload, no server, zero bytes sent.
PDFs and images require AI — your key, your provider, your choice. The API processes server-side but never stores your files. Privacy policy →
Every format produces the same typed Block ADT. Office is deterministic. PDFs use any AI — Gemini, Claude, OpenAI, or Ollama locally.
Or click to browse
Unlocks PDF, image, audio, and video parsing. Also describes embedded images in DOCX/PPTX files.
CLI, WASM, and bring-your-own-key are free and unlimited. Pricing is for the hosted API only. Full pricing details →
| AILANG Parse | PDF-first parsers | Wrapper libraries | |
|---|---|---|---|
| Method | Direct XML extraction | Convert to PDF → ML reconstruction | mammoth / pandas text extraction |
| Track changes | Full (author, date, type) | Lost in PDF conversion | Not extracted |
| Merged cells | Structural (colspan/rowspan) | Flattened | Flattened |
| Comments | Author-attributed | Dropped | Dropped |
| Speed | <1s (CLI: <1ms) | 2–5 seconds | ~500ms |
| Dependencies | AILANG only | Python + ML libs | Python + wrappers |
| Browser / WASM | Yes | No | No |
| Privacy | Client-side only | Server required | Server required |
| OfficeDocBench | 93.9% | 61–79% | 71–84% |
| Cost per DOCX (25pg avg) | €0.00029 (per doc) | $0.025–$0.25 (per page) | Free (self-hosted) |
| Pricing model | Per document (any page count) | Per page ($0.001–$0.01/pg) | Open source |
PDF-first: Unstructured, Docling, LlamaParse. Wrappers: MarkItDown.
Full benchmark methodology and results →
Office formats parsed deterministically from XML. Plain text formats parsed natively. For PDFs, images, audio, and video: bring any AI provider — Gemini, Claude, OpenAI, Ollama (fully local), even LlamaParse or Unstructured. Structured output regardless of provider. No vendor lock-in.
Most parsers convert Office files to PDF first, losing structure. AILANG Parse reads the XML directly. These features require direct XML access.
Every format produces the same structured Block ADT — track changes, merged cells, comments, and more preserved as typed data, not flattened text.
Here is some text.
Here is the text to be moved.
Here is some more text.
{
"type": "change",
"changeType": "move-to",
"author": "Jesse Rosenthal",
"date": "2016-04-16T08:20:00Z",
"text": "Here is the text to be moved."
}
| 0-0 | 0-12 colspan: 2 | 0-3 | |
|---|---|---|---|
| 12-0 | 1-1 | 1-2 | 1-3 |
| merged ↑ | 2-1 | 2-2 | 2-3 |
| 3-0 | 34-123 colspan: 3 | ||
| 4-0 | merged ↑ · colspan: 3 | ||
{
"type": "table",
"headers": ["0-0", { "text": "0-12", "colSpan": 2 }, "0-3"],
"rows": [
["12-0", "1-1", "1-2", "1-3"],
[{ "colSpan": 1, "merged": true }, "2-1", "2-2", "2-3"],
["3-0", { "text": "34-123", "colSpan": 3 }]
]
}
| Name | Game | Fame | Blame |
|---|---|---|---|
| Lebron James | Basketball | Very High | Leaving Cleveland |
| Ryan Braun | Baseball | Moderate | Steroids |
| Russell Wilson | Football | High | Tacky uniform |
| Sinple | Table |
| Without | Header |
| Simple Multiparagraph | Table Full |
|---|---|
| Of Paragraphs | In each Cell. |
{
"type": "table",
"headers": ["Name", "Game", "Fame", "Blame"],
"rows": [
["Lebron James", "Basketball", "Very High", "Leaving Cleveland"],
["Ryan Braun", "Baseball", "Moderate", "Steroids"],
["Russell Wilson", "Football", "High", "Tacky uniform"]
]
}
{
"type": "section",
"kind": "comment",
"blocks": [
{ "text": "[Jesse Rosenthal] I left a comment." }
]
}
{
"type": "audio",
"transcription": "Welcome to the show. Today we're discussing...",
"mime": "audio/mp3"
}
| Region | Revenue | Growth |
|---|---|---|
| North America | $2.4M | +12% |
| EMEA | $1.8M | +8% |
{
"type": "video",
"description": "Technical tutorial showing a spreadsheet...",
"mime": "video/mp4"
}
| Metric | AILANG Parse v0.3.0 | Raw OOXML | Pandoc v3.9 | Kreuzberg v4.7 | MarkItDown v0.1.5 |
|---|---|---|---|---|---|
| Composite score | 93.9% | 84.4% | 74.0% | 71.1% | 67.9% |
| Coverage-adjusted | 93.9% | 52.6% | 48.2% | 68.0% | 51.2% |
| Format coverage | 100% | 62% | 65% | 96% | 35% |
| Track changes | 3/3 | 2/3 | 3/3 | 0/3 | — |
| Comments | 2/2 | 2/2 | 0/2 | 0/2 | — |
| Headers & footers | 3/3 | 2/2 | 0/3 | 2/3 | — |
| Text boxes | 2/2 | 1/2 | 0/2 | 0/2 | — |
| Equations (§22.1) | 1/1 | 0/1 | 0/1 | 0/1 | — |
| Formats supported | 10 | 3 | 5 | 9 | 5 |
| Runtime dependencies | AILANG only | Python stdlib | Pandoc binary | Python + libs | Python + libs |
A single pipeline handles all formats. The format router detects the file type, selects the appropriate parser, and produces a unified Block ADT.
One command. Every format. CLI or library.
# Parse any document (Office formats — instant, no AI)
docparse report.docx
docparse slides.pptx
docparse data.xlsx
docparse document.odt
docparse book.epub# PDF and image extraction (auto-selects AI backend)
docparse invoice.pdf
docparse scan.png
# Choose your AI backend
docparse doc.pdf --ai gemini-2.5-flash # Google (default)
docparse doc.pdf --ai claude-haiku-4-5 # Anthropic
docparse doc.pdf --ai granite-docling # Local Ollama (free)docparse --check # Type-check all 18 modules
docparse --test # Run 51 inline tests
docparse --prove # Z3 contract verification
docparse report.docx --verify # Runtime contract checks
docparse report.docx --describe # AI image descriptions
docparse report.docx --summarize # AI document summaryFree in your browser, no account required. 1,000 API requests/month free. CLI and WASM are unlimited.
Try AILANG Parse →
I want some text to have a comment on it.
This is a new paragraph.
And so is this.
One more. And this is one with a comment in a comment.