10 modules ~2,500 lines
Structural document parsing that preserves what matters
DOCX, PPTX, XLSX parsed deterministically — PDF and images use Gemini multimodal AI
DocParse extracts structured content from documents — not flat text. Headers, footers, track changes, merged cells, text boxes, and images are all preserved as typed blocks with metadata.
Head-to-head comparison against Unstructured, a VC-funded Python document parsing library ($65M+ raised). 21 test files across DOCX, PPTX, XLSX, and PDF.
| Capability | AILANG DocParse | Unstructured (open-source) |
|---|---|---|
| Track changes | Structured (insert/delete/move) | Not supported |
| Headers & footers | 6 elements (semantic) | 2 elements (flat text) |
| Text boxes / shapes | 8 elements extracted | 3 elements (partial) |
| VML images | Detected + extracted | Not detected |
| Table structure | Preserved (headers, rows, merge info) | Atomized into individual cells |
| Image extraction (Office) | Detected + optional AI descriptions | Not extracted |
AILANG captures 99.8% of Unstructured's content. Unstructured captures only 95.3% of AILANG's — the 5% gap is track changes, headers/footers, sheet names, and VML images.
All demos run entirely in your browser via WebAssembly. No server needed for Office formats.
The production CLI ships in the ailang-parse repo — 15 formats (adds ODT/ODP/ODS, HTML, Markdown, CSV, EPUB, EML, MBOX, TEX), document generation (--generate), Z3 contract verification (--prove), batch mode, and an eval harness. This browser demo runs a subset of the same modules client-side.
# Install (see ailang-parse README for full steps)
git clone https://github.com/sunholo-data/ailang-parse
ln -s $(pwd)/ailang-parse/bin/docparse ~/.local/bin/docparse
# Parse any Office doc (no API key needed)
docparse report.docx
# Batch mode — compile once, parse many (10x faster)
docparse ~/inbox/
# With AI image descriptions
docparse presentation.pptx --describe
# PDF/image (auto-enables Gemini multimodal)
docparse invoice.pdf
# Z3 contract verification
docparse --prove