10 modules ~2,500 lines
Structural document parsing that preserves what matters
DOCX, PPTX, XLSX parsed deterministically — PDF and images use Gemini multimodal AI
DocParse extracts structured content from documents — not flat text. Headers, footers, track changes, merged cells, text boxes, and images are all preserved as typed blocks with metadata.
Head-to-head comparison against Unstructured, a VC-funded Python document parsing library ($65M+ raised). 21 test files across DOCX, PPTX, XLSX, and PDF.
| Capability | AILANG DocParse | Unstructured (open-source) |
|---|---|---|
| Track changes | Structured (insert/delete/move) | Not supported |
| Headers & footers | 6 elements (semantic) | 2 elements (flat text) |
| Text boxes / shapes | 8 elements extracted | 3 elements (partial) |
| VML images | Detected + extracted | Not detected |
| Table structure | Preserved (headers, rows, merge info) | Atomized into individual cells |
| Image extraction (Office) | Detected + optional AI descriptions | Not extracted |
AILANG captures 99.8% of Unstructured's content. Unstructured captures only 95.3% of AILANG's — the 5% gap is track changes, headers/footers, sheet names, and VML images.
All demos run entirely in your browser via WebAssembly. No server needed for Office formats.
DocParse is also available as a command-line tool for batch processing and integration into pipelines.
# Install (symlink to PATH)
ln -s $(pwd)/docparse/docparse ~/.local/bin/docparse
# Parse a document (Office formats — no API key needed)
docparse report.docx
# Parse with AI image descriptions (needs Google Cloud ADC)
docparse presentation.pptx describe
# Parse a PDF (uses Gemini multimodal)
docparse invoice.pdf
# Output: JSON to stdout, structured blocks