AILANG DocParse 10 modules ~2,500 lines

DocParse

Structural document parsing that preserves what matters

5 formats 28 contracts AI-powered PDF

DOCX, PPTX, XLSX parsed deterministically — PDF and images use Gemini multimodal AI

Document Intelligence
Capabilities

DocParse extracts structured content from documents — not flat text. Headers, footers, track changes, merged cells, text boxes, and images are all preserved as typed blocks with metadata.

Track Changes
Insert, delete, and move blocks with author attribution and timestamps. Structured data, not rendered text.
Table Structure
Preserved headers, rows, cells, and merge info. Not atomized into individual cell elements.
Headers & Footers
Extracted as semantic sections with typed blocks — not flattened into body text.
Text Boxes & Shapes
Content extracted from DrawingML and VML shapes, including legacy VML images.
Image Extraction
Embedded images detected with optional AI-generated descriptions via Gemini multimodal.
Contract Verified
28 ensures contracts + 51 inline tests. Filter bounds, 1:1 mapper preservation, structural invariants.
Benchmark: AILANG vs Unstructured

Head-to-head comparison against Unstructured, a VC-funded Python document parsing library ($65M+ raised). 21 test files across DOCX, PPTX, XLSX, and PDF.

99.8%
AILANG recall
95.3%
Unstructured recall
11/21
More elements
~2,500
Lines of AILANG
Capability AILANG DocParse Unstructured (open-source)
Track changes Structured (insert/delete/move) Not supported
Headers & footers 6 elements (semantic) 2 elements (flat text)
Text boxes / shapes 8 elements extracted 3 elements (partial)
VML images Detected + extracted Not detected
Table structure Preserved (headers, rows, merge info) Atomized into individual cells
Image extraction (Office) Detected + optional AI descriptions Not extracted

AILANG captures 99.8% of Unstructured's content. Unstructured captures only 95.3% of AILANG's — the 5% gap is track changes, headers/footers, sheet names, and VML images.

Try It

All demos run entirely in your browser via WebAssembly. No server needed for Office formats.

The production CLI ships in the ailang-parse repo — 15 formats (adds ODT/ODP/ODS, HTML, Markdown, CSV, EPUB, EML, MBOX, TEX), document generation (--generate), Z3 contract verification (--prove), batch mode, and an eval harness. This browser demo runs a subset of the same modules client-side.

# Install (see ailang-parse README for full steps) git clone https://github.com/sunholo-data/ailang-parse ln -s $(pwd)/ailang-parse/bin/docparse ~/.local/bin/docparse # Parse any Office doc (no API key needed) docparse report.docx # Batch mode — compile once, parse many (10x faster) docparse ~/inbox/ # With AI image descriptions docparse presentation.pptx --describe # PDF/image (auto-enables Gemini multimodal) docparse invoice.pdf # Z3 contract verification docparse --prove
Coming Soon
planned
Standalone Binary
Pre-built binary distribution — no AILANG CLI installation required. Single executable for macOS, Linux, and Windows.
planned
Hosted API
Cloud-hosted DocParse API on Multivac. Send a document, get structured JSON back. No local setup needed.