AILANG DocParse 10 modules ~2,500 lines

DocParse

Structural document parsing that preserves what matters

5 formats 28 contracts AI-powered PDF

DOCX, PPTX, XLSX parsed deterministically — PDF and images use Gemini multimodal AI

Document Intelligence
Capabilities

DocParse extracts structured content from documents — not flat text. Headers, footers, track changes, merged cells, text boxes, and images are all preserved as typed blocks with metadata.

Track Changes
Insert, delete, and move blocks with author attribution and timestamps. Structured data, not rendered text.
Table Structure
Preserved headers, rows, cells, and merge info. Not atomized into individual cell elements.
Headers & Footers
Extracted as semantic sections with typed blocks — not flattened into body text.
Text Boxes & Shapes
Content extracted from DrawingML and VML shapes, including legacy VML images.
Image Extraction
Embedded images detected with optional AI-generated descriptions via Gemini multimodal.
Contract Verified
28 ensures contracts + 51 inline tests. Filter bounds, 1:1 mapper preservation, structural invariants.
Benchmark: AILANG vs Unstructured

Head-to-head comparison against Unstructured, a VC-funded Python document parsing library ($65M+ raised). 21 test files across DOCX, PPTX, XLSX, and PDF.

99.8%
AILANG recall
95.3%
Unstructured recall
11/21
More elements
~2,500
Lines of AILANG
Capability AILANG DocParse Unstructured (open-source)
Track changes Structured (insert/delete/move) Not supported
Headers & footers 6 elements (semantic) 2 elements (flat text)
Text boxes / shapes 8 elements extracted 3 elements (partial)
VML images Detected + extracted Not detected
Table structure Preserved (headers, rows, merge info) Atomized into individual cells
Image extraction (Office) Detected + optional AI descriptions Not extracted

AILANG captures 99.8% of Unstructured's content. Unstructured captures only 95.3% of AILANG's — the 5% gap is track changes, headers/footers, sheet names, and VML images.

Try It

All demos run entirely in your browser via WebAssembly. No server needed for Office formats.

CLI Tool

DocParse is also available as a command-line tool for batch processing and integration into pipelines.

# Install (symlink to PATH) ln -s $(pwd)/docparse/docparse ~/.local/bin/docparse # Parse a document (Office formats — no API key needed) docparse report.docx # Parse with AI image descriptions (needs Google Cloud ADC) docparse presentation.pptx describe # Parse a PDF (uses Gemini multimodal) docparse invoice.pdf # Output: JSON to stdout, structured blocks
Coming Soon
planned
Standalone Binary
Pre-built binary distribution — no AILANG CLI installation required. Single executable for macOS, Linux, and Windows.
planned
Hosted API
Cloud-hosted DocParse API on Multivac. Send a document, get structured JSON back. No local setup needed.