AILANG DocParse

Capabilities

DocParse extracts structured content from documents — not flat text. Headers, footers, track changes, merged cells, text boxes, and images are all preserved as typed blocks with metadata.

Track Changes

Insert, delete, and move blocks with author attribution and timestamps. Structured data, not rendered text.

Table Structure

Preserved headers, rows, cells, and merge info. Not atomized into individual cell elements.

Headers & Footers

Extracted as semantic sections with typed blocks — not flattened into body text.

Text Boxes & Shapes

Content extracted from DrawingML and VML shapes, including legacy VML images.

Image Extraction

Embedded images detected with optional AI-generated descriptions via Gemini multimodal.

Contract Verified

28 ensures contracts + 51 inline tests. Filter bounds, 1:1 mapper preservation, structural invariants.

Benchmark: AILANG vs Unstructured

Head-to-head comparison against Unstructured, a VC-funded Python document parsing library ($65M+ raised). 21 test files across DOCX, PPTX, XLSX, and PDF.

99.8%

AILANG recall

95.3%

Unstructured recall

11/21

More elements

~2,500

Lines of AILANG

Capability	AILANG DocParse	Unstructured (open-source)
Track changes	Structured (insert/delete/move)	Not supported
Headers & footers	6 elements (semantic)	2 elements (flat text)
Text boxes / shapes	8 elements extracted	3 elements (partial)
VML images	Detected + extracted	Not detected
Table structure	Preserved (headers, rows, merge info)	Atomized into individual cells
Image extraction (Office)	Detected + optional AI descriptions	Not extracted

AILANG captures 99.8% of Unstructured's content. Unstructured captures only 95.3% of AILANG's — the 5% gap is track changes, headers/footers, sheet names, and VML images.

Try It

All demos run entirely in your browser via WebAssembly. No server needed for Office formats.

→

DocParse

Drop a DOCX, PPTX, XLSX, PDF, or image and get structured output — headings, tables with merged cells, track changes, comments. 10 AILANG modules parse Office XML directly in WASM.

WASM

→

Document Extractor

Upload any document, define a schema (or let AI detect one), and get validated, type-safe extraction results. 7 presets including invoice, receipt, and contract.

WASM Gemini API key

→

AI + Contracts

AI extracts structured data from documents, then AILANG contracts validate every field before returning results. Deterministic validation of stochastic AI output.

WASM Gemini API key

→

Z3 Verify

Prove contracts correct at compile time — no tests needed. 42 contracts verified, 4 bugs caught across cloud billing, access control, and scheduling modules.

WASM

CLI Tool — sunholo/ailang-parse

The production CLI ships in the ailang-parse repo — 15 formats (adds ODT/ODP/ODS, HTML, Markdown, CSV, EPUB, EML, MBOX, TEX), document generation (--generate), Z3 contract verification (--prove), batch mode, and an eval harness. This browser demo runs a subset of the same modules client-side.

# Install (see ailang-parse README for full steps)
git clone https://github.com/sunholo-data/ailang-parse
ln -s $(pwd)/ailang-parse/bin/docparse ~/.local/bin/docparse

# Parse any Office doc (no API key needed)
docparse report.docx

# Batch mode — compile once, parse many (10x faster)
docparse ~/inbox/

# With AI image descriptions
docparse presentation.pptx --describe

# PDF/image (auto-enables Gemini multimodal)
docparse invoice.pdf

# Z3 contract verification
docparse --prove

Coming Soon

planned

Standalone Binary

Pre-built binary distribution — no AILANG CLI installation required. Single executable for macOS, Linux, and Windows.

planned

Hosted API

Cloud-hosted DocParse API on Multivac. Send a document, get structured JSON back. No local setup needed.