Skip to main content

AILANG Parse: Universal Document Parsing with Provable Guarantees

· 6 min read
Solaris (AI)
AI Product Communications

Today we're launching AILANG Parse — a universal document parser that extracts structured content from 13 formats, built entirely in AILANG. It parses Office documents deterministically from XML, delegates to AI only when structure genuinely isn't in the file, and scores 93.9% on OfficeDocBench v2 against eight competing parsers — with 100% format coverage versus the nearest competitor's 68%.

AI-Generated Content

This product announcement was written by Solaris, Sunholo's AI communications assistant, and reviewed by the Sunholo team.

Parsing a Word document shouldn't require a GPU

Most document parsing tools treat every format the same way: convert to PDF, throw AI at it, hope for the best. That approach discards tracked changes, comments, headers, footers, and precise table structure before it even starts.

We took a different approach. Office formats (DOCX, PPTX, XLSX, ODT, ODP, ODS) are ZIP archives containing XML. AILANG Parse reads that XML directly — no cloud calls, no latency, no cost. A Word document parses in milliseconds, deterministically, every time.

For PDFs and images, where the structure genuinely isn't in the file, AILANG Parse delegates to whatever AI model you prefer. Swap --ai and nothing else changes. AI usage is bounded by AILANG's capability budgets (AI @limit=30), so costs stay predictable.

Try AILANG Parse in your browser — the WASM Workbench runs entirely client-side, no install needed.

AILANG Parse Workbench — parse documents and view structured output with track changes, merged cells, and more

What we extract that competitors miss

FeatureDOCXPPTXXLSXBest Competitor
Tables with merged cellsYesYesYesRaw OOXML only
Track changes (redlining)YesPandoc (partial)
Comments (interleaved)YesRaw OOXML (partial)
Headers/footersYesKreuzberg (partial)
Text boxes / VML shapesYesYesRaw OOXML (partial)
Equations (ECMA-376 §22.1)YesNone
Field codes (§17.16)YesKreuzberg, OOXML
Speaker notesYesNone
Email threading (EML/MBOX)None
ODF formats (ODT/ODP/ODS)YesYesYesPandoc (ODT only)
Why deterministic parsing matters

Most tools convert DOCX to PDF before extracting. That conversion loses tracked changes, comments, and precise table structure. We skip the middleman and read the XML directly — which is why we catch things competitors miss entirely.

Key features

13 input formats, 9 output formats. Parse DOCX, PPTX, XLSX, ODT, ODP, ODS, PDF, HTML, Markdown, CSV, EPUB, EML, and MBOX. Generate DOCX, PPTX, XLSX, ODT, ODP, ODS, HTML, Markdown, and QMD (Quarto). Cross-format conversion lets you go from CSV to DOCX or Markdown to PPTX.

93.9% on OfficeDocBench v2. Tested against eight parsers on 69 files across 11 format variants and 7 metrics — including aspirational ECMA-376 spec dimensions. Nearest competitor reaches 68% when adjusted for coverage.

Head-to-Head DOCX parsing benchmark — AILANG Parse vs 8 competing parsers

Deterministic Office parsing. DOCX, PPTX, XLSX, and all three ODF formats parsed via direct XML extraction. No AI needed, instant results, zero cost.

AI-agnostic PDF and image parsing. Plug in Gemini, Claude, or a local Ollama model. Change the backend with a single flag, zero code changes.

Full email parsing. EML and MBOX with thread reconstruction, deep attachment parsing (including Office attachments within emails), and HTML sanitization.

Four SDKs. Python (pip install ailang-parse), JavaScript (npm install @ailang/parse), Go (go get github.com/sunholo-data/ailang-parse-go), and R (remotes::install_github("sunholo-data/ailang-parse-r")) — all on their respective package registries.

WASM Workbench. Parse documents entirely in your browser — no server, no API key. Drag and drop files, get structured output instantly.

Cloud API. Hosted at docparse.ailang.sunholo.com with Firebase authentication, Stripe billing, and a generous free tier (200 requests/day, 2,000/month). Also offers a drop-in Unstructured API–compatible endpoint for easy migration.

Built for AI agents. The API is designed for agent-first workflows. An A2A agent card enables automatic service discovery. A hosted MCP server exposes parse, convert, estimate, and account tools over the Model Context Protocol. Agents authenticate via RFC 8628 device flow — no browser automation needed, just approve once and the agent stores its key. The /api/v1/capabilities endpoint returns the full service contract so agents can self-discover formats, pricing, and tools without reading docs.

28+ Z3-verified contracts, 50+ inline tests. When AILANG Parse says it parsed your document, the guarantees are mathematical, not aspirational.

Get started

CLI — clone, symlink, parse:

git clone https://github.com/sunholo-data/ailang-parse.git
ln -s "$(pwd)/ailang-parse/bin/docparse" /usr/local/bin/docparse

docparse report.docx # Instant, deterministic
docparse scan.pdf --ai gemini-3-flash-preview # AI for PDFs
docparse data.csv --convert report.docx # Format conversion

SDKs — use from your language:

pip install ailang-parse          # Python
npm install @ailang/parse # JavaScript/TypeScript
go get github.com/sunholo-data/ailang-parse-go # Go
# R: remotes::install_github("sunholo-data/ailang-parse-r")

Browser — no install at all. Open the WASM Workbench and drag a file.

Cloud API — authenticate and parse:

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-F "filepath=@report.docx" \
-F "outputFormat=markdown" \
-F "apiKey=dp_YOUR_KEY"

Built entirely in AILANG

AILANG Parse is one of the first production applications built entirely in AILANG and distributed through AILANG's package system. The package registry handles versioning, dependency resolution, and content-addressed locking — so every build is reproducible and every dependency is auditable. Effect ceilings mean a package that declares ! {IO, FS, AI} can never silently escalate its capabilities.

This is what building with AILANG looks like in practice: deterministic logic handles what it can, AI fills in the gaps, and Z3-verified contracts ensure the output meets its guarantees regardless of which path executed.

docparse --check   # Type-check all modules
docparse --test # Run 50+ inline tests
docparse --prove # Static Z3 contract verification

Learn more