Why Migrate
Unstructured converts Office documents to PDF before parsing, destroying structural data — track changes, merged cells, comments, headers, footers, text boxes — that exists in the source XML. AILANG Parse reads the XML directly. Deterministic, structural, fast.
On the OfficeDocBench structural benchmark, AILANG Parse scores 93.9% with 100% format coverage, compared to Unstructured's 62.1% composite (38.7% coverage-adjusted — Unstructured only handles 62% of the benchmark files).
Quick Migration
AILANG Parse is a true drop-in replacement. Change the URL — your existing code works unchanged:
curl — One URL change
# Before: Unstructured
curl -X POST "https://api.unstructured.io/general/v0/general" \
-H "unstructured-api-key: $UNSTRUCTURED_KEY" \
-F "files=@report.docx"
# After: AILANG Parse (same endpoint, same field name, same header)
curl -X POST "https://docparse.ailang.sunholo.com/general/v0/general" \
-H "unstructured-api-key: $DOCPARSE_API_KEY" \
-F "files=@report.docx"
Python
# Before: Unstructured
resp = requests.post(
"https://api.unstructured.io/general/v0/general",
headers={"unstructured-api-key": UNSTRUCTURED_KEY},
files={"files": open("report.docx", "rb")}
)
# After: AILANG Parse (change URL and key — everything else stays the same)
resp = requests.post(
"https://docparse.ailang.sunholo.com/general/v0/general",
headers={"unstructured-api-key": DOCPARSE_API_KEY},
files={"files": open("report.docx", "rb")}
)
/general/v0/general endpoint accepts both multipart file upload (-F "files=@file.docx") and JSON body requests ({"filepath": "sample_id"}). API key can be passed via the unstructured-api-key header or as an apiKey form field. The response format is fully compatible with existing Unstructured client code.
Routing Pattern
For mixed-format pipelines, the recommended approach is to route Office formats through AILANG Parse and PDFs through Unstructured. A simple file-extension check is all you need:
import requests
from pathlib import Path
AILANG_PARSE_URL = "https://docparse.ailang.sunholo.com"
UNSTRUCTURED_URL = "https://api.unstructured.io"
OFFICE_EXTENSIONS = {
".docx", ".pptx", ".xlsx",
".odt", ".odp", ".ods",
".csv", ".html", ".md", ".epub"
}
def parse_document(filepath: str) -> dict:
"""Route documents to the best parser for their format."""
ext = Path(filepath).suffix.lower()
if ext in OFFICE_EXTENSIONS:
# Deterministic structural parsing — no AI needed, no per-page billing
# Same endpoint format as Unstructured — just change the URL
resp = requests.post(
f"{AILANG_PARSE_URL}/general/v0/general",
headers={"unstructured-api-key": DOCPARSE_API_KEY},
files={"files": open(filepath, "rb")}
)
else:
# PDFs, scanned images — Unstructured ML pipeline
resp = requests.post(
f"{UNSTRUCTURED_URL}/general/v0/general",
headers={"unstructured-api-key": UNSTRUCTURED_KEY},
files={"files": open(filepath, "rb")}
)
return resp.json()
This gives you the best of both worlds: structural fidelity for Office documents and ML-powered extraction for PDFs and scans.
What You Gain
| Feature | Unstructured | AILANG Parse |
|---|---|---|
| Track Changes | None (lost in PDF conversion) | Full — insertions, deletions, author, date |
| Merged Cells | Flattened to text | Structural — row/column spans preserved |
| Comments | Dropped | Author-attributed with anchor positions |
| Headers / Footers | Dropped or mixed into body | Separate blocks with page context |
| Text Boxes | Dropped | Extracted as positioned blocks |
| Footnotes / Endnotes | Dropped | Preserved with reference markers |
| Parse Speed | 2–10 seconds | 11 milliseconds |
| Dependencies | Python + heavy libraries | Zero — single binary, no runtime deps |
| Determinism | Non-deterministic (ML inference) | Deterministic — same input always gives same output |
| OfficeDocBench (composite) | 62.1% | 93.9% |
| Coverage-adjusted | 38.7% | 93.9% |
Self-Hosting
Run AILANG Parse locally with Docker — no external dependencies for Office formats:
# Clone and build
git clone https://github.com/sunholo-data/ailang-parse.git
cd ailang-parse
docker build -t docparse .
# Parse a document (mount your files into /data)
docker run -v $(pwd):/data docparse /data/report.docx
# Parse with AI (pass your API key)
docker run -e GOOGLE_API_KEY="your-key" \
-v $(pwd):/data docparse --ai gemini-2.5-flash /data/document.pdf
The Dockerfile builds AILANG from source and includes all 31 parser modules. Office parsing works immediately with zero configuration. Add a GOOGLE_API_KEY environment variable to enable AI-powered PDF and image parsing.
For detailed configuration options, capability budgets, and AI model setup, see the Self-Host Guide.
Frequently Asked Questions
How do I migrate from Unstructured to AILANG Parse?
Change the URL from api.unstructured.io to docparse.ailang.sunholo.com and swap your API key. The /general/v0/general endpoint accepts the same multipart upload (-F "files=@file.docx") and the same unstructured-api-key header. See the Quick Migration section for curl and Python examples.
Is AILANG Parse a drop-in replacement for Unstructured?
For Office document parsing, yes. AILANG Parse implements the same /general/v0/general endpoint with the same multipart upload format and the same unstructured-api-key header. Change the URL and API key — your existing code works unchanged. For PDFs and scanned documents, keep using Unstructured — its ML pipeline is purpose-built for those formats.
Can I use both Unstructured and AILANG Parse together?
Yes, and this is the recommended pattern. Route Office formats (DOCX, PPTX, XLSX, ODT, ODP, ODS, CSV, HTML, Markdown, EPUB) through AILANG Parse for deterministic structural parsing, and route PDFs and scanned images through Unstructured where its ML pipeline excels. A simple file-extension check in your routing layer is all you need. See the Routing Pattern section for a complete example.
What Office formats does AILANG Parse support that Unstructured doesn't parse well?
DOCX, PPTX, XLSX, ODT, ODP, ODS, CSV, HTML, Markdown, and EPUB — with full structural fidelity including track changes, merged cells, and comments.