Migrate from Unstructured

Q: How do I migrate from Unstructured to AILANG Parse?

Change the URL from api.unstructured.io to docparse.ailang.sunholo.com. The /general/v0/general endpoint accepts the same multipart file upload (-F files=@file.docx), same header (unstructured-api-key), and returns the same response format. See the Quick Migration section for curl and Python examples.

Why Migrate

Unstructured converts Office documents to PDF before parsing, destroying structural data — track changes, merged cells, comments, headers, footers, text boxes — that exists in the source XML. AILANG Parse reads the XML directly. Deterministic, structural, fast.

On the OfficeDocBench structural benchmark, AILANG Parse scores 93.9% with 100% format coverage, compared to Unstructured's 62.1% composite (38.7% coverage-adjusted — Unstructured only handles 62% of the benchmark files).

Keep Unstructured for PDFs and scanned documents. Use AILANG Parse for Office formats.

Quick Migration

AILANG Parse is a true drop-in replacement. Change the URL — your existing code works unchanged:

curl — One URL change

# Before: Unstructured
curl -X POST "https://api.unstructured.io/general/v0/general" \
  -H "unstructured-api-key: $UNSTRUCTURED_KEY" \
  -F "files=@report.docx"

# After: AILANG Parse (same endpoint, same field name, same header)
curl -X POST "https://docparse.ailang.sunholo.com/general/v0/general" \
  -H "unstructured-api-key: $DOCPARSE_API_KEY" \
  -F "files=@report.docx"

Python

# Before: Unstructured
resp = requests.post(
    "https://api.unstructured.io/general/v0/general",
    headers={"unstructured-api-key": UNSTRUCTURED_KEY},
    files={"files": open("report.docx", "rb")}
)

# After: AILANG Parse (change URL and key — everything else stays the same)
resp = requests.post(
    "https://docparse.ailang.sunholo.com/general/v0/general",
    headers={"unstructured-api-key": DOCPARSE_API_KEY},
    files={"files": open("report.docx", "rb")}
)

The /general/v0/general endpoint accepts both multipart file upload (-F "files=@file.docx") and JSON body requests ({"filepath": "sample_id"}). API key can be passed via the unstructured-api-key header or as an apiKey form field. The response format is fully compatible with existing Unstructured client code.

Routing Pattern

For mixed-format pipelines, the recommended approach is to route Office formats through AILANG Parse and PDFs through Unstructured. A simple file-extension check is all you need:

import requests
from pathlib import Path

AILANG_PARSE_URL = "https://docparse.ailang.sunholo.com"
UNSTRUCTURED_URL = "https://api.unstructured.io"

OFFICE_EXTENSIONS = {
    ".docx", ".pptx", ".xlsx",
    ".odt", ".odp", ".ods",
    ".csv", ".html", ".md", ".epub"
}

def parse_document(filepath: str) -> dict:
    """Route documents to the best parser for their format."""
    ext = Path(filepath).suffix.lower()

    if ext in OFFICE_EXTENSIONS:
        # Deterministic structural parsing — no AI needed, no per-page billing
        # Same endpoint format as Unstructured — just change the URL
        resp = requests.post(
            f"{AILANG_PARSE_URL}/general/v0/general",
            headers={"unstructured-api-key": DOCPARSE_API_KEY},
            files={"files": open(filepath, "rb")}
        )
    else:
        # PDFs, scanned images — Unstructured ML pipeline
        resp = requests.post(
            f"{UNSTRUCTURED_URL}/general/v0/general",
            headers={"unstructured-api-key": UNSTRUCTURED_KEY},
            files={"files": open(filepath, "rb")}
        )

    return resp.json()

This gives you the best of both worlds: structural fidelity for Office documents and ML-powered extraction for PDFs and scans.

What You Gain

Feature	Unstructured	AILANG Parse
Track Changes	None (lost in PDF conversion)	Full — insertions, deletions, author, date
Merged Cells	Flattened to text	Structural — row/column spans preserved
Comments	Dropped	Author-attributed with anchor positions
Headers / Footers	Dropped or mixed into body	Separate blocks with page context
Text Boxes	Dropped	Extracted as positioned blocks
Footnotes / Endnotes	Dropped	Preserved with reference markers
Dependencies	Python + heavy libraries	Zero — single binary, no runtime deps
Determinism	Non-deterministic (ML inference)	Deterministic — same input always gives same output
OfficeDocBench (composite)	62.1%	93.9%
Coverage-adjusted	38.7%	93.9%

Self-Hosting

Run AILANG Parse locally with Docker — no external dependencies for Office formats:

# Clone and build
git clone https://github.com/sunholo-data/ailang-parse.git
cd ailang-parse
docker build -t docparse .

# Parse a document (mount your files into /data)
docker run -v $(pwd):/data docparse /data/report.docx

# Parse with AI (pass your API key)
docker run -e GOOGLE_API_KEY="your-key" \
  -v $(pwd):/data docparse --ai gemini-2.5-flash /data/document.pdf

The Dockerfile builds AILANG from source and includes all 31 parser modules. Office parsing works immediately with zero configuration. Add a GOOGLE_API_KEY environment variable to enable AI-powered PDF and image parsing.

For detailed configuration options, capability budgets, and AI model setup, see the Self-Host Guide.

Frequently Asked Questions

How do I migrate from Unstructured to AILANG Parse?

Change the URL from api.unstructured.io to docparse.ailang.sunholo.com and swap your API key. The /general/v0/general endpoint accepts the same multipart upload (-F "files=@file.docx") and the same unstructured-api-key header. See the Quick Migration section for curl and Python examples.

Is AILANG Parse a drop-in replacement for Unstructured?

For Office document parsing, yes. AILANG Parse implements the same /general/v0/general endpoint with the same multipart upload format and the same unstructured-api-key header. Change the URL and API key — your existing code works unchanged. For PDFs and scanned documents, keep using Unstructured — its ML pipeline is purpose-built for those formats.

Can I use both Unstructured and AILANG Parse together?

Yes, and this is the recommended pattern. Route Office formats (DOCX, PPTX, XLSX, ODT, ODP, ODS, CSV, HTML, Markdown, EPUB, EML, TEX, RTF) through AILANG Parse for deterministic structural parsing, and route PDFs and scanned images through Unstructured where its ML pipeline excels. A simple file-extension check in your routing layer is all you need. See the Routing Pattern section for a complete example.

What Office formats does AILANG Parse support that Unstructured doesn't parse well?

DOCX, PPTX, XLSX, ODT, ODP, ODS, CSV, HTML, Markdown, and EPUB — with full structural fidelity including track changes, merged cells, and comments.