PDF Parsing API

Parse PDFs to structured JSON via AI (Gemini, Claude, OpenAI, Ollama) or local deterministic backends (Docling, LiteParse) — up to 40× faster than AI on born-digital papers. Same typed blocks as Office formats, no vendor lock-in.

PDFs Have No Semantic Structure

A PDF encodes where to draw glyphs on a page. It has no concept of "heading", "paragraph", or "table cell". Extracting document structure from PDF requires visual understanding — recognizing that larger bold text at the top of a page is a heading, that aligned columns of text form a table, that reading order flows left-to-right then top-to-bottom.

AILANG Parse delegates this visual understanding to the AI provider of your choice, then normalizes the output into the same Block ADT used for DOCX, XLSX, and PPTX parsing. Your downstream code handles all formats identically.

If your source document is a DOCX that was converted to PDF, parse the DOCX directly instead. You'll get track changes, comments, merged cells, and metadata that the PDF lost. See why DOCX > PDF for parsing.

How It Works

AILANG Parse sends the PDF to the configured AI provider with a structured extraction prompt. The AI returns content in a normalized format, which AILANG Parse validates and converts into typed blocks. The result is the same Block ADT — headings, text, tables, images — regardless of which AI provider performed the extraction.

Switch providers with a single flag. No code changes, no output format differences.

# Gemini (fastest, most cost-effective)
ailang run --entry main --caps IO,FS,Env,AI --ai gemini-2.5-flash \
  docparse/main.ail document.pdf

# Claude
ailang run --entry main --caps IO,FS,Env,AI --ai claude-sonnet-4-5-20250514 \
  docparse/main.ail document.pdf

# Ollama (fully local)
ailang run --entry main --caps IO,FS,Env,AI --ai ollama/llama3.2-vision \
  docparse/main.ail document.pdf

AI Providers

Gemini

Best for high-volume processing. gemini-2.5-flash offers the fastest response times and lowest per-page cost. gemini-2.5-pro for complex documents with dense tables or multi-column layouts.

Claude

Strong on nuanced document understanding. claude-sonnet-4-5-20250514 excels at extracting meaning from complex layouts and academic papers. Good for documents where context matters.

OpenAI

Solid general-purpose extraction. gpt-4o handles standard business documents well. Use when your team already has an OpenAI API key and wants a single provider.

Ollama (Local)

Fully local processing — no data leaves your machine. Use any vision-capable model: llama3.2-vision, llava, etc. Ideal for sensitive documents, air-gapped environments, or cost-zero development.

Local Deterministic Backends v0.19+

As of v0.19.0, the CLI can dispatch PDF parsing to local backends instead of an AI provider. Two options ship today:

Docling structural

IBM Docling — local layout analysis with table detection and reading order. Matches AI structural quality on real arxiv papers, runs in ~28 s on a 100 KB paper (vs ~127 s for Gemini Flash). Free, MIT, no API key. Best when you want the headings/tables structure without an AI round-trip.

LiteParse fastest

run-llama LiteParse — text extraction with bounding boxes, no layout model. End-to-end ~3 s on a 100 KB PDF via the CLI (the raw parse is under a second, the rest is AILANG startup + Python import). Heading detection via a font-size heuristic. Best for RAG ingest, search indexing, or anywhere raw text is what you need.

Both run as Python subprocesses dispatched from inside AILANG via the sunholo/external_backend package, with typed error handling and stderr capture. Selection is per call — there's no global config to drift.

Install

The ai backend works out of the box. The local backends need their Python package installed in the same Python environment your CLI uses:

# Docling (see docling-project/docling for the full install matrix)
uv pip install docling

# LiteParse (see run-llama/liteparse)
uv pip install liteparse

# Or both
uv pip install docling liteparse

No service to run, no daemon to start — both adapters launch on demand when you pass --pdf-backend docling or --pdf-backend liteparse. If the package isn't installed, the adapter exits non-zero and AILANG surfaces a typed NonZeroExit error with the underlying ImportError captured from stderr.

Usage

# AI (default — unchanged behavior)
./bin/docparse paper.pdf --pdf-backend ai --ai gemini-2.5-flash

# Docling — local, structural, ~5× faster than Gemini
./bin/docparse paper.pdf --pdf-backend docling

# LiteParse — local, fastest text extraction (~40× faster than AI end-to-end)
./bin/docparse paper.pdf --pdf-backend liteparse

Output is the same Block ADT — your pipeline code doesn't change when you switch backend. When --pdf-backend is anything other than ai, AILANG drops the AI capability and swaps in Process — no API key required, no network calls.

Which Backend to Pick

NeedRecommended BackendWhy
Scanned PDFs, handwriting, complex multimodalai (Gemini / Claude / Ollama)Only AI handles non-text content; deterministic backends can't read scans.
Born-digital PDF, want structure (headings, tables)doclingStructural quality on par with AI; ~5× faster; free; offline.
Large batch / RAG ingest where you just need textliteparseSub-second per PDF; highest character recall; trivial cost.
Cost-sensitive production with mixed PDFsTier: try liteparse first, fall back to ai if confidence lowMost born-digital PDFs parse cleanly without AI; reserve AI budget for the hard ones.

What Gets Extracted

Headings & Text

The AI identifies heading hierarchy and body text from visual cues (font size, weight, position). Extracted as heading and text blocks with inferred levels.

Tables

Visually aligned data is extracted as structured tables with headers and rows. Multi-page tables spanning page breaks are handled by the AI's visual context window.

Images & Figures

Embedded images are identified with their captions and alt text. The AI describes figure content for downstream indexing or accessibility.

Uniform Output

Regardless of AI provider, the output is the same Block ADT: heading, text, table, image. Your pipeline code never changes when you switch providers.

Local Processing with Ollama

For sensitive documents that can't leave your network, use Ollama as the AI backend. Install Ollama, pull a vision-capable model, and point AILANG Parse at it. Zero cloud calls.

# Install Ollama and pull a vision model
ollama pull llama3.2-vision

# Parse PDF fully locally
ailang run --entry main --caps IO,FS,Env,AI \
  --ai ollama/llama3.2-vision \
  docparse/main.ail confidential-report.pdf

The output format is identical to cloud providers. You can develop locally with Ollama and deploy to production with Gemini or Claude — same code, same output structure.

Use Cases

Scanned Document Digitization

OCR-level extraction from scanned PDFs using AI vision. The model reads handwritten notes, stamp marks, and degraded print that rule-based OCR struggles with. Output is structured JSON, not raw text.

Invoice Processing

Extract line items, totals, dates, and vendor details from PDF invoices. The AI identifies table structure from visual layout, handling the wide variation in invoice formats across vendors.

Academic Paper Ingestion

Parse research papers into structured sections: abstract, methodology, results, references. The AI handles multi-column layouts, figure captions, and equation blocks that trip up rule-based PDF parsers.

Legacy Archive Migration

Convert PDF archives to structured JSON for database ingestion or knowledge base population. Process thousands of documents with batch mode, using the most cost-effective AI provider for your volume.

Compliance Document Review

Parse regulatory filings, audit reports, and policy documents from PDF into structured blocks. Feed to an LLM for automated compliance checking, clause extraction, or cross-document comparison.

Air-Gapped Environments

Process classified or restricted documents entirely on-premises using Ollama. No internet connection required after initial model download. The same Block ADT output integrates with your existing pipeline.

Try It

CLI

# Parse with Gemini (set GOOGLE_API_KEY)
GOOGLE_API_KEY="your-key" ailang run --entry main \
  --caps IO,FS,Env,AI --ai gemini-2.5-flash \
  docparse/main.ail report.pdf

# Parse with Ollama (fully local)
ailang run --entry main --caps IO,FS,Env,AI \
  --ai ollama/llama3.2-vision \
  docparse/main.ail report.pdf

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample_pdf","outputFormat":"markdown","apiKey":"YOUR_API_KEY","ai":"gemini-2.5-flash"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")

# Parse with AI provider
result = client.parse("report.pdf", output_format="json", ai="gemini-2.5-flash")

# Same Block ADT as Office formats
for block in result.blocks:
    if block.type == "table":
        print(f"Table: {len(block.rows)} rows")
    if block.type == "heading":
        print(f"H{block.level}: {block.text}")

Parse in Browser    API Reference

Frequently Asked Questions

How do I parse a PDF to JSON?

Send the PDF to the AILANG Parse API with an AI provider specified. The AI extracts content into the same structured JSON used for Office formats.

Which AI providers are supported for PDF parsing?

Gemini (gemini-2.5-flash, gemini-2.5-pro), Claude (claude-sonnet-4-5-20250514), OpenAI (gpt-4o), and Ollama for fully local processing. Switch with a single flag.

Can I parse PDFs locally without sending data to the cloud?

Yes. Use Ollama as the AI backend. Set --ai ollama/llama3.2-vision or any vision-capable model. No data leaves your machine.

Why does PDF parsing require AI?

PDFs encode glyph positions, not document structure. Extracting headings, tables, and reading order requires visual understanding. For structured source files, parse the DOCX directly.

Is the output format the same as Office parsing?

Yes. Same Block ADT (headings, text, tables, images) regardless of provider. Your downstream code handles all formats identically.

How much does PDF parsing cost?

AILANG Parse charges per API request. AI provider costs are separate (your own key). Gemini Flash is the most cost-effective for volume. Ollama is free (your hardware).