Document Parsing Performance Guide

58 files, 22.9MB, 11 formats — 2.5 seconds. How to get the best throughput from AILANG Parse.

Batch Mode: Compile Once, Parse Many

AILANG Parse compiles 38 modules before parsing starts. In batch mode, this happens once — then every file is parsed with zero recompilation. This is the single most important document parsing performance optimisation.

Batch (multiple files or folder)

# Pass multiple files
docparse *.docx *.pptx *.xlsx

# Or an entire folder
docparse ~/Documents/

# Both auto-enable batch mode

58 files in 2.5s (44ms/file)

Loop (recompiles each time)

# Don't do this — each invocation
# recompiles all 38 modules
for f in *.docx; do
  docparse "$f"
done

58 files in ~25s (~430ms/file)

The docparse CLI detects multiple arguments automatically and passes --batch to AILANG. No extra flags needed.

Tracing Control

AILANG's auto-trace collector creates ~2.7M objects per run when OTEL_EXPORTER_OTLP_ENDPOINT or GOOGLE_CLOUD_PROJECT is set. The docparse CLI disables this by default for a 2–5x speedup.

FileWith TraceNo TraceSpeedup
Alice EPUB (185KB)2.59s1.26s2.1x
Moby Dick EPUB (797KB)9.79s2.82s3.5x
10MB DOCX1.95s0.40s4.9x

If you call ailang run directly (bypassing the CLI wrapper), set the environment variable yourself:

# Direct ailang invocation with tracing disabled
AILANG_NO_TRACE=1 ailang run --batch --entry main --caps IO,FS,Env \
  --max-recursion-depth 50000 docparse/main.ail file1.docx file2.pptx

# Re-enable tracing for debugging
AILANG_NO_TRACE=0 docparse report.docx

Folder Benchmark

58 files across 11 formats (DOCX, PPTX, XLSX, ODT, ODP, ODS, EPUB, HTML, Markdown, CSV, TSV), including a 10MB DOCX and 11MB PPTX. Same files, same machine, measured wall-clock time.

ToolFiles ParsedTotal TimePer FileQuality
Kreuzberg v4.7.255/58318ms6ms71.3%
MarkItDown v0.1.546/58905ms20ms67.9%
AILANG Parse v0.3.058/582.54s44ms92.2%
Unstructured v0.22.1639/583.46s89ms62.1%
Pandoc v3.9.051/583.64s71ms74.6%
Docling v2.84.042/588.64s206ms64.0%
No quality compromise. The fastest tools skip 3–19 files and score 20+ points lower on structural quality. AILANG Parse parses every file at 92.2% composite — competitive speed, best extraction. See full benchmark results.

CLI vs API Server

CLI (docparse)

Best for: batch processing folders, CI/CD pipelines, one-off conversions.

  • Batch mode amortises compilation
  • Zero network overhead
  • Processes local files directly

API Server

Best for: web apps, real-time parsing, multi-user access.

  • AILANG modules stay compiled in memory
  • No startup cost per request
  • Sub-100ms response for most files

For sustained throughput (e.g., processing uploads), the API server avoids compilation overhead entirely. For batch jobs against local files, the CLI with folder input is optimal.

Single-File Overhead

Parsing a single file via docparse report.docx takes ~400–500ms regardless of file size. This is AILANG runtime startup (module compilation), not parse time. The actual parse of a 5KB DOCX is <1ms.

FileSingle FileIn BatchStartup Overhead
sample.docx (5KB)411ms~2ms~409ms
tables.docx (31KB)466ms~5ms~461ms
10MB DOCX472ms~70ms~402ms
Moby Dick EPUB (797KB)2.78s~2.3s~480ms

AILANG incremental compilation caching (M-INCREMENTAL-TYPECHECK) will reduce startup further in a future release.

Quick Performance Tips

  1. Always batch. Pass folders or globs: docparse ~/inbox/ or docparse *.docx *.xlsx
  2. Use the CLI wrapper. docparse sets AILANG_NO_TRACE=1 and auto-enables batch mode.
  3. Use the API for real-time. Compilation stays in memory — every request is pure parse time.
  4. Skip AI for Office files. DOCX/PPTX/XLSX/ODF parsing is deterministic and needs no AI model. Images still require --ai; PDFs default to --ai but can opt into local deterministic backends via --pdf-backend docling (~5× faster) or --pdf-backend liteparse (~40× faster) — see PDF parsing.
  5. Profile if needed. Pass -cpuprofile profile.out or -memprofile mem.out to ailang run for Go pprof analysis.

Frequently Asked Questions

How fast is AILANG Parse compared to other document parsers?

58 mixed Office files (22.9MB, 11 formats) in 2.5 seconds using batch mode. Faster than Unstructured (3.5s), Pandoc (3.6s), and Docling (8.6s) while parsing every file.

Why is AILANG Parse slow on a single file?

Single-file invocations pay ~400ms of AILANG runtime startup. Use batch mode (pass multiple files or a folder) to amortise this — per-file time drops to ~44ms.

How do I enable batch mode?

The docparse CLI enables it automatically when you pass multiple files or a folder. Just run docparse ~/Documents/. No flags needed.

What does AILANG_NO_TRACE do?

Disables auto-trace collection (2.7M objects per run when OTEL endpoints are configured). Gives 2–5x speedup. The docparse CLI sets this by default.