The Round-Trip Problem
A DOCX file is a ZIP archive of XML. Track changes, comments, merged cells, headers, footers, speaker notes — they are all right there in the markup, addressable and typed. Converting to PDF flattens this into a bitmap of positioned glyphs. Then an ML model has to guess the structure back from pixel positions.
Two lossy steps versus zero. A categorically different approach.
Feature-by-Feature Comparison
What each tool preserves when parsing a DOCX, PPTX, or XLSX file. Data from OfficeDocBench (March 2026).
| Feature | AILANG Parse | Unstructured | Docling | MarkItDown |
|---|---|---|---|---|
| Parsing method | Direct XML | PDF conversion + ML | PDF conversion + ML | mammoth / pandas wrappers |
| Track changes | ✓ Full (author, date, type) | ✕ None | ✕ None | ✕ None |
| Merged cells | ✓ Structural (colspan) | ✕ Flattened | ✕ Flattened | ✕ Flattened |
| Comments | ✓ Author-attributed | ✕ Dropped | ✕ Dropped | ✕ Dropped |
| Headers / footers | ✓ Per-section | ✕ Lost | ✕ Lost | ~ Basic |
| Text boxes | ✓ Preserved | ✕ Dropped | ✕ Dropped | ✕ Dropped |
| Speaker notes (PPTX) | ✓ Preserved | ✕ Dropped | ✕ Dropped | ✕ Dropped |
| Runtime deps | ✓ None | Python + heavy libs | Python + PyTorch | Python + mammoth |
| Browser / WASM | ✓ Yes | ✕ No | ✕ No | ✕ No |
| Speed (DOCX) | <11 ms | ~2 s | ~3 s | ~500 ms |
| OfficeDocBench (composite) | 93.9% | 62.1% | 63.3% | 67.9% |
| Coverage-adjusted | 93.9% | 38.7% | 38.5% | 51.2% |
Every row with a ✕ represents data that existed in the source file and was thrown away during conversion. For legal, compliance, and audit workflows, this is not acceptable.
Using Alongside Other Tools
PDFs need ML — there is no XML to read. But Office documents are not PDFs, and routing them through a PDF pipeline destroys information for no reason.
The practical architecture: use AILANG Parse for Office formats where it reads the XML directly, and your preferred tool for PDFs where ML is genuinely needed.
Before: everything through one ML pipeline
# Everything through a PDF-first pipeline
from unstructured.partition.auto import partition
# Converts DOCX to PDF internally,
# then runs ML on the PDF.
# Track changes, comments, merged cells: gone.
elements = partition(filename="report.docx")
After: right tool for each format
from ailang_parse import AilangParse
from unstructured.partition.pdf import partition_pdf
if filename.endswith(('.docx', '.pptx', '.xlsx')):
# Direct XML parse, no per-page billing, full structure
blocks = AilangParse().parse(filename)
else:
# ML pipeline where it's actually needed
elements = partition_pdf(filename)
Keep your PDF pipeline exactly as it is. Add AILANG Parse for the formats where deterministic parsing produces categorically better results.
See for Yourself
Upload a DOCX with track changes, a PPTX with speaker notes, or an XLSX with merged cells. Compare the output with whatever you are currently using.
Frequently Asked Questions
What exactly is lost when converting DOCX to PDF before parsing?
Converting DOCX to PDF destroys five categories of data: (1) track changes and revision history, (2) merge attributes on table cells, (3) comments and annotations with author metadata, (4) semantic heading levels (PDF has no heading concept), and (5) document metadata like custom properties. AILANG Parse preserves all five by reading Office XML directly.
Why do wrapper-based parsers have quality problems with DOCX parsing?
Most wrapper-based parsers use libraries designed for display rendering, not structural extraction. They miss track changes, flatten merged cells, and drop comments. On OfficeDocBench, the next-best alternative scores 71.1% composite (68.0% coverage-adjusted) versus AILANG Parse's 93.9% with 100% format coverage.
Is PDF parsing always worse than direct DOCX parsing?
When the original source is a DOCX, direct XML parsing is strictly better — more accurate, no per-page billing, and preserves metadata that PDF rendering destroys. But if your source is a born-digital PDF or a scanned document, you need an ML-based PDF parser. AILANG Parse handles both: deterministic XML extraction for Office formats and pluggable AI backends (Gemini, Claude, Ollama) for PDFs.