DOCX Parsing API — Parse Word Documents to Structured JSON

Q: How do I parse a DOCX file to JSON?

Send the .docx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads Office Open XML directly and returns structured JSON with typed blocks for paragraphs, tables, images, and metadata.

Q: Are comments extracted from DOCX files?

Yes. Each comment is extracted with author, date, the anchored text range, and reply threads. Comments appear as typed blocks alongside the paragraphs they reference.

Q: How are tables with merged cells handled?

Merged cells (gridSpan for horizontal, vMerge for vertical) are resolved into a clean row/column structure with explicit merge metadata. No guessing from visual layout.

Q: Does AILANG Parse convert DOCX to PDF first?

No. AILANG Parse reads the Office Open XML directly — the .docx zip contains document.xml, styles.xml, and relationship files that encode the full document structure. No rendering or PDF conversion step.

Q: How fast is DOCX parsing?

Typical DOCX files parse in under 50ms. The parser reads XML directly with no external dependencies, no subprocess spawning, and no rendering engine. 93.9% composite on OfficeDocBench.

Your DOCX Is Already Structured Data

A .docx file is a zip archive containing XML. Paragraphs, styles, tables, revisions, comments — all encoded in document.xml, styles.xml, and relationship files. Most parsers ignore this and render the document to PDF or plain text first, destroying the structure in the process.

AILANG Parse reads the Office Open XML directly. Track changes get author and timestamp. Merged cells get explicit gridSpan/vMerge metadata. Comments stay anchored to their text ranges. Nothing is flattened, nothing is lost.

93.9% on OfficeDocBench (69 real-world files, 100% format coverage). PDF-first parsers score 33-68% coverage-adjusted on the same benchmark because rendering destroys structural metadata. See benchmarks.

Raw Office XML vs Structured Output

Raw document.xml

<w:p>
  <w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
  <w:r><w:t>Q1 Revenue</w:t></w:r>
</w:p>
<w:p>
  <w:r><w:rPr><w:b/></w:rPr>
    <w:t>Total: </w:t></w:r>
  <w:ins w:id="1" w:author="Alice"
         w:date="2026-03-15T10:30:00Z">
    <w:r><w:t>$2.4M</w:t></w:r>
  </w:ins>
  <w:del w:id="2" w:author="Alice"
         w:date="2026-03-15T10:30:00Z">
    <w:r><w:delText>$2.1M</w:delText></w:r>
  </w:del>
</w:p>
<w:tbl>
  <w:tr>
    <w:tc><w:tcPr><w:gridSpan w:val="2"/>
    </w:tcPr>
    <w:p><w:r><w:t>Region</w:t>...
  </w:tc></w:tr>
</w:tbl>

Structured output

{
  "metadata": {
    "title": "Q1 Revenue Report",
    "author": "Finance Team",
    "created": "2026-03-01T09:00:00Z"
  },
  "blocks": [
    {
      "type": "heading",
      "level": 1,
      "text": "Q1 Revenue"
    },
    {
      "type": "text",
      "text": "Total: $2.4M",
      "style": "Normal",
      "changes": [
        {
          "type": "insert",
          "author": "Alice",
          "date": "2026-03-15T10:30:00Z",
          "text": "$2.4M"
        },
        {
          "type": "delete",
          "author": "Alice",
          "date": "2026-03-15T10:30:00Z",
          "text": "$2.1M"
        }
      ]
    },
    {
      "type": "table",
      "headers": [{"text":"Region","gridSpan":2}],
      "rows": [...]
    }
  ]
}

What Gets Extracted

Paragraphs & Styles

Every paragraph preserves its style name (Heading 1, Normal, ListBullet, etc.) and inline formatting. Run-level bold, italic, underline, and font changes are captured.

Track Changes

Insertions, deletions, and format changes with author name, timestamp, and original vs revised text. Full revision history, not just the accepted result. Deep dive →

Comments & Replies

Each comment extracted with author, date, anchored text range, and threaded replies. Comments stay associated with the paragraph they reference. Deep dive →

Tables & Merged Cells

Full table structure with gridSpan (horizontal merge) and vMerge (vertical merge) resolved into explicit metadata. Header rows identified. Deep dive →

Headers, Footers & Text Boxes

Header and footer content extracted as separate section blocks. Text boxes (w:txbxContent) extracted inline at their anchor position in the document flow.

Images & Metadata

Embedded images identified by relationship ID with dimensions and alt text. Document metadata (title, author, created date, revision count) extracted from core.xml properties.

Track Changes

DOCX track changes encode the full editing history: who changed what, when, and what the text looked like before. AILANG Parse extracts every revision as a structured change block with type (insert/delete/formatChange), author, date, and the affected text.

This is the metadata that disappears when you render to PDF. A legal team reviewing a contract needs to see that "Alice deleted net-30 and inserted net-60 on March 15th" — not just the final text.

{
  "type": "text",
  "text": "Payment terms: net-60 days from invoice date.",
  "changes": [
    {"type": "delete", "author": "Alice Chen", "date": "2026-03-15T14:22:00Z", "text": "net-30"},
    {"type": "insert", "author": "Alice Chen", "date": "2026-03-15T14:22:00Z", "text": "net-60"}
  ]
}

See Track Changes extraction for the full specification.

Tables & Merged Cells

Office tables use gridSpan for horizontal merges and vMerge for vertical merges. AILANG Parse resolves these into a clean row/column structure with explicit merge metadata — no heuristic guessing from visual alignment.

{
  "type": "table",
  "headers": [
    {"text": "Region", "gridSpan": 2},
    {"text": "Q1"},
    {"text": "Q2"}
  ],
  "rows": [
    [{"text": "EMEA"}, {"text": "UK"}, {"text": "$125K"}, {"text": "$140K"}],
    [{"text": "EMEA", "vMerge": "continue"}, {"text": "DE"}, {"text": "$98K"}, {"text": "$112K"}]
  ]
}

See Tables & Merged Cells for the full specification.

Comments

DOCX comments are anchored to specific text ranges via w:commentRangeStart/w:commentRangeEnd markers. AILANG Parse extracts each comment with its author, timestamp, the anchored text, and any reply threads.

{
  "type": "text",
  "text": "The delivery timeline is aggressive.",
  "comments": [
    {
      "author": "Bob Martinez",
      "date": "2026-03-20T09:15:00Z",
      "text": "Can we push this to Q3?",
      "replies": [
        {"author": "Alice Chen", "date": "2026-03-20T10:02:00Z", "text": "Agreed, updated."}
      ]
    }
  ]
}

See Comment extraction for the full specification.

Use Cases

Legal Document Review

Extract redlines from contracts with full author/date attribution. Track changes show exactly who modified which clause and when — the audit trail that disappears in PDF. Feed structured revisions to an LLM for clause-by-clause comparison.

Contract Analysis

Parse contracts into typed blocks: headings map to clause structure, tables capture obligation matrices, comments surface negotiation context. Downstream code identifies key terms, dates, and parties without regex on raw text.

Regulatory Compliance

Audit trails require knowing who changed what and when. Track changes metadata provides this directly. Parse submission documents, compare revisions programmatically, and generate compliance reports from structured data instead of manual review.

Academic Paper Processing

Extract heading hierarchy, citation tables, figure references, and reviewer comments from manuscript DOCX files. The structured output feeds directly into reference management, plagiarism detection, or automated formatting pipelines.

Document Migration

Convert DOCX archives to any output format: Markdown for docs-as-code, HTML for web publishing, Quarto for reproducible reports. Structure preservation means headings, tables, and images survive the conversion intact.

AI Document Understanding

Feed structured JSON to LLMs instead of raw text. The model sees typed blocks (heading, table, comment) with metadata — not a flat string. Token efficiency improves and the model can reason about document structure, not just content.

Try It

CLI

# Parse a DOCX file
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail report.docx

# Convert DOCX to Markdown
./bin/docparse report.docx --convert output.md

# Convert DOCX to HTML
./bin/docparse report.docx --convert output.html

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample_docx_formatting","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("contract.docx", output_format="json")

# Document metadata
print(result.metadata.title)
print(result.metadata.author)

# Iterate structured blocks
for block in result.blocks:
    if block.type == "table":
        print(f"Table: {len(block.rows)} rows, merged={block.has_merges}")
    if hasattr(block, "changes") and block.changes:
        for change in block.changes:
            print(f"{change.author} {change.type}d: {change.text}")

Parse in Browser API Reference

Frequently Asked Questions

How do I parse a DOCX file to JSON?

Send the .docx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads Office Open XML directly and returns structured JSON with typed blocks.

Does DOCX parsing preserve track changes?

Yes. Every insertion, deletion, and format change is extracted with author, timestamp, and original vs revised text. See track changes docs.

Are comments extracted from DOCX files?

Yes. Each comment includes author, date, anchored text range, and reply threads. See comment extraction docs.

How are tables with merged cells handled?

Merged cells (gridSpan for horizontal, vMerge for vertical) are resolved into a clean row/column structure with explicit merge metadata.

Does AILANG Parse convert DOCX to PDF first?

No. It reads Office Open XML directly from the .docx zip archive. No rendering, no PDF conversion, no LibreOffice dependency. See why this matters.

What about headers, footers, and text boxes?

Headers and footers are extracted as separate section blocks. Text boxes (w:txbxContent) are extracted inline at their anchor position.

How fast is DOCX parsing?

Typical files parse in under 50ms. No external dependencies, no subprocess spawning. 93.9% composite on OfficeDocBench.