Your DOCX Is Already Structured Data
A .docx file is a zip archive containing XML. Paragraphs, styles, tables, revisions, comments — all encoded in document.xml, styles.xml, and relationship files. Most parsers ignore this and render the document to PDF or plain text first, destroying the structure in the process.
AILANG Parse reads the Office Open XML directly. Track changes get author and timestamp. Merged cells get explicit gridSpan/vMerge metadata. Comments stay anchored to their text ranges. Nothing is flattened, nothing is lost.
Raw Office XML vs Structured Output
<w:p>
<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
<w:r><w:t>Q1 Revenue</w:t></w:r>
</w:p>
<w:p>
<w:r><w:rPr><w:b/></w:rPr>
<w:t>Total: </w:t></w:r>
<w:ins w:id="1" w:author="Alice"
w:date="2026-03-15T10:30:00Z">
<w:r><w:t>$2.4M</w:t></w:r>
</w:ins>
<w:del w:id="2" w:author="Alice"
w:date="2026-03-15T10:30:00Z">
<w:r><w:delText>$2.1M</w:delText></w:r>
</w:del>
</w:p>
<w:tbl>
<w:tr>
<w:tc><w:tcPr><w:gridSpan w:val="2"/>
</w:tcPr>
<w:p><w:r><w:t>Region</w:t>...
</w:tc></w:tr>
</w:tbl>
{
"metadata": {
"title": "Q1 Revenue Report",
"author": "Finance Team",
"created": "2026-03-01T09:00:00Z"
},
"blocks": [
{
"type": "heading",
"level": 1,
"text": "Q1 Revenue"
},
{
"type": "text",
"text": "Total: $2.4M",
"style": "Normal",
"changes": [
{
"type": "insert",
"author": "Alice",
"date": "2026-03-15T10:30:00Z",
"text": "$2.4M"
},
{
"type": "delete",
"author": "Alice",
"date": "2026-03-15T10:30:00Z",
"text": "$2.1M"
}
]
},
{
"type": "table",
"headers": [{"text":"Region","gridSpan":2}],
"rows": [...]
}
]
}
What Gets Extracted
Paragraphs & Styles
Every paragraph preserves its style name (Heading 1, Normal, ListBullet, etc.) and inline formatting. Run-level bold, italic, underline, and font changes are captured.
Track Changes
Insertions, deletions, and format changes with author name, timestamp, and original vs revised text. Full revision history, not just the accepted result. Deep dive →
Comments & Replies
Each comment extracted with author, date, anchored text range, and threaded replies. Comments stay associated with the paragraph they reference. Deep dive →
Tables & Merged Cells
Full table structure with gridSpan (horizontal merge) and vMerge (vertical merge) resolved into explicit metadata. Header rows identified. Deep dive →
Headers, Footers & Text Boxes
Header and footer content extracted as separate section blocks. Text boxes (w:txbxContent) extracted inline at their anchor position in the document flow.
Images & Metadata
Embedded images identified by relationship ID with dimensions and alt text. Document metadata (title, author, created date, revision count) extracted from core.xml properties.
Track Changes
DOCX track changes encode the full editing history: who changed what, when, and what the text looked like before. AILANG Parse extracts every revision as a structured change block with type (insert/delete/formatChange), author, date, and the affected text.
This is the metadata that disappears when you render to PDF. A legal team reviewing a contract needs to see that "Alice deleted net-30 and inserted net-60 on March 15th" — not just the final text.
{
"type": "text",
"text": "Payment terms: net-60 days from invoice date.",
"changes": [
{"type": "delete", "author": "Alice Chen", "date": "2026-03-15T14:22:00Z", "text": "net-30"},
{"type": "insert", "author": "Alice Chen", "date": "2026-03-15T14:22:00Z", "text": "net-60"}
]
}
See Track Changes extraction for the full specification.
Tables & Merged Cells
Office tables use gridSpan for horizontal merges and vMerge for vertical merges. AILANG Parse resolves these into a clean row/column structure with explicit merge metadata — no heuristic guessing from visual alignment.
{
"type": "table",
"headers": [
{"text": "Region", "gridSpan": 2},
{"text": "Q1"},
{"text": "Q2"}
],
"rows": [
[{"text": "EMEA"}, {"text": "UK"}, {"text": "$125K"}, {"text": "$140K"}],
[{"text": "EMEA", "vMerge": "continue"}, {"text": "DE"}, {"text": "$98K"}, {"text": "$112K"}]
]
}
See Tables & Merged Cells for the full specification.
Comments
DOCX comments are anchored to specific text ranges via w:commentRangeStart/w:commentRangeEnd markers. AILANG Parse extracts each comment with its author, timestamp, the anchored text, and any reply threads.
{
"type": "text",
"text": "The delivery timeline is aggressive.",
"comments": [
{
"author": "Bob Martinez",
"date": "2026-03-20T09:15:00Z",
"text": "Can we push this to Q3?",
"replies": [
{"author": "Alice Chen", "date": "2026-03-20T10:02:00Z", "text": "Agreed, updated."}
]
}
]
}
See Comment extraction for the full specification.
Use Cases
Legal Document Review
Extract redlines from contracts with full author/date attribution. Track changes show exactly who modified which clause and when — the audit trail that disappears in PDF. Feed structured revisions to an LLM for clause-by-clause comparison.
Contract Analysis
Parse contracts into typed blocks: headings map to clause structure, tables capture obligation matrices, comments surface negotiation context. Downstream code identifies key terms, dates, and parties without regex on raw text.
Regulatory Compliance
Audit trails require knowing who changed what and when. Track changes metadata provides this directly. Parse submission documents, compare revisions programmatically, and generate compliance reports from structured data instead of manual review.
Academic Paper Processing
Extract heading hierarchy, citation tables, figure references, and reviewer comments from manuscript DOCX files. The structured output feeds directly into reference management, plagiarism detection, or automated formatting pipelines.
Document Migration
Convert DOCX archives to any output format: Markdown for docs-as-code, HTML for web publishing, Quarto for reproducible reports. Structure preservation means headings, tables, and images survive the conversion intact.
AI Document Understanding
Feed structured JSON to LLMs instead of raw text. The model sees typed blocks (heading, table, comment) with metadata — not a flat string. Token efficiency improves and the model can reason about document structure, not just content.
Try It
CLI
# Parse a DOCX file
ailang run --entry main --caps IO,FS,Env \
docparse/main.ail report.docx
# Convert DOCX to Markdown
./bin/docparse report.docx --convert output.md
# Convert DOCX to HTML
./bin/docparse report.docx --convert output.html
API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath":"sample_docx_formatting","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'
Python SDK
from ailang_parse import DocParse
client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("contract.docx", output_format="json")
# Document metadata
print(result.metadata.title)
print(result.metadata.author)
# Iterate structured blocks
for block in result.blocks:
if block.type == "table":
print(f"Table: {len(block.rows)} rows, merged={block.has_merges}")
if hasattr(block, "changes") and block.changes:
for change in block.changes:
print(f"{change.author} {change.type}d: {change.text}")
Parse in Browser API Reference
Frequently Asked Questions
How do I parse a DOCX file to JSON?
Send the .docx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads Office Open XML directly and returns structured JSON with typed blocks.
Does DOCX parsing preserve track changes?
Yes. Every insertion, deletion, and format change is extracted with author, timestamp, and original vs revised text. See track changes docs.
Are comments extracted from DOCX files?
Yes. Each comment includes author, date, anchored text range, and reply threads. See comment extraction docs.
How are tables with merged cells handled?
Merged cells (gridSpan for horizontal, vMerge for vertical) are resolved into a clean row/column structure with explicit merge metadata.
Does AILANG Parse convert DOCX to PDF first?
No. It reads Office Open XML directly from the .docx zip archive. No rendering, no PDF conversion, no LibreOffice dependency. See why this matters.
What about headers, footers, and text boxes?
Headers and footers are extracted as separate section blocks. Text boxes (w:txbxContent) are extracted inline at their anchor position.
How fast is DOCX parsing?
Typical files parse in under 50ms. No external dependencies, no subprocess spawning. 93.9% composite on OfficeDocBench.