Table & Merged Cell Extraction

Q: How do I parse DOCX tables and preserve merged cells?

AILANG Parse reads the w:gridSpan (horizontal merge) and w:vMerge (vertical merge) attributes directly from DOCX XML. The output uses standard colspan and rowspan semantics, so the table structure is structurally correct — not guessed from visual layout like PDF-first parsers.

The Problem

Tables are the hardest part of document parsing. Cells contain nested content, columns vary in width, and headers span multiple rows. But the real challenge is merged cells — a single cell that spans two columns, three rows, or both. Get the merge wrong and the data shifts: a revenue figure lands under the wrong quarter, a compliance field maps to the wrong entity.

PDF-first parsers guess where table borders are from rendered pixels, which fails on complex merges. A cell spanning columns 2–4 looks like three separate cells in the rendered image. A vertically merged cell that extends across five rows has no visible border to detect.

AILANG Parse takes a different approach: it reads the merge attributes directly from the Office XML. No guessing, no spatial analysis, no image processing. The structure is already there — encoded in gridSpan, vMerge, and mergeCells elements. We just read it.

Why this matters for financial and compliance documents: Merged cells in balance sheets, income statements, and regulatory filings carry semantic meaning. A wrong merge span changes which column a number belongs to — turning a quarterly figure into an annual total, or attributing a liability to the wrong subsidiary. Deterministic extraction is not a nice-to-have here; it is a correctness requirement.

DOCX Tables

DOCX files use the OOXML standard (ECMA-376). Tables are defined by a grid model: the <w:tblGrid> element declares the column structure, and each cell references its position within that grid.

Horizontal Merges (gridSpan)

When a cell spans multiple columns, it carries a <w:gridSpan w:val="3"/> attribute. AILANG Parse reads this directly from the XML and emits it as a colspan value on the cell. No heuristics, no guessing from rendered width.

<!-- OOXML: cell spanning 3 columns -->
<w:tc>
  <w:tcPr>
    <w:gridSpan w:val="3"/>
  </w:tcPr>
  <w:p><w:r><w:t>Total Revenue</w:t></w:r></w:p>
</w:tc>

Vertical Merges (vMerge)

Vertical merges use a two-part mechanism. The first cell in a vertical span has <w:vMerge w:val="restart"/>, and subsequent cells in the same column carry <w:vMerge/> (with no val attribute, meaning "continue"). AILANG Parse tracks the merge state across rows and computes the final rowspan.

<!-- Row 1: start of vertical merge -->
<w:tc>
  <w:tcPr><w:vMerge w:val="restart"/></w:tcPr>
  <w:p><w:r><w:t>Category A</w:t></w:r></w:p>
</w:tc>

<!-- Row 2: continuation (empty vMerge = continue) -->
<w:tc>
  <w:tcPr><w:vMerge/></w:tcPr>
  <w:p/>
</w:tc>

<!-- Row 3: continuation -->
<w:tc>
  <w:tcPr><w:vMerge/></w:tcPr>
  <w:p/>
</w:tc>

This produces a cell with rowspan: 3 in the parsed output. The continuation cells are omitted — they carry no content.

Combined Merges

A cell can have both gridSpan and vMerge, creating a rectangular region that spans multiple rows and columns simultaneously. AILANG Parse handles this correctly because it reads both attributes from the same <w:tcPr> element.

XLSX Tables

Excel spreadsheets store merge information in a dedicated <mergeCells> element within each sheet's XML. Each merge is defined as a range like B2:D4, meaning columns B through D and rows 2 through 4 are merged into a single cell.

<!-- XLSX: sheet1.xml merge declarations -->
<mergeCells count="2">
  <mergeCell ref="B2:D4"/>
  <mergeCell ref="A1:C1"/>
</mergeCells>

AILANG Parse reads these ranges, resolves the row/column coordinates, and applies them to the cell data from sharedStrings.xml. The result is a table block where each merged region is represented by a single cell with the correct colspan and rowspan values.

Shared Strings

XLSX files store string values in a shared string table (sharedStrings.xml) and reference them by index in the cell data. AILANG Parse resolves these references transparently, so the output contains the actual text, not numeric indices.

PPTX Tables

PowerPoint tables use the DrawingML namespace (a:) rather than WordprocessingML (w:). The grid is declared via <a:tblGrid> with <a:gridCol> elements, and horizontal merges use the gridSpan attribute directly on <a:tc> cells.

<!-- PPTX: table cell spanning 2 columns -->
<a:tc gridSpan="2">
  <a:txBody>
    <a:p><a:r><a:t>Merged Header</a:t></a:r></a:p>
  </a:txBody>
</a:tc>

Vertical merges in PPTX use the rowSpan attribute on the starting cell and vMerge="1" on continuation cells. AILANG Parse reads both and computes the correct span.

Example Output

Given a DOCX file with a table where "Total Revenue" spans columns 2–4 and "Category A" spans rows 1–3, AILANG Parse produces:

{
  "type": "Table",
  "rows": [
    {
      "cells": [
        { "text": "Category A", "rowspan": 3 },
        { "text": "Total Revenue", "colspan": 3 }
      ]
    },
    {
      "cells": [
        { "text": "Q1", "colspan": 1 },
        { "text": "Q2", "colspan": 1 },
        { "text": "Q3", "colspan": 1 }
      ]
    },
    {
      "cells": [
        { "text": "$1.2M", "colspan": 1 },
        { "text": "$1.5M", "colspan": 1 },
        { "text": "$1.8M", "colspan": 1 }
      ]
    }
  ]
}

The rowspan: 3 on "Category A" is computed from the three consecutive vMerge elements in the source XML. The colspan: 3 on "Total Revenue" comes directly from gridSpan="3". No AI, no layout detection, no guessing.

This output is a Table block in the AILANG Parse Block ADT. The same structure is produced regardless of whether the source is DOCX, XLSX, or PPTX — all three formats normalize to the same representation with colspan and rowspan on each cell.

How Other Parsers Handle This

Every major document parsing tool handles tables differently. Here is how they compare on merged cell extraction:

Tool	Approach	Merged Cell Result
AILANG Parse	Reads gridSpan, vMerge, mergeCells directly from Office XML	Structurally correct colspan/rowspan on every cell
Unstructured	Converts DOCX to PDF via LibreOffice, then uses layout analysis	Merged cells become separate cells or get mangled — merge semantics lost
Docling	Converts to PDF, runs table detection model (TableFormer)	Table detection sometimes works, but merge semantics are not recovered from the PDF render
LlamaParse	Sends document to cloud API; LLM reconstructs table structure	Non-deterministic — same document can produce different merge results on different runs
MarkItDown	Uses python-docx to extract cells	Partial gridSpan support, but misses vMerge entirely — vertical merges become empty rows

Why PDF Conversion Destroys Merge Information

When a DOCX is rendered to PDF, the table becomes a visual layout: lines on a page. The gridSpan and vMerge attributes are consumed by the rendering engine to draw the table, then discarded. The PDF contains no record that cells were merged — only the visual result.

Tools that work from the PDF must then reverse-engineer the table structure from line positions and text coordinates. This works for simple tables with visible borders but fails when:

Cells are merged across columns (the interior border is removed, so the detection model sees one wide cell but cannot determine how many columns it spans)
Cells are merged vertically (no horizontal border means the rows blend together)
Tables have no visible borders at all (common in financial statements)
Nested tables or cells contain complex content like lists or images

AILANG Parse avoids all of these failure modes by never converting to PDF in the first place. The merge information is right there in the XML. We just read it.

Try It

Parse a document with tables and see the structural output:

# Parse a DOCX with merged cells
./bin/docparse data/test_files/sample.docx

# Parse an XLSX spreadsheet
./bin/docparse data/test_files/sample.xlsx

# Parse a PPTX with tables
./bin/docparse data/test_files/sample.pptx

# Convert a table-heavy DOCX to HTML (preserves colspan/rowspan)
./bin/docparse report.docx --convert output.html

Try in Browser Run Locally API Reference

Need help with complex table extraction?
Deploy AILANG Parse as part of Sunholo Multivac — our AI platform for enterprise document processing with dedicated support for financial and compliance workflows.

Contact Sunholo →

Frequently Asked Questions

How do I parse DOCX tables and preserve merged cells?

AILANG Parse reads the w:gridSpan (horizontal merge) and w:vMerge (vertical merge) attributes directly from DOCX XML. The output uses standard colspan and rowspan semantics, so the table structure is structurally correct — not guessed from visual layout like PDF-first parsers.

Why do most document parsers break table structures from DOCX files?

Most parsers convert DOCX to PDF first, which renders tables as positioned text blocks. ML models then try to reconstruct the table grid from visual coordinates, but merged cells, nested tables, and cells with multiple paragraphs create ambiguity that heuristics cannot resolve. AILANG Parse bypasses this entirely by reading the table XML directly.

How does AILANG Parse handle complex tables with nested content?

Each table cell in AILANG Parse's Block ADT can contain any block type — paragraphs, lists, nested tables, or even images. The parser reads the cell's XML subtree recursively, preserving the full content hierarchy. No flattening, no one-paragraph-per-cell limitation.

How does AILANG Parse compare to other parsers for DOCX table extraction?

On OfficeDocBench, AILANG Parse scores 93.9% with 100% format coverage versus 60-71% composite (32-68% coverage-adjusted) for the next-best alternatives. The gap is largest on merged cells: PDF conversion destroys merge attributes that AILANG Parse reads directly from the XML.

Format Guides

Tables are extracted from DOCX, XLSX, PPTX, and HTML files. See the full format guides for everything AILANG Parse extracts:

DOCX Parsing → · XLSX Parsing → · PPTX Parsing → · HTML Parsing →

Tables & Merged Cells