XLSX Parsing API — Parse Excel Spreadsheets to Structured JSON

Q: How do I parse an XLSX file to JSON?

Send the .xlsx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads SpreadsheetML directly and returns structured JSON with tables, merged cells, and metadata per sheet.

Q: How are merged cells handled in XLSX parsing?

Merged cell ranges from the mergeCell elements are resolved into explicit gridSpan and vMerge metadata on each cell. No visual heuristics — the merge topology comes directly from the XML.

Q: What about shared strings in XLSX?

XLSX stores repeated text in a shared string table (sharedStrings.xml). AILANG Parse resolves all shared string references to their actual text values automatically.

Q: How fast is XLSX parsing?

Typical spreadsheets parse in under 50ms. The parser reads SpreadsheetML directly with no external dependencies. 93.9% composite on OfficeDocBench.

Spreadsheets Are Structured Data Trapped in a Visual Layout

An .xlsx file is a zip archive containing SpreadsheetML — XML that encodes cells, types, formulas, merge ranges, and shared strings. Most parsers flatten this into CSV or plain text, destroying merged cells, multi-sheet structure, and cell type information.

AILANG Parse reads SpreadsheetML directly. Merged cells get explicit gridSpan/vMerge metadata from the mergeCell elements. Shared string references are resolved. Each sheet becomes a separate section with its full table structure intact.

XLSX merged cells are the #1 structural feature that other parsers drop. A financial report with merged region headers becomes an unreadable flat grid. AILANG Parse preserves the merge topology. See table extraction details.

Raw SpreadsheetML vs Structured Output

Raw sheet1.xml + sharedStrings.xml

<!-- sharedStrings.xml -->
<sst count="6">
  <si><t>Region</t></si>
  <si><t>Q1</t></si>
  <si><t>Q2</t></si>
  <si><t>EMEA</t></si>
  <si><t>APAC</t></si>
</sst>

<!-- sheet1.xml -->
<mergeCells count="1">
  <mergeCell ref="A1:A2"/>
</mergeCells>
<sheetData>
  <row r="1">
    <c r="A1" t="s"><v>0</v></c>
    <c r="B1" t="s"><v>1</v></c>
    <c r="C1" t="s"><v>2</v></c>
  </row>
  <row r="2">
    <c r="B2"><v>125000</v></c>
    <c r="C2"><v>140000</v></c>
  </row>
  <row r="3">
    <c r="A3" t="s"><v>4</v></c>
    <c r="B3"><v>143000</v></c>
    <c r="C3">
      <f>SUM(B2:B3)</f>
      <v>268000</v>
    </c>
  </row>
</sheetData>

Structured output

{
  "metadata": {
    "title": "Q1-Q2 Revenue",
    "sheets": ["Revenue", "Summary"]
  },
  "blocks": [
    {
      "type": "heading",
      "level": 2,
      "text": "Revenue"
    },
    {
      "type": "table",
      "headers": [
        {"text": "Region", "vMerge": "restart"},
        {"text": "Q1"},
        {"text": "Q2"}
      ],
      "rows": [
        [
          {"text": "EMEA"},
          {"text": "125000"},
          {"text": "140000"}
        ],
        [
          {"text": "APAC"},
          {"text": "143000"},
          {"text": "268000",
           "formula": "SUM(B2:B3)"}
        ]
      ]
    }
  ]
}

What Gets Extracted

Tables with Cell Types

Every cell preserves its type: string, number, boolean, date, or formula. Shared string references are resolved to actual text. Empty cells are handled correctly in sparse sheets.

Merged Cells

Horizontal and vertical merges from mergeCell elements are resolved into gridSpan and vMerge metadata. The merge topology comes directly from the XML, not visual heuristics. Deep dive →

Multi-Sheet Workbooks

Each worksheet becomes a separate SectionBlock with the sheet name as a heading. All sheets are parsed in workbook order. Sheet-level metadata (visibility, tab color) is preserved.

Formulas & Cached Values

Formula expressions are extracted alongside their last computed values. You get both "formula": "SUM(B2:B10)" and "text": "268000" for downstream processing.

Shared Strings

XLSX stores repeated text values in sharedStrings.xml and references them by index. AILANG Parse resolves all references automatically — you see the actual text, not index numbers.

Comments & Metadata

Cell-level comments are extracted with author attribution. Workbook metadata (title, author, created date) comes from the core properties. Comment details →

Merged Cells

Financial reports, inventory sheets, and regulatory filings use merged cells extensively for group headers and category labels. XLSX stores these as mergeCell elements with cell range references (e.g., A1:A3). AILANG Parse resolves each range into per-cell gridSpan (horizontal) and vMerge (vertical) attributes.

{
  "type": "table",
  "headers": [
    {"text": "Department", "vMerge": "restart"},
    {"text": "H1", "gridSpan": 2},
    {"text": "H2", "gridSpan": 2}
  ],
  "rows": [
    [{"text": ""}, {"text": "Q1"}, {"text": "Q2"}, {"text": "Q3"}, {"text": "Q4"}],
    [{"text": "Engineering"}, {"text": "$450K"}, {"text": "$480K"}, {"text": "$510K"}, {"text": "$520K"}],
    [{"text": "Sales"}, {"text": "$320K"}, {"text": "$340K"}, {"text": "$355K"}, {"text": "$370K"}]
  ]
}

See Tables & Merged Cells for the full specification.

Multi-Sheet Workbooks

A single XLSX file often contains multiple worksheets: raw data, summaries, pivot tables, configuration. AILANG Parse processes all sheets in workbook order, each becoming a SectionBlock with the sheet name as a heading and the sheet's table data as nested blocks.

{
  "blocks": [
    {"type": "heading", "level": 2, "text": "Raw Data"},
    {"type": "table", "headers": [...], "rows": [...]},
    {"type": "heading", "level": 2, "text": "Summary"},
    {"type": "table", "headers": [...], "rows": [...]},
    {"type": "heading", "level": 2, "text": "Config"},
    {"type": "table", "headers": [...], "rows": [...]}
  ]
}

Use Cases

Financial Report Extraction

Parse budget spreadsheets, P&L statements, and revenue models with merged category headers intact. Formulas are preserved alongside computed values, so downstream systems can verify calculations or flag anomalies.

Data Pipeline Ingestion

Convert XLSX uploads to structured JSON for database ingestion. Cell types (number, date, string) are preserved, so your pipeline doesn't need to guess whether "42736" is a number or an Excel date serial. Multi-sheet workbooks are handled in a single call.

Regulatory Filing Analysis

Financial regulatory filings (XBRL supplements, risk reports, capital adequacy schedules) rely heavily on merged cells for hierarchical categorization. Flat CSV export destroys this structure. Structured parsing preserves it.

Inventory & ERP Data

Warehouse inventory sheets, BOM spreadsheets, and ERP exports use multi-sheet workbooks with cross-references. Parse the entire workbook into typed blocks, then feed to an LLM for procurement analysis, stock reconciliation, or demand forecasting.

AI Spreadsheet Understanding

Feed structured table JSON to LLMs instead of CSV dumps. The model sees typed cells, merge topology, and formula expressions — enabling it to reason about table structure, not just cell values. Token efficiency improves when the model doesn't need to infer column boundaries.

Format Conversion

Convert XLSX to Markdown tables, HTML, or Quarto for documentation and reporting. Merged cells are expanded or annotated in the output format. Multi-sheet workbooks produce sectioned output with clear sheet boundaries.

Try It

CLI

# Parse an XLSX file
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail financials.xlsx

# Convert XLSX to Markdown tables
./bin/docparse financials.xlsx --convert output.md

# Convert XLSX to HTML
./bin/docparse financials.xlsx --convert output.html

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample_xlsx_formatting","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("financials.xlsx", output_format="json")

# Iterate sheets (each is a section block)
for block in result.blocks:
    if block.type == "heading":
        print(f"Sheet: {block.text}")
    if block.type == "table":
        print(f"  {len(block.rows)} rows, {len(block.headers)} columns")
        if block.has_merges:
            print("  Contains merged cells")

Parse in Browser API Reference

Frequently Asked Questions

How do I parse an XLSX file to JSON?

Send the .xlsx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads SpreadsheetML directly and returns structured JSON with tables, merged cells, and metadata per sheet.

How are merged cells handled in XLSX parsing?

Merged cell ranges are resolved into explicit gridSpan and vMerge metadata on each cell. The merge topology comes directly from the XML. See table extraction docs.

Does XLSX parsing support multi-sheet workbooks?

Yes. Each sheet becomes a separate section block with the sheet name as a heading. All sheets are parsed in workbook order.

Are formulas extracted from XLSX files?

Yes. Formula expressions are captured alongside their cached values. You get both the formula string and the last computed result.

What about shared strings in XLSX?

XLSX stores repeated text in a shared string table. AILANG Parse resolves all references to their actual text values automatically.

How fast is XLSX parsing?

Typical spreadsheets parse in under 50ms. No external dependencies, no subprocess spawning. 93.9% composite on OfficeDocBench.