Spreadsheets Are Structured Data Trapped in a Visual Layout
An .xlsx file is a zip archive containing SpreadsheetML — XML that encodes cells, types, formulas, merge ranges, and shared strings. Most parsers flatten this into CSV or plain text, destroying merged cells, multi-sheet structure, and cell type information.
AILANG Parse reads SpreadsheetML directly. Merged cells get explicit gridSpan/vMerge metadata from the mergeCell elements. Shared string references are resolved. Each sheet becomes a separate section with its full table structure intact.
Raw SpreadsheetML vs Structured Output
<!-- sharedStrings.xml -->
<sst count="6">
<si><t>Region</t></si>
<si><t>Q1</t></si>
<si><t>Q2</t></si>
<si><t>EMEA</t></si>
<si><t>APAC</t></si>
</sst>
<!-- sheet1.xml -->
<mergeCells count="1">
<mergeCell ref="A1:A2"/>
</mergeCells>
<sheetData>
<row r="1">
<c r="A1" t="s"><v>0</v></c>
<c r="B1" t="s"><v>1</v></c>
<c r="C1" t="s"><v>2</v></c>
</row>
<row r="2">
<c r="B2"><v>125000</v></c>
<c r="C2"><v>140000</v></c>
</row>
<row r="3">
<c r="A3" t="s"><v>4</v></c>
<c r="B3"><v>143000</v></c>
<c r="C3">
<f>SUM(B2:B3)</f>
<v>268000</v>
</c>
</row>
</sheetData>
{
"metadata": {
"title": "Q1-Q2 Revenue",
"sheets": ["Revenue", "Summary"]
},
"blocks": [
{
"type": "heading",
"level": 2,
"text": "Revenue"
},
{
"type": "table",
"headers": [
{"text": "Region", "vMerge": "restart"},
{"text": "Q1"},
{"text": "Q2"}
],
"rows": [
[
{"text": "EMEA"},
{"text": "125000"},
{"text": "140000"}
],
[
{"text": "APAC"},
{"text": "143000"},
{"text": "268000",
"formula": "SUM(B2:B3)"}
]
]
}
]
}
What Gets Extracted
Tables with Cell Types
Every cell preserves its type: string, number, boolean, date, or formula. Shared string references are resolved to actual text. Empty cells are handled correctly in sparse sheets.
Merged Cells
Horizontal and vertical merges from mergeCell elements are resolved into gridSpan and vMerge metadata. The merge topology comes directly from the XML, not visual heuristics. Deep dive →
Multi-Sheet Workbooks
Each worksheet becomes a separate SectionBlock with the sheet name as a heading. All sheets are parsed in workbook order. Sheet-level metadata (visibility, tab color) is preserved.
Formulas & Cached Values
Formula expressions are extracted alongside their last computed values. You get both "formula": "SUM(B2:B10)" and "text": "268000" for downstream processing.
Shared Strings
XLSX stores repeated text values in sharedStrings.xml and references them by index. AILANG Parse resolves all references automatically — you see the actual text, not index numbers.
Comments & Metadata
Cell-level comments are extracted with author attribution. Workbook metadata (title, author, created date) comes from the core properties. Comment details →
Merged Cells
Financial reports, inventory sheets, and regulatory filings use merged cells extensively for group headers and category labels. XLSX stores these as mergeCell elements with cell range references (e.g., A1:A3). AILANG Parse resolves each range into per-cell gridSpan (horizontal) and vMerge (vertical) attributes.
{
"type": "table",
"headers": [
{"text": "Department", "vMerge": "restart"},
{"text": "H1", "gridSpan": 2},
{"text": "H2", "gridSpan": 2}
],
"rows": [
[{"text": ""}, {"text": "Q1"}, {"text": "Q2"}, {"text": "Q3"}, {"text": "Q4"}],
[{"text": "Engineering"}, {"text": "$450K"}, {"text": "$480K"}, {"text": "$510K"}, {"text": "$520K"}],
[{"text": "Sales"}, {"text": "$320K"}, {"text": "$340K"}, {"text": "$355K"}, {"text": "$370K"}]
]
}
See Tables & Merged Cells for the full specification.
Multi-Sheet Workbooks
A single XLSX file often contains multiple worksheets: raw data, summaries, pivot tables, configuration. AILANG Parse processes all sheets in workbook order, each becoming a SectionBlock with the sheet name as a heading and the sheet's table data as nested blocks.
{
"blocks": [
{"type": "heading", "level": 2, "text": "Raw Data"},
{"type": "table", "headers": [...], "rows": [...]},
{"type": "heading", "level": 2, "text": "Summary"},
{"type": "table", "headers": [...], "rows": [...]},
{"type": "heading", "level": 2, "text": "Config"},
{"type": "table", "headers": [...], "rows": [...]}
]
}
Use Cases
Financial Report Extraction
Parse budget spreadsheets, P&L statements, and revenue models with merged category headers intact. Formulas are preserved alongside computed values, so downstream systems can verify calculations or flag anomalies.
Data Pipeline Ingestion
Convert XLSX uploads to structured JSON for database ingestion. Cell types (number, date, string) are preserved, so your pipeline doesn't need to guess whether "42736" is a number or an Excel date serial. Multi-sheet workbooks are handled in a single call.
Regulatory Filing Analysis
Financial regulatory filings (XBRL supplements, risk reports, capital adequacy schedules) rely heavily on merged cells for hierarchical categorization. Flat CSV export destroys this structure. Structured parsing preserves it.
Inventory & ERP Data
Warehouse inventory sheets, BOM spreadsheets, and ERP exports use multi-sheet workbooks with cross-references. Parse the entire workbook into typed blocks, then feed to an LLM for procurement analysis, stock reconciliation, or demand forecasting.
AI Spreadsheet Understanding
Feed structured table JSON to LLMs instead of CSV dumps. The model sees typed cells, merge topology, and formula expressions — enabling it to reason about table structure, not just cell values. Token efficiency improves when the model doesn't need to infer column boundaries.
Format Conversion
Convert XLSX to Markdown tables, HTML, or Quarto for documentation and reporting. Merged cells are expanded or annotated in the output format. Multi-sheet workbooks produce sectioned output with clear sheet boundaries.
Try It
CLI
# Parse an XLSX file
ailang run --entry main --caps IO,FS,Env \
docparse/main.ail financials.xlsx
# Convert XLSX to Markdown tables
./bin/docparse financials.xlsx --convert output.md
# Convert XLSX to HTML
./bin/docparse financials.xlsx --convert output.html
API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath":"sample_xlsx_formatting","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'
Python SDK
from ailang_parse import DocParse
client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("financials.xlsx", output_format="json")
# Iterate sheets (each is a section block)
for block in result.blocks:
if block.type == "heading":
print(f"Sheet: {block.text}")
if block.type == "table":
print(f" {len(block.rows)} rows, {len(block.headers)} columns")
if block.has_merges:
print(" Contains merged cells")
Parse in Browser API Reference
Frequently Asked Questions
How do I parse an XLSX file to JSON?
Send the .xlsx file to the AILANG Parse API or use the Python/JS/Go SDK. The parser reads SpreadsheetML directly and returns structured JSON with tables, merged cells, and metadata per sheet.
How are merged cells handled in XLSX parsing?
Merged cell ranges are resolved into explicit gridSpan and vMerge metadata on each cell. The merge topology comes directly from the XML. See table extraction docs.
Does XLSX parsing support multi-sheet workbooks?
Yes. Each sheet becomes a separate section block with the sheet name as a heading. All sheets are parsed in workbook order.
Are formulas extracted from XLSX files?
Yes. Formula expressions are captured alongside their cached values. You get both the formula string and the last computed result.
What about shared strings in XLSX?
XLSX stores repeated text in a shared string table. AILANG Parse resolves all references to their actual text values automatically.
How fast is XLSX parsing?
Typical spreadsheets parse in under 50ms. No external dependencies, no subprocess spawning. 93.9% composite on OfficeDocBench.