Native ODT, Not "DOCX-ish"
An .odt file is a zip archive containing XML — specifically, the OpenDocument Format. Paragraphs live in content.xml as text:p elements. Tables use the table: namespace. Headers and footers sit inside style:master-page in styles.xml. None of this maps cleanly to Word's w: schema.
Most parsers handle ODT by shelling out to LibreOffice to convert to DOCX or PDF first — losing whatever ODF-specific structure won't survive the trip. AILANG Parse reads ODF XML directly, with the same Block ADT output you get from DOCX. No subprocess, no rendering, no conversion.
soffice --headless --convert-to, adding 2–5 seconds per file and losing structure. AILANG Parse runs in milliseconds with no external dependency.Raw ODF XML vs Structured Output
<office:body>
<office:text>
<text:h text:outline-level="1">
Q1 Plan
</text:h>
<text:p text:style-name="Normal">
Revenue target: 2.4M EUR
</text:p>
<table:table table:name="Targets">
<table:table-row>
<table:table-cell
table:number-columns-spanned="2">
<text:p>Region</text:p>
</table:table-cell>
<table:table-cell>
<text:p>Q1</text:p>
</table:table-cell>
</table:table-row>
</table:table>
<text:list>
<text:list-item>
<text:p>EMEA</text:p>
</text:list-item>
</text:list>
</office:text>
</office:body>
{
"metadata": {
"title": "Q1 Plan",
"author": "Alice Chen",
"created": "2026-03-15T09:00:00Z"
},
"blocks": [
{
"type": "heading",
"level": 1,
"text": "Q1 Plan"
},
{
"type": "text",
"text": "Revenue target: 2.4M EUR",
"style": "normal"
},
{
"type": "table",
"headers": [
{"text": "Region", "colSpan": 2},
{"text": "Q1"}
],
"rows": [...]
},
{
"type": "list",
"ordered": false,
"items": ["EMEA"]
}
]
}
What Gets Extracted
Headings & Outline
Every text:h element becomes a typed heading block with its outline level (1–6) preserved from text:outline-level. The document outline is reconstructable without rendering.
Paragraphs & Styles
text:p elements become text blocks with their style name preserved. Inline runs and embedded frames (images anchored to text) come through inline at their anchor position.
Tables & Merged Cells
Full table:table structure with table:number-columns-spanned resolved into explicit colSpan metadata. Header row identification matches the DOCX parser's contract.
Lists
text:list elements become typed list blocks with their items extracted and empty entries filtered. Nested list items come through as individual entries.
Headers, Footers & Text Boxes
Headers and footers from style:master-page in styles.xml extracted as section blocks. Text boxes (draw:text-box) come through as their own SectionBlocks, anchored at the right place in the flow.
Images & Metadata
Embedded images (draw:image) extracted with their xlink:href path and inferred MIME type. Metadata from meta.xml — dc:title, dc:creator, dc:date, page count from meta:document-statistic.
Headers & Footers
Unlike DOCX, ODT puts headers and footers inside the style definitions, not the body. They live in style:master-page elements inside styles.xml. AILANG Parse reads styles.xml as a second pass, walks every master-page, and emits any header/footer content as a SectionBlock with kind: "header" or "footer".
{
"type": "section",
"kind": "header",
"blocks": [
{"type": "text", "text": "Confidential — Internal Only", "style": "normal"}
]
}
Tables & Merged Cells
ODF tables encode horizontal merges with table:number-columns-spanned on the cell that owns the merge. AILANG Parse resolves these into the same colSpan/rowSpan shape used by the DOCX parser, so downstream code can treat ODT and DOCX tables identically.
{
"type": "table",
"headers": [
{"text": "Region", "colSpan": 2, "rowSpan": 1, "merged": false},
{"text": "Total", "colSpan": 1, "rowSpan": 1, "merged": false}
],
"rows": [
[{"text": "EMEA"}, {"text": "UK"}, {"text": "€125K"}],
[{"text": "EMEA"}, {"text": "DE"}, {"text": "€98K"}]
]
}
Use Cases
European Public Sector
EU government bodies and many national administrations standardise on ODF for procurement, contracts, and policy documents. Native ODT parsing means you can ingest this archive without setting up a LibreOffice fleet.
Mixed-Office Environments
Organisations with a Linux/LibreOffice contingent end up with mixed DOCX/ODT archives. One parser, one Block ADT, one downstream pipeline — instead of branching on file extension and hoping the conversion didn't lose structure.
Academic Manuscripts
Researchers using LaTeX or Zotero often produce ODT exports. Extract heading hierarchy, citation tables, and figure references without round-tripping through Word format.
Document Migration
Convert legacy ODT archives to Markdown for docs-as-code, HTML for web publishing, or DOCX for collaborators who insist on Word. Structure preservation means headings, tables, and images survive intact.
Compliance & Records
ODF is the standard for long-term document preservation in many jurisdictions because the format is open. Native parsing means you can index, search, and audit ODF archives without converting them to a proprietary format first.
Browser-Based Tools
Because ODT parsing runs in WebAssembly, you can build review tools that never upload files. Drop an ODT into the Workbench — the parser runs in your tab, files never leave your device.
Try It
Workbench (in-browser)
Drop an ODT file into the Workbench — ODT parsing runs in WebAssembly with the same AILANG parser used by the CLI. No upload, no signup.
CLI
# Parse an ODT file
./bin/docparse report.odt
# Convert ODT to Markdown
./bin/docparse report.odt --convert output.md
# Convert ODT to DOCX
./bin/docparse report.odt --convert output.docx
API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath":"sample.odt","outputFormat":"json","apiKey":"YOUR_API_KEY"}'
Python SDK
from ailang_parse import DocParse
client = DocParse(api_key="YOUR_API_KEY")
result = client.parse_file("report.odt")
print(result.metadata.title)
for block in result.blocks:
if block.type == "heading":
print("H" + str(block.level), block.text)
elif block.type == "table":
print("Table:", len(block.rows), "rows")
Parse in Browser API Reference
Frequently Asked Questions
How do I parse an ODT file to JSON?
Does AILANG Parse need LibreOffice installed?
No. ODF content.xml, styles.xml, and meta.xml are read directly from the zip archive. There is no soffice subprocess and no headless rendering.
What ODT elements are extracted?
Headings (text:h with outline level), paragraphs (text:p), tables with merged cells (table:number-columns-spanned), lists (text:list), images (draw:image), text boxes (draw:text-box), sections, headers and footers from style:master-page, and metadata from meta.xml.
Are headers and footers extracted?
Yes. ODT stores headers and footers in style:master-page elements inside styles.xml. AILANG Parse reads styles.xml separately and emits each header/footer as a SectionBlock.
Does ODT parsing run in the browser?
Yes. The same AILANG parser used by the CLI compiles to WebAssembly. Drop an .odt into the Workbench and parsing happens locally in your browser tab — the file never leaves your device.
How is ODT different from DOCX parsing?
Both are zip archives with XML inside, but they use different schemas. DOCX uses Office Open XML (w: namespace). ODT uses OpenDocument Format (text:, table:, office: namespaces). AILANG Parse implements both natively, with the same Block ADT output. See DOCX parsing for the Word equivalent.