ODT Parsing API

Native OpenDocument Text parsing for LibreOffice Writer files. Headings, tables, lists, headers/footers, images, and metadata — extracted directly from content.xml. No LibreOffice subprocess, no DOCX conversion step.

Native ODT, Not "DOCX-ish"

An .odt file is a zip archive containing XML — specifically, the OpenDocument Format. Paragraphs live in content.xml as text:p elements. Tables use the table: namespace. Headers and footers sit inside style:master-page in styles.xml. None of this maps cleanly to Word's w: schema.

Most parsers handle ODT by shelling out to LibreOffice to convert to DOCX or PDF first — losing whatever ODF-specific structure won't survive the trip. AILANG Parse reads ODF XML directly, with the same Block ADT output you get from DOCX. No subprocess, no rendering, no conversion.

Native ODF parsing is a strategic gap in the ecosystem. Most "universal" document parsers either skip ODT entirely or pipe it through soffice --headless --convert-to, adding 2–5 seconds per file and losing structure. AILANG Parse runs in milliseconds with no external dependency.

Raw ODF XML vs Structured Output

Raw content.xml
<office:body>
 <office:text>
  <text:h text:outline-level="1">
    Q1 Plan
  </text:h>
  <text:p text:style-name="Normal">
    Revenue target: 2.4M EUR
  </text:p>
  <table:table table:name="Targets">
   <table:table-row>
    <table:table-cell
      table:number-columns-spanned="2">
     <text:p>Region</text:p>
    </table:table-cell>
    <table:table-cell>
     <text:p>Q1</text:p>
    </table:table-cell>
   </table:table-row>
  </table:table>
  <text:list>
    <text:list-item>
      <text:p>EMEA</text:p>
    </text:list-item>
  </text:list>
 </office:text>
</office:body>
Structured output
{
  "metadata": {
    "title": "Q1 Plan",
    "author": "Alice Chen",
    "created": "2026-03-15T09:00:00Z"
  },
  "blocks": [
    {
      "type": "heading",
      "level": 1,
      "text": "Q1 Plan"
    },
    {
      "type": "text",
      "text": "Revenue target: 2.4M EUR",
      "style": "normal"
    },
    {
      "type": "table",
      "headers": [
        {"text": "Region", "colSpan": 2},
        {"text": "Q1"}
      ],
      "rows": [...]
    },
    {
      "type": "list",
      "ordered": false,
      "items": ["EMEA"]
    }
  ]
}

What Gets Extracted

Headings & Outline

Every text:h element becomes a typed heading block with its outline level (1–6) preserved from text:outline-level. The document outline is reconstructable without rendering.

Paragraphs & Styles

text:p elements become text blocks with their style name preserved. Inline runs and embedded frames (images anchored to text) come through inline at their anchor position.

Tables & Merged Cells

Full table:table structure with table:number-columns-spanned resolved into explicit colSpan metadata. Header row identification matches the DOCX parser's contract.

Lists

text:list elements become typed list blocks with their items extracted and empty entries filtered. Nested list items come through as individual entries.

Headers, Footers & Text Boxes

Headers and footers from style:master-page in styles.xml extracted as section blocks. Text boxes (draw:text-box) come through as their own SectionBlocks, anchored at the right place in the flow.

Images & Metadata

Embedded images (draw:image) extracted with their xlink:href path and inferred MIME type. Metadata from meta.xmldc:title, dc:creator, dc:date, page count from meta:document-statistic.

Headers & Footers

Unlike DOCX, ODT puts headers and footers inside the style definitions, not the body. They live in style:master-page elements inside styles.xml. AILANG Parse reads styles.xml as a second pass, walks every master-page, and emits any header/footer content as a SectionBlock with kind: "header" or "footer".

{
  "type": "section",
  "kind": "header",
  "blocks": [
    {"type": "text", "text": "Confidential — Internal Only", "style": "normal"}
  ]
}

Tables & Merged Cells

ODF tables encode horizontal merges with table:number-columns-spanned on the cell that owns the merge. AILANG Parse resolves these into the same colSpan/rowSpan shape used by the DOCX parser, so downstream code can treat ODT and DOCX tables identically.

{
  "type": "table",
  "headers": [
    {"text": "Region", "colSpan": 2, "rowSpan": 1, "merged": false},
    {"text": "Total", "colSpan": 1, "rowSpan": 1, "merged": false}
  ],
  "rows": [
    [{"text": "EMEA"}, {"text": "UK"}, {"text": "€125K"}],
    [{"text": "EMEA"}, {"text": "DE"}, {"text": "€98K"}]
  ]
}

Use Cases

European Public Sector

EU government bodies and many national administrations standardise on ODF for procurement, contracts, and policy documents. Native ODT parsing means you can ingest this archive without setting up a LibreOffice fleet.

Mixed-Office Environments

Organisations with a Linux/LibreOffice contingent end up with mixed DOCX/ODT archives. One parser, one Block ADT, one downstream pipeline — instead of branching on file extension and hoping the conversion didn't lose structure.

Academic Manuscripts

Researchers using LaTeX or Zotero often produce ODT exports. Extract heading hierarchy, citation tables, and figure references without round-tripping through Word format.

Document Migration

Convert legacy ODT archives to Markdown for docs-as-code, HTML for web publishing, or DOCX for collaborators who insist on Word. Structure preservation means headings, tables, and images survive intact.

Compliance & Records

ODF is the standard for long-term document preservation in many jurisdictions because the format is open. Native parsing means you can index, search, and audit ODF archives without converting them to a proprietary format first.

Browser-Based Tools

Because ODT parsing runs in WebAssembly, you can build review tools that never upload files. Drop an ODT into the Workbench — the parser runs in your tab, files never leave your device.

Try It

Workbench (in-browser)

Drop an ODT file into the Workbench — ODT parsing runs in WebAssembly with the same AILANG parser used by the CLI. No upload, no signup.

CLI

# Parse an ODT file
./bin/docparse report.odt

# Convert ODT to Markdown
./bin/docparse report.odt --convert output.md

# Convert ODT to DOCX
./bin/docparse report.odt --convert output.docx

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample.odt","outputFormat":"json","apiKey":"YOUR_API_KEY"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")
result = client.parse_file("report.odt")

print(result.metadata.title)
for block in result.blocks:
    if block.type == "heading":
        print("H" + str(block.level), block.text)
    elif block.type == "table":
        print("Table:", len(block.rows), "rows")

Parse in Browser    API Reference

Frequently Asked Questions

How do I parse an ODT file to JSON?

Drop the .odt file into the Workbench, send it to the API, or use the Python/JS/Go/R SDK. The parser reads OpenDocument XML directly — no LibreOffice subprocess, no DOCX conversion.

Does AILANG Parse need LibreOffice installed?

No. ODF content.xml, styles.xml, and meta.xml are read directly from the zip archive. There is no soffice subprocess and no headless rendering.

What ODT elements are extracted?

Headings (text:h with outline level), paragraphs (text:p), tables with merged cells (table:number-columns-spanned), lists (text:list), images (draw:image), text boxes (draw:text-box), sections, headers and footers from style:master-page, and metadata from meta.xml.

Are headers and footers extracted?

Yes. ODT stores headers and footers in style:master-page elements inside styles.xml. AILANG Parse reads styles.xml separately and emits each header/footer as a SectionBlock.

Does ODT parsing run in the browser?

Yes. The same AILANG parser used by the CLI compiles to WebAssembly. Drop an .odt into the Workbench and parsing happens locally in your browser tab — the file never leaves your device.

How is ODT different from DOCX parsing?

Both are zip archives with XML inside, but they use different schemas. DOCX uses Office Open XML (w: namespace). ODT uses OpenDocument Format (text:, table:, office: namespaces). AILANG Parse implements both natively, with the same Block ADT output. See DOCX parsing for the Word equivalent.