ODP Parsing API

Native OpenDocument Presentation parsing for LibreOffice Impress files. Per-slide blocks with text frames, tables, images, and lists — extracted directly from content.xml. No LibreOffice subprocess, no PPTX conversion step.

Slides as Structured Data

An .odp file is a zip archive with one big content.xml describing every slide as a draw:page element. Inside each slide, text frames, tables, lists, and images sit as draw:frame children with the ODF text:, table:, and draw: namespaces. None of this maps to PowerPoint's p: schema.

Parsers that pipe ODP through LibreOffice to convert to PPTX or PDF lose slide structure, lose master-slide context, and add seconds of latency per file. AILANG Parse reads ODF directly: each draw:page becomes a SectionBlock with a slide name, and the contents come through as typed blocks within it.

The same Block ADT used by the PPTX parser — one downstream pipeline handles both formats without branching on file extension.

Raw ODF XML vs Structured Output

Raw content.xml
<office:body>
 <office:presentation>
  <draw:page draw:name="Slide1">
   <draw:frame>
    <draw:text-box>
     <text:h text:outline-level="1">
       Q1 Roadmap
     </text:h>
    </draw:text-box>
   </draw:frame>
   <draw:frame>
    <draw:text-box>
     <text:list>
      <text:list-item>
       <text:p>Ship v0.9</text:p>
      </text:list-item>
      <text:list-item>
       <text:p>OfficeDocBench</text:p>
      </text:list-item>
     </text:list>
    </draw:text-box>
   </draw:frame>
  </draw:page>
 </office:presentation>
</office:body>
Structured output
{
  "metadata": {
    "title": "Q1 Roadmap",
    "author": "Alice Chen"
  },
  "blocks": [
    {
      "type": "section",
      "kind": "slide:Slide1",
      "blocks": [
        {
          "type": "heading",
          "level": 1,
          "text": "Q1 Roadmap"
        },
        {
          "type": "list",
          "ordered": false,
          "items": [
            "Ship v0.9",
            "OfficeDocBench"
          ]
        }
      ]
    }
  ]
}

What Gets Extracted

Per-Slide SectionBlocks

Every draw:page becomes a SectionBlock with kind: "slide:<name>". Slide names come from draw:name attributes when present, falling back to a generic label otherwise.

Text Frames & Headings

Text inside draw:text-box elements comes through as text and heading blocks. text:h elements preserve their outline level (1–6); text:p elements become normal text blocks.

Tables on Slides

Tables can sit directly inside slides or inside frames. Both paths come through as TableBlock with rows, headers, and merged cells preserved — same shape as the DOCX/PPTX parser.

Bullet & Numbered Lists

text:list elements become typed list blocks. Each text:list-item becomes an entry in the list; empty items are filtered out.

Images

Embedded images (draw:image) extracted with their xlink:href path and inferred MIME type. Image references stay anchored to the slide they appeared on.

Metadata

Document metadata from meta.xmldc:title, dc:creator, dc:date. Useful for indexing slide decks by author or date in bulk pipelines.

Slide-by-Slide Structure

Because each slide becomes its own SectionBlock, downstream code can iterate slides cleanly without flattening the deck into one continuous stream. Want just the headings? Filter to type: "section" with kind matching "slide:*", then walk each slide's children for type: "heading". Want just slides with tables? Same shape, different filter.

from ailang_parse import DocParse

result = DocParse(api_key="...").parse_file("deck.odp")

for slide in result.blocks:
    if slide.type != "section":
        continue
    print("---", slide.kind, "---")
    for block in slide.blocks:
        if block.type == "heading":
            print("H" + str(block.level), block.text)
        elif block.type == "list":
            for item in block.items:
                print("  -", item)

Use Cases

Conference Talk Archives

Many academic and open-source conferences publish slide decks in ODP. Index them by speaker, search across hundreds of decks, or feed slide outlines to an LLM without losing slide boundaries.

Mixed-Office Pipelines

Organisations with a Linux/LibreOffice contingent end up with mixed PPTX/ODP archives. One parser, one Block ADT, one downstream pipeline. See PPTX parsing for the PowerPoint counterpart.

Slide-Level RAG

Each slide becomes its own retrievable chunk with explicit boundaries. Embed the slide as a unit instead of a windowed chunk of flattened text — the model gets a coherent slide instead of half of slide 4 plus half of slide 5.

Outline Extraction

Pull every slide title for an automatic table of contents. Generate PPTX or QMD output from an ODP source for cross-tool collaboration.

Browser-Based Review

Drop an ODP into the Workbench — the parser runs in WebAssembly, the file never leaves your browser. Useful when slide decks are confidential.

Format Migration

Convert ODP archives to PPTX, HTML, or Markdown. Slide structure, text frames, lists, and images survive the conversion intact — no rendering pass required.

Try It

Workbench (in-browser)

Drop an ODP file into the Workbench — parsing runs in WebAssembly with the same AILANG parser used by the CLI. No upload, no signup.

CLI

# Parse an ODP file
./bin/docparse deck.odp

# Convert ODP to Markdown
./bin/docparse deck.odp --convert outline.md

# Convert ODP to PPTX
./bin/docparse deck.odp --convert deck.pptx

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"deck.odp","outputFormat":"json","apiKey":"YOUR_API_KEY"}'

Parse in Browser    API Reference

Frequently Asked Questions

How do I parse an ODP file to JSON?

Drop it into the Workbench, send it to the API, or use the Python/JS/Go/R SDK. The parser reads OpenDocument Presentation XML directly — no LibreOffice subprocess.

Are slides extracted individually?

Yes. Every draw:page becomes its own SectionBlock with a slide name. Text frames, tables, lists, and images inside that slide come through as typed blocks within the section.

Does ODP parsing need LibreOffice installed?

No. content.xml is read directly from the .odp zip. No soffice subprocess, no headless rendering, no PPTX conversion.

What slide elements are extracted?

Text frames (draw:text-box), headings (text:h), paragraphs (text:p), tables (table:table), lists (text:list), and images (draw:image) — each anchored to the slide it lives on.

Does ODP parsing run in the browser?

Yes. The same AILANG parser used by the CLI compiles to WebAssembly. Files never leave the browser tab.