EPUB Parsing API — Parse EPUB Ebooks to Structured JSON

Q: How do I parse an EPUB file to JSON?

Drop the .epub file into the AILANG Parse Workbench, send it to the API, or use the Python/JS/Go/R SDK. The parser walks META-INF/container.xml to find the OPF, reads the spine, and parses each XHTML chapter directly — no Calibre or PDF conversion.

Q: Are chapters parsed in reading order?

Yes. The OPF spine defines the canonical reading order. AILANG Parse follows the spine and emits each chapter as a SectionBlock in that order, so downstream code never has to guess at file ordering.

Q: Does EPUB parsing need Calibre or any external tool?

No. The parser is pure AILANG — it reads container.xml, walks the OPF, parses each XHTML chapter via the same HTML parser used for standalone HTML files. No subprocesses, no PDF conversion.

Q: What metadata is extracted from EPUB?

Dublin Core metadata from the OPF: dc:title, dc:creator (author), dc:language, dc:identifier (ISBN), dc:date, dc:publisher. Useful for indexing libraries by author or ISBN.

Q: Does EPUB parsing run in the browser?

Yes. EPUB parsing runs in WebAssembly via the same AILANG parser used by the CLI. Files never leave the browser tab — try it in the Workbench.

Books as Structured Data

An .epub file is a zip archive that hides a small filesystem inside it: META-INF/container.xml points to a .opf manifest, the OPF lists every resource and a spine defining the canonical reading order, and the actual content lives as XHTML files — one per chapter, sometimes more. Embedded images, fonts, and CSS sit alongside.

Most ebook tooling treats EPUB as a black box: convert to PDF, render to text, lose chapter boundaries, lose metadata, lose image references. AILANG Parse walks the EPUB structure directly — container → OPF → spine → XHTML — and emits each chapter as a SectionBlock in spine order.

The same Block ADT used by the HTML parser — chapter XHTML routes through the same code path, so EPUB and standalone HTML pages flow through one downstream pipeline.

Raw EPUB Structure vs Structured Output

Raw OPF + chapter XHTML

<!-- content.opf -->
<package version="3.0">
 <metadata>
  <dc:title>A Tale of Two Cities</dc:title>
  <dc:creator>Charles Dickens</dc:creator>
  <dc:language>en</dc:language>
 </metadata>
 <manifest>
  <item id="ch1" href="ch1.xhtml"
        media-type="application/xhtml+xml"/>
  <item id="ch2" href="ch2.xhtml"
        media-type="application/xhtml+xml"/>
 </manifest>
 <spine>
  <itemref idref="ch1"/>
  <itemref idref="ch2"/>
 </spine>
</package>

<!-- ch1.xhtml -->
<h1>Book the First</h1>
<p>It was the best of times,
   it was the worst of times...</p>

Structured output

{
  "metadata": {
    "title": "A Tale of Two Cities",
    "author": "Charles Dickens",
    "language": "en"
  },
  "blocks": [
    {
      "type": "section",
      "kind": "chapter:ch1.xhtml",
      "blocks": [
        {
          "type": "heading",
          "level": 1,
          "text": "Book the First"
        },
        {
          "type": "text",
          "text": "It was the best of times, it was the worst of times..."
        }
      ]
    },
    {
      "type": "section",
      "kind": "chapter:ch2.xhtml",
      "blocks": [ ... ]
    }
  ]
}

What Gets Extracted

Spine-Ordered Chapters

The OPF spine defines canonical reading order. Each spine item becomes a SectionBlock with kind: "chapter:<href>", in the exact order an ebook reader would render them.

OPF Metadata

Dublin Core fields from the OPF: dc:title, dc:creator, dc:language, dc:identifier (often the ISBN), dc:date, dc:publisher. Indexable, searchable, exportable.

Headings, Paragraphs, Lists

Each chapter XHTML is parsed by the same HTML parser used for standalone HTML files. Headings (h1–h6), paragraphs, ordered/unordered lists, blockquotes, and inline emphasis come through as typed blocks.

Tables & Code Blocks

Reference works, technical books, and cookbooks routinely embed tables and <pre><code> blocks. Both round-trip cleanly through the Block ADT — no flattening.

Embedded Images

Image references (<img src="...">) come through as ImageBlock with their relative path inside the EPUB. Useful for cover extraction or figure captioning pipelines.

No Calibre, No PDF Pipeline

The parser is pure AILANG. No calibre binary, no ebook-convert subprocess, no PDF intermediate. Same code runs in the CLI and in WebAssembly in the browser.

Spine & Reading Order

EPUB chapter files do not have to be ordered by filename — the OPF spine is the only authoritative source of reading order. AILANG Parse follows the spine, so chapters always come out in the order the author intended, even if the underlying files are named x4j2-zb.xhtml.

Each chapter is its own SectionBlock, which means downstream code can chunk by chapter, embed by chapter, or render a table of contents from headings without flattening the book first.

from ailang_parse import DocParse

result = DocParse(api_key="...").parse_file("book.epub")

print(result.metadata.title, "by", result.metadata.author)

for chapter in result.blocks:
    if chapter.type != "section":
        continue
    print("---", chapter.kind, "---")
    for block in chapter.blocks:
        if block.type == "heading":
            print("H" + str(block.level), block.text)

Use Cases

Project Gutenberg & Public Domain

Bulk-process tens of thousands of EPUBs from Project Gutenberg, Standard Ebooks, or the Internet Archive. One parser, one Block ADT, no per-source heuristics.

Ebook RAG & Semantic Search

Each chapter becomes a coherent retrievable chunk with explicit boundaries — embed by chapter or by heading section, not by an arbitrary 512-token window that splits a sentence in two.

Accessibility Tooling

Generate audio versions, large-print exports, or simplified-language renderings. Structural blocks survive the conversion, so chapter boundaries and headings remain intact in the output.

Format Migration

Convert EPUB to Markdown, HTML, or DOCX while preserving chapter structure, headings, lists, and image references. See HTML parsing for the inverse direction.

Browser-Based Library Tools

Drop an EPUB into the Workbench — the parser runs in WebAssembly, the file never leaves the browser. Useful for personal library curation without uploading purchased books.

Citation & Quote Extraction

Walk the chapter tree to extract every quotation, every blockquote, or every section heading for citation databases, study guides, or literary analysis pipelines.

Try It

Workbench (in-browser)

Drop an EPUB file into the Workbench — parsing runs in WebAssembly with the same AILANG parser used by the CLI. No upload, no signup.

CLI

# Parse an EPUB file
./bin/docparse book.epub

# Convert EPUB to Markdown
./bin/docparse book.epub --convert book.md

# Convert EPUB to HTML
./bin/docparse book.epub --convert book.html

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"book.epub","outputFormat":"json","apiKey":"YOUR_API_KEY"}'

Parse in Browser API Reference

Frequently Asked Questions

How do I parse an EPUB file to JSON?

Drop it into the Workbench, send it to the API, or use the Python/JS/Go/R SDK. The parser walks META-INF/container.xml, reads the OPF spine, and parses each XHTML chapter directly — no Calibre, no PDF conversion.

Are chapters parsed in reading order?

Yes. The OPF spine defines canonical reading order. AILANG Parse follows the spine and emits each chapter as a SectionBlock in that order, even when the underlying filenames are arbitrary.

Does EPUB parsing need Calibre or any external tool?

No. The parser is pure AILANG. Container, OPF, and chapter XHTML are all read directly from the .epub zip archive.

What metadata is extracted from EPUB?

Dublin Core fields from the OPF: dc:title, dc:creator, dc:language, dc:identifier (ISBN), dc:date, dc:publisher.

Does EPUB parsing run in the browser?

Yes. The same AILANG parser used by the CLI compiles to WebAssembly. Files never leave the browser tab.