Books as Structured Data
An .epub file is a zip archive that hides a small filesystem inside it: META-INF/container.xml points to a .opf manifest, the OPF lists every resource and a spine defining the canonical reading order, and the actual content lives as XHTML files — one per chapter, sometimes more. Embedded images, fonts, and CSS sit alongside.
Most ebook tooling treats EPUB as a black box: convert to PDF, render to text, lose chapter boundaries, lose metadata, lose image references. AILANG Parse walks the EPUB structure directly — container → OPF → spine → XHTML — and emits each chapter as a SectionBlock in spine order.
Raw EPUB Structure vs Structured Output
<!-- content.opf -->
<package version="3.0">
<metadata>
<dc:title>A Tale of Two Cities</dc:title>
<dc:creator>Charles Dickens</dc:creator>
<dc:language>en</dc:language>
</metadata>
<manifest>
<item id="ch1" href="ch1.xhtml"
media-type="application/xhtml+xml"/>
<item id="ch2" href="ch2.xhtml"
media-type="application/xhtml+xml"/>
</manifest>
<spine>
<itemref idref="ch1"/>
<itemref idref="ch2"/>
</spine>
</package>
<!-- ch1.xhtml -->
<h1>Book the First</h1>
<p>It was the best of times,
it was the worst of times...</p>
{
"metadata": {
"title": "A Tale of Two Cities",
"author": "Charles Dickens",
"language": "en"
},
"blocks": [
{
"type": "section",
"kind": "chapter:ch1.xhtml",
"blocks": [
{
"type": "heading",
"level": 1,
"text": "Book the First"
},
{
"type": "text",
"text": "It was the best of times, it was the worst of times..."
}
]
},
{
"type": "section",
"kind": "chapter:ch2.xhtml",
"blocks": [ ... ]
}
]
}
What Gets Extracted
Spine-Ordered Chapters
The OPF spine defines canonical reading order. Each spine item becomes a SectionBlock with kind: "chapter:<href>", in the exact order an ebook reader would render them.
OPF Metadata
Dublin Core fields from the OPF: dc:title, dc:creator, dc:language, dc:identifier (often the ISBN), dc:date, dc:publisher. Indexable, searchable, exportable.
Headings, Paragraphs, Lists
Each chapter XHTML is parsed by the same HTML parser used for standalone HTML files. Headings (h1–h6), paragraphs, ordered/unordered lists, blockquotes, and inline emphasis come through as typed blocks.
Tables & Code Blocks
Reference works, technical books, and cookbooks routinely embed tables and <pre><code> blocks. Both round-trip cleanly through the Block ADT — no flattening.
Embedded Images
Image references (<img src="...">) come through as ImageBlock with their relative path inside the EPUB. Useful for cover extraction or figure captioning pipelines.
No Calibre, No PDF Pipeline
The parser is pure AILANG. No calibre binary, no ebook-convert subprocess, no PDF intermediate. Same code runs in the CLI and in WebAssembly in the browser.
Spine & Reading Order
EPUB chapter files do not have to be ordered by filename — the OPF spine is the only authoritative source of reading order. AILANG Parse follows the spine, so chapters always come out in the order the author intended, even if the underlying files are named x4j2-zb.xhtml.
Each chapter is its own SectionBlock, which means downstream code can chunk by chapter, embed by chapter, or render a table of contents from headings without flattening the book first.
from ailang_parse import DocParse
result = DocParse(api_key="...").parse_file("book.epub")
print(result.metadata.title, "by", result.metadata.author)
for chapter in result.blocks:
if chapter.type != "section":
continue
print("---", chapter.kind, "---")
for block in chapter.blocks:
if block.type == "heading":
print("H" + str(block.level), block.text)
Use Cases
Project Gutenberg & Public Domain
Bulk-process tens of thousands of EPUBs from Project Gutenberg, Standard Ebooks, or the Internet Archive. One parser, one Block ADT, no per-source heuristics.
Ebook RAG & Semantic Search
Each chapter becomes a coherent retrievable chunk with explicit boundaries — embed by chapter or by heading section, not by an arbitrary 512-token window that splits a sentence in two.
Accessibility Tooling
Generate audio versions, large-print exports, or simplified-language renderings. Structural blocks survive the conversion, so chapter boundaries and headings remain intact in the output.
Format Migration
Convert EPUB to Markdown, HTML, or DOCX while preserving chapter structure, headings, lists, and image references. See HTML parsing for the inverse direction.
Browser-Based Library Tools
Drop an EPUB into the Workbench — the parser runs in WebAssembly, the file never leaves the browser. Useful for personal library curation without uploading purchased books.
Citation & Quote Extraction
Walk the chapter tree to extract every quotation, every blockquote, or every section heading for citation databases, study guides, or literary analysis pipelines.
Try It
Workbench (in-browser)
Drop an EPUB file into the Workbench — parsing runs in WebAssembly with the same AILANG parser used by the CLI. No upload, no signup.
CLI
# Parse an EPUB file
./bin/docparse book.epub
# Convert EPUB to Markdown
./bin/docparse book.epub --convert book.md
# Convert EPUB to HTML
./bin/docparse book.epub --convert book.html
API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath":"book.epub","outputFormat":"json","apiKey":"YOUR_API_KEY"}'
Parse in Browser API Reference
Frequently Asked Questions
How do I parse an EPUB file to JSON?
Are chapters parsed in reading order?
Yes. The OPF spine defines canonical reading order. AILANG Parse follows the spine and emits each chapter as a SectionBlock in that order, even when the underlying filenames are arbitrary.
Does EPUB parsing need Calibre or any external tool?
No. The parser is pure AILANG. Container, OPF, and chapter XHTML are all read directly from the .epub zip archive.
What metadata is extracted from EPUB?
Dublin Core fields from the OPF: dc:title, dc:creator, dc:language, dc:identifier (ISBN), dc:date, dc:publisher.
Does EPUB parsing run in the browser?
Yes. The same AILANG parser used by the CLI compiles to WebAssembly. Files never leave the browser tab.