HTML Parsing API — Parse HTML to Structured JSON

Q: How do I parse HTML to structured JSON?

Send the HTML file or content to the AILANG Parse API or use the Python/JS/Go SDK. The parser extracts semantic structure — headings, paragraphs, tables, lists — into typed blocks.

Q: Does HTML parsing handle messy or malformed HTML?

Yes. The parser handles unclosed tags, nested tables, inline styles, and real-world email HTML. It sanitizes the input while preserving semantic structure.

Q: Are HTML tables extracted with structure?

Yes. HTML tables are extracted with headers (from thead or first row), rows, colspan/rowspan merges, and cell content. Nested tables are flattened into the parent structure.

Q: Can I parse email HTML bodies?

Yes. Email HTML (from multipart/alternative emails) is parsed the same way. Use with EML parsing for full email extraction — the HTML body becomes structured blocks alongside headers and attachments.

Q: Is the output the same as Office format parsing?

Yes. HTML parsing produces the same Block ADT as DOCX, XLSX, and PPTX parsing. Headings, text, tables, and lists are all typed blocks. Your downstream code handles all formats identically.

Q: How fast is HTML parsing?

HTML parsing is deterministic and runs in milliseconds. No AI needed, no external dependencies. Even complex email HTML with deeply nested tables parses in under 100ms.

HTML Varies Wildly

Clean semantic HTML with proper <h1>, <table>, and <article> tags is easy to parse. Real-world HTML is not. Email clients generate deeply nested table layouts with inline styles. CMS exports mix semantic markup with presentation markup. Web scrapes contain navigation, ads, and boilerplate alongside content.

AILANG Parse strips the presentation layer and extracts semantic structure: headings, paragraphs, tables, and lists become typed blocks. The same Block ADT output as DOCX and XLSX parsing — your pipeline handles all formats identically.

HTML parsing is fully deterministic — no AI calls, no API keys needed. Even complex email HTML with nested tables and inline styles parses in under 100ms.

New in v0.13.0: tolerant HTML5 parsing. Boolean attributes (crossorigin, disabled, defer, open, …), overlapping or unclosed tags (<a><p>…</a>), inline <script> blocks containing JSX or template literals, and IE conditional comments are all handled deterministically — no AI fallback required. See the CHANGELOG.

Raw HTML vs Structured Output

Raw HTML (email newsletter)

<html>
<body style="margin:0;padding:0">
<table width="100%" cellpadding="0">
  <tr><td align="center">
    <table width="600">
      <tr><td>
        <h2 style="font-family:Arial;
          color:#333;font-size:24px;
          margin:0 0 16px">
          Weekly Report
        </h2>
        <p style="font-family:Arial;
          font-size:14px;color:#666;
          line-height:1.5">
          Here are this week's metrics:
        </p>
        <table border="1" cellpadding="8"
          style="border-collapse:collapse;
          width:100%">
          <tr style="background:#f5f5f5">
            <th>Metric</th>
            <th>Value</th>
            <th>Change</th>
          </tr>
          <tr>
            <td>Revenue</td>
            <td>$142K</td>
            <td>+12%</td>
          </tr>
        </table>
      </td></tr>
    </table>
  </td></tr>
</table>
</body></html>

Structured output

{
  "blocks": [
    {
      "type": "heading",
      "level": 2,
      "text": "Weekly Report"
    },
    {
      "type": "text",
      "text": "Here are this week's metrics:"
    },
    {
      "type": "table",
      "headers": [
        {"text": "Metric"},
        {"text": "Value"},
        {"text": "Change"}
      ],
      "rows": [
        [
          {"text": "Revenue"},
          {"text": "$142K"},
          {"text": "+12%"}
        ]
      ]
    }
  ]
}

What Gets Extracted

Heading Hierarchy

HTML heading tags (<h1> through <h6>) are extracted with their level preserved. Headings created with CSS styling (large bold text in <div> or <span>) are normalized when identifiable.

Tables with Merges

HTML tables are extracted with headers (from <thead> or first row), rows, and colspan/rowspan resolved into gridSpan/vMerge metadata. Nested layout tables are collapsed. Deep dive →

Lists

Ordered (<ol>) and unordered (<ul>) lists are extracted as text blocks with list styling preserved. Nested lists maintain their hierarchy.

Content Extraction

Navigation, headers, footers, ads, and boilerplate are stripped. The parser identifies main content areas and extracts meaningful text, tables, and structure.

Email HTML

Email HTML is its own challenge. Email clients require table-based layouts with inline styles for compatibility across Outlook, Gmail, Apple Mail, and dozens of other renderers. A simple newsletter can be 500+ lines of nested <table> tags wrapping a few paragraphs and a data table.

AILANG Parse distinguishes layout tables (used for positioning) from data tables (containing actual tabular data). Layout tables are collapsed; data tables are extracted with full structure. Combined with EML parsing, you get headers, metadata, and the HTML body all as structured blocks.

# Parse an HTML file directly
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail newsletter.html

# Parse an EML with HTML body — both are extracted
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail newsletter.eml

Use Cases

Web Scraping Pipelines

Parse scraped web pages into structured blocks for indexing, analysis, or RAG. Content extraction strips navigation and boilerplate. Tables are structured, not flattened to text. Headings provide document hierarchy for chunking strategies.

CMS Content Migration

Export content from WordPress, Drupal, or other CMS platforms as HTML, then parse into structured JSON for migration to a new system. Heading hierarchy, tables, and lists transfer cleanly. Inline styles and CMS-specific markup are stripped.

Email HTML Body Parsing

Extract content from HTML email bodies: newsletter data tables, receipts, shipping notifications. Combined with EML parsing, you get a complete structured view of the email — headers, plain text, HTML content, and attachments.

Documentation Conversion

Convert HTML documentation to Markdown, Quarto, or other formats. Heading hierarchy, code blocks, tables, and lists transfer with structure preserved. Ideal for docs-as-code migrations or multi-format publishing.

AI Knowledge Ingestion

Feed structured blocks to LLMs instead of raw HTML. The model processes typed headings, paragraphs, and tables — not tag soup. Token efficiency improves dramatically when you strip <div class="wrapper-outer-inner"> wrappers.

Report Standardization

Normalize HTML reports from different sources into a consistent Block ADT. Whether the input is a well-structured report or an email-style table layout, the output is the same typed blocks your pipeline expects.

Try It

CLI

# Parse an HTML file
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail page.html

# Convert HTML to Markdown
./bin/docparse page.html --convert output.md

# Convert HTML to DOCX
./bin/docparse page.html --convert output.docx

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample_html","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("newsletter.html", output_format="json")

for block in result.blocks:
    if block.type == "heading":
        print(f"H{block.level}: {block.text}")
    if block.type == "table":
        print(f"Table: {len(block.rows)} rows, {len(block.headers)} columns")

Parse in Browser API Reference

Frequently Asked Questions

How do I parse HTML to structured JSON?

Send the HTML file to the AILANG Parse API or use the Python/JS/Go SDK. The parser extracts semantic structure into typed blocks.

Does HTML parsing handle messy or malformed HTML?

Yes. Unclosed tags, nested tables, inline styles, and real-world email HTML are all handled. The parser sanitizes input while preserving semantic structure.

Are HTML tables extracted with structure?

Yes. Headers, rows, colspan/rowspan merges are all extracted. Layout tables are collapsed; data tables are preserved. See table extraction docs.

Can I parse email HTML bodies?

Yes. Use with EML parsing for full email extraction, or parse HTML files directly. Layout tables from email clients are collapsed automatically.

Is the output the same as Office format parsing?

Yes. Same Block ADT: headings, text, tables, lists. Your downstream code handles DOCX, XLSX, PPTX, and HTML identically.

How fast is HTML parsing?

Deterministic, under 100ms for complex email HTML. No AI needed, no external dependencies.