HTML Varies Wildly
Clean semantic HTML with proper <h1>, <table>, and <article> tags is easy to parse. Real-world HTML is not. Email clients generate deeply nested table layouts with inline styles. CMS exports mix semantic markup with presentation markup. Web scrapes contain navigation, ads, and boilerplate alongside content.
AILANG Parse strips the presentation layer and extracts semantic structure: headings, paragraphs, tables, and lists become typed blocks. The same Block ADT output as DOCX and XLSX parsing — your pipeline handles all formats identically.
Raw HTML vs Structured Output
<html>
<body style="margin:0;padding:0">
<table width="100%" cellpadding="0">
<tr><td align="center">
<table width="600">
<tr><td>
<h2 style="font-family:Arial;
color:#333;font-size:24px;
margin:0 0 16px">
Weekly Report
</h2>
<p style="font-family:Arial;
font-size:14px;color:#666;
line-height:1.5">
Here are this week's metrics:
</p>
<table border="1" cellpadding="8"
style="border-collapse:collapse;
width:100%">
<tr style="background:#f5f5f5">
<th>Metric</th>
<th>Value</th>
<th>Change</th>
</tr>
<tr>
<td>Revenue</td>
<td>$142K</td>
<td>+12%</td>
</tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
</body></html>
{
"blocks": [
{
"type": "heading",
"level": 2,
"text": "Weekly Report"
},
{
"type": "text",
"text": "Here are this week's metrics:"
},
{
"type": "table",
"headers": [
{"text": "Metric"},
{"text": "Value"},
{"text": "Change"}
],
"rows": [
[
{"text": "Revenue"},
{"text": "$142K"},
{"text": "+12%"}
]
]
}
]
}
What Gets Extracted
Heading Hierarchy
HTML heading tags (<h1> through <h6>) are extracted with their level preserved. Headings created with CSS styling (large bold text in <div> or <span>) are normalized when identifiable.
Tables with Merges
HTML tables are extracted with headers (from <thead> or first row), rows, and colspan/rowspan resolved into gridSpan/vMerge metadata. Nested layout tables are collapsed. Deep dive →
Lists
Ordered (<ol>) and unordered (<ul>) lists are extracted as text blocks with list styling preserved. Nested lists maintain their hierarchy.
Content Extraction
Navigation, headers, footers, ads, and boilerplate are stripped. The parser identifies main content areas and extracts meaningful text, tables, and structure.
Email HTML
Email HTML is its own challenge. Email clients require table-based layouts with inline styles for compatibility across Outlook, Gmail, Apple Mail, and dozens of other renderers. A simple newsletter can be 500+ lines of nested <table> tags wrapping a few paragraphs and a data table.
AILANG Parse distinguishes layout tables (used for positioning) from data tables (containing actual tabular data). Layout tables are collapsed; data tables are extracted with full structure. Combined with EML parsing, you get headers, metadata, and the HTML body all as structured blocks.
# Parse an HTML file directly
ailang run --entry main --caps IO,FS,Env \
docparse/main.ail newsletter.html
# Parse an EML with HTML body — both are extracted
ailang run --entry main --caps IO,FS,Env \
docparse/main.ail newsletter.eml
Use Cases
Web Scraping Pipelines
Parse scraped web pages into structured blocks for indexing, analysis, or RAG. Content extraction strips navigation and boilerplate. Tables are structured, not flattened to text. Headings provide document hierarchy for chunking strategies.
CMS Content Migration
Export content from WordPress, Drupal, or other CMS platforms as HTML, then parse into structured JSON for migration to a new system. Heading hierarchy, tables, and lists transfer cleanly. Inline styles and CMS-specific markup are stripped.
Email HTML Body Parsing
Extract content from HTML email bodies: newsletter data tables, receipts, shipping notifications. Combined with EML parsing, you get a complete structured view of the email — headers, plain text, HTML content, and attachments.
Documentation Conversion
Convert HTML documentation to Markdown, Quarto, or other formats. Heading hierarchy, code blocks, tables, and lists transfer with structure preserved. Ideal for docs-as-code migrations or multi-format publishing.
AI Knowledge Ingestion
Feed structured blocks to LLMs instead of raw HTML. The model processes typed headings, paragraphs, and tables — not tag soup. Token efficiency improves dramatically when you strip <div class="wrapper-outer-inner"> wrappers.
Report Standardization
Normalize HTML reports from different sources into a consistent Block ADT. Whether the input is a well-structured report or an email-style table layout, the output is the same typed blocks your pipeline expects.
Try It
CLI
# Parse an HTML file
ailang run --entry main --caps IO,FS,Env \
docparse/main.ail page.html
# Convert HTML to Markdown
./bin/docparse page.html --convert output.md
# Convert HTML to DOCX
./bin/docparse page.html --convert output.docx
API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath":"sample_html","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'
Python SDK
from ailang_parse import DocParse
client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("newsletter.html", output_format="json")
for block in result.blocks:
if block.type == "heading":
print(f"H{block.level}: {block.text}")
if block.type == "table":
print(f"Table: {len(block.rows)} rows, {len(block.headers)} columns")
Parse in Browser API Reference
Frequently Asked Questions
How do I parse HTML to structured JSON?
Send the HTML file to the AILANG Parse API or use the Python/JS/Go SDK. The parser extracts semantic structure into typed blocks.
Does HTML parsing handle messy or malformed HTML?
Yes. Unclosed tags, nested tables, inline styles, and real-world email HTML are all handled. The parser sanitizes input while preserving semantic structure.
Are HTML tables extracted with structure?
Yes. Headers, rows, colspan/rowspan merges are all extracted. Layout tables are collapsed; data tables are preserved. See table extraction docs.
Can I parse email HTML bodies?
Yes. Use with EML parsing for full email extraction, or parse HTML files directly. Layout tables from email clients are collapsed automatically.
Is the output the same as Office format parsing?
How fast is HTML parsing?
Deterministic, under 100ms for complex email HTML. No AI needed, no external dependencies.