Email Parsing API — Parse EML to Structured JSON

Q: How do I parse an EML file to JSON?

Send the .eml file to the AILANG Parse API or use the Python/JS/Go SDK. The parser returns structured JSON with typed blocks for headers, body text, HTML content, and attachments.

Q: Does email parsing require AI or an API key?

No. Email parsing is fully deterministic — it uses RFC 5322 and MIME spec parsing, not AI. It runs in the browser via WASM or server-side with zero external dependencies.

Q: Can I reconstruct email threads from an MBOX file?

Yes. Use the --threaded flag. Messages are grouped into conversation threads using Message-ID/In-Reply-To/References headers, with quoted text stripped from replies.

Raw Email Is Noise

A typical email from Gmail or Outlook is 2,000+ lines of wire-format RFC 5322. MIME boundaries, base64-encoded attachments, folded headers across multiple lines, quoted-printable escape sequences. Feed that to an LLM and you're burning tokens on ARC-Seal: i=1; a=rsa-sha256; t=1775136695; cv=none; instead of the actual message.

AILANG Parse reads the RFC 5322 structure directly and outputs clean, typed blocks — the same format as DOCX, XLSX, and HTML parsing. Your pipeline handles email identically to every other format.

A 125 KB Google Cloud invoice email: 2,076 lines raw, 4 structured blocks. The parser extracts the message content, identifies the PDF attachment, and maps Subject/From/Date to typed metadata — in under 2 seconds, with zero AI calls.

Raw .eml vs Structured Output

Raw .eml file (2,076 lines)

Delivered-To: user@company.com
Received: by 2002:ad4:5d67:0:b0:89c...
X-Received: by 2002:a53:c084:0:b0:650...
ARC-Seal: i=1; a=rsa-sha256; t=17751...
        d=google.com; s=arc-20240605;
        b=Htpf2+pNpdk7kEWo2yzUieUhVe...
         GUGJNZ20uHE4rPvnq0UW+8R1+v4...
ARC-Message-Signature: i=1; a=rsa-sha...
DKIM-Signature: v=1; a=rsa-sha256; ...
Content-Type: multipart/mixed;
  boundary="000000000000c91f7f064e77bbd9"

--000000000000c91f7f064e77bbd9
Content-Type: text/plain; charset="UTF-8"

Your monthly invoice is available.
...
--000000000000c91f7f064e77bbd9
Content-Type: application/pdf;
  name="5539333924.pdf"
Content-Transfer-Encoding: base64

JVBERi0xLjcKCjEgMCBvYmoKPDwgL1R5cG
UgL0NhdGFsb2cgL1BhZ2VzIDIgMCBSIC9N
YXJrSW5mbyA8PCAvTWFya2VkIHRydWUgPj
4gL1N0cnVjdFRyZWVSb290IDMgMCBSID4+
... (1,800+ more lines of base64)

Structured output (4 blocks)

{
  "metadata": {
    "title": "Your invoice is available",
    "author": "Google Payments",
    "created": "Thu, 02 Apr 2026 03:33:38 -0700"
  },
  "blocks": [
    {
      "type": "section",
      "kind": "email-headers",
      "blocks": [
        {"type": "text", "style": "email-header",
         "text": "From: Google Payments <payments-noreply@google.com>"},
        {"type": "text", "style": "email-header",
         "text": "To: user@company.com"},
        {"type": "text", "style": "email-header",
         "text": "Subject: Your invoice is available"}
      ]
    },
    {
      "type": "text",
      "style": "Normal",
      "text": "Your monthly invoice is available.
        Please find the PDF attached..."
    },
    {
      "type": "text",
      "style": "attachment",
      "text": "[attachment: 5539333924.pdf,
        application/pdf]"
    }
  ]
}

What Gets Extracted

Headers → Structured Metadata

Subject maps to title, From to author, Date to created. Multi-line folded headers (RFC 5322 §2.2.3) are automatically unfolded into single values.

Body Text

Plain text bodies become TextBlock. For multipart/alternative emails, both text and HTML parts are extracted as separate blocks.

MIME Multipart

Multipart/mixed (body + attachments), multipart/alternative (text + HTML), and nested multipart structures are all resolved by boundary splitting.

Attachments — Parsed Inline

Text-based attachments (CSV, HTML, Markdown, nested EML) are decoded and parsed inline as SectionBlock with kind attachment. Binary attachments are identified by filename and MIME type. See details below.

Encoding Support

Real-world email uses three encoding layers. AILANG Parse handles all of them:

Encoding	RFC	What It Does	Example
Base64	RFC 2045 §6.8	Body or attachment encoded as ASCII	`VGhpcyBpcyBiYXNlNjQ=` → "This is base64"
Quoted-Printable	RFC 2045 §6.7	Non-ASCII chars as `=XX` hex pairs	`=C3=A9` → é `=C3=BC` → ü
Encoded-Words	RFC 2047	Non-ASCII in headers (Subject, From)	`=?UTF-8?B?5Lya6K2w?=` → 会議の議事録

Multi-byte UTF-8 sequences spanning consecutive =XX pairs are accumulated and decoded together — accented characters (café, résumé, Zürich), CJK, and emoji all work correctly.

MBOX Archives

MBOX files (RFC 4155) contain multiple emails separated by From envelope lines. AILANG Parse splits them into individual messages, each wrapped in a SectionBlock with kind mbox-message. Every message gets full header/body/attachment parsing.

This makes MBOX ideal for bulk email processing: export a Gmail label or Thunderbird folder as MBOX, parse it in one call, and get structured JSON for every message in the archive.

Attachment Chain Parsing

Emails are documents that contain other documents. A budget email has a CSV; a contract email has HTML or a forwarded .eml. AILANG Parse decodes and parses text-based attachments inline — one API call extracts the email and its attachment content.

Supported inline formats: CSV, TSV, HTML, Markdown, plain text, and nested EML (message/rfc822). With the --deep flag, DOCX, PPTX, and XLSX attachments are also extracted and parsed using the full Office parser. Binary attachments (PDF, images) are identified by filename and MIME type.

{
  "type": "section",
  "kind": "attachment",
  "blocks": [
    {"type": "text", "style": "attachment-meta",
     "text": "revenue_q1.csv (text/csv)"},
    {"type": "table",
     "headers": [{"text": "Region"}, {"text": "Revenue"}, {"text": "Target"}],
     "rows": [
       [{"text": "EMEA"}, {"text": "125000"}, {"text": "110000"}],
       [{"text": "APAC"}, {"text": "143000"}, {"text": "130000"}]
     ]}
  ]
}

With --deep, Office attachments get the same treatment — a DOCX attachment is decoded, extracted, and parsed into paragraphs and tables:

# Parse email + resolve Office attachments
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail invoice-email.eml --deep

{
  "type": "section",
  "kind": "attachment",
  "blocks": [
    {"type": "text", "style": "attachment-meta",
     "text": "q1_revenue.docx (application/vnd...wordprocessingml.document)"},
    {"type": "text", "text": "Q1 Revenue Analysis", "style": "Normal"},
    {"type": "table",
     "headers": ["Region", "Revenue"],
     "rows": [["EMEA", "$125,000"], ["APAC", "$143,000"], ["Americas", "$157,000"]]}
  ]
}

Forwarded emails (message/rfc822 attachments) are parsed recursively — the inner email gets full header and body extraction, nested inside the attachment block.

Thread Reconstruction

Individual emails are often unintelligible without their thread. AILANG Parse reconstructs conversation threads from MBOX archives using Message-ID, In-Reply-To, and References headers.

# Threaded mode: groups messages into conversations
ailang run --entry main --caps IO,FS,Env \
  docparse/main.ail archive.mbox --threaded

Output: each thread becomes a SectionBlock with kind thread, containing the normalized subject, participant list, and chronologically-ordered messages with quoted text stripped.

{
  "type": "section",
  "kind": "thread",
  "blocks": [
    {"type": "text", "style": "thread-subject", "text": "Q2 Budget Planning"},
    {"type": "text", "style": "thread-participants", "text": "Alice, Bob, Carol"},
    {"type": "section", "kind": "thread-message", "blocks": [...]},
    {"type": "section", "kind": "thread-message", "blocks": [...]},
    {"type": "section", "kind": "thread-message", "blocks": [...]}
  ]
}

Thread reconstruction gives LLMs the full conversation context for summarization, triage, or automated response — without the noise of repeated quoted text.

Use Cases

Emails are documents that contain other documents. Structured parsing makes them composable with the rest of your data pipeline.

AI Email Triage & Routing

Feed structured email JSON to an LLM for classification, routing, or auto-response. A 2,000-line raw email becomes 4 typed blocks — the model reasons about content, not DKIM signatures. Classification accuracy improves because tokens are spent on the actual message, not transport headers.

Inbox Monitoring Agents

An AI agent watching a mailbox (IMAP, local .eml files, or Gmail API exports) parses each email to structured JSON, extracts the message + attachment manifest, and takes action. The agent sees [attachment: invoice.pdf, application/pdf] — not 1,500 lines of base64.

Email-to-Knowledge-Base

Parse MBOX exports from Gmail labels or Thunderbird folders into structured blocks, embed them in a vector database. Metadata (author, date, subject) becomes filterable dimensions alongside semantic search on body content. One ailang run --batch call processes an entire archive.

Multi-Format Document Pipelines

CSV, HTML, and Markdown attachments are parsed inline automatically. A single API call returns the email body and the table data from the attached spreadsheet — no second parse step needed. Binary attachments (PDF, images) are identified for downstream AI processing.

Compliance & E-Discovery

Bulk MBOX processing with the full delivery chain preserved (Received headers, DKIM, SPF results) for forensic analysis, while body content is cleanly separated for legal review. Attachment manifests provide a complete inventory without extracting binary data.

Customer Support Automation

Parse inbound support emails to extract the customer query, thread context, and any attached screenshots or documents. The structured output feeds directly into ticket creation, sentiment analysis, or automated response generation — with the original email metadata preserved for routing.

Batch mode: ailang run --batch compiles once and parses N files, so the AILANG runtime startup is paid once per batch instead of once per file. Useful for processing entire mailbox directories in a single invocation.

Example: GitHub Notification

A real GitHub Actions notification email (48 KB, multipart/alternative with DKIM/ARC/SPF headers):

{
  "metadata": {
    "title": "[sunholo-data/ailang] Run failed: Build and Release - dev",
    "author": "\"Voight-Kampff (bot)\"",
    "created": "Thu, 02 Apr 2026 06:31:33 -0700"
  },
  "blocks": [
    {
      "type": "section",
      "kind": "email-headers",
      "blocks": [
        {"type": "text", "style": "email-header",
         "text": "From: \"Voight-Kampff (bot)\" <notifications@github.com>"},
        {"type": "text", "style": "email-header",
         "text": "Subject: [sunholo-data/ailang] Run failed: Build and Release - dev"},
        {"type": "text", "style": "email-header",
         "text": "X-Github-Reason: ci_activity"}
      ]
    },
    {
      "type": "text",
      "style": "Normal",
      "text": "Repository: sunholo-data/ailang\nWorkflow: Build and Release\nDuration: 5 minutes and 50.0 seconds\n\nJobs:\n  * Build ubuntu-latest succeeded\n  * Build windows-latest failed (2 annotations)\n  * Create Release Bundle skipped"
    }
  ]
}

28 raw headers (DKIM chains, ARC seals, SPF results) reduced to the headers that matter. The body is clean text, ready for an LLM to reason about build failures.

Try It

CLI

# Parse an EML email file
ailang run --entry main --caps IO,FS,Env \
  ~/.ailang/cache/registry/sunholo/ailang_parse/*/docparse/main.ail inbox.eml

# Parse an MBOX archive (multiple messages)
ailang run --entry main --caps IO,FS,Env \
  ~/.ailang/cache/registry/sunholo/ailang_parse/*/docparse/main.ail archive.mbox

API

curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath":"sample_eml_welcome","outputFormat":"markdown","apiKey":"YOUR_API_KEY"}'

Python SDK

from ailang_parse import DocParse

client = DocParse(api_key="YOUR_API_KEY")
result = client.parse("inbox.eml", output_format="json")

# Metadata extracted from headers
print(result.metadata.title)   # Subject line
print(result.metadata.author)  # From display name

# Iterate blocks like any other format
for block in result.blocks:
    if block.style == "attachment":
        print(f"Attachment: {block.text}")

Parse in Browser API Reference

Frequently Asked Questions

How do I parse an EML file to JSON?

Send the .eml file to the AILANG Parse API or use the Python/JS/Go SDK. The parser returns structured JSON with typed blocks for headers, body text, HTML content, and attachments.

Does email parsing require AI or an API key?

No. Email parsing is fully deterministic — RFC 5322 and MIME spec parsing, not AI. It runs in the browser via WASM or server-side with zero external dependencies.

What email encodings are supported?

Base64 body decoding, quoted-printable (including multi-byte UTF-8), and RFC 2047 encoded-words for international headers (both B and Q encoding).

Can I parse MBOX archives with multiple messages?

Yes. MBOX files are split on RFC 4155 envelope lines. Each message becomes a separate section block, parsed independently with full header and body extraction.

What is the output format for parsed emails?

The same Block ADT used for DOCX, XLSX, and HTML parsing: typed blocks with metadata. Downstream code handles all formats identically. See the docs for the full schema.

Are email attachments parsed automatically?

Text-based attachments (CSV, HTML, Markdown, nested EML) are parsed inline automatically. With the --deep flag, DOCX, PPTX, and XLSX attachments are also extracted and parsed using the full Office parser.

Can I reconstruct email threads from an MBOX file?

Yes. Use the --threaded flag. Messages are grouped into conversation threads using Message-ID/In-Reply-To/References headers, with quoted text stripped from replies.