RTF parsing — pure AILANG, no Office required

Government forms, legal templates, and 30+ year-old MailMerge pipelines still hand you .rtf files. AILANG Parse reads them directly — no LibreOffice subprocess, no Word ActiveX, no antiword shell-out.

The Problem

RTF was specified by Microsoft in 1987 and never went away. It's still a default "Save As" option in Word, it's what TextEdit produces on macOS, and it's the lingua franca of legacy government and legal pipelines. Yet most parsing libraries either skip RTF entirely or shell out to a 200 MB LibreOffice install just to read a 50 KB form.

Spawning soffice --headless to read an RTF is a 2 GB memory hit, ~3 seconds of cold-start, and a recurring source of file-descriptor leaks in long-running services. AILANG Parse parses the same file in <50 ms with no subprocess.

How It Works

The RTF parser is a char-by-char state machine that walks the source via std/string.foldChars. It tracks { } group nesting, recognizes control words and control symbols, and decodes Unicode/CP1252 escapes. The output is the same Block ADT as every other AILANG Parse format — TextBlock per paragraph, no new variants.

# Parse an RTF file from the CLI
./bin/docparse contract.rtf

# Or via API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -F "filepath=@contract.rtf" \
  -F "outputFormat=markdown" \
  -F "apiKey=YOUR_API_KEY"

Encoding Support

RTF uses two parallel escape mechanisms for non-ASCII characters. AILANG Parse decodes both correctly.

EscapeMeaningExampleDecoded
\uNNNN?Signed 16-bit Unicode code point (RTF 1.5+)K\u248?bsaftaleKøbsaftale
\'XXCP1252 (Windows-1252) byte in hex100\'80100€
\\ \{ \}Literal backslash/brace50\% off50% off
\tab \cellTab character / table cell separatorName:\tab JaneName:[TAB]Jane
\par \rowParagraph breakline 1\par line 2Two TextBlocks

The CP1252 decoder includes the full Windows-1252 translation table for the 0x80–0x9F range — euro sign (0x80 → €), smart quotes (0x91–0x94 → ' ' " "), em/en dashes (0x96–0x97 → – —), ellipsis (0x85 → …), trademark (0x99 → ™), and all the other punctuation Microsoft tucked into the C1 control space. Bytes 0xA0–0xFF map directly to Unicode U+00A0–U+00FF.

Skipped Destinations

RTF mixes content with header tables — fonts, colors, stylesheets, list definitions, generator strings, theme data. The parser skips these wholesale so they never pollute extracted text. Nested skip groups (e.g. {\*\panose ...} inside \fonttbl) are tracked correctly: the outer skip context wins, so leaving a nested group doesn't accidentally re-enable text capture.

  • \fonttbl, \filetbl, \colortbl, \stylesheet, \listtable, \listoverridetable — definition tables
  • \info, \generator, \themedata, \latentstyles, \datastore — document metadata blobs
  • \pict, \shppict, \nonshppict — embedded image data (not yet extracted)
  • \field, \fldinst, \fldrslt — field codes and hyperlink internals
  • \bkmkstart, \bkmkend, \panose, \xmlnstbl — minor structural markers
  • Any {\*\foo …} — RTF specifies that starred destinations are "optional" and consumers free to skip; we do

Example Output

Each \par closes a TextBlock. Tab-separated form fields show up as \t-joined text within a single block, which preserves the row layout when re-rendered as Markdown or HTML.

{
  "blocks": [
    {"kind": "TextBlock", "style": "normal", "text": "Sample RTF Document"},
    {"kind": "TextBlock", "style": "normal", "text": "Danish words: Købsaftale, økonomi, smørrebrød, fjærkræ, ål."},
    {"kind": "TextBlock", "style": "normal", "text": "French phrases: café au lait, première classe, naïveté."},
    {"kind": "TextBlock", "style": "normal", "text": "Currency: 50£ pounds sterling, 100€ euro, 250¥ yen, copyright © 1999."},
    {"kind": "TextBlock", "style": "normal", "text": "Smart quotes: “like these” and ‘these’."},
    {"kind": "TextBlock", "style": "normal", "text": "Name:\tJane Doe"},
    {"kind": "TextBlock", "style": "normal", "text": "Email:\tjane@example.com"}
  ]
}

Use Cases

Government & Legal Forms

Danish Købsaftale (real-estate purchase agreements), French CERFA tax forms, UK court filings, US municipal templates. These are often shipped as .rtf because every operating system since the 1990s can render them without a converter. Parse them in batch without standing up Word.

Legacy ERP & CRM Exports

Older SAP, Oracle, and JD Edwards modules export reports as RTF. So do Salesforce mail merge templates and certain Dynamics flows. Hooking up a parser that doesn't require Office means batch ETL jobs stop needing a Windows host.

Email & MailMerge Pipelines

Outlook still saves drafts as RTF in message/rfc822 attachments. Mass-mailing platforms (Constant Contact, Mailchimp legacy lists) accept RTF templates. Parsing them lets you feed historical campaigns into modern analytics without re-keying.

Document Migration & Archival

Government and academic archives often hold decades of .rtf files alongside .doc and .docx. Migrating to a modern store means parsing all three. Until v0.20, AILANG Parse could only handle two of the three.

RAG over Heterogeneous Corpora

Your knowledge base has DOCX from 2018, PDF from 2024, EML from 2012, and RTF from 2003. A single parser API that handles all of them, deterministically and with the same Block ADT, is worth more than slight-quality wins on any individual format.

TextEdit & Pages Exports

macOS users routinely save notes as RTF — that's TextEdit's default rich-text format. Pages and Numbers export to RTF for cross-platform handoff. If your product takes uploads from Mac users, you'll hit RTF eventually.

Try It

RTF is one of 16 supported input formats. The parser is pure AILANG — zero model calls, zero subprocess, fully deterministic.

# CLI
./bin/docparse contract.rtf --convert contract.md

# API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -F "filepath=@contract.rtf" \
  -F "outputFormat=markdown" \
  -F "apiKey=YOUR_API_KEY"

Parse in Browser    API Reference

Frequently Asked Questions

Do I need LibreOffice or Word installed to parse RTF?

No. AILANG Parse handles RTF entirely in-process — no soffice subprocess, no Word ActiveX, no antiword binary. The parser is pure AILANG and runs in the browser via WebAssembly as well as on the server.

Does it handle non-ASCII characters like Danish or German?

Yes. \uNNNN Unicode escapes are decoded via std/bytes UTF-8 reassembly, and \'XX CP1252 hex escapes go through a built-in Windows-1252 translation table that maps 0x80–0x9F to their canonical Unicode code points (euro sign, smart quotes, em dash, etc.).

What does the parser skip?

Header destinations (\fonttbl, \colortbl, \stylesheet, \info, \listtable, \themedata, \panose, picture data, fields, etc.) are skipped as whole groups. Any starred destination ({\*\foo …}) is also skipped including its nested children. Formatting control words (\b, \i, \fs, \cf, …) are recognized and ignored.

Are tables preserved?

Tables are flattened in v0.20: \cell becomes TAB, \row becomes a paragraph break. This preserves the data but not the grid topology. Proper TableBlock emission for RTF is on the roadmap.

What about embedded images (\pict)?

Image data is skipped in v0.20 (the \pict destination is in the skip list). The pixel data is hex-encoded inside the RTF and would require a separate decoder. If you need RTF image extraction, file a feature request.

Where does RTF still show up in real workflows?

Government forms (Danish Købsaftale, French CERFA), legal templates, court filings, ERP exports, older MailMerge pipelines, and any system that needs a format Word and TextEdit both render without a converter. RTF is still a default "Save As" option in Microsoft Word.