Track Changes Extraction

Author attribution, timestamps, and change types — directly from the XML.

The Problem

Track changes are the negotiation history of a document. In legal contract review, every insertion and deletion carries meaning: who changed what, when, and why.

Yet every document parser on the market throws them away.

Track changes in contracts are legally significant data. Losing them is data destruction, not a quality tradeoff.

How It Works

A DOCX file is a ZIP archive containing XML files governed by the ECMA-376 (Office Open XML) standard. Track changes are stored as inline revision markup within the document body XML (word/document.xml).

AILANG Parse reads these XML elements directly:

  • w:ins — Insertions. The element wraps the inserted content and carries w:author and w:date attributes.
  • w:del — Deletions. Wraps the deleted content (which Word preserves in the XML even though it is struck through in the UI).
  • w:rPrChange — Run property changes. Records formatting modifications (bold, italic, font size, color) with before/after state.

Each revision element includes the author name and an ISO 8601 timestamp, both extracted verbatim. AILANG Parse does not interpret or resolve the changes — it preserves them as structured data so downstream consumers can decide how to handle them.

The Block ADT

AILANG Parse represents all document content as a flat list of typed blocks. The Block ADT has 9 variants: Text, Heading, Table, Image, Audio, Video, List, Section, and Change.

Track changes map to the Change variant, which carries:

  • type"insertion", "deletion", or "formatChange"
  • author — the author name from the revision markup
  • date — the ISO 8601 timestamp
  • content — the affected text

This means track changes are first-class citizens in the output, not annotations or metadata bolted onto text blocks. They flow through the same pipeline as every other block type — into JSON, into format conversion, into Quarto Markdown (which renders them as CriticMarkup).

What Gets Extracted

Change TypeOOXML ElementExtracted Data
Insertionsw:insAuthor, date, inserted text content
Deletionsw:delAuthor, date, deleted text content (preserved from XML)
Formattingw:rPrChangeAuthor, date, property change description
Move fromw:moveFromAuthor, date, original location content
Move tow:moveToAuthor, date, destination content
Section propsw:sectPrChangeAuthor, date, section property modification
Paragraph propsw:pPrChangeAuthor, date, paragraph formatting change

All change types preserve the full author string and ISO 8601 date from the XML. Nothing is inferred, approximated, or dropped.

Example Output

Given a DOCX contract with tracked edits from two authors, AILANG Parse produces output like this:

[
  {
    "type": "Heading",
    "level": 1,
    "content": "Service Agreement"
  },
  {
    "type": "Text",
    "content": "This agreement is entered into between Party A and Party B."
  },
  {
    "type": "Change",
    "changeType": "deletion",
    "author": "Sarah Chen",
    "date": "2026-03-15T14:32:00Z",
    "content": "30 calendar days"
  },
  {
    "type": "Change",
    "changeType": "insertion",
    "author": "Sarah Chen",
    "date": "2026-03-15T14:32:00Z",
    "content": "45 business days"
  },
  {
    "type": "Text",
    "content": "written notice to the other party."
  },
  {
    "type": "Change",
    "changeType": "insertion",
    "author": "James Park",
    "date": "2026-03-17T09:15:00Z",
    "content": "Termination for cause requires documented breach with a 15-day cure period."
  },
  {
    "type": "Change",
    "changeType": "formatChange",
    "author": "James Park",
    "date": "2026-03-17T09:16:00Z",
    "content": "Indemnification clause modified (bold applied)"
  }
]

Every change is a discrete block with its own author and timestamp. Downstream consumers can filter by author, reconstruct the document at any point in its revision history, or render the changes visually (as AILANG Parse does when converting to Quarto Markdown with CriticMarkup).

Other Parsers

No other document parser we tested extracts track changes from DOCX files. This is not a matter of degree — the feature is entirely absent.

ParserTrack ChangesNotes
AILANG Parse Full extraction — author, date, type, content Reads w:ins, w:del, w:rPrChange directly from OOXML
Unstructured Not extracted Converts DOCX to text; revision markup is discarded
Docling Not extracted Converts DOCX to PDF internally; track changes lost in conversion
LlamaParse Not extracted Cloud API returns accepted-view text only
MarkItDown Not extracted Converts DOCX to Markdown; revision markup is stripped

The reason is architectural. PDF-first parsers lose track changes in the rendering step, and text extraction libraries do not handle revision markup. AILANG Parse reads the OOXML revision elements directly.

Try It

Upload a DOCX with track changes and see the Change blocks appear in the output. Compare it to what your current parser returns — the revision data is either there or it is not.

Frequently Asked Questions

How do I extract track changes from a DOCX file programmatically?

AILANG Parse reads w:ins, w:del, and w:moveTo/w:moveFrom revision markup directly from the DOCX XML, returning each revision with author, timestamp, and change type.

Do other document parsers preserve track changes from DOCX files?

No. All major parsers either convert to PDF (stripping revisions) or use libraries that ignore revision elements. See the comparison page for details.

Can I get track change author and date information from a DOCX?

Yes. AILANG Parse extracts w:author and w:date from every revision element, with ISO 8601 timestamps for filtering, sorting, or audit trails.

What does deterministic document parsing mean?

Same input always produces the same output. AILANG Parse uses rule-based extraction verified by Z3 formal contracts — no ML models, no variability.

Format Guides

Track changes are supported in DOCX and PPTX files. See the full format guides for everything AILANG Parse extracts:

DOCX Parsing →  ·  PPTX Parsing →