Comment Extraction — AILANG Parse

Q: How do I extract comments with author names from a DOCX file?

AILANG Parse reads comments.xml and matches each comment to its anchor using w:commentRangeStart/w:commentRangeEnd markers, outputting text, author, date, and anchored range.

The Problem

Comments are the collaboration layer of Office documents. They carry the context that matters most: who raised a concern, when they raised it, what specific text they were responding to, and the threaded conversation that followed.

PDF has no concept of comments. Converting to PDF silently erases the entire review history.

Comments are stored in dedicated XML parts (word/comments.xml, xl/comments1.xml, ppt/comments/comment1.xml) with range markers linking them to the exact content they annotate. No PDF-based parser can access them because they never survive the conversion.

No other parser extracts Office comments with full attribution. See the comparison table below.

Parser	Comments	Author	Timestamp	Anchor Text	Threading
AILANG Parse	Extracted	Preserved	Preserved	Preserved	Preserved
Unstructured	Dropped	Dropped	Dropped	Dropped	Dropped
Docling	Dropped	Dropped	Dropped	Dropped	Dropped
LlamaParse	Dropped	Dropped	Dropped	Dropped	Dropped
MarkItDown	Dropped	Dropped	Dropped	Dropped	Dropped

How It Works

AILANG Parse reads the Office XML directly — no intermediate conversion, no rendering, no PDF step. It opens the OOXML ZIP archive and extracts comment data from the native XML parts.

DOCX (Word)

Word stores comments in word/comments.xml as w:comment elements, each with an w:id, w:author, and w:date attribute. In the document body (word/document.xml), w:commentRangeStart and w:commentRangeEnd markers bracket the exact text the comment is attached to. AILANG Parse correlates these ranges with the comment content to produce fully anchored output.

Threaded replies use w:commentReference linking back to a parent comment ID, preserving the full conversation chain.

PPTX (PowerPoint)

PowerPoint stores comments in per-slide XML files (ppt/comments/comment1.xml) with author references resolved from ppt/commentAuthors.xml. Each comment includes pos coordinates indicating where on the slide it was placed. AILANG Parse extracts the comment text, author, date, and associates it with the slide number.

XLSX (Excel)

Excel comments (formerly "notes") live in per-sheet files (xl/comments1.xml) keyed to cell references. Each comment element has a ref attribute (e.g., B5) and an authorId resolved from the authors list. The comment text is in r/t sub-elements, often with rich formatting spans. AILANG Parse extracts the comment text, author, and cell reference.

What Gets Extracted

Field	Description	DOCX	PPTX	XLSX
Comment Text	The full text content of the comment	Yes	Yes	Yes
Author Name	Who wrote the comment	Yes	Yes	Yes
Timestamp	When the comment was created (ISO 8601)	Yes	Yes	—
Anchor Text	The document text the comment is attached to	Yes	—	—
Cell Reference	The cell a comment is attached to (e.g., B5)	—	—	Yes
Slide Number	Which slide the comment appears on	—	Yes	—
Comment ID	Unique identifier for cross-referencing	Yes	Yes	—
Reply Threading	Parent-child relationships between comments	Yes	—	—

PPTX & XLSX Comments

Comments work differently in each Office format, but AILANG Parse normalizes them into the same block structure.

PowerPoint Comments

PowerPoint comments are positional — placed at x/y coordinates on a slide rather than anchored to a text range. They are commonly used during presentation review ("Move this chart to slide 3", "Font too small for projection"). AILANG Parse extracts the comment and associates it with the slide where it appears.

Excel Comments

Excel comments (called "Notes" in modern Excel) are anchored to specific cells. They often contain data validation notes, formula explanations, or review feedback on specific values. AILANG Parse extracts the comment text, the author, and the cell reference so you know exactly which data point the comment refers to.

Modern Excel also has "Threaded Comments" (introduced in Office 365), which are stored in a separate xl/threadedComments/ part with richer metadata including timestamps and reply chains.

Example Output

A DOCX with three review comments produces output like this:

{
  "blocks": [
    {
      "type": "Text",
      "content": "The quarterly revenue target of $4.2M was met."
    },
    {
      "type": "Text",
      "content": "[Comment by Sarah Chen on 2026-01-15T09:32:00Z] on 'quarterly revenue target of $4.2M': Should we break this down by region? The EMEA numbers look flat."
    },
    {
      "type": "Text",
      "content": "[Comment by James Park on 2026-01-15T10:14:00Z] (reply to Sarah Chen): Good call. EMEA was $1.1M vs $1.3M target. Adding a regional breakdown table."
    },
    {
      "type": "Text",
      "content": "Operating expenses remained within the approved budget."
    },
    {
      "type": "Text",
      "content": "[Comment by Legal Review on 2026-01-16T14:02:00Z] on 'approved budget': Need to clarify which board approval this refers to. Add the resolution number."
    }
  ]
}

Each comment preserves the author, timestamp, the specific text it was anchored to, and reply threading. This is the raw collaboration history that PDF conversion destroys.

XLSX Example

{
  "blocks": [
    {
      "type": "Table",
      "content": "| Region | Q1 Revenue | Q1 Target | ... |",
      "metadata": { "sheet": "Revenue" }
    },
    {
      "type": "Text",
      "content": "[Comment by Audit Team on cell D7]: This figure doesn't match the GL export. Verify against SAP report #4821."
    }
  ]
}

Use Cases

Legal Review

Extract who raised which concern and when. Build an audit trail of legal review comments across contract drafts without manual review of each document version.

Compliance Audit

Prove that documents went through proper review. Extract the full comment chain to show reviewers, timestamps, and responses for regulatory compliance.

Collaborative Editing

Extract all feedback from a reviewed document in one pass. No more clicking through comment bubbles — get every comment as structured data for triage and tracking.

Document Intelligence

Feed comments into an LLM for sentiment analysis on specific passages. Know not just what the document says, but what reviewers thought about each section.

Try It

Extract comments from any Office document:

# Parse a DOCX with comments
./bin/docparse contract_reviewed.docx

# Parse via the API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -H "Content-Type: application/json" \
  -d '{"filepath": "contract_reviewed.docx", "apiKey": "dp_YOUR_KEY"}'

# Parse an XLSX with cell comments
./bin/docparse budget_annotated.xlsx

Parse in Browser API Reference

Frequently Asked Questions

How do I extract comments with author names from a DOCX file?

AILANG Parse reads comments.xml and matches each comment to its anchor using w:commentRangeStart/w:commentRangeEnd markers, outputting text, author, date, and anchored range.

Do PDF-based document parsers preserve DOCX comments?

No. PDF conversion either renders comments as margin annotations or drops them entirely. AILANG Parse reads them from the XML with full metadata. See the comparison page.

Can I extract both comments and track changes from the same document?

Yes. AILANG Parse extracts all metadata in a single pass. Each element type is a distinct block type you can filter independently.

Format Guides

Comments are extracted from DOCX, PPTX, and XLSX files. See the full format guides for everything AILANG Parse extracts:

DOCX Parsing → · PPTX Parsing → · XLSX Parsing →