The Problem
Comments are the collaboration layer of Office documents. They carry the context that matters most: who raised a concern, when they raised it, what specific text they were responding to, and the threaded conversation that followed.
PDF has no concept of comments. Converting to PDF silently erases the entire review history.
Comments are stored in dedicated XML parts (word/comments.xml, xl/comments1.xml, ppt/comments/comment1.xml) with range markers linking them to the exact content they annotate. No PDF-based parser can access them because they never survive the conversion.
No other parser extracts Office comments with full attribution. See the comparison table below.
| Parser | Comments | Author | Timestamp | Anchor Text | Threading |
|---|---|---|---|---|---|
| AILANG Parse | Extracted | Preserved | Preserved | Preserved | Preserved |
| Unstructured | Dropped | Dropped | Dropped | Dropped | Dropped |
| Docling | Dropped | Dropped | Dropped | Dropped | Dropped |
| LlamaParse | Dropped | Dropped | Dropped | Dropped | Dropped |
| MarkItDown | Dropped | Dropped | Dropped | Dropped | Dropped |
How It Works
AILANG Parse reads the Office XML directly — no intermediate conversion, no rendering, no PDF step. It opens the OOXML ZIP archive and extracts comment data from the native XML parts.
DOCX (Word)
Word stores comments in word/comments.xml as w:comment elements, each with an w:id, w:author, and w:date attribute. In the document body (word/document.xml), w:commentRangeStart and w:commentRangeEnd markers bracket the exact text the comment is attached to. AILANG Parse correlates these ranges with the comment content to produce fully anchored output.
Threaded replies use w:commentReference linking back to a parent comment ID, preserving the full conversation chain.
PPTX (PowerPoint)
PowerPoint stores comments in per-slide XML files (ppt/comments/comment1.xml) with author references resolved from ppt/commentAuthors.xml. Each comment includes pos coordinates indicating where on the slide it was placed. AILANG Parse extracts the comment text, author, date, and associates it with the slide number.
XLSX (Excel)
Excel comments (formerly "notes") live in per-sheet files (xl/comments1.xml) keyed to cell references. Each comment element has a ref attribute (e.g., B5) and an authorId resolved from the authors list. The comment text is in r/t sub-elements, often with rich formatting spans. AILANG Parse extracts the comment text, author, and cell reference.
What Gets Extracted
| Field | Description | DOCX | PPTX | XLSX |
|---|---|---|---|---|
| Comment Text | The full text content of the comment | Yes | Yes | Yes |
| Author Name | Who wrote the comment | Yes | Yes | Yes |
| Timestamp | When the comment was created (ISO 8601) | Yes | Yes | — |
| Anchor Text | The document text the comment is attached to | Yes | — | — |
| Cell Reference | The cell a comment is attached to (e.g., B5) | — | — | Yes |
| Slide Number | Which slide the comment appears on | — | Yes | — |
| Comment ID | Unique identifier for cross-referencing | Yes | Yes | — |
| Reply Threading | Parent-child relationships between comments | Yes | — | — |
PPTX & XLSX Comments
Comments work differently in each Office format, but AILANG Parse normalizes them into the same block structure.
PowerPoint Comments
PowerPoint comments are positional — placed at x/y coordinates on a slide rather than anchored to a text range. They are commonly used during presentation review ("Move this chart to slide 3", "Font too small for projection"). AILANG Parse extracts the comment and associates it with the slide where it appears.
Excel Comments
Excel comments (called "Notes" in modern Excel) are anchored to specific cells. They often contain data validation notes, formula explanations, or review feedback on specific values. AILANG Parse extracts the comment text, the author, and the cell reference so you know exactly which data point the comment refers to.
Modern Excel also has "Threaded Comments" (introduced in Office 365), which are stored in a separate xl/threadedComments/ part with richer metadata including timestamps and reply chains.
Example Output
A DOCX with three review comments produces output like this:
{
"blocks": [
{
"type": "Text",
"content": "The quarterly revenue target of $4.2M was met."
},
{
"type": "Text",
"content": "[Comment by Sarah Chen on 2026-01-15T09:32:00Z] on 'quarterly revenue target of $4.2M': Should we break this down by region? The EMEA numbers look flat."
},
{
"type": "Text",
"content": "[Comment by James Park on 2026-01-15T10:14:00Z] (reply to Sarah Chen): Good call. EMEA was $1.1M vs $1.3M target. Adding a regional breakdown table."
},
{
"type": "Text",
"content": "Operating expenses remained within the approved budget."
},
{
"type": "Text",
"content": "[Comment by Legal Review on 2026-01-16T14:02:00Z] on 'approved budget': Need to clarify which board approval this refers to. Add the resolution number."
}
]
}
Each comment preserves the author, timestamp, the specific text it was anchored to, and reply threading. This is the raw collaboration history that PDF conversion destroys.
XLSX Example
{
"blocks": [
{
"type": "Table",
"content": "| Region | Q1 Revenue | Q1 Target | ... |",
"metadata": { "sheet": "Revenue" }
},
{
"type": "Text",
"content": "[Comment by Audit Team on cell D7]: This figure doesn't match the GL export. Verify against SAP report #4821."
}
]
}
Use Cases
Legal Review
Extract who raised which concern and when. Build an audit trail of legal review comments across contract drafts without manual review of each document version.
Compliance Audit
Prove that documents went through proper review. Extract the full comment chain to show reviewers, timestamps, and responses for regulatory compliance.
Collaborative Editing
Extract all feedback from a reviewed document in one pass. No more clicking through comment bubbles — get every comment as structured data for triage and tracking.
Document Intelligence
Feed comments into an LLM for sentiment analysis on specific passages. Know not just what the document says, but what reviewers thought about each section.
Try It
Extract comments from any Office document:
# Parse a DOCX with comments
./bin/docparse contract_reviewed.docx
# Parse via the API
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
-H "Content-Type: application/json" \
-d '{"filepath": "contract_reviewed.docx", "apiKey": "dp_YOUR_KEY"}'
# Parse an XLSX with cell comments
./bin/docparse budget_annotated.xlsx
Parse in Browser API Reference
Frequently Asked Questions
How do I extract comments with author names from a DOCX file?
AILANG Parse reads comments.xml and matches each comment to its anchor using w:commentRangeStart/w:commentRangeEnd markers, outputting text, author, date, and anchored range.
Do PDF-based document parsers preserve DOCX comments?
No. PDF conversion either renders comments as margin annotations or drops them entirely. AILANG Parse reads them from the XML with full metadata. See the comparison page.
Can I extract both comments and track changes from the same document?
Yes. AILANG Parse extracts all metadata in a single pass. Each element type is a distinct block type you can filter independently.
Format Guides
Comments are extracted from DOCX, PPTX, and XLSX files. See the full format guides for everything AILANG Parse extracts: