LaTeX & arXiv Parsing — AILANG Parse

Q: Why parse the .tex source instead of the PDF?

The .tex file is the authored source. PDF is a rendering that loses equation structure, citation keys, and bibliography metadata. OCR-based parsers score 0% on equations and bibliography because those signals don't survive rendering.

Q: Does AILANG Parse handle multi-file LaTeX papers?

Yes. \input and \include directives are resolved recursively with a depth cap of 10 and cycle detection. Multi-file papers like the Vaswani Attention paper and BERT parse end-to-end.

Q: Can I get the raw LaTeX equations back?

Yes. Equations are preserved as raw LaTeX inside TextBlock entries with style set to equation. This is what downstream LLMs and RAG systems actually want — don't render to MathML or Unicode.

Q: Is arXiv source freely available?

Yes. arXiv provides bulk .tex source for ~89% of its 2M+ papers. The remaining ~11% are PDF-only submissions; those fall through to the PDF parsing path.

The Problem

Scientific RAG runs on a corpus that's LaTeX-native: arXiv alone holds 2M+ papers, ~89% with .tex source available. Every mainstream document parser — Docling, LlamaParse, MarkItDown, Unstructured — converts those papers to PDF first, then OCRs the rendering. Equations become garbled unicode. Citations lose their keys. Bibliography entries disappear into the reference list as plain text.

On the arxivbench corpus of 19 peer-reviewed papers, every PDF-based parser scores 0% on structured equations and 0% on structured bibliography entries. The raw text is usually still there in the output — what's gone is the structure downstream RAG needs: the LaTeX source of each equation, citation keys that back-link to their bibliography entries, and reference records as typed blocks rather than paragraphs of prose.

How It Works

AILANG Parse treats .tex as a first-class input format alongside DOCX and XLSX. The parser is pure, deterministic AILANG — no OCR, no rendering step, no model calls.

The \input / \include resolver runs before the pure parser, reading referenced files depth-first with a cycle detector and a depth cap of 10. Unresolved references become LaTeX comments so the downstream parser stays effect-free. Equations, citations, and bibliography entries are mapped onto the existing Block ADT — no new variants needed.

# Parse an arXiv paper directly from .tex source
./bin/docparse data/test_files/arxiv/vaswani_attention/ms.tex

# Multi-file papers: \input is resolved automatically
#   [\input] introduction -> introduction.tex (5300 chars)
#   [\input] background -> background.tex (8376 chars)
#   ...
#   After \input expansion: 79593 chars.
#   Extracted 125 blocks.

Coverage on arxivbench

Every adapter is scored against structural truth extracted from the raw .tex source. Score = min(observed, truth) / truth, averaged across papers where truth > 0.

What the scores measure: structural preservation, not raw text capture. A PDF-OCR parser scoring 0% on equations can still read the rendered equation as flattened text — what it can't produce is the LaTeX source of the equation as a separable, re-renderable block. Same for bibliography: OCR parsers dump the reference list as paragraphs; they don't emit each entry as a typed record with its citation key. Downstream RAG pipelines need the structure, not the prose.

Dimension	AILANG (tex)	Pandoc (tex)	Docling (pdf)	MarkItDown (pdf)	Unstructured (pdf)
Papers parsed	19/19	15/19	13/19	13/19	13/19
Sections	94%	77%	64%	1%	75%
Equations (display)	79%	63%	0%	0%	0%
Equations (inline)	76%	57%	0%	0%	0%
Tables	79%	32%	63%	56%	0%
Figures	100%	61%	70%	0%	0%
Citations	93%	66%	19%	23%	23%
Bibliography	100%	0%	0%	0%	0%
Lists	93%	51%	46%	23%	55%
Theorems	100%	47%	0%	0%	0%

Pandoc also reads .tex source but fails outright on 4 of 19 papers — custom \newcolumntype macros, \input harvmac, and a handful of package-specific primitives. AILANG degrades gracefully: the paper still parses, with unrecognized constructs preserved as raw text rather than killing the run.

Multi-File Papers

Most papers over ~15 pages ship as a thin main.tex or ms.tex that pulls in per-section files via \input{background}, \input{introduction}, and so on. Parsing just the wrapper captures almost nothing — the real content lives behind the directives.

Resolution is depth-first with a depth cap of 10 and cycle detection via a visited-set. Paths are tried in the order <basedir>/<arg>.tex → <basedir>/<arg>. Unresolved or cyclic references become LaTeX comments so downstream code stays silent.

# Vaswani "Attention Is All You Need" — 10 \input files
./bin/docparse data/test_files/arxiv/vaswani_attention/ms.tex
# Parsing 17611 chars of LaTeX source...
#     [\input] introduction -> introduction.tex (5300 chars)
#     [\input] background -> background.tex (8376 chars)
#     [\input] model_architecture -> model_architecture.tex (17109 chars)
#     [\input] why_self_attention -> why_self_attention.tex (7867 chars)
#     [\input] training -> training.tex (4444 chars)
#     [\input] results -> results.tex (11800 chars)
#     [\input] visualizations -> visualizations.tex (1543 chars)
#     [\input] parameter_attention -> parameter_attention.tex (3489 chars)
#     [\input] sqrt_d_trick -> sqrt_d_trick.tex (2226 chars)
# After \input expansion: 79593 chars.
# Extracted 125 blocks.

Example Output

Equations are preserved as raw LaTeX inside TextBlock entries with style: "equation". This is what LLMs and downstream RAG systems actually want — don't render to MathML or Unicode; keep the source.

{
  "blocks": [
    {"kind": "HeadingBlock", "level": 1, "text": "Attention Is All You Need"},
    {"kind": "HeadingBlock", "level": 2, "text": "3.2 Attention"},
    {
      "kind": "TextBlock",
      "style": "equation",
      "text": "\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right) V"
    },
    {
      "kind": "TextBlock",
      "text": "The dot-product attention is identical to our algorithm, except for the scaling factor of [cite:bahdanau2014neural]."
    },
    {"kind": "TableBlock", "rows": [...], "caption": "Table 1: Maximum path lengths..."},
    {
      "kind": "SectionBlock",
      "heading": "References",
      "blocks": [
        {"kind": "TextBlock", "style": "bibitem", "key": "vaswani2017attention",
         "text": "Vaswani, A., Shazeer, N., ..."}
      ]
    }
  ]
}

Use Cases

Scientific RAG & Research Copilots

Retrieval-augmented QA over arXiv, bioRxiv, OpenReview. Equation fidelity and structured citations are table stakes — retrieval on OCR'd PDFs surfaces garbled math and drops bibliography entries entirely. Parse .tex source and your embeddings finally index what the paper actually says.

Literature Review Automation

Build citation graphs from 10,000 papers without hand-curation. Bibliography entries come out structured with preserved keys, so you can cross-reference \cite{vaswani2017} across the corpus and follow the graph programmatically.

Equation Search & Extraction

Equations are preserved as raw LaTeX inside TextBlock entries. Train math embeddings, build an "image-to-LaTeX" evaluator, or power formula-search UIs — downstream tools finally see the source the author wrote, not an OCR approximation.

Conference & Journal Pipelines

Submission intake for ACL, NeurIPS, ICML, physics venues. Authors submit .tex; reviewers need structured content. Parse once, convert to Markdown or HTML for the review portal, keep the original LaTeX as ground truth.

Academic Plagiarism Detection

Section-level and equation-level comparison across a paper corpus. Structured blocks make it possible to compare what two papers claim in Section 3, not just whether their PDFs happen to share ngrams after OCR mangling.

Preprint & Thesis Archives

University repositories, lab internal knowledge bases, grant-writing archives. Multi-file papers with \input{chapter-1} structure parse end-to-end — the long-tail use case PDF-first tools simply can't reach.

Try It

LaTeX is one of 15 supported input formats. The parser is pure AILANG — zero model calls, zero OCR, fully deterministic.

# Upload a .tex file
curl -X POST https://docparse.ailang.sunholo.com/api/v1/parse \
  -F "filepath=@paper.tex" \
  -F "outputFormat=markdown" \
  -F "apiKey=YOUR_API_KEY"

# Or via the CLI with a local arXiv paper
./bin/docparse paper.tex --convert paper.md

Parse in Browser See Benchmarks

Frequently Asked Questions

Why parse the .tex source instead of the PDF?

The .tex file is the authored source. PDF is a rendering that loses equation structure, citation keys, and bibliography metadata. OCR-based parsers score 0% on equations and bibliography because those signals don't survive rendering. See the coverage table.

Does AILANG Parse handle multi-file LaTeX papers?

Yes. \input and \include directives are resolved recursively with a depth cap of 10 and cycle detection. Multi-file papers like Vaswani's Attention paper and BERT parse end-to-end. See Multi-File Papers.

What about custom macros and \newcommand?

Standard LaTeX sectioning, math environments, tables, figures, lists, theorems, and citations are supported. Deeply macro-driven papers (harvmac sectioning, custom \newcolumntype) degrade gracefully — the paper still parses, with macro-defined structure counted lower. Pandoc fails outright on these; AILANG keeps going.

Can I get the raw LaTeX equations back?

Yes. Equations are preserved as raw LaTeX inside TextBlock entries with style: "equation". This is what downstream LLMs and RAG systems actually want — don't render to MathML or Unicode.

Is arXiv source freely available?

Yes. arXiv provides bulk .tex source for ~89% of its 2M+ papers. The remaining ~11% are PDF-only submissions; those fall through to the PDF parsing path.

Does it handle .tar.gz arXiv source bundles?

Not yet — current intake accepts raw .tex files and their referenced siblings. Bundle handling is planned for v0.15.1. Extract the tarball first; the parser will follow \input from there.

LaTeX & arXiv parsing — deterministic, from source