Featured image of post MinerU in Practice: Turning PDFs into RAG-Ready Markdown

MinerU in Practice: Turning PDFs into RAG-Ready Markdown

Feeding PDFs to LLMs breaks formulas, tables, and multi-column layouts. I ran MinerU 2.5 on an academic PDF β€” formulas became LaTeX, tables became HTML, reading order preserved, and it runs on CPU.

The worst part of feeding PDFs to LLMs isn’t token count β€” it’s that the content breaks on the way in. Two-column layouts get flattened, tables turn into gibberish, formulas lose all meaning. The chunks your RAG retrieves are unreadable.

I recently tried MinerU, an open-source document parsing engine from Shanghai AI Lab (opendatalab). One job: convert PDF / DOCX / PPTX / images into LLM-friendly Markdown or JSON. Way better than my old pdfplumber + regex cleanup pipeline.

Why traditional PDF parsing falls short

Tools like pdfplumber, PyPDF2, pdfminer extract text fast, but they see “characters at coordinates” β€” not “this is a table” or “this is a formula.” Common outcomes:

  • Two-column papers read column 1 then column 2 as one stream β€” order destroyed
  • Tables become space-delimited strings with misaligned columns
  • Formulas like βˆ‘_{i=1}^{n} collapse to i=1n
  • Headers/footers leak into body text, repeating “Journal of XXX, Vol 42” per page

Feed that to a RAG pipeline and you get garbage in, garbage out.

What MinerU does

MinerU 2.5 ships two backends:

BackendAccuracyHardwareUse case
pipelineOmniDocBench 86CPU-only worksBatch jobs, local dev
vlmOmniDocBench 95+GPU (Volta+ or Apple Silicon)Complex layouts, high precision

What it does:

  • Layout analysis: detects titles, paragraphs, tables, figures, formulas; strips headers/footers
  • Reading order reconstruction: multi-column is correctly ordered
  • Formulas β†’ LaTeX, tables β†’ HTML, images extracted with captions
  • OCR in 109 languages β€” scanned PDFs work

Install

1
pip install -U "mineru[all]"

Python 3.10–3.13. 16GB RAM minimum, 20GB disk. GPU is optional β€” without one, it uses the pipeline backend.

Minimal example

1
2
3
4
5
# Defaults to vlm when a GPU is available
mineru -p paper.pdf -o output/

# Force CPU
mineru -p paper.pdf -o output/ -b pipeline

Output layout:

1
2
3
4
5
6
output/paper/
β”œβ”€β”€ paper.md              # main Markdown output
β”œβ”€β”€ paper_content_list.json  # blocks in reading order
β”œβ”€β”€ paper_layout.pdf      # PDF overlay with layout boxes (debug)
β”œβ”€β”€ paper_origin.pdf      # copy of source
└── images/               # extracted figures

Each block in content_list.json has type (text / table / equation / image), page_idx, bbox β€” handy when writing your own chunking logic.

What table and formula output looks like

Given a page with both, the Markdown comes out roughly like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## 3. Method

The loss function is defined as:

$$
\mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) + \lambda \|\theta\|_2^2
$$

Results on benchmark:

<table>
  <tr><th>Model</th><th>Accuracy</th><th>Latency</th></tr>
  <tr><td>Baseline</td><td>0.812</td><td>45ms</td></tr>
  <tr><td>Ours</td><td>0.894</td><td>52ms</td></tr>
</table>

Tables use HTML, not pipe syntax, because Markdown can’t express multi-header or merged cells. For LLM QA, the structural HTML actually helps extraction accuracy.

Plugging into a RAG pipeline

Simplest wiring:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import json
from pathlib import Path
from langchain.text_splitter import MarkdownHeaderTextSplitter

content = json.loads(Path("output/paper/paper_content_list.json").read_text())

# Separate blocks by type
text_blocks = [b for b in content if b["type"] == "text"]
tables = [b for b in content if b["type"] == "table"]
equations = [b for b in content if b["type"] == "equation"]

# Keep tables and formulas as single chunks; never let a recursive splitter shred them
md = Path("output/paper/paper.md").read_text()
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
chunks = splitter.split_text(md)

The key rule: don’t throw tables and formulas into a recursive text splitter with the body text β€” HTML tags will get sliced. MinerU already blocks them out; just route by type.

Gotchas

  • First VLM run downloads ~4GB of models β€” behind a firewall, set HF_ENDPOINT=https://hf-mirror.com
  • Scanned PDFs need -l auto or an explicit language β€” default is English, non-English OCR will be broken
  • Formula recognition doesn’t work on handwriting β€” printed papers fine, scanned notebooks don’t expect much
  • 50-page PDF on my M2 MacBook: ~2 min on pipeline, ~40s on VLM with an RTX 4090 β€” ~3x gap

Compared to others

What I also tried:

  • pdfplumber β€” fast raw text, zero layout awareness
  • unstructured.io β€” similar architecture, lower table/formula precision
  • LlamaParse (LlamaIndex) β€” comparable accuracy, needs API key and costs money
  • docling (IBM) β€” also open source, MinerU edges it on tables in my tests

If your workload is Chinese documents + academic papers + local deployment, MinerU is the most reliable option I’ve used.

License

From 3.1.0 onward, MinerU uses its own Open Source License β€” an Apache 2.0-based custom license. Commercial use is generally fine; read the terms for edge cases.

References