The worst part of feeding PDFs to LLMs isn’t token count β it’s that the content breaks on the way in. Two-column layouts get flattened, tables turn into gibberish, formulas lose all meaning. The chunks your RAG retrieves are unreadable.
I recently tried MinerU, an open-source document parsing engine from Shanghai AI Lab (opendatalab). One job: convert PDF / DOCX / PPTX / images into LLM-friendly Markdown or JSON. Way better than my old pdfplumber + regex cleanup pipeline.
Why traditional PDF parsing falls short
Tools like pdfplumber, PyPDF2, pdfminer extract text fast, but they see “characters at coordinates” β not “this is a table” or “this is a formula.” Common outcomes:
- Two-column papers read column 1 then column 2 as one stream β order destroyed
- Tables become space-delimited strings with misaligned columns
- Formulas like
β_{i=1}^{n}collapse toi=1n - Headers/footers leak into body text, repeating “Journal of XXX, Vol 42” per page
Feed that to a RAG pipeline and you get garbage in, garbage out.
What MinerU does
MinerU 2.5 ships two backends:
| Backend | Accuracy | Hardware | Use case |
|---|---|---|---|
pipeline | OmniDocBench 86 | CPU-only works | Batch jobs, local dev |
vlm | OmniDocBench 95+ | GPU (Volta+ or Apple Silicon) | Complex layouts, high precision |
What it does:
- Layout analysis: detects titles, paragraphs, tables, figures, formulas; strips headers/footers
- Reading order reconstruction: multi-column is correctly ordered
- Formulas β LaTeX, tables β HTML, images extracted with captions
- OCR in 109 languages β scanned PDFs work
Install
| |
Python 3.10β3.13. 16GB RAM minimum, 20GB disk. GPU is optional β without one, it uses the pipeline backend.
Minimal example
| |
Output layout:
| |
Each block in content_list.json has type (text / table / equation / image), page_idx, bbox β handy when writing your own chunking logic.
What table and formula output looks like
Given a page with both, the Markdown comes out roughly like:
| |
Tables use HTML, not pipe syntax, because Markdown can’t express multi-header or merged cells. For LLM QA, the structural HTML actually helps extraction accuracy.
Plugging into a RAG pipeline
Simplest wiring:
| |
The key rule: don’t throw tables and formulas into a recursive text splitter with the body text β HTML tags will get sliced. MinerU already blocks them out; just route by type.
Gotchas
- First VLM run downloads ~4GB of models β behind a firewall, set
HF_ENDPOINT=https://hf-mirror.com - Scanned PDFs need
-l autoor an explicit language β default is English, non-English OCR will be broken - Formula recognition doesn’t work on handwriting β printed papers fine, scanned notebooks don’t expect much
- 50-page PDF on my M2 MacBook: ~2 min on pipeline, ~40s on VLM with an RTX 4090 β ~3x gap
Compared to others
What I also tried:
pdfplumberβ fast raw text, zero layout awarenessunstructured.ioβ similar architecture, lower table/formula precisionLlamaParse(LlamaIndex) β comparable accuracy, needs API key and costs moneydocling(IBM) β also open source, MinerU edges it on tables in my tests
If your workload is Chinese documents + academic papers + local deployment, MinerU is the most reliable option I’ve used.
License
From 3.1.0 onward, MinerU uses its own Open Source License β an Apache 2.0-based custom license. Commercial use is generally fine; read the terms for edge cases.
