pdfmux
Self-healing PDF extraction with per-page confidence scoring. Open-source LlamaParse alternative for RAG pipelines, MCP server for Claude Desktop, LangChain + LlamaIndex loaders. Ranked #2 on opendataloader-bench (0.900).
The only PDF extractor that audits its own output. Catches blank pages, scrambled columns, broken tables — re-extracts them with a stronger backend. So your LLM gets clean data, not silent garbage. Routes each page to the best of 5 rule-based backends + BYOK LLM fallback (Gemini / Claude / GPT-4o / Ollama). One CLI. One API. Zero config.
<p align="center"> <img src="demo.svg" alt="pdfmux terminal demo" width="700" /> </p>PDF ──> pdfmux router ──> best extractor per page ──> audit ──> re-extract failures ──> Markdown / JSON / chunks
|
├─ PyMuPDF (digital text, 0.01s/page)
├─ OpenDataLoader (complex layouts, 0.05s/page)
├─ RapidOCR (scanned pages, CPU-only)
├─ Docling (tables, 97.9% TEDS)
├─ Surya (heavy OCR fallback)
├─ Marker (academic papers, neural)
├─ Mistral OCR ($0.002/page, 96.6% tables)
└─ YOUR LLM (Gemini / Gemma 4 / Claude / GPT-4o / Ollama / Mistral — BYOK via YAML)Install
pip install pdfmuxThat handles digital PDFs. For any real-world batch, install pdfmux[ocr] too — almost every directory of PDFs has at least one scan, and without OCR those pages return empty text:
pip install "pdfmux[ocr]" # ⭐ recommended — RapidOCR for scanned pages (~200MB, CPU)Other backends, by document type:
pip install "pdfmux[tables]" # Docling — table-heavy docs (~500MB)
pip install "pdfmux[opendataloader]" # OpenDataLoader — complex layouts (Java 11+)
pip install "pdfmux[marker]" # Marker — neural extraction for academic papers
pip install "pdfmux[llm]" # Gemini fallback (default LLM)
pip install "pdfmux[llm-claude]" # Claude (Sonnet / Opus)
pip install "pdfmux[llm-openai]" # GPT-4o family
pip install "pdfmux[llm-ollama]" # Ollama (any local model)
pip install "pdfmux[llm-mistral]" # Mistral OCR API ($0.002/page)
pip install "pdfmux[llm-all]" # all LLM providers (incl. Gemma 4 via Gemini key)
pip install "pdfmux[watch]" # `pdfmux watch <dir>` auto-convert on change
pip install "pdfmux[all]" # everythingRequires Python 3.11+.
Quick Start
CLI
# zero config — just works
pdfmux convert invoice.pdf
# invoice.pdf -> invoice.md (2 pages, 95% confidence, via pymupdf4llm)
# RAG-ready chunks with token limits
pdfmux convert report.pdf --chunk --max-tokens 500
# cost-aware extraction with budget cap
pdfmux convert report.pdf --mode economy --budget 0.50
# schema-guided structured extraction (5 built-in presets)
pdfmux convert invoice.pdf --schema invoice
# BYOK any LLM for hardest pages
pdfmux convert scan.pdf --llm-provider claude
# use a built-in or saved profile (invoices, receipts, papers, contracts, bulk-rag)
pdfmux convert invoice.pdf --profile invoices
# predict cost before running anything
pdfmux estimate big-report.pdf --llm-provider gemini
# stream pages as NDJSON as they finish (great for long documents)
pdfmux stream report.pdf --quality high
# auto-convert any new PDFs that land in a folder
pdfmux watch ./inbox/ -o ./output/
# diff two extractions side-by-side
pdfmux diff old.pdf new.pdf
# batch a directory — writes manifest.json with per-doc confidence
pdfmux convert ./docs/ -o ./output/
# CI mode: fail the run if any document is below 0.20 confidence
pdfmux convert ./docs/ -o ./output/ --strict --min-confidence 0.20
# pre-flight a directory: which extras do you actually need for THIS batch?
pdfmux doctor --check ./docs/
# results are cached by file hash — re-runs are instant; bypass with --no-cache
pdfmux convert report.pdf --no-cache
pdfmux convert report.pdf --clear-cachePython
For batch processing, use batch_extract() — not a subprocess.run(['pdfmux', ...]) loop. Same pipeline, no per-file process spawn, handles non-ASCII filenames:
import pdfmux
from pathlib import Path
# Batch extract — yields (path, result) tuples as each PDF completes.
pdfs = list(Path("./inbox").glob("*.pdf"))
for path, result in pdfmux.batch_extract(pdfs, quality="standard"):
if isinstance(result, Exception):
print(f"FAILED {path.name}: {result}")
continue
if result.confidence < 0.50:
print(f"REVIEW {path.name} ({result.confidence:.2f})")
else:
print(f"OK {path.name} ({result.confidence:.2f})")
# Single-file helpers.
text = pdfmux.extract_text("report.pdf") # markdown string
data = pdfmux.extract_json("report.pdf") # locked schema dict
chunks = pdfmux.chunk("report.pdf", max_tokens=500) # RAG-ready chunksDon't wrap pdfmux with your own pypdf/pdfplumber fallback. pdfmux already routes per page through PyMuPDF → RapidOCR → vision LLM. PyMuPDF tolerates malformed PDFs that pypdf rejects ("Stream has ended unexpectedly"), so a downstream pypdf fallback turns recoverable PDFs into failures. Trust the router; check the confidence score on the result.
Architecture
┌─────────────────────────────┐
│ Segment Detector │
│ text / tables / images / │
│ formulas / headers per page │
└─────────────┬───────────────┘
│
┌────────────────────────────────────────┐
│ Router Engine │
│ │
│ economy ── balanced ── premium │
│ (minimize $) (default) (max quality)│
│ budget caps: --budget 0.50 │
└────────────────────┬───────────────────┘
│
┌──────────┬──────────┬────────┴────────┬──────────┐
│ │ │ │ │
PyMuPDF OpenData RapidOCR Docling LLM
digital Loader scanned tables (BYOK)
0.01s/pg complex CPU-only 97.9% any provider
layouts TEDS
│ │ │ │ │
└──────────┴──────────┴────────┬────────┴──────────┘
│
┌────────────────────────────────────────┐
│ Quality Auditor │
│ │
│ 4-signal dynamic confidence scoring │
│ per-page: good / bad / empty │
│ if bad -> re-extract with next backend│
└────────────────────┬───────────────────┘
│
┌────────────────────────────────────────┐
│ Output Pipeline │
│ │
│ heading injection (font-size analysis)│
│ table extraction + normalization │
│ text cleanup + merge │
│ confidence score (honest, not inflated)│
└────────────────────────────────────────┘Key design decisions
- Router, not extractor. pdfmux does not compete with PyMuPDF or Docling. It picks the best one per page.
- Agentic multi-pass. Extract, audit confidence, re-extract failures with a stronger backend. Bad pages get retried automatically.
- Segment-level detection. Each page is classified by content type (text, tables, images, formulas, headers) before routing.
- 4-signal confidence. Dynamic quality scoring from character density, OCR noise ratio, table integrity, and heading structure. Not hardcoded thresholds.
- Document cache. Each PDF is opened once, not once per extractor. Shared across the full pipeline.
- Data flywheel. Local telemetry tracks which extractors win per document type. Routing improves with usage.
Features
| Feature | What it does | Command |
|---|---|---|
| Zero-config extraction | Routes to best backend automatically | pdfmux convert file.pdf |
| RAG chunking | Section-aware chunks with token estimates | pdfmux convert file.pdf --chunk --max-tokens 500 |
| Cost modes | economy / balanced / premium with budget caps | pdfmux convert file.pdf --mode economy --budget 0.50 |
| Schema extraction | 5 built-in presets (invoice, receipt, contract, resume, paper) | pdfmux convert file.pdf --schema invoice |
| Profiles | Save and re-use config; built-ins for invoices/receipts/papers/contracts/bulk-rag | pdfmux convert file.pdf --profile invoices |
| BYOK LLM | Gemini, Gemma 4, Claude, GPT-4o, Ollama, Mistral, any OpenAI-compatible API | pdfmux convert file.pdf --llm-provider claude |
| Cost estimate | Predict spend before running | pdfmux estimate file.pdf --llm-provider gemini |
| Streaming output | NDJSON events page-by-page for long docs | pdfmux stream file.pdf |
| Smart cache | Hash-keyed result cache, 30-day TTL, 1 GB LRU | pdfmux convert file.pdf (auto), --no-cache to bypass |
| Watch mode | Auto-convert any PDF added to a folder | pdfmux watch ./inbox/ |
| Diff | Compare two extractions | pdfmux diff a.pdf b.pdf |
| Benchmark | Eval all installed extractors against ground truth | pdfmux benchmark |
| Doctor | Show installed backends, coverage gaps, recommendations | pdfmux doctor |
| MCP server | AI agents read PDFs via stdio or HTTP | pdfmux serve |
| Batch processing | Convert entire directories | pdfmux convert ./docs/ |
| Page-level streaming API | Bounded-memory page iteration for large files | for page in ext.extract("500pg.pdf") |
| Retry with backoff | Every LLM provider auto-retries with exponential backoff + Retry-After | (built-in) |
CLI Reference
pdfmux convert
pdfmux convert <file-or-dir> [options]
Options:
-o, --output PATH Output file or directory
-f, --format FORMAT markdown | json | csv | llm (default: markdown)
-q, --quality QUALITY fast | standard | high (default: standard)
-s, --schema SCHEMA JSON schema file or preset (invoice, receipt, contract, resume, paper)
--chunk Output RAG-ready chunks
--max-tokens N Max tokens per chunk (default: 500)
--mode MODE economy | balanced | premium (default: balanced)
--budget AMOUNT Max spend per document in USD
--llm-provider PROVIDER LLM backend: gemini | claude | openai | ollama
--confidence Include confidence score in output
--stdout Print to stdout instead of filepdfmux serve
Start the MCP server for AI agent integration.
pdfmux serve # stdio mode (Claude Desktop, Cursor)
pdfmux serve --http 8080 # HTTP modepdfmux doctor
pdfmux doctor
# ┌──────────────────┬─────────────┬─────────┬──────────────────────────────
…