MarkCrawl by iD8 π·οΈπ
Turn any webpage or website into clean Markdown for LLM pipelines β in one command.
Latest: v0.11.1 (2026-05-12) β default aggregator URL filter. See What's New below.
pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progressMarkCrawl is a crawl-and-structure engine. It fetches one page or crawls an entire website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.
Everything else β LLM extraction, Supabase upload, MCP server, LangChain tools β is optional and installed separately.
Want a hosted API instead of running locally? Join the waitlist β we're gauging interest.
LLM agents: Load docs/LLM_PROMPT.md as a system prompt to generate correct MarkCrawl commands automatically.
What's New
Install or upgrade with pip:
pip install --upgrade markcrawl
pip show markcrawl | grep Version # confirm the installed version
markcrawl --help | head -1 # confirm the binary on $PATH is the upgraded oneIf markcrawl --help is missing flags you expect (e.g. --screenshot, --seed-file, --smart-sample, --download-images), your local install is stale. Run pip install --upgrade markcrawl against the same Python that owns the markcrawl binary on your PATH β head -1 $(which markcrawl) shows the right interpreter. PyPI is always the source of truth; see CHANGELOG.md for the full release history.
v0.11 highlights (changelog):
- Aggregator URL filter (default, v0.11.1) β rejects mdBook
/print.htmland Hugo/_print/pages during crawl-time URL filtering. These bundle the entire docs tree on a single URL and otherwise dominate retrieval rankings on cosine similarity (markcrawl was returning them in 49% of rust-book and 39% of kubernetes-docs top-5 retrieval slots before the fix; competitors return 0%). Opt out viainclude_aggregator_pages=True/--include-aggregators. - Binary downloads (v0.11.0) β new
download_types=["pdf", "docx"]kwarg streams referenced files to<out_dir>/downloads/with size + content-type guards. Pre-fetchdownload_filtercallback receives URL + anchor text + parent-page context; reject candidates before any HTTP bytes transfer. - Local embedder is the default since v0.10.1 β
pip install markcrawlships the full ML stack (torch + transformers + sentence-transformers). Zero API key required for embedding. Override withMARKCRAWL_EMBEDDER=text-embedding-3-smallor theembedding_modelkwarg if you want OpenAI back. - Tenacity-backed HTTP retry β full-jitter exponential backoff (2 s β 30 s, 5 attempts) that honors the server's
Retry-Afterheader on 429s.
Where markcrawl stands on the public benchmark, honestly. The independent llm-crawler-benchmarks v1.4 leaderboard measures 7 web crawlers on how well their output supports RAG. Markcrawl ranks 1st on cost ($4,505/yr at 100,000-page scale) but 7th of 7 on answer quality (3.77/5) and retrieval accuracy (MRR 0.341 vs leaders at 0.76). We're actively working to close that gap on three fronts:
- v0.11.1 (just shipped) filters out
/print.htmland/_print/"whole-book-on-one-page" URLs that were stealing 39β49% of markcrawl's top-5 retrieval slots on documentation sites. Competitors already filter these. Expected MRR improvement: +0.02 to +0.04 on docs-heavy sites (formal measurement pending the next benchmark cycle).- Upcoming releases improve how markcrawl chooses which pages to crawl within its budget β markcrawl's deliberately-narrower crawl strategy (which keeps cost low and signal-to-noise high) is also the main cause of the retrieval gap.
- The benchmark itself is being improved β v1.4's test questions were sampled from one specific crawler's output, which structurally penalizes any crawler whose discovery strategy differs from that anchor. The benchmark is being updated so each site's test questions come from the site's own sitemap, independent of any crawler. We expect this fix alone to surface ~5β10% of markcrawl's current "misses" as actually correct answers at different URLs β work shown in our audit notes.
Goal for the next benchmark cycle: move from 7th to mid-pack on retrieval (+0.10 to +0.20 MRR) and answer quality, while keeping the cost-efficiency lead. Honest, measured progress β we publish the numbers either way.
Quickstart (2 minutes)
pip install markcrawl
markcrawl --base https://quotes.toscrape.com --out ./demo --max-pages 5 --show-progressYour ./demo folder now contains:
demo/
βββ index__a4f3b2c1d0.md β clean Markdown of the page
βββ page-2__b7e2d1f0a3.md
βββ ...
βββ pages.jsonl β structured index (one JSON line per page)Each line in pages.jsonl:
{
"url": "https://quotes.toscrape.com/",
"title": "Quotes to Scrape",
"crawled_at": "2026-04-04T12:30:00Z",
"citation": "Quotes to Scrape. quotes.toscrape.com. Available at: https://quotes.toscrape.com/ [Accessed April 04, 2026].",
"tool": "markcrawl",
"text": "# Quotes to Scrape\n\n> "The world as we have created it is a process of our thinking..." β Albert Einstein\n\nTags: change, deep-thoughts, thinking, world..."
}Schema β every page in pages.jsonl has these fields:
| Field | Type | Description |
|---|---|---|
url | string | Original URL fetched. |
title | string | Page title from <title> (or first H1 if missing). |
crawled_at | string (ISO 8601) | UTC timestamp of when the page was fetched. |
citation | string | Pre-formatted academic-style citation including access date. |
tool | string | Always "markcrawl". Helps when merging output from multiple crawlers. |
text | string | Clean Markdown content (nav/footer/scripts stripped). |
downloads | array (optional) | Present when download_types is set; one entry per saved binary: {url, path, size_bytes, content_type}. |
images | array (optional) | Present when --download-images is set; lists saved image paths. |
screenshot | string (optional) | Present when --screenshot is set; relative path to the PNG/JPEG capture. |
Common Recipes
Runnable examples for the most common patterns:
- Single-page scrapes β including JS-rendered pages (React, Vue, YouTube)
- Whole-site crawls β docs, blogs, subsections; resume interrupted runs
- URL filtering β
--exclude-path,--include-path,--dry-run, smart sampling - Extraction backends β BS4 (default), trafilatura, ensemble, ReaderLM-v2
- Binary downloads β images, PDFs (with pre-fetch filter callbacks), DOCX
- Screenshots β full-page or cropped, PNG or JPEG
- End-to-end use cases β competitive analysis, RAG chatbot, API-docs β code-gen
Full recipes with copy-paste commands and expected outputs: docs/RECIPES.md.
<details> <summary>How it compares to other crawlers β decision matrix</summary>Pick this tool whenβ¦
| If you need⦠| Use⦠| Why |
|---|---|---|
| Clean Markdown for LLM/RAG ingestion, run locally, no API keys | MarkCrawl | Default install bundles local embedder ($0 API spend); strips nav/scripts; produces JSONL with citations out of the box |
| A hosted scraping API (no infra to run) | FireCrawl | SaaS option; pay-per-call; outsources crawling entirely |
| AI-native crawling with built-in LLM extraction | Crawl4AI | Deeper LLM-extraction primitives; built-in Playwright |
| Massive distributed crawling (millions of pages, custom pipelines) | Scrapy | Battle-tested framework; rich plugin ecosystem; spider architecture |
| JavaScript-heavy automation without framework overhead | Playwright (direct) | Lower-level control over browser automation |
| Sites behind login/auth or aggressive bot protection | None of the above (build custom) | See When NOT to use MarkCrawl; same constraints apply to most public crawlers |
Feature comparison
| MarkCrawl | FireCrawl | Crawl4AI | Scrapy | |
|---|---|---|---|---|
| License | MIT | AGPL-3.0 | Apache-2.0 | BSD-3 |
| Install | pip install markcrawl | SaaS or self-host | pip + Playwright | pip + framework |
| Output | Markdown + JSONL | Markdown + JSON | Markdown | Custom pipelines |
| JS rendering | Optional (--render-js) | Built-in | Built-in | Plugin |
| LLM extraction | Optional add-on | Via API | Built-in | None |
| Local-only operation | β | β (SaaS) | β | β |
| Citations + timestamps in output | β | Partial | β | Manual |
| Best for | Single-site crawl β clean Markdown | Hosted scraping API | AI-native crawling | Large-scale distributed |
MarkCrawl's niche is focused-scope RAG ingestion β narrow crawls of docs/blogs/product sites that produce LLM-ready Markdown with minimal junk. For broader scope or bigger scale, the other tools above are stronger choices.
Benchmark results (6 tools, May 2026)
Speed: scrapy+md is fastest (5.0 pages/sec), markcrawl at 2.7. Playwright-based tools average 1.4-2.1 pages/sec.
Output cleanliness: markcrawl has the lowest nav pollution (53 words vs 500+ for others) β less junk in your embeddings.
RAG answer quality: markcrawl scores 3.77/5 on answer quality with the fewest chunks (27,193 total, 2.2x fewer than the most), keeping embedding costs low.
| Tool | Chunks/page | Answer Quality (/5) | Annual cost (100K pages, 1K queries/day) |
|---|---|---|---|
| markcrawl | 18.7 | 3.77 | $4,505 |
| scrapy+md | 31.7 | 3.68 | $5,464 |
| crawl4ai | 16.8 | 4.72 | $6,960 |
| colly+md | 40.6 | 4.36 | $7,213 |
| playwright | 39.0 | 4.48 | $7,320 |
| crawlee | 40.5 | 4.68 | $7,467 |
Full benchmark data: docs/BENCHMARKS.md | Methodology: llm-crawler-benchmarks
</details>Methodology caveat (numbers as of bench v1.4, 2026-05-11): the v1.4 leaderboard sourced test queries from a single high-coverage crawler's output. The bench is actively being updated in v1.5 to source queries from each site's own sitemap independent of any crawler (release notes). Numbers above are single-trial; multi-trial measurement is on the v1.5.1 roadmap. Treat individual rankings as point-in-time signal, not steady-state.
Installation
pip install markcrawl # Core crawler + chunker + local embedder
# (no API keys required for embedding)Optional add-ons (tasks beyond the crawl-and-embed core):
pip install markcrawl[js] # + JavaScript rendering (Playwright)
pip install markcrawl[extract] # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[upload] # + Supabase upload integration
pip install markcrawl[mcp] # + MCP server for AI agents
pip install markcrawl[langchain] # + LangChain tool wrappers
pip install markcrawl[all] # EverythingFor Playwright, also run playwright install chromium after installing.
Lean install (skip the local-embedd
β¦