Back to MCP Servers

Webclaw

Web content extraction for AI agents. 10 tools: scrape, crawl, map, batch, extract, summarize, diff, brand, search, research. TLS fingerprinting bypasses anti-bot without a browser. 67% fewer tokens than raw HTML. `npx create-webclaw` auto-configures Claude, Cursor, Windsurf, Co…

other-tools-and-integrationsbrowseraiagent
By 0xMassi
1.6k176Updated todayRustAGPL-3.0

Installation

npx create-webclaw

Configuration

{
  "mcpServers": {
    "webclaw": {
      "command": "npx",
      "args": ["-y", "webclaw"]
    }
  }
}

How to use

  1. Run the installation command above (if needed)
  2. Open your Claude Code settings file (~/.claude/settings.json)
  3. Add the configuration to the mcpServers section
  4. Restart Claude Code to apply changes
<p align="center"> <a href="https://webclaw.io"> <img src=".github/banner.png" alt="webclaw" width="760" /> </a> </p> <h1 align="center">webclaw</h1> <p align="center"> <strong>Turn websites into clean markdown, JSON, and LLM-ready context.</strong><br/> <sub>CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.</sub> </p> <p align="center"> <a href="https://github.com/0xMassi/webclaw/stargazers"><img src="https://shieldcn.dev/github/stars/0xMassi/webclaw.svg?variant=branded&logo=github" alt="Stars" /></a> <a href="https://github.com/0xMassi/webclaw/releases"><img src="https://shieldcn.dev/github/tag/0xMassi/webclaw.svg?variant=branded&logo=rust" alt="Version" /></a> <a href="https://github.com/0xMassi/webclaw/blob/main/LICENSE"><img src="https://shieldcn.dev/github/license/0xMassi/webclaw.svg?variant=branded" alt="License" /></a> <a href="https://www.npmjs.com/package/create-webclaw"><img src="https://shieldcn.dev/npm/dt/create-webclaw.svg?variant=branded" alt="npm installs" /></a> </p> <p align="center"> <a href="https://discord.gg/KDfd48EpnW"><img src="https://shieldcn.dev/badge/Discord-Join.svg?variant=branded&logo=discord" alt="Discord" /></a> <a href="https://x.com/webclaw_io"><img src="https://shieldcn.dev/badge/Follow-@webclaw__io.svg?variant=branded&logo=x" alt="X / Twitter" /></a> <a href="https://webclaw.io"><img src="https://shieldcn.dev/badge/Hosted-webclaw.io.svg?variant=branded&logo=safari" alt="Hosted webclaw" /></a> <a href="https://webclaw.io/docs"><img src="https://shieldcn.dev/badge/Docs-Read.svg?variant=branded&logo=readthedocs" alt="Docs" /></a> </p> <p align="center"> <img src="assets/demo.gif" alt="webclaw extracting clean markdown from a page" width="760" /> </p>

Most web scraping tools give your agent one of two bad outputs:

  • a blocked page, login wall, or empty app shell
  • raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate

webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.

webclaw turns a URL into clean content your tools can actually use.

webclaw https://example.com --format markdown
# Example Domain

This domain is for use in illustrative examples in documents.

You may use this domain in literature without prior coordination or asking for permission.

Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.


Install

Agent setup

The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:

npx create-webclaw

The installer detects supported clients and configures the MCP server for you.

Homebrew

brew tap 0xMassi/webclaw
brew install webclaw

Prebuilt binaries

Download macOS, Linux, and Windows binaries from GitHub Releases.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Cargo

cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp

If building from source fails because native build tools are missing, install the platform prerequisites:

OSCommand
Debian / Ubuntusudo apt install -y pkg-config libssl-dev cmake clang git build-essential
Fedora / RHELsudo dnf install -y pkg-config openssl-devel cmake clang git make gcc
Archsudo pacman -S pkg-config openssl cmake clang git base-devel
macOSxcode-select --install

Quick Start

Scrape one page

webclaw https://stripe.com --format markdown

Return LLM-optimized text

webclaw https://docs.anthropic.com --format llm

Keep only the main content

webclaw https://example.com/blog/post --only-main-content

Include or exclude selectors

webclaw https://example.com \
  --include "article, main, .content" \
  --exclude "nav, footer, .sidebar, .ad"

Crawl a documentation site

webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Workflow examples

Extract brand assets

webclaw https://github.com --brand

Compare a page over time

webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json

MCP Server

webclaw ships with an MCP server for AI agents.

npx create-webclaw

Manual config:

{
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp"
    }
  }
}

Then ask your agent things like:

Scrape these competitor pricing pages and summarize the differences.
Crawl this documentation site and prepare clean context for a RAG index.
Extract the brand colors, fonts, and logos from this company website.

Tools

ToolWhat it doesLocal
scrapeExtract one URL as markdown, text, JSON, LLM format, or HTMLYes
crawlFollow same-origin links and extract discovered pagesYes
mapDiscover URLs without extracting every pageYes
batchScrape multiple URLs in parallelYes
extractConvert page content into structured dataYes, with local or configured LLM
summarizeSummarize a pageYes, with local or configured LLM
diffCompare page content snapshotsYes
brandExtract colors, fonts, logos, and metadataYes
searchSearch the web and scrape resultsHosted API
researchMulti-source research workflowHosted API

SDKs

npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
<details> <summary>TypeScript</summary>
import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);
</details> <details> <summary>Python</summary>
from webclaw import Webclaw

client = Webclaw(api_key="wc_your_key")

page = client.scrape(
    "https://example.com",
    formats=["markdown"],
    only_main_content=True,
)

print(page.markdown)
</details> <details> <summary>cURL</summary>
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'
</details>

Output Formats

FormatUse it when you need
markdownClean page content with structure preserved
llmCompact context for agents and RAG pipelines
textPlain text with minimal formatting
jsonStructured metadata, links, images, and extracted fields
htmlCleaned HTML for custom processing

Local First, Hosted When Needed

The CLI and MCP server work locally without an account for the core extraction path.

Use the hosted API at webclaw.io when you need:

  • protected-site access without managing infrastructure
  • JavaScript rendering
  • async crawl and research jobs
  • web search
  • watches and production usage tracking
  • SDKs for application code
export WEBCLAW_API_KEY=wc_your_key

webclaw https://example.com --cloud

What You Can Build

Use caseExample
AI agent web accessGive Claude, Cursor, or another MCP client clean page context
RAG ingestionCrawl docs, help centers, blogs, and knowledge bases
Competitor monitoringTrack pricing pages, changelogs, docs, and product pages
Structured extractionTurn messy pages into typed JSON for automations
Research workflowsSearch, scrape, summarize, and cite multiple sources
Brand intelligenceExtract logos, colors, fonts, and social metadata

Architecture

webclaw/
  crates/
    webclaw-core     HTML to markdown, text, JSON, and LLM-ready output
    webclaw-fetch    Fetching, crawling, batching, and mapping
    webclaw-llm      Local and hosted LLM provider support
    webclaw-pdf      PDF text extraction
    webclaw-mcp      MCP server for AI agents
    webclaw-cli      Command-line interface

webclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.


Configuration

VariableDescription
WEBCLAW_API_KEYHosted API key
OLLAMA_HOSTOllama URL for local LLM features
OPENAI_API_KEYOpenAI-compatible LLM provider key
OPENAI_BASE_URLOpenAI-compatible base URL
ANTHROPIC_API_KEYAnthropic-compatible LLM provider key
ANTHROPIC_BASE_URLAnthropic-compatible base URL
WEBCLAW_PROXYSingle proxy URL
WEBCLAW_PROXY_FILEProxy pool file

Contributing

The most useful contributions right now are practical and small:

  • add examples for real agent and RAG workflows
  • improve SDK snippets
  • report pages that extract poorly
  • add failing fixtures for messy HTML
  • improve docs for MCP clients and local setup
  • test the CLI on more Linux/macOS environments

Good first places to start:

If a page extracts badly, include:

URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:

Please remove secrets, cookies, private tokens, and customer data from logs before posting.


Infrastructure Partner

<table> <tr> <td align="center"> <a href="https://coldproxy.com/"> <img src="./assets/sponsors/coldproxy-banner.png" alt="ColdProxy" width="720" /> </a> </td> </tr> <tr> <td> <strong>ColdProxy</strong> supports webclaw as an Infrastructure Partner, providing residential IPv4, residential IPv6, and datacenter IPv6 proxy infrastructure across 195+ countries for public data collection, regional testing, monitoring, and web scraping workflows. Explore <a href="https://coldproxy.com/">ColdProxy</a>'s latest plans and available offers directly on the website. See the <a href="examples/proxy-backed-crawling/#using-coldproxy">proxy-backed crawling guide</a> for a hands-on walkthrough of wiring ColdProxy into webclaw. </td> </tr> </table>

Studio Partners

<table> <tr> <td width="340" align="center"> <a href="https://go.nodemaven.com/webclaw"> <img src="./assets/sponsors/nodemaven-banner.png" alt="NodeMaven" width="300" /> </a> </td> <td> <strong>NodeMaven</strong> is the most reliable proxy provider with the highest-quality IPs on the market. Best solution for automation, web scraping, SEO research, and social media management: 99.9% uptime, sticky sessions up to 7 days, IP filtering (all proxies under a 97% fraud score), no KYC, and cashback up to 10% on traffic. Use <code>WEBCLAW35</code> for 35% off Mobile and Residential proxies, or <code>WEBCLAW40</code> for 40% off ISP (Static) proxies at <a href="https://go.

View source on GitHub