Back to Skills
Cloudflare Workers Ai
Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.
cloudflareaillmembedding
By secondsky
Skill Content
# Cloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
**Status**: Production Ready ā
**Last Updated**: 2025-11-21
**Dependencies**: cloudflare-worker-base (for Worker setup)
**Latest Versions**: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
---
## Table of Contents
1. [Quick Start (5 minutes)](#quick-start-5-minutes)
2. [Workers AI API Reference](#workers-ai-api-reference)
3. [Model Selection Guide](#model-selection-guide)
4. [Common Patterns](#common-patterns)
5. [AI Gateway Integration](#ai-gateway-integration)
6. [Rate Limits & Pricing](#rate-limits--pricing)
7. [Production Checklist](#production-checklist)
---
## Quick Start (5 minutes)
### 1. Add AI Binding
**wrangler.jsonc:**
```jsonc
{
"ai": {
"binding": "AI"
}
}
```
### 2. Run Your First Model
```typescript
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
```
### 3. Add Streaming (Recommended)
```typescript
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
```
**Why streaming?**
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
---
## Workers AI API Reference
### Core API: `env.AI.run()`
```typescript
const response = await env.AI.run(model, inputs, options?);
```
| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | Model ID (e.g., `@cf/meta/llama-3.1-8b-instruct`) |
| `inputs` | object | Model-specific inputs (see model type below) |
| `options.gateway.id` | string | AI Gateway ID for caching/logging |
| `options.gateway.skipCache` | boolean | Skip AI Gateway cache |
**Returns**: `Promise<ModelOutput>` (non-streaming) or `ReadableStream` (streaming)
### Input Types by Model Category
| Category | Key Inputs | Output |
|----------|------------|--------|
| **Text Generation** | `messages[]`, `stream`, `max_tokens`, `temperature` | `{ response: string }` |
| **Embeddings** | `text: string \| string[]` | `{ data: number[][], shape: number[] }` |
| **Image Generation** | `prompt`, `num_steps`, `guidance` | Binary PNG |
| **Vision** | `messages[].content[].image_url` | `{ response: string }` |
š **Full model details**: Load `references/models-catalog.md` for complete model list, parameters, and rate limits.
---
## Model Selection Guide
### Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|-------|----------|------------|------|
| `@cf/meta/llama-3.1-8b-instruct` | General purpose, fast | 300/min | 8B |
| `@cf/meta/llama-3.2-1b-instruct` | Ultra-fast, simple tasks | 300/min | 1B |
| `@cf/qwen/qwen1.5-14b-chat-awq` | High quality, complex reasoning | 150/min | 14B |
| `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` | Coding, technical content | 300/min | 32B |
| `@hf/thebloke/mistral-7b-instruct-v0.1-awq` | Fast, efficient | 400/min | 7B |
### Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|-------|-----------|----------|------------|
| `@cf/baai/bge-base-en-v1.5` | 768 | General purpose RAG | 3000/min |
| `@cf/baai/bge-large-en-v1.5` | 1024 | High accuracy search | 1500/min |
| `@cf/baai/bge-small-en-v1.5` | 384 | Fast, low storage | 3000/min |
### Image Generation
| Model | Best For | Rate Limit | Speed |
|-------|----------|------------|-------|
| `@cf/black-forest-labs/flux-1-schnell` | High quality, photorealistic | 720/min | Fast |
| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | General purpose | 720/min | Medium |
| `@cf/lykon/dreamshaper-8-lcm` | Artistic, stylized | 720/min | Fast |
### Vision Models
| Model | Best For | Rate Limit |
|-------|----------|------------|
| `@cf/meta/llama-3.2-11b-vision-instruct` | Image understanding | 720/min |
| `@cf/unum/uform-gen2-qwen-500m` | Fast image captioning | 720/min |
---
## Common Patterns
### Pattern 1: Chat with Streaming
```typescript
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>();
const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true });
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
});
```
### Pattern 2: RAG (Retrieval Augmented Generation)
```typescript
// 1. Generate embedding for query
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
// 3. Build context
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `Answer using this context:\n${context}` },
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
```
š **More patterns**: Load `references/best-practices.md` for structured output, image generation, multi-model consensus, and production patterns.
---
## AI Gateway Integration
Enable caching, logging, and cost tracking with AI Gateway:
```typescript
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, {
gateway: { id: 'my-gateway', skipCache: false },
});
```
**Benefits**: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.
---
## Rate Limits & Pricing
**Information last verified**: 2025-01-14
Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:
- **Rate Limits**: https://developers.cloudflare.com/workers-ai/platform/limits/
- **Pricing**: https://developers.cloudflare.com/workers-ai/platform/pricing/
**Free Tier**: 10,000 neurons/day
**Paid Tier**: $0.011 per 1,000 neurons
š **Per-model details**: See `references/models-catalog.md` for specific rate limits and pricing for each model.
---
## Production Checklist
**Essential before deploying:**
- [ ] Enable AI Gateway for cost tracking
- [ ] Implement streaming for text generation
- [ ] Add rate limit retry with exponential backoff
- [ ] Validate input length (prevent token limit errors)
- [ ] Add input sanitization (prevent prompt injection)
š **Full checklist**: Load `references/best-practices.md` for complete production checklist, error handling patterns, monitoring, and cost optimization.
---
## External SDK Integrations
Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:
```typescript
// OpenAI SDK - use same patterns with Workers AI models
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Vercel AI SDK - native integration
import { createWorkersAI } from 'workers-ai-provider';
const workersai = createWorkersAI({ binding: env.AI });
```
š **Full integration guide**: Load `references/integrations.md` for OpenAI SDK, Vercel AI SDK, and REST API examples.
---
## Limits Summary
| Feature | Limit |
|---------|-------|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
---
## When to Load References
| Reference File | Load When... |
|----------------|--------------|
| `references/models-catalog.md` | Choosing a model, checking rate limits, comparing model capabilities |
| `references/best-practices.md` | Production deployment, error handling, cost optimization, security |
| `references/integrations.md` | Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding |
---
## References
- [Workers AI Docs](https://developers.cloudflare.com/workers-ai/)
- [Models Catalog](https://developers.cloudflare.com/workers-ai/models/)
- [AI Gateway](https://developers.cloudflare.com/ai-gateway/)
- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)How to use
- Copy the skill content above
- Create a .claude/skills directory in your project
- Save as .claude/skills/claude-skills-cloudflare-workers-ai.md
- Use /claude-skills-cloudflare-workers-ai in Claude Code to invoke this skill