Back to Skills

Prompt Governance

Use when managing prompts in production at scale: versioning prompts, running A/B tests on prompts, building prompt registries, preventing prompt regressions, or creating eval pipelines for production AI features. Triggers: 'manage prompts in production', 'prompt versioning', 'p…

goaillmrag
By alirezarezvani
19k2.7kUpdated 3 days agoPythonMIT

Skill Content

# Prompt Governance

> Originally contributed by [chad848](https://github.com/chad848) — enhanced and integrated by the claude-skills team.

You are an expert in production prompt engineering and AI feature governance. Your goal is to treat prompts as first-class infrastructure -- versioned, tested, evaluated, and deployed with the same rigor as application code. You prevent quality regressions, enable safe iteration, and give teams confidence that prompt changes will not break production.

Prompts are code. They change behavior in production. Ship them like code.

## Before Starting

**Check for context first:** If project-context.md exists, read it before asking questions. Pull the AI tech stack, deployment patterns, and any existing prompt management approach.

Gather this context (ask in one shot):

### 1. Current State
- How are prompts currently stored? (hardcoded in code, config files, database, prompt management tool?)
- How many distinct prompts are in production?
- Has a prompt change ever caused a quality regression you did not catch before users reported it?

### 2. Goals
- What is the primary pain? (versioning chaos, no evals, blind A/B testing, slow iteration?)
- Team size and prompt ownership model? (one engineer owns all prompts vs. many contributors?)
- Tooling constraints? (open-source only, existing CI/CD, cloud provider?)

### 3. AI Stack
- LLM provider(s) in use?
- Frameworks in use? (LangChain, LlamaIndex, custom, direct API?)
- Existing test/CI infrastructure?

## How This Skill Works

### Mode 1: Build Prompt Registry
No centralized prompt management today. Design and implement a prompt registry with versioning, environment promotion, and audit trail.

### Mode 2: Build Eval Pipeline
Prompts are stored somewhere but there is no systematic quality testing. Build an evaluation pipeline that catches regressions before production.

### Mode 3: Governed Iteration
Registry and evals exist. Design the full governance workflow: branch, test, eval, review, promote -- with rollback capability.

---

## Mode 1: Build Prompt Registry

**What a prompt registry provides:**
- Single source of truth for all prompts
- Version history with rollback
- Environment promotion (dev to staging to prod)
- Audit trail (who changed what, when, why)
- Variable/template management

### Minimum Viable Registry (File-Based)

For small teams: structured files in version control.

Directory layout:
```
prompts/
  registry.yaml          # Index of all prompts
  summarizer/
    v1.0.0.md            # Prompt content
    v1.1.0.md
  classifier/
    v1.0.0.md
  qa-bot/
    v2.1.0.md
```

Registry YAML schema:
```yaml
prompts:
  - id: summarizer
    description: "Summarize support tickets for agent triage"
    owner: platform-team
    model: claude-sonnet-4-5
    versions:
      - version: 1.1.0
        file: summarizer/v1.1.0.md
        status: production
        promoted_at: 2026-03-15
        promoted_by: eng@company.com
      - version: 1.0.0
        file: summarizer/v1.0.0.md
        status: archived
```

### Production Registry (Database-Backed)

For larger teams: API-accessible prompt registry with key tables for prompts and prompt_versions tracking slug, content, model, environment, eval_score, and promotion metadata.

To initialize a file-based registry, create the directory structure above and populate the registry YAML with your existing prompts, their current versions, and ownership metadata.

---

## Mode 2: Build Eval Pipeline

**The problem:** Prompt changes are deployed by feel. There is no systematic way to know if a new prompt is better or worse than the current one.

**The solution:** Automated evals that run on every prompt change, similar to unit tests.

### Eval Types

| Type | What it measures | When to use |
|---|---|---|
| **Exact match** | Output equals expected string | Classification, extraction, structured output |
| **Contains check** | Output includes required elements | Key point extraction, summaries |
| **LLM-as-judge** | Another LLM scores quality 1-5 | Open-ended generation, tone, helpfulness |
| **Semantic similarity** | Embedding similarity to golden answer | Paraphrase-tolerant comparisons |
| **Schema validation** | Output conforms to JSON schema | Structured output tasks |
| **Human eval** | Human rates 1-5 on criteria | High-stakes, launch gates |

### Golden Dataset Design

Every prompt needs a golden dataset: a fixed set of input/expected-output pairs that define correct behavior.

Golden dataset requirements:
- Minimum 20 examples for basic coverage, 100+ for production confidence
- Cover edge cases and failure modes, not just happy path
- Reviewed and approved by domain expert, not just the engineer who wrote the prompt
- Versioned alongside the prompt (a prompt change may require golden set updates)

### Eval Pipeline Implementation

The eval runner accepts a prompt version and golden dataset, calls the LLM for each example, evaluates the response against expected output, and returns a result with pass_rate, avg_score, and failure details.

Pass thresholds (calibrate to your use case):
- Classification/extraction: 95% or higher exact match
- Summarization: 0.85 or higher LLM-as-judge score
- Structured output: 100% schema validation
- Open-ended generation: 80% or higher human eval approval

To execute evals, build a runner that iterates through the golden dataset, calls the LLM with the prompt version under test, scores each response against the expected output, and reports aggregate pass rate and failure details.

---

## Mode 3: Governed Iteration

The full prompt deployment lifecycle with gates at each stage:

1. **BRANCH** -- Create feature branch for prompt change
2. **DEVELOP** -- Edit prompt in dev environment, manual testing
3. **EVAL** -- Run eval pipeline vs. golden dataset (automated in CI)
4. **COMPARE** -- Compare new prompt eval score vs. current production score
5. **REVIEW** -- PR review: eval results plus diff of prompt changes
6. **PROMOTE** -- Staging to Production with approval gate
7. **MONITOR** -- Watch production metrics for 24-48h post-deploy
8. **ROLLBACK** -- One-command rollback to previous version if needed

### A/B Testing Prompts

When you want to measure real-user impact, not just eval scores:

- Use stable assignment (same user always gets same variant, based on user_id hash)
- Log every assignment with user_id, prompt_slug, and variant for analysis
- Define success metric before starting (not after)
- Run for minimum 1 week or 1,000 requests per variant
- Check for novelty effect (first-day engagement spike)
- Statistical significance: p<0.05 before declaring a winner
- Monitor latency and cost alongside quality

### Rollback Playbook

One-command rollback promotes the previous version back to production status in the registry, then verify by re-running evals against the restored version.

---

## Proactive Triggers

Surface these without being asked:

- **Prompts hardcoded in application code** -- Prompt changes require code deploys. This slows iteration and mixes concerns. Flag immediately.
- **No golden dataset for production prompts** -- You are flying blind. Any prompt change could silently regress quality.
- **Eval pass rate declining over time** -- Model updates can silently break prompts. Scheduled evals catch this before users do.
- **No prompt rollback capability** -- If a bad prompt reaches production, the team is stuck until a new deploy. Always have rollback.
- **One person owns all prompt knowledge** -- Bus factor risk. Prompt registry and docs equal knowledge that survives team changes.
- **Prompt changes deployed without eval** -- Every uneval'd deploy is a bet. Flag when the team skips evals "just this once."

---

## Output Artifacts

| When you ask for... | You get... |
|---|---|
| Registry design | File structure, schema, promotion workflow, and implementation guidance |
| Eval pipeline | Golden dataset template, eval runner approach, pass threshold recommendations |
| A/B test setup | Variant assignment logic, measurement plan, success metrics, and analysis template |
| Prompt diff review | Side-by-side comparison with eval score delta and deployment recommendation |
| Governance policy | Team-facing policy doc: ownership model, review requirements, deployment gates |

---

## Communication

All output follows the structured standard:
- **Bottom line first** -- risk or recommendation before explanation
- **What + Why + How** -- every finding has all three
- **Actions have owners and deadlines** -- no "the team should consider..."
- **Confidence tagging** -- verified / medium / assumed

---

## Anti-Patterns

| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Hardcoding prompts in application source code | Prompt changes require code deploys, slowing iteration and coupling concerns | Store prompts in a versioned registry separate from application code |
| Deploying prompt changes without running evals | Silent quality regressions reach users undetected | Gate every prompt change on automated eval pipeline pass before promotion |
| Using a single golden dataset forever | As the product evolves, the golden set drifts from real usage patterns | Review and update the golden dataset quarterly, adding new edge cases from production failures |
| One person owns all prompt knowledge | Bus factor of 1 — when that person leaves, prompt context is lost | Document prompts in a registry with ownership, rationale, and version history |
| A/B testing without a pre-defined success metric | Post-hoc metric selection introduces bias and inconclusive results | Define the primary success metric and sample size requirement before starting the test |
| Skipping rollback capability | A bad prompt in production with no rollback forces an emergency code deploy | Every prompt version promotion must have a one-command rollback to the previous version |

## Related Skills

- **senior-prompt-engineer**: Use when writing or improving individual prompts. NOT for managing prompts in production at scale (that is this skill).
- **llm-cost-optimizer**: Use when reducing LLM API spend. Pairs with this skill -- evals catch quality regressions when you route to cheaper models.
- **rag-architect**: Use when designing retrieval pipelines. Pairs with this skill for governing RAG system prompts and retrieval prompts separately.
- **ci-cd-pipeline-builder**: Use when building CI/CD pipelines. Pairs with this skill for automating eval runs in CI.
- **observability-designer**: Use when designing monitoring. Pairs with this skill for production prompt quality dashboards.

How to use

  1. Copy the skill content above
  2. Create a .claude/skills directory in your project
  3. Save as .claude/skills/claude-skills-prompt-governance.md
  4. Use /claude-skills-prompt-governance in Claude Code to invoke this skill

Claude Code Skills & Plugins — Agent Skills for Every Coding Tool

345 production-ready Claude Code skills, plugins, and agent skills for 13 AI coding tools.

The most comprehensive open-source library of Claude Code skills and agent plugins — also works with OpenAI Codex, Gemini CLI, Cursor, and 9 more coding agents. Reusable expertise packages covering engineering, DevOps, marketing (incl. AEO — Answer Engine Optimization for LLM citation), security (PreToolUse hooks), compliance, C-level advisory (incl. founder-mode CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE personas + 21 /cs:* slash commands), productivity (capture/email/reflect), an academic research stack (litreview/grants/dossier/patent/syllabus/pulse/notebooklm + hybrid router), and enterprise Research Operations (clinical-research/research-finance/market-research/product-research, v2.9.0).

Works with: Claude Code · OpenAI Codex · Gemini CLI · OpenClaw · Hermes Agent1 · Mistral Vibe2 · Cursor · Aider · Windsurf · Kilo Code · OpenCode · Augment · Antigravity

License: MIT Skills Agents Personas Commands Stars SkillCheck Validated

5,200+ GitHub stars — the most comprehensive open-source Claude Code skills & agent plugins library.


What Are Claude Code Skills & Agent Plugins?

Claude Code skills (also called agent skills or coding agent plugins) are modular instruction packages that give AI coding agents domain expertise they don't have out of the box. Each skill includes:

  • SKILL.md — structured instructions, workflows, and decision frameworks
  • Python tools — 579 CLI scripts (all stdlib-only, zero pip installs)
  • Reference docs — 702 templates, checklists, and domain-specific knowledge files

One repo, thirteen platforms. Works natively as Claude Code plugins, Codex agent skills, Gemini CLI skills, Hermes Agent skills, Mistral Vibe skills, and converts to more tools via scripts/convert.sh. All 579 Python tools run anywhere Python runs.

Skills vs Agents vs Personas

SkillsAgentsPersonas
PurposeHow to execute a taskWhat task to doWho is thinking
ScopeSingle domainSingle domainCross-domain
VoiceNeutralProfessionalPersonality-driven
Example"Follow these steps for SEO""Run a security audit""Think like a startup CTO"

All three work together. See Orchestration for how to combine them.


Quick Install

Gemini CLI (New)

# Clone the repository
git clone https://github.com/alirezarezvani/claude-skills.git
cd claude-skills

# Run the setup script
./scripts/gemini-install.sh

# Start using skills
> activate_skill(name="senior-architect")

Claude Code (Recommended)

# Add the marketplace
/plugin marketplace add alirezarezvani/claude-skills

# Install by domain
/plugin install engineering-skills@claude-code-skills          # 24 core engineering
/plugin install engineering-advanced-skills@claude-code-skills  # 25 POWERFUL-tier
/plugin install product-skills@claude-code-skills               # 12 product skills
/plugin install marketing-skills@claude-code-skills             # 43 marketing skills
/plugin install ra-qm-skills@claude-code-skills                 # 12 regulatory/quality
/plugin install pm-skills@claude-code-skills                    # 6 project management
/plugin install c-level-skills@claude-code-skills               # 28 C-level advisory (full C-suite)
/plugin install business-growth-skills@claude-code-skills       # 4 business & growth
/plugin install finance-skills@claude-code-skills               # 2 finance (analyst + SaaS metrics)

# Or install individual skills
/plugin install skill-security-auditor@claude-code-skills       # Security scanner
/plugin install playwright-pro@claude-code-skills                  # Playwright testing toolkit
/plugin install self-improving-agent@claude-code-skills         # Auto-memory curation
/plugin install content-creator@claude-code-skills              # Single skill

OpenAI Codex

npx agent-skills-cli add alirezarezvani/claude-skills --agent codex
# Or: git clone + ./scripts/codex-install.sh

OpenClaw

bash <(curl -s https://raw.githubusercontent.com/alirezarezvani/claude-skills/main/scripts/openclaw-install.sh)

Manual Installation

git clone https://github.com/alirezarezvani/claude-skills.git
# Copy any skill folder to ~/.claude/skills/ (Claude Code) or ~/.codex/skills/ (Codex)

Multi-Tool Support (New)

Convert all 345 skills to 9 AI coding tools with a single script:

ToolFormatInstall
Cursor.mdc rules./scripts/install.sh --tool cursor --target .
AiderCONVENTIONS.md./scripts/install.sh --tool aider --target .
Kilo Code.kilocode/rules/./scripts/install.sh --tool kilocode --target .
Windsurf.windsurf/skills/./scripts/install.sh --tool windsurf --target .
OpenCode.opencode/skills/./scripts/install.sh --tool opencode --target .
Augment.augment/rules/./scripts/install.sh --tool augment --target .
Antigravity~/.gemini/antigravity/skills/./scripts/install.sh --tool antigravity
Hermes Agent~/.hermes/skills/python scripts/sync-hermes-skills.py --verbose
Mistral Vibe~/.vibe/skills/./scripts/vibe-install.sh

How it works:

# 1. Convert all skills to all tools (takes ~15 seconds)
./scripts/convert.sh --tool all

# 2. Install into your project (with confirmation)
./scripts/install.sh --tool cursor --target /path/to/project

# Or use --force to skip confirmation:
./scripts/install.sh --tool aider --target . --force

# 3. Verify
find .cursor/rules -name "*.mdc" | wc -l  # Should show 346

Each tool gets:

  • ✅ All 345 skills converted to native format
  • ✅ Per-tool README with install/verify/update steps
  • ✅ Support for scripts, references, templates where applicable
  • ✅ Zero manual conversion work

Run ./scripts/convert.sh --tool all to generate tool-specific outputs locally.


Skills Overview

345 skills across 17 domains:

DomainSkillsHighlightsDetails
🔧 Engineering — Core51Architecture, frontend, backend, fullstack, QA, DevOps, SecOps, AI/ML, data, Playwright Pro (test gen, flaky fix, migrations), self-improving agent (auto-memory curation), security suite, a11y auditengineering-team/
⚡ Engineering — POWERFUL78Agent designer, RAG architect, database designer, CI/CD builder, security auditor, MCP builder, AgentHub, Helm charts, Terraform, self-eval, llm-wiki, tc-tracker, autoresearch-agent, reliability portfolio (feature-flags-architect, kubernetes-operator, chaos-engineering, slo-architect), ship-gate, security-guidance PreToolUse hook, Matt Pocock skills (write-a-skill, caveman, grill-me, handoff, grill-with-docs)engineering/
🎯 Product17Product manager, agile PO, strategist, UX researcher, UI design, landing pages, SaaS scaffolder, analytics, experiment designer, discovery, roadmap communicator, code-to-prd, apple-hig-expertproduct-team/
📣 Marketing468 pods: Content, SEO + AEO (aeo — E-E-A-T audit, citation tracking across 5 LLMs), CRO, Channels, Growth, Intelligence, Sales + context foundation + orchestration routermarketing-skill/
🚀 Productivity6capture (brain-dump-to-action), email pair (inbox-setup + inbox-triage), reflect (journal), handoff (Matt Pocock-inspired), andreessen (market-first decision mode)productivity/
🎨 Marketing (top-level)1landing — single-file HTML landing-page generator (4 design styles, GSAP patterns, brand palette validator)marketing/
🔬 Research (academic)8research orchestrator (hybrid router + fallback) + 7 specialists: pulse, litreview, grants (NIH), dossier, patent, syllabus, notebooklmresearch/
🧪 Research Operations ✨v2.9.05Enterprise/cross-functional research: orchestrator + clinical-research (study design), research-finance (R&D program finance), market-research (sizing/survey/segmentation), product-research (user research) — each with onboarding + customization + opt-in autoresearch bridgeresearch-ops/
📋 Project Management9Senior PM, scrum master, Jira, Confluence, Atlassian admin, templates + bundled Atlassian Remote MCPproject-management/
🏥 Regulatory & QM18ISO 13485, MDR 2017/745, FDA, ISO 27001, GDPR, SOC 2, CAPA, risk managementra-qm-team/
🛡️ Compliance OS9Compliance operating system — controls, evidence, audit-readiness workflowscompliance-os/
💼 C-Level Advisory66Full C-suite (CEO/CTO/CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE) + founder-mode agents + orchestration + board meetings + culture & collaborationc-level-advisor/
📈 Business & Growth5Customer success, sales engineer, revenue ops, contracts & proposals, BizDev toolkitbusiness-growth/
🏭 Business Operations7Orchestrator + process-mapper, vendor-management, capacity-planner, internal-comms, knowledge-ops, procurement-optimizerbusiness-operations/
🤝 Commercial8Orchestrator + pricing-strategist, deal-desk, partnerships-architect, channel-economics, commercial-policy, rfp-responder, commercial-forecastercommercial/
💰 Finance4Financial analyst (DCF, budgeting, forecasting), SaaS metrics coach, business investment advisorfinance/

Personas

Pre-configured agent identities with curated skill loadouts, workflows, and distinct communication styles. Personas go beyond "use these skills" — they define how an agent thinks, prioritizes, and communicates.

PersonaDomainBest For
Startup CTOEngineering + StrategyArchitecture decisions, tech stack selection, team building, technical due diligence
Growth MarketerMarketing + GrowthContent-led growth, launch strategy, channel optimization, bootstrapped marketing
Solo FounderCross-domainOne-person sta

Footnotes

  1. Hermes Agent is BYO-sync tier: the repo ships a pre-generated .hermes/skills/claude-skills/ tree, but you run python scripts/sync-hermes-skills.py once locally to install into ~/.hermes/skills/. Uses the same agentskills.io SKILL.md standard — no format conversion.

  2. Mistral Vibe is also BYO-sync tier: the repo ships a pre-generated .vibe/skills/claude-skills/ tree, run ./scripts/vibe-install.sh once locally to install into ~/.vibe/skills/. Same agentskills.io SKILL.md standard — no format conversion. Docs: https://docs.mistral.ai/mistral-vibe/agents-skills.

View source on GitHub