Observability Designer

Name: Observability Designer
Author: alirezarezvani

Design production-ready observability strategies combining metrics, logs, and traces. Includes SLI/SLO design, golden-signals monitoring, alert optimization. Use when adding observability to a new service, refactoring alerting that is too noisy, or designing an SLO program befor…

gomonitoring

By alirezarezvani

19k 2.6kUpdated 3 days agoPythonMIT

Skill Content

# Observability Designer (POWERFUL)

**Category:** Engineering  
**Tier:** POWERFUL  
**Description:** Design comprehensive observability strategies for production systems including SLI/SLO frameworks, alerting optimization, and dashboard generation.

## Overview

Observability Designer creates production-ready dashboards, alert configurations, and monitoring strategies across the three pillars (metrics, logs, traces).

**When NOT to use → slo-architect.** For SLO/SLI design with error-budget math, multi-window burn-rate alerting thresholds, and SLO review gates, route to `slo-architect` — it is the authoritative skill for that half. This skill's `slo_designer.py` produces a quick scaffold only. This skill's lane: dashboards (`dashboard_generator.py`) and alert-noise reduction (`alert_optimizer.py`).

## Quick Start

```bash
# Dashboard spec (Grafana JSON + docs) for a service
python3 scripts/dashboard_generator.py --service-type api --name payments --criticality critical --role sre --format grafana -o dashboard.json --doc-output dashboard.md

# Analyze an existing alert config for noise, duplicates, and coverage gaps
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report alert_report.json
# ...then emit the optimized config once the report is reviewed:
python3 scripts/alert_optimizer.py --input alerts.json --output alerts_optimized.json

# Quick SLO scaffold (hand off to slo-architect for the real error-budget work)
python3 scripts/slo_designer.py --service-type api --criticality high --user-facing true --service-name payments -o slo_scaffold.json
```

**Verification loop:** after deploying optimized alerts, track the report's noise metrics for one on-call rotation — if the actionable-alert ratio didn't improve, re-run `--analyze-only` against the live config and iterate. Import the generated dashboard into Grafana and confirm every golden-signal panel renders with live data before closing the task.

## Core Competencies

### SLI/SLO/SLA Framework Design
- **Service Level Indicators (SLI):** Define measurable signals that indicate service health
- **Service Level Objectives (SLO):** Set reliability targets based on user experience
- **Service Level Agreements (SLA):** Establish customer-facing commitments with consequences
- **Error Budget Management:** Calculate and track error budget consumption
- **Burn Rate Alerting:** Multi-window burn rate alerts for proactive SLO protection

### Three Pillars of Observability

#### Metrics
- **Golden Signals:** Latency, traffic, errors, and saturation monitoring
- **RED Method:** Rate, Errors, and Duration for request-driven services
- **USE Method:** Utilization, Saturation, and Errors for resource monitoring
- **Business Metrics:** Revenue, user engagement, and feature adoption tracking
- **Infrastructure Metrics:** CPU, memory, disk, network, and custom resource metrics

#### Logs
- **Structured Logging:** JSON-based log formats with consistent fields
- **Log Aggregation:** Centralized log collection and indexing strategies
- **Log Levels:** Appropriate use of DEBUG, INFO, WARN, ERROR, FATAL levels
- **Correlation IDs:** Request tracing through distributed systems
- **Log Sampling:** Volume management for high-throughput systems

#### Traces
- **Distributed Tracing:** End-to-end request flow visualization
- **Span Design:** Meaningful span boundaries and metadata
- **Trace Sampling:** Intelligent sampling strategies for performance and cost
- **Service Maps:** Automatic dependency discovery through traces
- **Root Cause Analysis:** Trace-driven debugging workflows

### Dashboard Design Principles

#### Information Architecture
- **Hierarchy:** Overview → Service → Component → Instance drill-down paths
- **Golden Ratio:** 80% operational metrics, 20% exploratory metrics
- **Cognitive Load:** Maximum 7±2 panels per dashboard screen
- **User Journey:** Role-based dashboard personas (SRE, Developer, Executive)

#### Visualization Best Practices
- **Chart Selection:** Time series for trends, heatmaps for distributions, gauges for status
- **Color Theory:** Red for critical, amber for warning, green for healthy states
- **Reference Lines:** SLO targets, capacity thresholds, and historical baselines
- **Time Ranges:** Default to meaningful windows (4h for incidents, 7d for trends)

#### Panel Design
- **Metric Queries:** Efficient Prometheus/InfluxDB queries with proper aggregation
- **Alerting Integration:** Visual alert state indicators on relevant panels
- **Interactive Elements:** Template variables, drill-down links, and annotation overlays
- **Performance:** Sub-second render times through query optimization

### Alert Design and Optimization

#### Alert Classification
- **Severity Levels:** 
  - **Critical:** Service down, SLO burn rate high
  - **Warning:** Approaching thresholds, non-user-facing issues
  - **Info:** Deployment notifications, capacity planning alerts
- **Actionability:** Every alert must have a clear response action
- **Alert Routing:** Escalation policies based on severity and team ownership

#### Alert Fatigue Prevention
- **Signal vs Noise:** High precision (few false positives) over high recall
- **Hysteresis:** Different thresholds for firing and resolving alerts
- **Suppression:** Dependent alert suppression during known outages
- **Grouping:** Related alerts grouped into single notifications

#### Alert Rule Design
- **Threshold Selection:** Statistical methods for threshold determination
- **Window Functions:** Appropriate averaging windows and percentile calculations
- **Alert Lifecycle:** Clear firing conditions and automatic resolution criteria
- **Testing:** Alert rule validation against historical data

### Runbook Generation and Incident Response

#### Runbook Structure
- **Alert Context:** What the alert means and why it fired
- **Impact Assessment:** User-facing vs internal impact evaluation
- **Investigation Steps:** Ordered troubleshooting procedures with time estimates
- **Resolution Actions:** Common fixes and escalation procedures
- **Post-Incident:** Follow-up tasks and prevention measures

#### Incident Detection Patterns
- **Anomaly Detection:** Statistical methods for detecting unusual patterns
- **Composite Alerts:** Multi-signal alerts for complex failure modes
- **Predictive Alerts:** Capacity and trend-based forward-looking alerts
- **Canary Monitoring:** Early detection through progressive deployment monitoring

### Golden Signals Framework

#### Latency Monitoring
- **Request Latency:** P50, P95, P99 response time tracking
- **Queue Latency:** Time spent waiting in processing queues
- **Network Latency:** Inter-service communication delays
- **Database Latency:** Query execution and connection pool metrics

#### Traffic Monitoring
- **Request Rate:** Requests per second with burst detection
- **Bandwidth Usage:** Network throughput and capacity utilization
- **User Sessions:** Active user tracking and session duration
- **Feature Usage:** API endpoint and feature adoption metrics

#### Error Monitoring
- **Error Rate:** 4xx and 5xx HTTP response code tracking
- **Error Budget:** SLO-based error rate targets and consumption
- **Error Distribution:** Error type classification and trending
- **Silent Failures:** Detection of processing failures without HTTP errors

#### Saturation Monitoring
- **Resource Utilization:** CPU, memory, disk, and network usage
- **Queue Depth:** Processing queue length and wait times
- **Connection Pools:** Database and service connection saturation
- **Rate Limiting:** API throttling and quota exhaustion tracking

### Distributed Tracing Strategies

#### Trace Architecture
- **Sampling Strategy:** Head-based, tail-based, and adaptive sampling
- **Trace Propagation:** Context propagation across service boundaries
- **Span Correlation:** Parent-child relationship modeling
- **Trace Storage:** Retention policies and storage optimization

#### Service Instrumentation
- **Auto-Instrumentation:** Framework-based automatic trace generation
- **Manual Instrumentation:** Custom span creation for business logic
- **Baggage Handling:** Cross-cutting concern propagation
- **Performance Impact:** Instrumentation overhead measurement and optimization

### Log Aggregation Patterns

#### Collection Architecture
- **Agent Deployment:** Log shipping agent strategies (push vs pull)
- **Log Routing:** Topic-based routing and filtering
- **Parsing Strategies:** Structured vs unstructured log handling
- **Schema Evolution:** Log format versioning and migration

#### Storage and Indexing
- **Index Design:** Optimized field indexing for common query patterns
- **Retention Policies:** Time and volume-based log retention
- **Compression:** Log data compression and archival strategies
- **Search Performance:** Query optimization and result caching

### Cost Optimization for Observability

#### Data Management
- **Metric Retention:** Tiered retention based on metric importance
- **Log Sampling:** Intelligent sampling to reduce ingestion costs
- **Trace Sampling:** Cost-effective trace collection strategies
- **Data Archival:** Cold storage for historical observability data

#### Resource Optimization
- **Query Efficiency:** Optimized metric and log queries
- **Storage Costs:** Appropriate storage tiers for different data types
- **Ingestion Rate Limiting:** Controlled data ingestion to manage costs
- **Cardinality Management:** High-cardinality metric detection and mitigation

## Scripts Overview

This skill includes three powerful Python scripts for comprehensive observability design:

### 1. SLO Designer (`slo_designer.py`)
Generates complete SLI/SLO frameworks based on service characteristics:
- **Input:** Service description JSON (type, criticality, dependencies)
- **Output:** SLI definitions, SLO targets, error budgets, burn rate alerts, SLA recommendations
- **Features:** Multi-window burn rate calculations, error budget policies, alert rule generation

### 2. Alert Optimizer (`alert_optimizer.py`)
Analyzes and optimizes existing alert configurations:
- **Input:** Alert configuration JSON with rules, thresholds, and routing
- **Output:** Optimization report and improved alert configuration
- **Features:** Noise detection, coverage gaps, duplicate identification, threshold optimization

### 3. Dashboard Generator (`dashboard_generator.py`)
Creates comprehensive dashboard specifications:
- **Input:** Service/system description JSON
- **Output:** Grafana-compatible dashboard JSON and documentation
- **Features:** Golden signals coverage, RED/USE methods, drill-down paths, role-based views

## Integration Patterns

### Monitoring Stack Integration
- **Prometheus:** Metric collection and alerting rule generation
- **Grafana:** Dashboard creation and visualization configuration
- **Elasticsearch/Kibana:** Log analysis and dashboard integration
- **Jaeger/Zipkin:** Distributed tracing configuration and analysis

### CI/CD Integration
- **Pipeline Monitoring:** Build, test, and deployment observability
- **Deployment Correlation:** Release impact tracking and rollback triggers
- **Feature Flag Monitoring:** A/B test and feature rollout observability
- **Performance Regression:** Automated performance monitoring in pipelines

### Incident Management Integration
- **PagerDuty/VictorOps:** Alert routing and escalation policies
- **Slack/Teams:** Notification and collaboration integration
- **JIRA/ServiceNow:** Incident tracking and resolution workflows
- **Post-Mortem:** Automated incident analysis and improvement tracking

## Advanced Patterns

### Multi-Cloud Observability
- **Cross-Cloud Metrics:** Unified metrics across AWS, GCP, Azure
- **Network Observability:** Inter-cloud connectivity monitoring
- **Cost Attribution:** Cloud resource cost tracking and optimization
- **Compliance Monitoring:** Security and compliance posture tracking

### Microservices Observability
- **Service Mesh Integration:** Istio/Linkerd observability configuration
- **API Gateway Monitoring:** Request routing and rate limiting observability
- **Container Orchestration:** Kubernetes cluster and workload monitoring
- **Service Discovery:** Dynamic service monitoring and health checks

### Machine Learning Observability
- **Model Performance:** Accuracy, drift, and bias monitoring
- **Feature Store Monitoring:** Feature quality and freshness tracking
- **Pipeline Observability:** ML pipeline execution and performance monitoring
- **A/B Test Analysis:** Statistical significance and business impact measurement

## Best Practices

### Organizational Alignment
- **SLO Setting:** Collaborative target setting between product and engineering
- **Alert Ownership:** Clear escalation paths and team responsibilities
- **Dashboard Governance:** Centralized dashboard management and standards
- **Training Programs:** Team education on observability tools and practices

### Technical Excellence
- **Infrastructure as Code:** Observability configuration version control
- **Testing Strategy:** Alert rule testing and dashboard validation
- **Performance Monitoring:** Observability system performance tracking
- **Security Considerations:** Access control and data privacy in observability

### Continuous Improvement
- **Metrics Review:** Regular SLI/SLO effectiveness assessment
- **Alert Tuning:** Ongoing alert threshold and routing optimization
- **Dashboard Evolution:** User feedback-driven dashboard improvements
- **Tool Evaluation:** Regular assessment of observability tool effectiveness

How to use

Copy the skill content above
Create a .claude/skills directory in your project
Save as .claude/skills/claude-skills-observability-designer.md
Use /claude-skills-observability-designer in Claude Code to invoke this skill

README

View on GitHub

Claude Code Skills & Plugins — Agent Skills for Every Coding Tool

345 production-ready Claude Code skills, plugins, and agent skills for 13 AI coding tools.

The most comprehensive open-source library of Claude Code skills and agent plugins — also works with OpenAI Codex, Gemini CLI, Cursor, and 9 more coding agents. Reusable expertise packages covering engineering, DevOps, marketing (incl. AEO — Answer Engine Optimization for LLM citation), security (PreToolUse hooks), compliance, C-level advisory (incl. founder-mode CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE personas + 21 /cs:* slash commands), productivity (capture/email/reflect), an academic research stack (litreview/grants/dossier/patent/syllabus/pulse/notebooklm + hybrid router), and enterprise Research Operations (clinical-research/research-finance/market-research/product-research, v2.9.0).

Works with: Claude Code · OpenAI Codex · Gemini CLI · OpenClaw · Hermes Agent¹ · Mistral Vibe² · Cursor · Aider · Windsurf · Kilo Code · OpenCode · Augment · Antigravity

5,200+ GitHub stars — the most comprehensive open-source Claude Code skills & agent plugins library.

What Are Claude Code Skills & Agent Plugins?

Claude Code skills (also called agent skills or coding agent plugins) are modular instruction packages that give AI coding agents domain expertise they don't have out of the box. Each skill includes:

SKILL.md — structured instructions, workflows, and decision frameworks
Python tools — 579 CLI scripts (all stdlib-only, zero pip installs)
Reference docs — 702 templates, checklists, and domain-specific knowledge files

One repo, thirteen platforms. Works natively as Claude Code plugins, Codex agent skills, Gemini CLI skills, Hermes Agent skills, Mistral Vibe skills, and converts to more tools via scripts/convert.sh. All 579 Python tools run anywhere Python runs.

Skills vs Agents vs Personas

	Skills	Agents	Personas
Purpose	How to execute a task	What task to do	Who is thinking
Scope	Single domain	Single domain	Cross-domain
Voice	Neutral	Professional	Personality-driven
Example	"Follow these steps for SEO"	"Run a security audit"	"Think like a startup CTO"

All three work together. See Orchestration for how to combine them.

Quick Install

Gemini CLI (New)

# Clone the repository
git clone https://github.com/alirezarezvani/claude-skills.git
cd claude-skills

# Run the setup script
./scripts/gemini-install.sh

# Start using skills
> activate_skill(name="senior-architect")

Claude Code (Recommended)

# Add the marketplace
/plugin marketplace add alirezarezvani/claude-skills

# Install by domain
/plugin install engineering-skills@claude-code-skills          # 24 core engineering
/plugin install engineering-advanced-skills@claude-code-skills  # 25 POWERFUL-tier
/plugin install product-skills@claude-code-skills               # 12 product skills
/plugin install marketing-skills@claude-code-skills             # 43 marketing skills
/plugin install ra-qm-skills@claude-code-skills                 # 12 regulatory/quality
/plugin install pm-skills@claude-code-skills                    # 6 project management
/plugin install c-level-skills@claude-code-skills               # 28 C-level advisory (full C-suite)
/plugin install business-growth-skills@claude-code-skills       # 4 business & growth
/plugin install finance-skills@claude-code-skills               # 2 finance (analyst + SaaS metrics)

# Or install individual skills
/plugin install skill-security-auditor@claude-code-skills       # Security scanner
/plugin install playwright-pro@claude-code-skills                  # Playwright testing toolkit
/plugin install self-improving-agent@claude-code-skills         # Auto-memory curation
/plugin install content-creator@claude-code-skills              # Single skill

OpenAI Codex

npx agent-skills-cli add alirezarezvani/claude-skills --agent codex
# Or: git clone + ./scripts/codex-install.sh

OpenClaw

bash <(curl -s https://raw.githubusercontent.com/alirezarezvani/claude-skills/main/scripts/openclaw-install.sh)

Manual Installation

git clone https://github.com/alirezarezvani/claude-skills.git
# Copy any skill folder to ~/.claude/skills/ (Claude Code) or ~/.codex/skills/ (Codex)

Multi-Tool Support (New)

Convert all 345 skills to 9 AI coding tools with a single script:

Tool	Format	Install
Cursor	`.mdc` rules	`./scripts/install.sh --tool cursor --target .`
Aider	`CONVENTIONS.md`	`./scripts/install.sh --tool aider --target .`
Kilo Code	`.kilocode/rules/`	`./scripts/install.sh --tool kilocode --target .`
Windsurf	`.windsurf/skills/`	`./scripts/install.sh --tool windsurf --target .`
OpenCode	`.opencode/skills/`	`./scripts/install.sh --tool opencode --target .`
Augment	`.augment/rules/`	`./scripts/install.sh --tool augment --target .`
Antigravity	`~/.gemini/antigravity/skills/`	`./scripts/install.sh --tool antigravity`
Hermes Agent	`~/.hermes/skills/`	`python scripts/sync-hermes-skills.py --verbose`
Mistral Vibe	`~/.vibe/skills/`	`./scripts/vibe-install.sh`

How it works:

# 1. Convert all skills to all tools (takes ~15 seconds)
./scripts/convert.sh --tool all

# 2. Install into your project (with confirmation)
./scripts/install.sh --tool cursor --target /path/to/project

# Or use --force to skip confirmation:
./scripts/install.sh --tool aider --target . --force

# 3. Verify
find .cursor/rules -name "*.mdc" | wc -l  # Should show 346

Each tool gets:

✅ All 345 skills converted to native format
✅ Per-tool README with install/verify/update steps
✅ Support for scripts, references, templates where applicable
✅ Zero manual conversion work

Run ./scripts/convert.sh --tool all to generate tool-specific outputs locally.

Skills Overview

345 skills across 17 domains:

Domain	Skills	Highlights	Details
🔧 Engineering — Core	51	Architecture, frontend, backend, fullstack, QA, DevOps, SecOps, AI/ML, data, Playwright Pro (test gen, flaky fix, migrations), self-improving agent (auto-memory curation), security suite, a11y audit	engineering-team/
⚡ Engineering — POWERFUL	78	Agent designer, RAG architect, database designer, CI/CD builder, security auditor, MCP builder, AgentHub, Helm charts, Terraform, self-eval, llm-wiki, tc-tracker, autoresearch-agent, reliability portfolio (feature-flags-architect, kubernetes-operator, chaos-engineering, slo-architect), ship-gate, security-guidance PreToolUse hook, Matt Pocock skills (write-a-skill, caveman, grill-me, handoff, grill-with-docs)	engineering/
🎯 Product	17	Product manager, agile PO, strategist, UX researcher, UI design, landing pages, SaaS scaffolder, analytics, experiment designer, discovery, roadmap communicator, code-to-prd, apple-hig-expert	product-team/
📣 Marketing	46	8 pods: Content, SEO + AEO (`aeo` — E-E-A-T audit, citation tracking across 5 LLMs), CRO, Channels, Growth, Intelligence, Sales + context foundation + orchestration router	marketing-skill/
🚀 Productivity	6	`capture` (brain-dump-to-action), `email` pair (inbox-setup + inbox-triage), `reflect` (journal), `handoff` (Matt Pocock-inspired), `andreessen` (market-first decision mode)	productivity/
🎨 Marketing (top-level)	1	`landing` — single-file HTML landing-page generator (4 design styles, GSAP patterns, brand palette validator)	marketing/
🔬 Research (academic)	8	`research` orchestrator (hybrid router + fallback) + 7 specialists: `pulse`, `litreview`, `grants` (NIH), `dossier`, `patent`, `syllabus`, `notebooklm`	research/
🧪 Research Operations ✨v2.9.0	5	Enterprise/cross-functional research: orchestrator + `clinical-research` (study design), `research-finance` (R&D program finance), `market-research` (sizing/survey/segmentation), `product-research` (user research) — each with onboarding + customization + opt-in autoresearch bridge	research-ops/
📋 Project Management	9	Senior PM, scrum master, Jira, Confluence, Atlassian admin, templates + bundled Atlassian Remote MCP	project-management/
🏥 Regulatory & QM	18	ISO 13485, MDR 2017/745, FDA, ISO 27001, GDPR, SOC 2, CAPA, risk management	ra-qm-team/
🛡️ Compliance OS	9	Compliance operating system — controls, evidence, audit-readiness workflows	compliance-os/
💼 C-Level Advisory	66	Full C-suite (CEO/CTO/CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE) + founder-mode agents + orchestration + board meetings + culture & collaboration	c-level-advisor/
📈 Business & Growth	5	Customer success, sales engineer, revenue ops, contracts & proposals, BizDev toolkit	business-growth/
🏭 Business Operations	7	Orchestrator + process-mapper, vendor-management, capacity-planner, internal-comms, knowledge-ops, procurement-optimizer	business-operations/
🤝 Commercial	8	Orchestrator + pricing-strategist, deal-desk, partnerships-architect, channel-economics, commercial-policy, rfp-responder, commercial-forecaster	commercial/
💰 Finance	4	Financial analyst (DCF, budgeting, forecasting), SaaS metrics coach, business investment advisor	finance/

Personas

Pre-configured agent identities with curated skill loadouts, workflows, and distinct communication styles. Personas go beyond "use these skills" — they define how an agent thinks, prioritizes, and communicates.

Persona	Domain	Best For
Startup CTO	Engineering + Strategy	Architecture decisions, tech stack selection, team building, technical due diligence
Growth Marketer	Marketing + Growth	Content-led growth, launch strategy, channel optimization, bootstrapped marketing
Solo Founder	Cross-domain	One-person sta

…

Footnotes

Hermes Agent is BYO-sync tier: the repo ships a pre-generated .hermes/skills/claude-skills/ tree, but you run python scripts/sync-hermes-skills.py once locally to install into ~/.hermes/skills/. Uses the same agentskills.io SKILL.md standard — no format conversion. ↩
Mistral Vibe is also BYO-sync tier: the repo ships a pre-generated .vibe/skills/claude-skills/ tree, run ./scripts/vibe-install.sh once locally to install into ~/.vibe/skills/. Same agentskills.io SKILL.md standard — no format conversion. Docs: https://docs.mistral.ai/mistral-vibe/agents-skills. ↩