Monitoring & Alerting

Safe

by @sreproVerified Source

4.71,150 installsv1.0.0Updated May 25, 2026

How to Use

About

Implement observability with Prometheus, Grafana, PagerDuty, and structured alerting for production systems.

Monitoring Operations

Comprehensive observability patterns covering the three pillars (metrics, logging, tracing), alerting strategies, dashboard design, and infrastructure monitoring for production systems.

Three Pillars Quick Reference

Use this table to decide which observability signal fits your need:

| Pillar | Best For | Tools | Data Type | |--------|----------|-------|-----------| | Metrics | Aggregated numeric measurements, trends, alerting on thresholds | Prometheus, Datadog, CloudWatch, StatsD | Time-series (numeric) | | Logs | Discrete events, error details, audit trails, debugging context | Loki, ELK, CloudWatch Logs, Fluentd | Unstructured/structured text | | Traces | Request flow across services, latency breakdown, dependency mapping | Jaeger, Tempo, Zipkin, Datadog APM | Span trees (structured) |

When to use which:

"How many requests per second?" → Metrics (counter + rate)
"Why did this specific request fail?" → Logs (error message + stack trace)
"Where is the latency in this request?" → Traces (span waterfall)
"Is the system healthy right now?" → Metrics (gauges + alerts)
"What happened at 3:42 AM?" → Logs (timestamped event search)
"Which downstream service caused the timeout?" → Traces (span analysis)

Correlation is key: Connect all three by embedding trace_id in log entries, recording exemplars in metrics, and linking trace spans to log queries.

Metrics Type Decision Tree

Use this tree to select the correct metric type:

What are you measuring?
│
├─ A count of events that only goes up?
│  └─ COUNTER
│     Examples: http_requests_total, errors_total, bytes_sent_total
│     Use rate() or increase() to get per-second or per-interval values
│     Never use a counter's raw value — it resets on restart
│
├─ A current value that goes up AND down?
│  └─ GAUGE
│     Examples: temperature_celsius, active_connections, queue_depth
│     Use for snapshots of current state
│     Can use avg_over_time(), max_over_time() for trends
│
├─ A distribution of values (latency, size)?
│  │
│  ├─ Need aggregatable quantiles across instances?
│  │  └─ HISTOGRAM
│  │     Examples: http_request_duration_seconds, response_size_bytes
│  │     Define buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
│  │     Use histogram_quantile() for percentiles (p50, p95, p99)
│  │     Aggregatable across instances (histograms can be summed)
│  │
│  └─ Need pre-calculated quantiles on a single instance?
│     └─ SUMMARY
│        Examples: go_gc_duration_seconds
│        Pre-calculates quantiles client-side
│        NOT aggregatable across instances
│        Prefer histogram unless you have a specific reason
│
└─ None of the above?
   └─ INFO metric (labels only, value=1)
      Examples: build_info{version="1.2.3", commit="abc123"}
      Use for metadata exposed as metrics

Rule of thumb: Start with counters and histograms. Add gauges for current state. Avoid summaries unless you have a compelling reason.

Alerting Decision Tree

What type of alert do you need?
│
├─ Known threshold with a fixed boundary?
│  └─ THRESHOLD-BASED
│     Example: CPU > 90% for 5 minutes
│     Pros: Simple, predictable, easy to understand
│     Cons: Requires manual tuning, doesn't adapt to patterns
│     Best for: Resource limits, error rate spikes, queue depth
│
├─ Normal behavior varies by time/season?
│  └─ ANOMALY-BASED
│     Example: Traffic 3 standard deviations below normal for this hour
│     Pros: Adapts to patterns, catches novel failures
│     Cons: Noisy during transitions, requires training data
│     Best for: Traffic patterns, business metrics, gradual degradation
│
└─ Defined reliability targets?
   └─ SLO-BASED (PREFERRED)
      Example: Error budget burn rate > 14.4x for 1 hour
      Pros: Aligned with user impact, reduces noise, principled
      Cons: Requires SLI/SLO definition, more complex setup
      Best for: User-facing services, platform reliability

Severity Levels

| Severity | Response | Examples | Routing | |----------|----------|----------|---------| | Critical (P1) | Page on-call immediately | Service down, data loss risk, security breach | PagerDuty high-urgency, phone call | | Warning (P2) | Investigate within hours | Elevated error rate, disk 80% full, SLO burn rate elevated | PagerDuty low-urgency, Slack alert channel | | Info (P3) | Review next business day | Deployment completed, certificat

Compatible Tools

Claude CodeCursor

Monitoring & Alerting

About

Monitoring Operations

Three Pillars Quick Reference

Metrics Type Decision Tree

Alerting Decision Tree

Severity Levels

Compatible Tools

Tags

Related Skills

MCP Builder Ms

Distributed Tracing

Homelab Network Setup

Container Orchestration

Devops Troubleshooter

Azure Keyvault Keys Ts