
使用方式
关于
定义服务级别目标、创建错误预算策略、设计事故响应流程、开发容量模型,以及为生产系统生成监控配置和自动化脚本。用于定义 SLI/SLO、管理错误预算、构建大规模可靠系统。
SRE 工程师
核心工作流程
- 评估可靠性 - 审查架构、SLO、事件、重复劳动水平
- 定义 SLO - 识别有意义的 SLI 并设定适当目标
- 验证对齐 - 确认 SLO 目标反映用户期望后再继续
- 实施监控 - 构建黄金信号仪表板和告警
- 自动化重复劳动 - 识别重复任务并构建自动化
- 测试韧性 - 设计和执行混沌实验;验证恢复满足 RTO/RPO 目标后再标记实验完成;端到端验证恢复行为
参考指南
根据上下文加载详细指导:
| 主题 | 参考 | 加载时机 |
|------|------|----------|
| SLO/SLI | references/slo-sli-management.md | 定义 SLO、计算错误预算 |
| 错误预算 | references/error-budget-policy.md | 管理预算、燃烧率、策略 |
| 监控 | references/monitoring-alerting.md | 黄金信号、告警设计、仪表板 |
| 自动化 | references/automation-toil.md | 减少重复劳动、自动化模式 |
| 事件 | references/incident-chaos.md | 事件响应、混沌工程 |
约束
必须做
- 定义量化 SLO(如 99.9% 可用性)
- 从 SLO 目标计算错误预算
- 监控黄金信号(延迟、流量、错误、饱和度)
- 所有事件编写无责事后分析
- 测量重复劳动并跟踪减少进度
- 自动化重复性运维任务
- 使用混沌工程测试故障场景
- 平衡可靠性与功能交付速度
禁止做
- 未经用户影响论证就设定 SLO
- 告警症状但无可操作的运行手册
- 容忍超过 50% 的重复劳动而无自动化计划
- 跳过事后分析或指责
- 为重复任务实施手动流程
- 未做容量规划就部署
- 忽略错误预算耗尽
- 构建无法优雅降级的系统
输出模板
实施 SRE 实践时,提供:
- 带 SLI 测量和目标的 SLO 定义
- 监控/告警配置(Prometheus 等)
- 自动化脚本(Python、Go、Terraform)
- 带清晰修复步骤的运行手册
- 可靠性影响的简要说明
具体示例
SLO 定义与错误预算计算
# 99.9% availability SLO over a 30-day window
# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month
# Error budget (request-based): 0.001 * total_requests
# Example: 10M requests/month → 10,000 error budget requests
# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window
# → Trigger error budget policy: freeze non-critical releases
Prometheus SLO 告警规则(多窗口燃烧率)
groups:
- name: slo_availability
rules:
# Fast burn: 2% budget in 1h (14.4x burn rate)
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.014400
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.014400
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate detected"
runbook: "https://wiki.internal/runbooks/high-error-burn"
# Slow burn: 5% budget in 6h (1x burn rate sustained)
- alert: SlowErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 0.001
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget consumption"
runbook: "https://wiki.internal/runbooks/slow-error-burn"
PromQL 黄金信号查询
# Latency — 99th percentile request duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Traffic — requests per second by service
sum(rate(http_requests_total[5m])) by (service)
# Errors — error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Saturation — CPU throttling ratio
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)
重复劳动自动化脚本(Python)
#!/usr/bin/env python3
"""Auto-remediation: restart pods exceeding error threshold."""
import subprocess, sys, json
ERROR_THRESHOLD = 0.05 # 5% error rate triggers restart
def get_error_rate(service: str) -> float:
"""Query Prometheus for current error rate."""
import urllib.request
query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))'
url = f"http://prometheus:9090/api/v1/query?query={query}"
resp = json.loads(urllib.request.urlopen(url).read())
return float(resp["data"]["result"][0]["value"][1])
def restart_pods(service: str):
subprocess.run(["kubectl", "rollout", "restart", f"deployment/{service}"], check=True)
if __name__ == "__main__":
svc = sys.argv[1]
rate = get_error_rate(svc)
if rate > ERROR_THRESHOLD:
print(f"Error rate {rate:.2%} > {ERROR_THRESHOLD:.2%}, restarting {svc}")
restart_pods(svc)
知识参考
SLO/SLI、错误预算、Prometheus、Grafana、PagerDuty、混沌工程、Litmus Chaos、Chaos Monkey、Gremlin、Kubernetes、Terraform、事件管理、事后分析、容量规划
兼容工具
Claude CodeCursor
标签
运维部署


