
关于
定义和实施服务级别指标(SLI)、服务级别目标(SLO)和错误预算的框架。
name: slo-implementation description: "定义和实施服务级别指标(SLI)、服务级别目标(SLO)和错误预算的框架。" risk: unknown source: community date_added: "2026-02-27"
SLO 实施
定义和实施服务级别指标(SLI)、服务级别目标(SLO)和错误预算的框架。
不要在以下情况使用此技能
- 任务与 SLO 实施无关时
- 你需要此范围之外的不同领域或工具时
说明
- 明确目标、约束和所需输入。
- 应用相关最佳实践并验证结果。
- 提供可操作的步骤和验证。
- 如果需要详细示例,请打开
resources/implementation-playbook.md。
目的
使用 SLI、SLO 和错误预算实施可衡量的可靠性目标,以平衡可靠性与创新速度。
何时使用此技能
- 定义服务可靠性目标
- 衡量用户感知的可靠性
- 实施错误预算
- 创建基于 SLO 的告警
- 跟踪可靠性目标
SLI/SLO/SLA 层级
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
定义 SLI
常见 SLI 类型
1. 可用性 SLI
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
2. 延迟 SLI
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
3. 持久性 SLI
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
参考: 见 references/slo-definitions.md
设置 SLO 目标
可用性 SLO 示例
| SLO % | 每月停机时间 | 每年停机时间 | |-------|-------------|-------------| | 99% | 7.2 小时 | 3.65 天 | | 99.9% | 43.2 分钟 | 8.76 小时 | | 99.95%| 21.6 分钟 | 4.38 小时 | | 99.99%| 4.32 分钟 | 52.56 分钟 |
选择合适的 SLO
考虑因素:
- 用户期望
- 业务需求
- 当前性能
- 可靠性成本
- 竞争对手基准
SLO 示例:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
错误预算计算
错误预算公式
Error Budget = 1 - SLO Target
示例:
- SLO:99.9% 可用性
- 错误预算:0.1% = 每月 43.2 分钟
- 当前错误:0.05% = 每月 21.6 分钟
- 剩余预算:50%
错误预算策略
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
参考: 见 references/error-budget.md
SLO 实施
Prometheus 记录规则
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
SLO 告警规则
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
```
兼容工具
Claude CodeCursor
标签
运维部署

