
关于
有效的值班交接模式,确保连续性、上下文传递和可靠的事故处理。
name: on-call-handoff-patterns description: "值班交接的有效模式,确保班次间的连续性、上下文传递和可靠的事件响应。" risk: unknown source: community date_added: "2026-02-27"
值班交接模式
确保班次间连续性、上下文传递和可靠事件响应的有效交接模式。
不适用场景
- 任务与值班交接模式无关
- 需要此范围之外的其他领域或工具
使用说明
- 明确目标、约束条件和所需输入。
- 应用相关最佳实践并验证结果。
- 提供可操作的步骤和验证方法。
- 如需详细示例,请打开
resources/implementation-playbook.md。
适用场景
- 进行值班职责交接
- 编写班次交接摘要
- 记录正在进行的调查
- 建立值班轮换流程
- 提升交接质量
- 新值班工程师入职培训
核心概念
1. 交接组成部分
| 组成部分 | 用途 | |-----------|---------| | 活跃事件 | 当前正在发生的故障 | | 进行中的调查 | 正在排查的问题 | | 近期变更 | 部署、配置变更 | | 已知问题 | 已有临时解决方案的问题 | | 即将发生的事件 | 维护、发布计划 |
2. 交接时间安排
Recommended: 30 min overlap between shifts
Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming
Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup
模板
模板 1:班次交接文档
# On-Call Handoff: Platform Team
**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC
---
## 🔴 Active Incidents
### None currently active
No active incidents at handoff time.
---
## 🟡 Ongoing Investigations
### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out
**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)
**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed
**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)
---
### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)
**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly
**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%
**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)
---
## 🟢 Resolved This Shift
### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231
---
## 📋 Recent Changes
### Deployments
| Service | Version | Time | Notes |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75
### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0
---
## ⚠️ Known Issues & Workarounds
### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)
### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)
---
## 📅 Upcoming Events
| Date | Event | Impact | Contact |
|------|-------|--------|---------|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
---
## 📞 Escalation Reminders
| Issue Type | First Escalation | Second Escalation |
|------------|------------------|-------------------|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
---
## 🔧 Quick Reference
- PagerDuty: [Platform Team Schedule](https://pagerduty/schedules/platform)
- Runbooks: [Platform Runbooks](https://docs/runbooks/platform)
- Dashboards: [On-Call Overview](https://grafana/d/oncall-overview)
模板 2:事件交接记录
# Incident Handoff: INC-2024-0042
## Current Status: INVESTIGATING
**Incident Commander**: Transferring from @alice to @bob
**Transfer Time**: 2024-01-22 09:00 UTC
**Incident Duration**: 4h 23m (started 04:37 UTC)
---
## Situation Summary
**What's happening**: API error rate elevated to 15% (normal: <0.1%)
**Customer impact**: Users experiencing intermittent 500 errors on checkout
**Affected services**: payment-service, order-service
**Severity**: SEV-2
## Timeline (Key Events)
| Time (UTC) | Event |
|------------|-------|
| 04:37 | PagerDuty alert: API error rate > 5% |
| 04:45 | Confirmed: payment-service returning 500s |
| 05:10 | Identified: database connection timeouts |
| 05:30 | Attempted fix: increased connection pool (no improvement) |
| 06:00 | Escalated to DBA team |
| 07:15 | DBA identified: long-running query blocking connections |
| 08:00 | Killed blocking query, partial recovery |
| 08:30 | Error rate down to 5%, still elevated |
## Current Theory
Long-running analytics query (started ~04:30) caused connection pool exhaustion.
Killing the query helped but connection pool may not have fully recovered.
## What's Been Tried
| Action | Result | Time |
|--------|--------|------|
| Increased pool size 50→100 | No improvement | 05:30 |
| Killed blocking query | Partial recovery | 08:00 |
| Restarted payment-service pod 1 | Pod healthy, others still affected | 08:45 |
## Immediate Next Steps
1. [ ] Rolling restart of remaining payment-service pods
2. [ ] Verify connection pool metrics after restart
3. [ ] Confirm error rate returns to normal
## Open Questions
- Why did the analytics query run during peak hours? (Usually scheduled for 02:00)
- Is there a connection leak in payment-service v2.3.4?
## Communication Status
- **Status page**: Updated (degraded performance)
- **Customer support**: Notified, template response provided
- **Leadership**: VP Engineering aware, next update at 10:00 UTC
最佳实践
交接质量检查清单
## Handoff Quality Checklist
### Before Handoff (Outgoing)
- [ ] All active incidents documented with current status
- [ ] Ongoing investigations have clear next steps
- [ ] Recent changes listed with rollback info
- [ ] Known issues have workarounds documented
- [ ] Upcoming events noted with contacts
- [ ] Alerting thresholds reviewed and appropriate
- [ ] Personal context shared (vacation, meetings)
### During Handoff
- [ ] Sync call completed (voice/video preferred)
- [ ] Questions answered and documented
- [ ] Access verified (VPN, dashboards, tools)
- [ ] Escalation paths confirmed
- [ ] PagerDuty/OpsGenie schedule transferred
### After Handoff (Incoming)
- [ ] Handoff document reviewed completely
- [ ] Dashboards bookmarked and accessible
- [ ] Test alert received successfully
- [ ] Team notified of on-call change
- [ ] Emergency contacts saved locally
交接反模式
| 反模式 | 问题 | 改进方案 | |---------|------|----------| | 口头交接 | 信息丢失,无法追溯 | 始终编写文档 | | 信息过载 | 关键细节被淹没 | 按优先级排列,突出重点 | | 无重叠时间 | 无法提问或澄清 | 安排至少30分钟重叠 | | 过时的运行手册 | 错误的操作步骤 | 每次班次后更新运行手册 | | 跳过交接 | 上下文完全丢失 | 将交接设为强制流程 |
自动化建议
# Example: Automated handoff reminder (PagerDuty webhook)
handoff_automation:
triggers:
- event: schedule_change
advance_notice: 30m
actions:
- create_handoff_document:
template: shift-handoff
auto_populate:
- active_incidents: from_pagerduty
- recent_deploys: from_github
- open_alerts: from_prometheus
- notify_outgoing:
channel: slack
message: "Reminder: Your on-call shift ends in 30 min. Please complete handoff doc."
- notify_incoming:
channel: slack
message: "Heads up: Your on-call shift starts in 30 min. Handoff doc: {link}"
- schedule_sync:
calendar: team_calendar
duration: 15m
participants: [outgoing, incoming]
指标与改进
需要跟踪的关键指标
| 指标 | 目标 | 衡量方式 | |------|------|----------| | 交接完成率 | 100% | 文档是否已创建 | | 同步通话完成率 | >90% | 日历记录 | | 交接后事件数 | 减少趋势 | 交接后1小时内的事件 | | 上下文丢失事件 | 0 | 因缺少上下文导致的升级 | | 交接满意度 | >4/5 | 接班人反馈评分 |
持续改进流程
- 每月回顾:审查交接质量指标
- 事后分析:将交接失败纳入事后分析
- 模板迭代:根据反馈更新模板
- 工具改进:自动化重复性的交接任务
- 培训更新:将经验教训纳入入职培训
兼容工具
Claude CodeCursor
标签
前端开发