On-Call 交接模式

低风险

作者 @sickn33已验证来源

4.5555 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

第二步：安装插件

/plugin install antigravity-awesome-skills@antigravity-awesome-skills

关于

有效的值班交接模式，确保连续性、上下文传递和可靠的事故处理。

name: on-call-handoff-patterns description: "值班交接的有效模式，确保班次间的连续性、上下文传递和可靠的事件响应。" risk: unknown source: community date_added: "2026-02-27"

值班交接模式

确保班次间连续性、上下文传递和可靠事件响应的有效交接模式。

不适用场景

任务与值班交接模式无关
需要此范围之外的其他领域或工具

使用说明

明确目标、约束条件和所需输入。
应用相关最佳实践并验证结果。
提供可操作的步骤和验证方法。
如需详细示例，请打开 resources/implementation-playbook.md。

适用场景

进行值班职责交接
编写班次交接摘要
记录正在进行的调查
建立值班轮换流程
提升交接质量
新值班工程师入职培训

核心概念

1. 交接组成部分

| 组成部分 | 用途 | |-----------|---------| | 活跃事件 | 当前正在发生的故障 | | 进行中的调查 | 正在排查的问题 | | 近期变更 | 部署、配置变更 | | 已知问题 | 已有临时解决方案的问题 | | 即将发生的事件 | 维护、发布计划 |

2. 交接时间安排

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

模板

模板 1：班次交接文档

# On-Call Handoff: Platform Team

**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC

---

## 🔴 Active Incidents

### None currently active
No active incidents at handoff time.

---

## 🟡 Ongoing Investigations

### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out

**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)

**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed

**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)

---

### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)

**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly

**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%

**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)

---

## 🟢 Resolved This Shift

### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231

---

## 📋 Recent Changes

### Deployments
| Service | Version | Time | Notes |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |

### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75

### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0

---

## ⚠️ Known Issues & Workarounds

### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)

### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)

---

## 📅 Upcoming Events

| Date | Event | Impact | Contact |
|------|-------|--------|---------|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |

---

## 📞 Escalation Reminders

| Issue Type | First Escalation | Second Escalation |
|------------|------------------|-------------------|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |

---

## 🔧 Quick Reference

- PagerDuty: [Platform Team Schedule](https://pagerduty/schedules/platform)
- Runbooks: [Platform Runbooks](https://docs/runbooks/platform)
- Dashboards: [On-Call Overview](https://grafana/d/oncall-overview)

模板 2：事件交接记录

# Incident Handoff: INC-2024-0042

## Current Status: INVESTIGATING

**Incident Commander**: Transferring from @alice to @bob
**Transfer Time**: 2024-01-22 09:00 UTC
**Incident Duration**: 4h 23m (started 04:37 UTC)

---

## Situation Summary

**What's happening**: API error rate elevated to 15% (normal: <0.1%)
**Customer impact**: Users experiencing intermittent 500 errors on checkout
**Affected services**: payment-service, order-service
**Severity**: SEV-2

## Timeline (Key Events)

| Time (UTC) | Event |
|------------|-------|
| 04:37 | PagerDuty alert: API error rate > 5% |
| 04:45 | Confirmed: payment-service returning 500s |
| 05:10 | Identified: database connection timeouts |
| 05:30 | Attempted fix: increased connection pool (no improvement) |
| 06:00 | Escalated to DBA team |
| 07:15 | DBA identified: long-running query blocking connections |
| 08:00 | Killed blocking query, partial recovery |
| 08:30 | Error rate down to 5%, still elevated |

## Current Theory

Long-running analytics query (started ~04:30) caused connection pool exhaustion.
Killing the query helped but connection pool may not have fully recovered.

## What's Been Tried

| Action | Result | Time |
|--------|--------|------|
| Increased pool size 50→100 | No improvement | 05:30 |
| Killed blocking query | Partial recovery | 08:00 |
| Restarted payment-service pod 1 | Pod healthy, others still affected | 08:45 |

## Immediate Next Steps

1. [ ] Rolling restart of remaining payment-service pods
2. [ ] Verify connection pool metrics after restart
3. [ ] Confirm error rate returns to normal

## Open Questions

- Why did the analytics query run during peak hours? (Usually scheduled for 02:00)
- Is there a connection leak in payment-service v2.3.4?

## Communication Status

- **Status page**: Updated (degraded performance)
- **Customer support**: Notified, template response provided
- **Leadership**: VP Engineering aware, next update at 10:00 UTC

最佳实践

交接质量检查清单

## Handoff Quality Checklist

### Before Handoff (Outgoing)
- [ ] All active incidents documented with current status
- [ ] Ongoing investigations have clear next steps
- [ ] Recent changes listed with rollback info
- [ ] Known issues have workarounds documented
- [ ] Upcoming events noted with contacts
- [ ] Alerting thresholds reviewed and appropriate
- [ ] Personal context shared (vacation, meetings)

### During Handoff
- [ ] Sync call completed (voice/video preferred)
- [ ] Questions answered and documented
- [ ] Access verified (VPN, dashboards, tools)
- [ ] Escalation paths confirmed
- [ ] PagerDuty/OpsGenie schedule transferred

### After Handoff (Incoming)
- [ ] Handoff document reviewed completely
- [ ] Dashboards bookmarked and accessible
- [ ] Test alert received successfully
- [ ] Team notified of on-call change
- [ ] Emergency contacts saved locally

交接反模式

| 反模式 | 问题 | 改进方案 | |---------|------|----------| | 口头交接 | 信息丢失，无法追溯 | 始终编写文档 | | 信息过载 | 关键细节被淹没 | 按优先级排列，突出重点 | | 无重叠时间 | 无法提问或澄清 | 安排至少30分钟重叠 | | 过时的运行手册 | 错误的操作步骤 | 每次班次后更新运行手册 | | 跳过交接 | 上下文完全丢失 | 将交接设为强制流程 |

自动化建议

# Example: Automated handoff reminder (PagerDuty webhook)
handoff_automation:
  triggers:
    - event: schedule_change
      advance_notice: 30m

  actions:
    - create_handoff_document:
        template: shift-handoff
        auto_populate:
          - active_incidents: from_pagerduty
          - recent_deploys: from_github
          - open_alerts: from_prometheus

    - notify_outgoing:
        channel: slack
        message: "Reminder: Your on-call shift ends in 30 min. Please complete handoff doc."

    - notify_incoming:
        channel: slack
        message: "Heads up: Your on-call shift starts in 30 min. Handoff doc: {link}"

    - schedule_sync:
        calendar: team_calendar
        duration: 15m
        participants: [outgoing, incoming]

指标与改进

需要跟踪的关键指标

| 指标 | 目标 | 衡量方式 | |------|------|----------| | 交接完成率 | 100% | 文档是否已创建 | | 同步通话完成率 | >90% | 日历记录 | | 交接后事件数 | 减少趋势 | 交接后1小时内的事件 | | 上下文丢失事件 | 0 | 因缺少上下文导致的升级 | | 交接满意度 | >4/5 | 接班人反馈评分 |

持续改进流程

每月回顾：审查交接质量指标
事后分析：将交接失败纳入事后分析
模板迭代：根据反馈更新模板
工具改进：自动化重复性的交接任务
培训更新：将经验教训纳入入职培训

兼容工具

Claude CodeCursor