事后分析报告撰写

低风险

作者 @sickn33已验证来源

4.1145 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

第二步：安装插件

/plugin install antigravity-awesome-skills@antigravity-awesome-skills

关于

编写有效、无责复盘报告的全面指南，推动组织学习并防止事件再次发生

name: postmortem-writing description: "编写有效、无指责事后分析报告的综合指南，推动组织学习并防止事故再次发生。" risk: unknown source: community date_added: "2026-02-27"

事后分析报告撰写

编写有效、无指责事后分析报告的综合指南，推动组织学习并防止事故再次发生。

不要在以下情况使用此技能

任务与事后分析报告撰写无关
你需要此范围之外的不同领域或工具

说明

明确目标、约束和所需输入。
应用相关最佳实践并验证结果。
提供可操作的步骤和验证方法。
如需详细示例，请打开 resources/implementation-playbook.md。

在以下情况使用此技能

进行事故后审查
撰写事后分析文档
主持无指责事后分析会议
识别根本原因和促成因素
创建可操作的后续行动项
建设组织学习文化

核心概念

1. 无指责文化

| 指责导向 | 无指责 | |---------------|-----------| | "谁造成了这个？" | "什么条件允许了这种情况？" | | "有人犯了错误" | "系统允许了这个错误" | | 惩罚个人 | 改进系统 | | 隐藏信息 | 分享经验教训 | | 害怕发言 | 心理安全 |

2. 事后分析触发条件

SEV1 或 SEV2 事故
面向客户的中断 > 15 分钟
数据丢失或安全事故
本可能很严重的险情
新型故障模式
需要非常规干预的事故

快速开始

事后分析时间线

第 0 天：事故发生
第 1-2 天：起草事后分析文档
第 3-5 天：事后分析会议
第 5-7 天：定稿文档，创建工单
第 2 周+：行动项完成
每季度：审查跨事故模式

模板

模板 1：标准事后分析

# Postmortem: [Incident Title]

**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes

## Executive Summary

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

**Impact**:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications

## Timeline (All times UTC)

| Time | Event |
|------|-------|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: \`payment_error_rate > 5%\` |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |

## Root Cause Analysis

### What Happened

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.

### Why It Happened

1. **Proximate Cause**: Code change in \`PaymentRepository.java\` replaced pooled \`DataSource\` with direct \`DriverManager.getConnection()\` calls.

2. **Contributing Factors**:
   - Code review did not catch the connection handling change
   - No integration tests specifically for connection pool behavior
   - Staging environment has lower traffic, masking the issue
   - Database connection metrics alert threshold was too high (90%)

3. **5 Whys Analysis**:
   - Why did the service fail? → Database connections exhausted
   - Why were connections exhausted? → Each request opened new connection
   - Why did each request open new connection? → Code bypassed connection pool
   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
   - Why was developer unfamiliar? → No documentation on connection management patterns

### System Diagram

\`\`\`
[Client] → [Load Balancer] → [Payment Service] → [Database]
                                    ↓
                            Connection Pool (broken)
                                    ↓
                            Direct connections (cause)
\`\`\`

## Detection

### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)

### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this

运行测试期间

测试期间

应该做的：

监控技术健康状况
记录外部因素

不应该做的：

因为"看起来不错的"结果而提前停止
在测试中途更改变体
添加新的流量来源
重新定义成功标准

分析结果

分析纪律

解读结果时：

不要超出测试人群进行泛化
不要声称超出测试变更的因果关系
不要忽视护栏指标失败
将统计显著性与商业判断分开

解读结果

| 结果 | 行动 | | -------------------- | -------------------------------------- | | 显著正面 | 考虑全量发布 | | 显著负面 | 拒绝变体，记录经验教训 | | 不确定 | 考虑更多流量或更大胆的变更 | | 护栏失败 | 即使主指标获胜也不发布 |

文档与学习

测试记录（必需）

记录：

假设
变体
指标
样本量 vs 实际达到
结果
决策
经验教训
后续想法

将记录存储在共享的、可搜索的位置，以避免重复失败。

拒绝条件（安全）

在以下情况拒绝继续：

基线率未知且无法估算
流量不足以检测 MDE
主指标未定义
在没有适当设计的情况下更改了多个变量
假设无法清晰陈述

解释原因并推荐下一步。

关键原则（不可协商）

每次测试一个假设
一个主指标
启动前承诺
不偷看
学习优先于获胜
统计严谨性优先

最后提醒

A/B 测试不是为了证明想法是对的。而是为了有信心地了解真相。

如果你感到想要匆忙、简化或"试试看"—— 这就是放慢脚步并重新检查设计的信号。

何时使用

当任务明确匹配上述概述中描述的工作流程或操作时，适用此技能。

限制

仅在任务明确匹配上述范围时使用此技能。
不要将输出视为环境特定验证、测试或专家审查的替代品。

兼容工具

Claude CodeCursor

事后分析报告撰写

关于

name: postmortem-writing description: "编写有效、无指责事后分析报告的综合指南，推动组织学习并防止事故再次发生。" risk: unknown source: community date_added: "2026-02-27"

事后分析报告撰写

不要在以下情况使用此技能

说明

在以下情况使用此技能

核心概念

1. 无指责文化

2. 事后分析触发条件

快速开始

事后分析时间线

模板

模板 1：标准事后分析

运行测试期间

测试期间

分析结果

分析纪律

解读结果

文档与学习

测试记录（必需）

拒绝条件（安全）

关键原则（不可协商）

最后提醒

何时使用

限制

兼容工具

标签

相关推荐

Conductor任务执行

JSON Canvas 编辑

上下文驱动开发

Kotlin 协程专家

错误诊断与分析

Agent发现与研究