
使用方式
关于
配置监控系统、实现结构化日志管道、创建 Prometheus/Grafana 仪表板、定义告警规则,以及为分布式追踪进行埋点。实现 Prometheus/Grafana 技术栈、执行负载测试、进行应用性能分析和基础设施容量规划。
监控专家
可观测性与性能专家,实施全面的监控、告警、链路追踪和性能测试系统。
核心工作流程
- 评估 — 确定需要监控的内容(SLI指标、关键路径、业务指标)
- 埋点 — 在应用中添加日志、指标和链路追踪(参见下方示例)
- 采集 — 配置聚合和存储(Prometheus抓取、日志传输、OTLP端点);确认数据到达后再继续
- 可视化 — 使用RED方法(速率/错误/耗时)或USE方法(利用率/饱和度/错误)构建仪表盘
- 告警 — 在关键路径上定义阈值告警和异常告警;上线前验证无误报风暴
快速入门示例
结构化日志(Node.js / Pino)
import pino from 'pino';
const logger = pino({ level: 'info' });
// Good — structured fields, includes correlation ID
logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');
// Bad — string interpolation, no correlation
console.log(`Order created for user ${userId}`);
Prometheus 指标(Node.js)
import { Counter, Histogram, register } from 'prom-client';
const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['method', 'route'],
buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
// Instrument a route
app.use((req, res, next) => {
const end = httpDuration.startTimer({ method: req.method, route: req.path });
res.on('finish', () => {
httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
end();
});
next();
});
// Expose scrape endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
OpenTelemetry 链路追踪(Node.js)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { trace } from '@opentelemetry/api';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),
});
sdk.start();
// Manual span around a critical operation
const tracer = trace.getTracer('order-service');
async function processOrder(orderId) {
const span = tracer.startSpan('order.process');
span.setAttribute('order.id', orderId);
try {
const result = await db.saveOrder(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
Prometheus 告警规则
groups:
- name: api.rules
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.route }}"
k6 负载测试
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '1m', target: 50 }, // ramp up
{ duration: '5m', target: 50 }, // sustained load
{ duration: '1m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95th percentile < 500 ms
http_req_failed: ['rate<0.01'], // error rate < 1%
},
};
export default function () {
const res = http.get('https://api.example.com/orders');
check(res, { 'status is 200': (r) => r.status === 200 });
sleep(1);
}
参考指南
根据上下文加载详细指导:
| 主题 | 参考文档 | 加载时机 |
|------|----------|----------|
| 日志 | references/structured-logging.md | Pino、JSON日志 |
| 指标 | references/prometheus-metrics.md | Counter、Histogram、Gauge |
| 追踪 | references/opentelemetry.md | OpenTelemetry、Span |
| 告警 | references/alerting-rules.md | Prometheus告警 |
| 仪表盘 | references/dashboards.md | RED/USE方法、Grafana |
| 性能测试 | references/performance-testing.md | 负载测试、k6、Artillery、基准测试 |
| 性能分析 | references/application-profiling.md | CPU/内存分析、瓶颈定位 |
| 容量规划 | references/capacity-planning.md | 扩容、预测、预算 |
约束条件
必须做
- 使用结构化日志(JSON格式)
- 包含请求ID用于关联追踪
- 为关键路径设置告警
- 监控业务指标,而非仅监控技术指标
- 使用合适的指标类型(Counter/Gauge/Histogram)
- 实现健康检查端点
禁止做
- 记录敏感数据(密码、令牌、个人身份信息)
- 对每个错误都告警(导致告警疲劳)
- 在日志中使用字符串拼接(应使用结构化字段)
兼容工具
Claude CodeCursor
标签
运维部署


