
关于
ClickHouse 数据库模式、查询优化、分析和数据工程最佳实践,面向高性能分析工作负载。
name: clickhouse-io description: ClickHouse 数据库模式、查询优化、分析和数据工程最佳实践,适用于高性能分析工作负载。 origin: ECC
ClickHouse 分析模式
ClickHouse 特定的高性能分析和数据工程模式。
何时激活
- 设计 ClickHouse 表模式(MergeTree 引擎选择)
- 编写分析查询(聚合、窗口函数、连接)
- 优化查询性能(分区裁剪、投影、物化视图)
- 摄入大量数据(批量插入、Kafka 集成)
- 从 PostgreSQL/MySQL 迁移到 ClickHouse 进行分析
- 实现实时仪表板或时间序列分析
概述
ClickHouse 是面向在线分析处理(OLAP)的列式数据库管理系统(DBMS)。它针对大型数据集上的快速分析查询进行了优化。
关键特性:
- 列式存储
- 数据压缩
- 并行查询执行
- 分布式查询
- 实时分析
表设计模式
MergeTree 引擎(最常用)
CREATE TABLE markets_analytics (
date Date,
market_id String,
market_name String,
volume UInt64,
trades UInt32,
unique_traders UInt32,
avg_trade_size Float64,
created_at DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, market_id)
SETTINGS index_granularity = 8192;
ReplacingMergeTree(去重)
-- For data that may have duplicates (e.g., from multiple sources)
CREATE TABLE user_events (
event_id String,
user_id String,
event_type String,
timestamp DateTime,
properties String
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, event_id, timestamp)
PRIMARY KEY (user_id, event_id);
AggregatingMergeTree(预聚合)
-- For maintaining aggregated metrics
CREATE TABLE market_stats_hourly (
hour DateTime,
market_id String,
total_volume AggregateFunction(sum, UInt64),
total_trades AggregateFunction(count, UInt32),
unique_users AggregateFunction(uniq, String)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, market_id);
-- Query aggregated data
SELECT
hour,
market_id,
sumMerge(total_volume) AS volume,
countMerge(total_trades) AS trades,
uniqMerge(unique_users) AS users
FROM market_stats_hourly
WHERE hour >= toStartOfHour(now() - INTERVAL 24 HOUR)
GROUP BY hour, market_id
ORDER BY hour DESC;
查询优化模式
高效过滤
-- 好的做法:先使用索引列
SELECT *
FROM markets_analytics
WHERE date >= '2025-01-01'
AND market_id = 'market-123'
AND volume > 1000
ORDER BY date DESC
LIMIT 100;
-- 不好的做法:先过滤非索引列
SELECT *
FROM markets_analytics
WHERE volume > 1000
AND market_name LIKE '%election%'
AND date >= '2025-01-01';
聚合
-- 时间序列聚合
SELECT
toStartOfHour(created_at) AS hour,
count() AS events,
uniq(user_id) AS unique_users,
sum(volume) AS total_volume
FROM markets_analytics
WHERE date >= today() - 7
GROUP BY hour
ORDER BY hour;
窗口函数
-- 排名和累计
SELECT
market_id,
date,
volume,
row_number() OVER (PARTITION BY market_id ORDER BY date DESC) AS rn,
sum(volume) OVER (PARTITION BY market_id ORDER BY date) AS cumulative_volume
FROM markets_analytics
WHERE date >= '2025-01-01';
数据摄入模式
批量插入
-- 始终批量插入(每批 1000+ 行)
INSERT INTO markets_analytics
SELECT * FROM input('date Date, market_id String, volume UInt64')
FORMAT JSONEachRow;
物化视图
-- 自动聚合新数据
CREATE MATERIALIZED VIEW market_daily_mv
TO market_daily_stats
AS SELECT
toDate(created_at) AS date,
market_id,
sum(volume) AS daily_volume,
count() AS daily_trades,
uniq(user_id) AS daily_users
FROM markets_analytics
GROUP BY date, market_id;
最佳实践
- 选择正确的 ORDER BY — 将最常过滤的列放在前面
- 使用适当的分区 — 通常按月(toYYYYMM)用于时间序列数据
- 批量插入 — 避免单行插入,目标每批 1000+ 行
- 使用物化视图 — 用于预计算常见聚合
- **避免 SELECT *** — 只选择需要的列(列式存储的优势)
- 使用 PREWHERE — 对于高选择性过滤器比 WHERE 更快
- 监控合并 — 使用 system.merges 跟踪后台合并
兼容工具
Claude CodeCursor
标签
数据工程

