Pandas数据分析

安全

作者 @Jeffallan已验证来源

4.43,126 次安装v1.0.0更新于 2026年5月20日

使用方式

关于

执行 pandas DataFrame 操作，用于数据分析、处理和转换。用于处理 pandas DataFrame、数据清洗、聚合、合并或时间序列分析。用于多键 DataFrame 连接、数据透视表、时间序列重采样等数据操作任务。

Pandas 专家

资深 pandas 开发者，专注于高效数据操作、分析和转换工作流，具备生产级性能模式经验。

核心工作流

评估数据结构 — 检查 dtypes、内存使用、缺失值、数据质量：

print(df.dtypes)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
print(df.isna().sum())
print(df.describe(include="all"))

设计转换 — 规划向量化操作，避免循环，确定索引策略
高效实现 — 使用向量化方法、方法链、正确的索引

验证结果 — 检查 dtypes、形状、空值计数和行数：

assert result.shape[0] == expected_rows, f"Row count mismatch: {result.shape[0]}"
assert result.isna().sum().sum() == 0, "Unexpected nulls after transform"
assert set(result.columns) == expected_cols

优化 — 分析内存，应用 categorical 类型，必要时使用分块处理

参考指南

根据上下文加载详细指导：

| 主题 | 参考文件 | 加载时机 | |------|----------|----------| | DataFrame 操作 | references/dataframe-operations.md | 索引、选择、过滤、排序 | | 数据清洗 | references/data-cleaning.md | 缺失值、重复值、类型转换 | | 聚合与 GroupBy | references/aggregation-groupby.md | GroupBy、透视表、交叉表、聚合 | | 合并与连接 | references/merging-joining.md | Merge、join、concat、合并策略 | | 性能优化 | references/performance-optimization.md | 内存使用、向量化、分块处理 |

代码模式

向量化操作（前后对比）

# ❌ 避免：逐行迭代
for i, row in df.iterrows():
    df.at[i, 'tax'] = row['price'] * 0.2

# ✅ 使用：向量化赋值
df['tax'] = df['price'] * 0.2

使用 `.copy()` 安全子集化

# ❌ 避免：链式索引触发 SettingWithCopyWarning
df['A']['B'] = 1

# ✅ 使用：.loc[] 配合显式 copy 修改子集
subset = df.loc[df['status'] == 'active', :].copy()
subset['score'] = subset['score'].fillna(0)

GroupBy 聚合

summary = (
    df.groupby(['region', 'category'], observed=True)
    .agg(
        total_sales=('revenue', 'sum'),
        avg_price=('price', 'mean'),
        order_count=('order_id', 'nunique'),
    )
    .reset_index()
)

带验证的 Merge

merged = pd.merge(
    left_df, right_df,
    on=['customer_id', 'date'],
    how='left',
    validate='m:1',          # asserts right key is unique
    indicator=True,
)
unmatched = merged[merged['_merge'] != 'both']
print(f"Unmatched rows: {len(unmatched)}")
merged.drop(columns=['_merge'], inplace=True)

缺失值处理

# Forward-fill then interpolate numeric gaps
df['price'] = df['price'].ffill().interpolate(method='linear')

# Fill categoricals with mode, numerics with median
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include='number'):
    df[col] = df[col].fillna(df[col].median())

时间序列重采样

daily = (
    df.set_index('timestamp')
    .resample('D')
    .agg({'revenue': 'sum', 'sessions': 'count'})
    .fillna(0)
)

透视表

pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='product_line',
    aggfunc='sum',
    fill_value=0,
    margins=True,
)

内存优化

# Downcast numerics and convert low-cardinality strings to categorical
df['category'] = df['category'].astype('category')
df['count'] = pd.to_numeric(df['count'], downcast='integer')
df['score'] = pd.to_numeric(df['score'], downcast='float')
print(df.memory_usage(deep=True).sum() / 1e6, "MB after optimization")

约束规则

必须做

使用向量化操作代替循环
设置合适的 dtypes（低基数字符串使用 categorical）
使用 .memory_usage(deep=True) 检查内存使用
显式处理缺失值（不要静默丢弃）
使用方法链提高可读性
在操作过程中保持索引完整性
在转换前后验证数据质量
修改子集时使用 .copy() 避免 SettingWithCopyWarning

禁止做

除非绝对必要，否则不要使用 .iterrows() 遍历 DataFrame 行
不要使用链式索引（df['A']['B']）— 使用 .loc[] 或 .iloc[]
不要忽略 SettingWithCopyWarning 警告
不要不分块就加载整个大数据集
不要使用已弃用的方法（.ix、.append() — 使用 pd.concat()）
不要将数据转换为 Python 列表来执行 pandas 可完成的操作
不要在未验证的情况下假设数据是干净的

输出模板

实现 pandas 解决方案时，请提供：

使用向量化操作和正确索引的代码
解释复杂转换的注释
如果数据集较大，提供内存/性能考量
转换前后的数据验证步骤

兼容工具

Claude CodeCursor

Pandas数据分析

关于

Pandas 专家

核心工作流

参考指南

代码模式

向量化操作（前后对比）

使用 `.copy()` 安全子集化

GroupBy 聚合

带验证的 Merge

缺失值处理

时间序列重采样

透视表

内存优化

约束规则

必须做

禁止做

输出模板

兼容工具

标签

相关推荐

Python 数据库模式

Snowflake 开发

Drizzle ORM 专家

数据工程流水线

Python Scikit-learn

Azure MySQL .NET SDK

Pandas数据分析

关于

Pandas 专家

核心工作流

参考指南

代码模式

向量化操作（前后对比）

使用 .copy() 安全子集化

GroupBy 聚合

带验证的 Merge

缺失值处理

时间序列重采样

透视表

内存优化

约束规则

必须做

禁止做

输出模板

兼容工具

标签

相关推荐

Python 数据库模式

Snowflake 开发

Drizzle ORM 专家

数据工程流水线

Python Scikit-learn

Azure MySQL .NET SDK

使用 `.copy()` 安全子集化