
关于
Statsmodels 是 Python 首选的统计建模库,提供跨多种统计方法的估计、推断和诊断工具。
name: statsmodels description: "Statsmodels 是 Python 首屈一指的统计建模库,提供涵盖多种统计方法的估计、推断和诊断工具。" license: BSD-3-Clause license metadata: skill-author: K-Dense Inc. risk: unknown source: community
Statsmodels:统计建模与计量经济学
概述
Statsmodels 是 Python 首屈一指的统计建模库,提供涵盖多种统计方法的估计、推断和诊断工具。当需要进行严谨的统计分析时(从简单线性回归到复杂时间序列模型和计量经济学分析),请使用此技能。
何时使用此技能
此技能适用于以下场景:
- 拟合回归模型(OLS、WLS、GLS、分位数回归)
- 执行广义线性建模(逻辑回归、泊松回归、Gamma 回归等)
- 分析离散结果(二元、多项、计数、有序)
- 进行时间序列分析(ARIMA、SARIMAX、VAR、预测)
- 运行统计检验和诊断
- 检验模型假设(异方差性、自相关、正态性)
- 检测异常值和强影响观测
- 比较模型(AIC/BIC、似然比检验)
- 估计因果效应
- 生成可发表的统计表格和推断结果
快速入门指南
线性回归(OLS)
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)
# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()
# View comprehensive results
print(results.summary())
# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\n{results.params}")
print(f"P-values:\n{results.pvalues}")
# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary) # includes mean, CI, prediction intervals
# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
逻辑回归(二元结果)
from statsmodels.discrete.discrete_model import Logit
# Add constant
X = sm.add_constant(X_data)
# Fit logit model
model = Logit(y_binary, X)
results = model.fit()
print(results.summary())
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\n", odds_ratios)
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())
时间序列(ARIMA)
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Check stationarity
from statsmodels.tsa.stattools import adfuller
adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")
if adf_result[1] > 0.05:
# Series is non-stationary, difference it
y_diff = y_series.diff().dropna()
# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()
# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()
print(forecast_df) # includes mean and confidence intervals
# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
广义线性模型(GLM)
import statsmodels.api as sm
# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\n", rate_ratios)
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")
if overdispersion > 1.5:
# Use Negative Binomial instead
from statsmodels.discrete.count_model import NegativeBinomial
nb_model = NegativeBinomial(y_counts, X)
nb_results = nb_model.fit()
print(nb_results.summary())
核心统计建模能力
1. 线性回归模型
针对连续结果变量的全面线性模型套件,支持各种误差结构。
兼容工具
Claude CodeCursor
标签
数据工程
