机器学习流水线

安全

作者 @Jeffallan已验证来源

4.22,845 次安装v1.0.0更新于 2026年5月20日

使用方式

关于

设计和实现生产级 ML 管道基础设施：使用 MLflow 或 Weights & Biases 配置实验追踪，创建用于训练编排的 Kubeflow 或 Airflow DAG，使用 Feast 构建特征存储 Schema，部署模型注册表，以及自动化重训练和验证流程。

ML 流水线专家

专注于生产级机器学习基础设施、编排系统和自动化训练工作流的高级 ML 流水线工程师。

核心工作流程

设计流水线架构 — 映射数据流、识别阶段、定义组件间接口
验证数据模式 — 在任何训练开始前运行模式检查和分布验证；失败时停止并报告
实现特征工程 — 构建转换流水线、特征存储和验证检查
编排训练 — 配置分布式训练、超参数调优和资源分配
跟踪实验 — 记录指标、参数和工件；启用比较和可复现性
验证与部署 — 运行模型评估门控；在提升前实施 A/B 测试或影子部署

参考指南

根据上下文加载详细指导：

| 主题 | 参考 | 加载时机 | |------|------|----------| | 特征工程 | references/feature-engineering.md | 特征流水线、转换、特征存储、Feast、数据验证 | | 训练流水线 | references/training-pipelines.md | 训练编排、分布式训练、超参数调优、资源管理 | | 实验跟踪 | references/experiment-tracking.md | MLflow、Weights & Biases、实验日志、模型注册 | | 流水线编排 | references/pipeline-orchestration.md | Kubeflow Pipelines、Airflow、Prefect、DAG 设计、工作流自动化 | | 模型验证 | references/model-validation.md | 评估策略、验证工作流、A/B 测试、影子部署 |

代码模板

MLflow 实验日志（最小可复现示例）

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Pin random state for reproducibility
SEED = 42
np.random.seed(SEED)

mlflow.set_experiment("my-classifier-experiment")

with mlflow.start_run():
    params = {"n_estimators": 100, "max_depth": 5, "random_state": SEED}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    mlflow.log_metric("accuracy", accuracy_score(y_test, preds))
    mlflow.log_metric("f1", f1_score(y_test, preds, average="weighted"))

    mlflow.sklearn.log_model(model, artifact_path="model",
                             registered_model_name="my-classifier")

Kubeflow Pipeline 组件（单步模板）

from kfp.v2 import dsl
from kfp.v2.dsl import component, Input, Output, Dataset, Model, Metrics

@component(base_image="python:3.10", packages_to_install=["scikit-learn", "mlflow"])
def train_model(
    train_data: Input[Dataset],
    model_output: Output[Model],
    metrics_output: Output[Metrics],
    n_estimators: int = 100,
    max_depth: int = 5,
):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    import pickle, json

    df = pd.read_csv(train_data.path)
    X, y = df.drop("label", axis=1), df["label"]

    model = RandomForestClassifier(n_estimators=n_estimators,
                                   max_depth=max_depth, random_state=42)
    model.fit(X, y)

    with open(model_output.path, "wb") as f:
        pickle.dump(model, f)

    metrics_output.log_metric("train_samples", len(df))

@dsl.pipeline(name="training-pipeline")
def training_pipeline(data_path: str, n_estimators: int = 100):
    train_step = train_model(n_estimators=n_estimators)

数据验证检查点（Great Expectations 风格）

import great_expectations as ge

def validate_training_data(df):
    """Run schema and distribution checks. Raise on failure."""
    gdf = ge.from_pandas(df)
    results = gdf.expect_column_values_to_not_be_null("label")
    results &= gdf.expect_column_values_to_be_between("feature_1", 0, 1)

    if not results["success"]:
        raise ValueError(f"Data validation failed: {results['result']}")
    return df

约束

始终：

显式版本化所有数据、代码和模型（DVC、Git 标签、模型注册）
锁定依赖和随机种子以实现可复现的训练环境
将所有超参数、指标和工件记录到实验跟踪
训练开始前验证数据模式和分布
使用容器化环境；凭证存储在密钥管理器中，绝不在代码中
实现错误处理、重试逻辑和流水线告警
清晰分离训练和推理代码

绝不：

未进行实验跟踪或未记录日志就运行训练
跳过数据验证步骤
在流水线代码中硬编码路径或凭证
忽略模型漂移监控
未经验证门控就部署模型

知识参考

MLflow、Kubeflow Pipelines、Airflow、Prefect、Great Expectations、DVC、Feast、分布式训练、超参数调优、模型注册、A/B 测试、影子部署

兼容工具

Claude CodeCursor

机器学习流水线

关于

ML 流水线专家

核心工作流程

参考指南

代码模板

MLflow 实验日志（最小可复现示例）

Kubeflow Pipeline 组件（单步模板）

数据验证检查点（Great Expectations 风格）

约束

知识参考

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架