成本敏感LLM流水线

低风险

作者 @affaan-m已验证来源

4.6290 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add affaan-m/ECC

第二步：安装插件

/plugin install cost-aware-llm-pipeline@ecc

关于

LLM API 使用的成本优化模式 — 按任务复杂度路由模型、预算跟踪、重试逻辑和提示词缓存。

name: cost-aware-llm-pipeline description: LLM API 使用的成本优化模式 — 按任务复杂度路由模型、预算跟踪、重试逻辑和提示缓存。 origin: ECC

成本感知 LLM 管道

控制 LLM API 成本同时保持质量的模式。将模型路由、预算跟踪、重试逻辑和提示缓存组合成可组合的管道。

何时激活

构建调用 LLM API（Claude、GPT 等）的应用程序
处理复杂度不同的批量项目
需要在 API 支出预算内运行
在不牺牲复杂任务质量的情况下优化成本

核心概念

1. 按任务复杂度路由模型

自动为简单任务选择更便宜的模型，将昂贵模型保留给复杂任务。

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. 不可变成本跟踪

使用冻结数据类跟踪累计支出。每次 API 调用返回一个新的跟踪器——永不修改状态。

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. 精确重试逻辑

仅在瞬态错误时重试。在认证或错误请求错误时快速失败。

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. 提示缓存

缓存长系统提示以避免每次请求都重新发送。

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

组合

将所有四种技术组合在单个管道函数中：

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

定价参考（2025-2026）

| 模型 | 输入（$/百万 token） | 输出（$/百万 token） | 相对成本 | |-------|---------------------|----------------------|----------| | Haiku 4.5 | $0.80 | $4.00 | 1x | | Sonnet 4.6 | $3.00 | $15.00 | ~4x | | Opus 4.5 | $15.00 | $75.00 | ~19x |

最佳实践

从最便宜的模型开始，仅在满足复杂度阈值时才路由到昂贵模型
在处理批次前设置明确的预算限制 — 尽早失败而不是超支
记录模型选择决策，以便根据真实数据调整阈值
对超过 1024 token 的系统提示使用提示缓存 — 节省成本和延迟
永远不要在认证错误时重试 — 这些不会自行解决

兼容工具

Claude CodeCursor

成本敏感LLM流水线

关于

name: cost-aware-llm-pipeline description: LLM API 使用的成本优化模式 — 按任务复杂度路由模型、预算跟踪、重试逻辑和提示缓存。 origin: ECC

成本感知 LLM 管道

何时激活

核心概念

1. 按任务复杂度路由模型

2. 不可变成本跟踪

3. 精确重试逻辑

4. 提示缓存

组合

定价参考（2025-2026）

最佳实践

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架