Regex vs LLM文本解析

低风险

作者 @affaan-m已验证来源

4.5408 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add affaan-m/ECC

第二步：安装插件

/plugin install regex-vs-llm-structured-text@ecc

关于

解析结构化文本时在正则表达式和 LLM 之间选择的决策框架——先用正则，仅在低置信度边缘情况下引入 LLM。

name: regex-vs-llm-structured-text description: 在解析结构化文本时选择正则表达式还是 LLM 的决策框架 — 从正则开始，仅对低置信度边缘情况添加 LLM。 origin: ECC

正则表达式 vs LLM 结构化文本解析

解析结构化文本（测验、表单、发票、文档）的实用决策框架。核心洞察：正则表达式以低成本和确定性方式处理 95-98% 的情况。将昂贵的 LLM 调用保留给剩余的边缘情况。

何时激活

解析具有重复模式的结构化文本（问题、表单、表格）
在正则表达式和 LLM 之间做文本提取决策
构建结合两种方法的混合管道
优化文本处理中的成本/准确性权衡

决策框架

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

架构模式

Source Text
    │
    ▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
    │
    ▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
    │
    ▼
[Confidence Scorer] ─── Flags low-confidence extractions
    │
    ├── High confidence (≥0.95) → Direct output
    │
    └── Low confidence (<0.95) → [LLM Validator] → Output

实现

1. 正则解析器（处理大多数情况）

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. 置信度评分

标记可能需要 LLM 审查的项目：

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM 验证器（仅用于边缘情况）

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. 混合管道

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

成本对比

| 方法 | 每 1000 项成本 | 延迟 | 确定性 | |------|------|------|------| | 纯正则 | ~$0 | <1ms/项 | 完全确定 | | 纯 LLM | ~$2-5 | 500ms-2s/项 | 非确定性 | | 混合（正则 + 5% LLM） | ~$0.10-0.25 | <10ms 平均 | 大部分确定 |

关键要点

从正则开始 — 它处理大多数结构化文本
测量置信度 — 了解正则在哪里失败
仅对边缘情况使用 LLM — 不要为 95% 的简单情况付费
使用最便宜的模型 — Haiku 对验证来说足够了
缓存 LLM 结果 — 相同的边缘情况经常重复出现

兼容工具

Claude CodeCursor

Regex vs LLM文本解析

关于

name: regex-vs-llm-structured-text description: 在解析结构化文本时选择正则表达式还是 LLM 的决策框架 — 从正则开始，仅对低置信度边缘情况添加 LLM。 origin: ECC

正则表达式 vs LLM 结构化文本解析

何时激活

决策框架

架构模式

实现

1. 正则解析器（处理大多数情况）

2. 置信度评分

3. LLM 验证器（仅用于边缘情况）

4. 混合管道

成本对比

关键要点

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架