LLM 评估策略

低风险

作者 @sickn33已验证来源

4.3577 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

第二步：安装插件

/plugin install llm-evaluation@antigravity-awesome-skills

关于

掌握 LLM 应用的全面评估策略，从自动化指标到人工评估和 A/B 测试，确保 AI 系统的质量和可靠性。

name: llm-evaluation description: "掌握 LLM 应用的全面评估策略，从自动化指标到人工评估和 A/B 测试。" risk: unknown source: community date_added: "2026-02-27"

LLM 评估

掌握 LLM 应用的全面评估策略，从自动化指标到人工评估和 A/B 测试。

不适用场景

任务与 LLM 评估无关
需要此范围之外的其他领域或工具

指令

明确目标、约束条件和所需输入。
应用相关最佳实践并验证结果。
提供可操作的步骤和验证方法。
如需详细示例，请打开 resources/implementation-playbook.md。

适用场景

系统化地衡量 LLM 应用性能
比较不同模型或提示词
在部署前检测性能回归
验证提示词变更带来的改进
建立对生产系统的信心
建立基线并跟踪长期进展
调试意外的模型行为

核心评估类型

1. 自动化指标

使用计算分数进行快速、可重复、可扩展的评估。

文本生成：

BLEU：N-gram 重叠度（翻译）
ROUGE：面向召回率（摘要）
METEOR：语义相似度
BERTScore：基于嵌入的相似度
困惑度：语言模型置信度

分类：

准确率：正确百分比
精确率/召回率/F1：类别特定性能
混淆矩阵：错误模式
AUC-ROC：排序质量

检索（RAG）：

MRR：平均倒数排名
NDCG：归一化折损累积增益
Precision@K：前 K 中的相关项
Recall@K：前 K 的覆盖率

2. 人工评估

对难以自动化的质量方面进行人工评估。

评估维度：

准确性：事实正确性
连贯性：逻辑流畅
相关性：回答了问题
流畅性：自然语言质量
安全性：无有害内容
有用性：对用户有帮助

3. LLM 作为评判者

使用更强的 LLM 来评估较弱模型的输出。

方法：

逐点评分：对单个回复评分
成对比较：比较两个回复
基于参考：与黄金标准比较
无参考：无需真实答案进行评判

快速开始

from llm_eval import EvaluationSuite, Metric

# Define evaluation suite
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom(name="groundedness", fn=check_groundedness)
])

# Prepare test cases
test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
    # ... more test cases
]

# Run evaluation
results = suite.evaluate(
    model=your_model,
    test_cases=test_cases
)

print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")

自动化指标实现

BLEU 分数

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference, hypothesis):
    """Calculate BLEU score between reference and hypothesis."""
    smoothie = SmoothingFunction().method4

    return sentence_bleu(
        [reference.split()],
        hypothesis.split(),
        smoothing_function=smoothie
    )

# Usage
bleu = calculate_bleu(
    reference="The cat sat on the mat",
    hypothesis="A cat is sitting on the mat"
)

ROUGE 分数

from rouge_score import rouge_scorer

def calculate_rouge(reference, hypothesis):
    """Calculate ROUGE scores."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

BERTScore

from bert_score import score

def calculate_bertscore(references, hypotheses):
    """Calculate BERTScore using pre-trained BERT."""
    P, R, F1 = score(
        hypotheses,
        references,
        lang='en',
        model_type='microsoft/deberta-xlarge-mnli'
    )

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

自定义指标

def calculate_groundedness(response, context):
    """Check if response is grounded in provided context."""
    from transformers import pipeline

    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")

    result = nli(f"{context} [SEP] {response}")[0]

    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0

def calculate_toxicity(text):
    """Measure toxicity of generated text."""
    from transformers import pipeline

    toxicity = pipeline("text-classification", model="unitary/toxic-bert")
    result = toxicity(text)[0]

    return result['score'] if result['label'] == 'toxic' else 0.0

兼容工具

Claude CodeCursor

LLM 评估策略

关于

name: llm-evaluation description: "掌握 LLM 应用的全面评估策略，从自动化指标到人工评估和 A/B 测试。" risk: unknown source: community date_added: "2026-02-27"

LLM 评估

不适用场景

指令

适用场景

核心评估类型

1. 自动化指标

2. 人工评估

3. LLM 作为评判者

快速开始

自动化指标实现

BLEU 分数

ROUGE 分数

BERTScore

自定义指标

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架