评估与监控 · WadeLy

为什么 AI 应用难评估

传统软件：单元测试通过 = 没问题。 AI 应用：输出没有标准答案——同一问题多种合理回答都对。

但没评估的 AI 应用就是黑盒。必须有可度量的指标。

评估的两个阶段

开发阶段                上线后
──────────             ──────────
准确率 / F1            响应时间
人工评分               用户反馈（点赞/踩）
A/B 比较               异常监控
回归测试               成本监控

1. 离线评估（开发期）

准备测试集

test_cases = [
    {"input": "Python 是什么？", "expected": "...编程语言..."},
    {"input": "怎么排序列表？", "expected": "...sort..."},
    ...     # 至少 50 条
]

准确率 / 召回率（结构化任务）

分类、抽取——能用客观指标：

from sklearn.metrics import classification_report

actual = [model.predict(c["input"]) for c in test_cases]
expected = [c["expected"] for c in test_cases]
print(classification_report(expected, actual))

LLM as Judge（开放式生成）

让另一个 LLM 当评委：

def judge(question, answer, expected):
    prompt = f"""评估这个回答的质量（1-5 分）。
问题：{question}
期望答案：{expected}
实际回答：{answer}

只返回数字 + 一句理由。"""
    msg = client.messages.create(
        model="claude-opus-4-7",     # 用更强的模型当评委
        max_tokens=100,
        messages=[{"role":"user","content":prompt}],
    )
    return msg.content[0].text


scores = [judge(c["input"], model_answer(c["input"]), c["expected"]) for c in test_cases]

用更强的模型评估较弱模型——是 2026 标配做法。

人工抽检

LLM 评委不是万能。每周抽 20 条让人审一遍——别让模型自己评自己。

2. 在线评估（上线后）

用户反馈按钮

<button onclick="rate('thumbs_up')">👍</button>
<button onclick="rate('thumbs_down')">👎</button>

记录到日志，定期统计 thumbs_up / 总数。

详细日志

每次请求记录：

log = {
    "timestamp": "2026-05-09T14:30:00",
    "user_id": "u123",
    "input": req.message,
    "output": response,
    "model": "claude-sonnet-4-6",
    "latency_ms": 850,
    "tokens_in": 200,
    "tokens_out": 350,
    "cost_usd": 0.012,
    "rating": None,                  # 用户后续反馈
}

存到日志系统（ELK / Loki / 自家数据库）。

监控指标

指标	警告阈值
P95 响应时间	> 5s
错误率	> 1%
用户负面反馈率	> 10%
月度成本	> 预算
Token 使用量增速	异常陡增

用 Grafana / Datadog 配警报——超阈值发邮件 / Slack。

3. 防退化：回归测试

每次改 prompt / 换模型——先跑测试集：

def regression_test(model_or_prompt):
    pass_count = 0
    for c in test_cases:
        result = run(model_or_prompt, c["input"])
        if check(result, c["expected"]):
            pass_count += 1
    return pass_count / len(test_cases)


print("旧版本:", regression_test("v1"))     # 0.92
print("新版本:", regression_test("v2"))     # 0.85   退步了！

退步就不要上——查原因再说。

4. A/B 测试

新 prompt / 新模型不直接换——两版同时跑，按用户分桶：

def get_model(user_id):
    if hash(user_id) % 100 < 10:    # 10% 用户
        return "v2"
    return "v1"

# 跑一周后对比两组的指标

A/B 是唯一可信的"v2 比 v1 好"证明——别拍脑袋决定。

5. 成本监控

OpenAI / Claude 都按 token 计费。每月：

total_cost = sum(log["cost_usd"] for log in this_month_logs)
print(f"本月: ${total_cost:.2f}")

# 按用户排
top_users = sorted(by_user.items(), key=lambda x: -x[1])[:10]

发现单用户花异常 → 加限速。

6. 出错处理

LLM API 不稳定——永远要 retry + 兜底：

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def safe_call(req):
    try:
        return await client.messages.create(...)
    except Exception as e:
        log.error("调用失败: %s", e)
        raise

一个完整 AI 应用的日常运维

每天   看监控大盘（错误率 / 延迟 / 成本）
每周   跑回归测试 + 抽检 50 条用户反馈
每月   评估总成本 / A/B 实验汇总 / 决定是否换模型
每季   重新生成测试集（用户问题在变）

终于结尾

这就是「Python AI 教程」30 篇的全部——从 NumPy 到 LLM API 到 RAG 到 Agent 到部署到评估。

恭喜读到这里。你现在有能力独立做一个端到端的 AI 应用。