模型评估

提示

LLM 评估没有“银弹”,但方法的选择决定了你能看到的世界。本文将带你拆解主流评估范式,理解背后的逻辑与局限。

理解 LLM 评估的四大主流方法

在实际工作中,如何科学评估大语言模型(Large Language Model, LLM)?这是一个看似简单却极具深度的问题。无论是模型选型、结果解读,还是微调与自研模型的进展衡量,评估方法的选择都至关重要。

本节将梳理 LLM 评估的四种常见方式:多项选择、验证器、排行榜和 LLM 评审。理解这些方法的原理,有助于你更好地解读 benchmark、leaderboard 以及论文报告的数据。

评估方法概览

目前主流的 LLM 评估方法可分为两大类:基准测试(Benchmark-based)和判断类评估(Judgment-based)。常见的四种方法如下:

  • 多项选择(Multiple Choice)
  • 验证器(Verifier)
  • 排行榜(Leaderboard)
  • LLM 评审(LLM Judge)

下图展示了这四种评估方式的关系,帮助理解各自的归属与联系。

[图示或嵌入内容已省略]

方法一:多项选择准确率评估

多项选择题(如 MMLU)是最常见的基准测试方法,主要考察模型的知识回忆能力。以 MMLU(Massive Multitask Language Understanding)为例,涵盖 57 个学科、约 1.6 万道选择题,评估指标为准确率。

下图展示了多项选择题评估的基本流程。

[图示或嵌入内容已省略]

图 2: MMLU 多项选择题评估示例

代码示例:加载模型与评测

以下代码演示如何加载 Qwen3 0.6B 模型并进行多项选择题评测。

from pathlib import Path
import torch
from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small, Qwen3Tokenizer,
    Qwen3Model, QWEN_CONFIG_06_B
)

device = get_device()

# Set matmul precision to "high" to
# enable Tensor Cores on compatible GPUs
torch.set_float32_matmul_precision("high")

# Uncomment the following line
# if you encounter device compatibility issues
# device = "cpu"

# Use the base model by default
WHICH_MODEL = "base"

if WHICH_MODEL == "base":
    download_qwen3_small(
        kind="base", tokenizer_only=False, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

elif WHICH_MODEL == "reasoning":
    download_qwen3_small(
        kind="reasoning", tokenizer_only=False, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

else:
    raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")

model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))
model.to(device)

# Optionally enable model compilation for potential performance gains
USE_COMPILE = False
if USE_COMPILE:
    torch._dynamo.config.allow_unspec_int_on_nn_module = True
    model = torch.compile(model)

实际效果如下:

Generated letter: C
Correct? False

提示

多项选择题评测简单直观,适合大规模快速对比,但仅能衡量知识回忆能力,无法反映推理与真实应用表现。


方法二:验证器自动判分

验证器方法允许模型自由生成答案,再用外部工具(如代码解释器、计算器)自动比对答案正确性,适用于数学、代码等可自动验证领域。

下图为验证器自动判分的流程示意。

[图示或嵌入内容已省略]

图 3: 验证器自动判分流程

该方法可自动生成大量题目,适合推理能力评测,但仅适用于可自动验证的领域,且依赖外部工具的准确性。

方法三:排行榜与偏好投票

排行榜方法通过用户或 LLM 对模型输出的偏好投票,统计模型受欢迎程度。典型如 LM Arena,用户对比两模型输出,投票选出更优者,最终形成排行榜。

下图为排行榜投票流程示意。

[图示或嵌入内容已省略]

代码示例:Elo 排名实现

def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):
    # Initialize all models with the same base rating
    ratings = {
        model: initial_rating
        for pair in vote_pairs
        for model in pair
    }

    # Update ratings after each match
    for winner, loser in vote_pairs:

        # Expected score for the current winner
        expected_winner = 1.0 / (
            1.0 + 10 ** (
                (ratings[loser] - ratings[winner])
                / 400.0
            )
        )

        # k_factor determines sensitivity of updates
        ratings[winner] = (
            ratings[winner]
            + k_factor * (1 - expected_winner)
        )
        ratings[loser] = (
            ratings[loser]
            + k_factor * (0 - (1 - expected_winner))
        )

    return ratings

输出示例:

GPT-5 : 1043.7
Claude-3 : 1015.2
Llama-4 : 1000.7
Llama-3 : 940.4

提示

排行榜方法能反映模型在真实场景下的受欢迎程度,但受用户群体、投票偏好等影响较大,且难以衡量答案正确性。


方法四:LLM 评审(AI 评分官)

LLM 评审方法利用强大的 LLM(如 GPT-5)作为评分官,依据评分标准(rubric)对模型输出进行自动打分,兼具可扩展性与一致性。

下图为 LLM 评审流程示意。

[图示或嵌入内容已省略]

图 5: LLM 评审流程示意

代码示例:Ollama API 自动评分

import json
import urllib.request

def query_model(
    prompt,
    model="gpt-oss:20b",
    # If you used
    # OLLAMA_HOST=127.0.0.1:11435 ollama serve
    # update the address below
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary:
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        # Settings required for deterministic responses:
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    # Convert the dictionary to JSON and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a POST request and add headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    response_data = ""

    # Send the request and capture the streaming response
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            # Parse each line into JSON
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

def rubric_prompt(instruction, reference_answer, model_answer):
    rubric = (
        "You are a fair judge assistant. You will be "
        "given an instruction, a reference answer, and "
        "a candidate answer to evaluate, according to "
        "the following rubric:\n\n"
        "1: The response fails to address the "
        "instruction, providing irrelevant, incorrect, "
        "or excessively verbose content.\n"
        "2: The response partially addresses the "
        "instruction but contains major errors, "
        "omissions, or irrelevant details.\n"
        "3: The response addresses the instruction to "
        "some degree but is incomplete, partially "
        "correct, or unclear in places.\n"
        "4: The response mostly adheres to the "
        "instruction, with only minor errors, "
        "omissions, or lack of clarity.\n"
        "5: The response fully adheres to the "
        "instruction, providing a clear, accurate, and "
        "relevant answer in a concise and efficient "
        "manner.\n\n"
        "Now here is the instruction, the reference "
        "answer, and the response.\n"
    )

    prompt = (
        f"{rubric}\n"
        f"Instruction:\n{instruction}\n\n"
        f"Reference Answer:\n{reference_answer}\n\n"
        f"Answer:\n{model_answer}\n\n"
        f"Evaluation: "
    )
    return prompt评分结果示例:

评分结果示例:

Score: 5

The candidate answer directly addresses the question, correctly applies the given premises, and concisely states that a penguin would be able to fly. It is accurate, relevant, and clear.

提示

LLM 评审方法适用于大规模自动评测,兼具灵活性与一致性,但结果依赖评分官模型与评分标准,存在一定主观性。

方法对比与适用建议

不同评估方法各有优缺点,实际应用中应结合多种方式,综合衡量模型能力。下表总结了四种方法的主要特性。

内嵌表格

方法 优点 缺点
多项选择 快速、标准化、可复现 仅测知识回忆,无法反映真实应用能力
验证器 自动化、可评推理、支持自由生成 仅适用可验证领域,依赖外部工具
排行榜 反映用户真实偏好、涵盖风格与安全性 受用户群体影响,难以衡量正确性
LLM 评审 可扩展、一致性强、支持多任务 依赖评分官模型与 rubric,存在主观性

表 1: 主流 LLM 评估方法对比

总结

本文系统梳理了 LLM 评估的四大主流方法,并配以从零实现的代码示例。每种方法各有适用场景与局限,实际评估时建议结合多种方式,并根据业务目标定制评测数据与流程。只有这样,才能全面、客观地衡量模型的真实能力与改进空间。