跳到主要内容

评估快速入门

评估是衡量 LLM 应用性能的定量方法,这很重要,因为 LLM 并非总是表现可预测 —— Prompt、模型或输入的微小变化都可能显著影响结果。评估提供了一种结构化的方法来识别故障、比较应用程序不同版本之间的更改,并构建更可靠的 AI 应用程序。

评估由三个组件组成

  1. 一个 数据集,包含测试输入和预期输出。
  2. 一个 目标函数,定义了您要评估的内容
  3. 评估器,用于为您的目标函数的输出评分。

本快速入门指南将引导您使用 LangSmith SDK 运行评估,并在 LangSmith 中可视化结果。

1. 安装依赖

pip install -U langsmith openai pydantic

2. 创建 API 密钥

要创建 API 密钥,请前往设置页面。然后单击 创建 API 密钥

3. 设置您的环境

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langchain-api-key>"

# The example uses OpenAI, but it's not necessary in general
export OPENAI_API_KEY="<your-openai-api-key>"

4. 导入依赖

from langsmith import wrappers, Client
from pydantic import BaseModel, Field
from openai import OpenAI

client = Client()
openai_client = wrappers.wrap_openai(OpenAI())

5. 创建数据集

# For other dataset creation methods, see:
# https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_programmatically
# https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_in_application


# Programmatically create a dataset in LangSmith
dataset = client.create_dataset(
dataset_name="Sample dataset", description="A sample dataset in LangSmith."
)

# Create examples
examples = [
{
"inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
"outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
},
{
"inputs": {"question": "What is Earth's lowest point?"},
"outputs": {"answer": "Earth's lowest point is The Dead Sea."},
},
]

# Add examples to the dataset
client.create_examples(dataset_id=dataset.id, examples=examples)

6. 定义您要评估的内容

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return { "response": response.choices[0].message.content.strip() }

7. 定义评估器

# Define instructions for the LLM judge evaluator
instructions = """Evaluate Student Answer against Ground Truth for conceptual similarity and classify true or false:
- False: No conceptual match and similarity
- True: Most or full conceptual match and similarity
- Key criteria: Concept should match, not exact wording.
"""

# Define output schema for the LLM judge
class Grade(BaseModel):
score: bool = Field(
description="Boolean that indicates whether the response is accurate relative to the reference answer"
)

# Define LLM judge that grades the accuracy of the response relative to reference output
def accuracy(outputs: dict, reference_outputs: dict) -> bool:
response = openai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": instructions},
{
"role": "user",
"content": f"""Ground Truth answer: {reference_outputs["answer"]};
Student's Answer: {outputs["response"]}"""
},
],
response_format=Grade,
)
return response.choices[0].message.parsed.score

8. 运行并查看结果

# After running the evaluation, a link will be provided to view the results in langsmith
experiment_results = client.evaluate(
target,
data="Sample dataset",
evaluators=[
accuracy,
# can add multiple evaluators here
],
experiment_prefix="first-eval-in-langsmith",
max_concurrency=2,
)

单击评估运行打印出的链接,以访问 LangSmith Experiments UI,并浏览您的评估结果。

下一步

有关概念解释,请参阅概念指南。有关“我该如何……?”格式问题的答案,请参阅操作指南。有关端到端演练,请参阅教程。有关每个类和函数的全面描述,请参阅API 参考

如果您喜欢视频教程,请查看 LangSmith 入门课程中的“数据集、评估器和实验”视频。


此页是否对您有帮助?


您可以留下详细的反馈 在 GitHub 上.