评估快速入门
评估是衡量 LLM 应用性能的定量方法,这很重要,因为 LLM 并非总是表现可预测 —— Prompt、模型或输入的微小变化都可能显著影响结果。评估提供了一种结构化的方法来识别故障、比较应用程序不同版本之间的更改,并构建更可靠的 AI 应用程序。
评估由三个组件组成
本快速入门指南将引导您使用 LangSmith SDK 运行评估,并在 LangSmith 中可视化结果。
1. 安装依赖
- Python
- TypeScript
pip install -U langsmith openai pydantic
yarn add langsmith openai zod
2. 创建 API 密钥
要创建 API 密钥,请前往设置页面。然后单击 创建 API 密钥。
3. 设置您的环境
- Shell
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langchain-api-key>"
# The example uses OpenAI, but it's not necessary in general
export OPENAI_API_KEY="<your-openai-api-key>"
4. 导入依赖
- Python
- TypeScript
from langsmith import wrappers, Client
from pydantic import BaseModel, Field
from openai import OpenAI
client = Client()
openai_client = wrappers.wrap_openai(OpenAI())
import { Client } from "langsmith";
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
import type { EvaluationResult } from "langsmith/evaluation";
import { evaluate } from "langsmith/evaluation";
const client = new Client();
const openai = new OpenAI();
5. 创建数据集
- Python
- TypeScript
# For other dataset creation methods, see:
# https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_programmatically
# https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_in_application
# Programmatically create a dataset in LangSmith
dataset = client.create_dataset(
dataset_name="Sample dataset", description="A sample dataset in LangSmith."
)
# Create examples
examples = [
{
"inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
"outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
},
{
"inputs": {"question": "What is Earth's lowest point?"},
"outputs": {"answer": "Earth's lowest point is The Dead Sea."},
},
]
# Add examples to the dataset
client.create_examples(dataset_id=dataset.id, examples=examples)
// For other dataset creation methods, see:
// https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_programmatically
// https://langsmith.langchain.ac.cn/evaluation/how_to_guides/manage_datasets_in_application
// Create inputs and reference outputs
const examples: [string, string][] = [
[
"Which country is Mount Kilimanjaro located in?",
"Mount Kilimanjaro is located in Tanzania.",
],
[
"What is Earth's lowest point?",
"Earth's lowest point is The Dead Sea.",
],
];
const inputs = examples.map(([inputPrompt]) => ({
question: inputPrompt,
}));
const outputs = examples.map(([, outputAnswer]) => ({
answer: outputAnswer,
}));
// Programmatically create a dataset in LangSmith
const dataset = await client.createDataset("Sample dataset", {
description: "A sample dataset in LangSmith.",
});
// Add examples to the dataset
await client.createExamples({
inputs,
outputs,
datasetId: dataset.id,
});
6. 定义您要评估的内容
- Python
- TypeScript
# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return { "response": response.choices[0].message.content.strip() }
// Define the application logic you want to evaluate inside a target function
// The SDK will automatically send the inputs from the dataset to your target function
async function target(inputs: string): Promise<{ response: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "Answer the following question accurately" },
{ role: "user", content: inputs },
],
});
return { response: response.choices[0].message.content?.trim() || "" };
}
7. 定义评估器
- Python
- TypeScript
# Define instructions for the LLM judge evaluator
instructions = """Evaluate Student Answer against Ground Truth for conceptual similarity and classify true or false:
- False: No conceptual match and similarity
- True: Most or full conceptual match and similarity
- Key criteria: Concept should match, not exact wording.
"""
# Define output schema for the LLM judge
class Grade(BaseModel):
score: bool = Field(
description="Boolean that indicates whether the response is accurate relative to the reference answer"
)
# Define LLM judge that grades the accuracy of the response relative to reference output
def accuracy(outputs: dict, reference_outputs: dict) -> bool:
response = openai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": instructions},
{
"role": "user",
"content": f"""Ground Truth answer: {reference_outputs["answer"]};
Student's Answer: {outputs["response"]}"""
},
],
response_format=Grade,
)
return response.choices[0].message.parsed.score
// Define instructions for the LLM judge evaluator
const instructions = `Evaluate Student Answer against Ground Truth for conceptual similarity and classify true or false:
- False: No conceptual match and similarity
- True: Most or full conceptual match and similarity
- Key criteria: Concept should match, not exact wording.
`;
// Define context for the LLM judge evaluator
const context = `Ground Truth answer: {reference}; Student's Answer: {prediction}`;
// Define output schema for the LLM judge
const ResponseSchema = z.object({
score: z
.boolean()
.describe(
"Boolean that indicates whether the response is accurate relative to the reference answer"
),
});
// Define LLM judge that grades the accuracy of the response relative to reference output
async function accuracy({
outputs,
referenceOutputs,
}: {
outputs?: Record<string, string>;
referenceOutputs?: Record<string, string>;
}): Promise<EvaluationResult> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: instructions },
{ role: "user", content: context.replace("{prediction}", outputs?.answer || "").replace("{reference}", referenceOutputs?.answer || "") }
],
response_format: zodResponseFormat(ResponseSchema, "response")
});
return {
key: "accuracy",
score: ResponseSchema.parse(JSON.parse(response.choices[0].message.content || "")).score,
};
}
8. 运行并查看结果
- Python
- TypeScript
# After running the evaluation, a link will be provided to view the results in langsmith
experiment_results = client.evaluate(
target,
data="Sample dataset",
evaluators=[
accuracy,
# can add multiple evaluators here
],
experiment_prefix="first-eval-in-langsmith",
max_concurrency=2,
)
// After running the evaluation, a link will be provided to view the results in langsmith
await evaluate(
(exampleInput) => {
return target(exampleInput.question);
},
{
data: "Sample dataset",
evaluators: [
accuracy,
// can add multiple evaluators here
],
experimentPrefix: "first-eval-in-langsmith",
maxConcurrency: 2,
}
);
单击评估运行打印出的链接,以访问 LangSmith Experiments UI,并浏览您的评估结果。
下一步
有关概念解释,请参阅概念指南。有关“我该如何……?”格式问题的答案,请参阅操作指南。有关端到端演练,请参阅教程。有关每个类和函数的全面描述,请参阅API 参考。
如果您喜欢视频教程,请查看 LangSmith 入门课程中的“数据集、评估器和实验”视频。