如何运行评估

关键概念

评估 | 评估器 | 数据集

在本指南中，我们将介绍如何使用 LangSmith SDK 中的 evaluate() 方法评估应用程序。

运行大型作业

对于 Python 中更大的评估作业，我们建议使用 aevaluate()，它是 evaluate() 的异步版本。建议先阅读本指南，因为两者具有相同的接口，然后再阅读关于异步运行评估的操作指南。

在 JS/TS 中，evaluate() 已经是异步的，因此不需要单独的方法。

在运行大型作业时，配置 max_concurrency/maxConcurrency 参数也很重要。这通过有效地跨线程拆分数据集来并行化评估。

定义应用程序

首先，我们需要一个要评估的应用程序。在本示例中，让我们创建一个简单的毒性分类器。

Python
TypeScript

from langsmith import traceable, wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def toxicity_classifier(inputs: dict) -> dict:
    instructions = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return {"class": result.choices[0].message.content}

import { OpenAI } from "openai";
import { wrapOpenAI } from "langsmith/wrappers";
import { traceable } from "langsmith/traceable";

# Optionally wrap the OpenAI client to trace all model calls.
const oaiClient = wrapOpenAI(new OpenAI());

# Optionally add the 'traceable' wrapper to trace the inputs/outputs of this function.
const toxicityClassifier = traceable(
  async (text: string) => {
    const result = await oaiClient.chat.completions.create({
      messages: [
        { 
          role: "system",
          content: "Please review the user query below and determine if it contains any form of toxic behavior, such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't.",
        },
        { role: "user", content: text },
      ],
      model: "gpt-4o-mini",
      temperature: 0,
    });
    
    return result.choices[0].message.content;
  },
  { name: "toxicityClassifier" }
);

我们已选择性地启用了追踪，以捕获管道中每个步骤的输入和输出。要了解如何注释代码以进行追踪，请参阅本指南。

创建或选择数据集

我们需要一个数据集来评估我们的应用程序。我们的数据集将包含有毒和无毒文本的标记示例。

Python
TypeScript

需要 langsmith>=0.3.13

from langsmith import Client

ls_client = Client()

examples = [
  {
    "inputs": {"text": "Shut up, idiot"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"}, 
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]

dataset = ls_client.create_dataset(dataset_name="Toxic Queries)
ls_client.create_examples(
  dataset_id=dataset.id, 
  examples=examples,
)

import { Client } from "langsmith";

const langsmith = new Client();

// create a dataset
const labeledTexts = [
  ["Shut up, idiot", "Toxic"],
  ["You're a wonderful person", "Not toxic"],
  ["This is the worst thing ever", "Toxic"],
  ["I had a great day today", "Not toxic"],
  ["Nobody likes you", "Toxic"],
  ["This is unacceptable. I want to speak to the manager.", "Not toxic"],
];

const [inputs, outputs] = labeledTexts.reduce<
  [Array<{ input: string }>, Array<{ outputs: string }>]
>(
  ([inputs, outputs], item) => [
    [...inputs, { input: item[0] }],
    [...outputs, { outputs: item[1] }],
  ],
  [[], []]
);

const datasetName = "Toxic Queries";
const toxicDataset = await langsmith.createDataset(datasetName);
await langsmith.createExamples({ inputs, outputs, datasetId: toxicDataset.id });

有关数据集管理的更多信息，请参阅此处。

定义评估器

提示

您还可以查看 LangChain 的开源评估包 openevals，以获取常用的预构建评估器。

评估器是用于对应用程序输出进行评分的函数。它们接收示例输入、实际输出以及（如果存在）参考输出。由于我们为此任务提供了标签，因此我们的评估器可以直接检查实际输出是否与参考输出匹配。

Python
TypeScript

需要 langsmith>=0.3.13

def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["class"] == reference_outputs["label"]

需要 langsmith>=0.2.9

import type { EvaluationResult } from "langsmith/evaluation";

function correct({
  outputs,
  referenceOutputs,
}: {
  outputs: Record<string, any>;
  referenceOutputs?: Record<string, any>;
}): EvaluationResult {
  const score = outputs.output === referenceOutputs?.outputs;
  return { key: "correct", score };
}

有关如何定义评估器的更多信息，请参阅此处。

运行评估

我们将使用 evaluate() / aevaluate() 方法来运行评估。

关键参数是

一个目标函数，它接受输入字典并返回输出字典。每个示例的 example.inputs 字段是传递给目标函数的内容。在本例中，我们的 toxicity_classifier 已经设置为接收示例输入，因此我们可以直接使用它。
data - 要评估的 LangSmith 数据集的名称或 UUID，或示例的迭代器
evaluators - 用于对函数输出进行评分的评估器列表

Python
TypeScript

需要 langsmith>=0.3.13

# Can equivalently use the 'evaluate' function directly:
# from langsmith import evaluate; evaluate(...)
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, baseline",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4, # optional, add concurrency
)

import { evaluate } from "langsmith/evaluation";

await evaluate((inputs) => toxicityClassifier(inputs["input"]), {
  data: datasetName,
  evaluators: [correct],
  experimentPrefix: "gpt-4o-mini, baseline",  // optional, experiment name prefix
  maxConcurrency: 4, // optional, add concurrency
});

有关启动评估的其他方法，请参阅此处，有关如何配置评估作业，请参阅此处。

浏览结果

每次调用 evaluate() 都会创建一个实验，可以在 LangSmith UI 中查看或通过 SDK 查询。评估分数作为反馈存储在每个实际输出中。

如果您已注释代码以进行追踪，则可以在侧面板视图中打开每行的追踪。

参考代码

点击查看整合的代码片段

Python
TypeScript

需要 langsmith>=0.3.13

from langsmith import Client, traceable, wrappers
from openai import OpenAI

# Step 1. Define an application
oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def toxicity_classifier(inputs: dict) -> str:
    system = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return result.choices[0].message.content

# Step 2. Create a dataset
ls_client = Client()

dataset = ls_client.create_dataset(dataset_name="Toxic Queries)
examples = [
  {
    "inputs": {"text": "Shut up, idiot"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"}, 
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"}, 
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]
ls_client.create_examples(
  dataset_id=dataset.id,
  examples=examples,
)

# Step 3. Define an evaluator
def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["output"] == reference_outputs["label"]

# Step 4. Run the evaluation
# Client.evaluate() and evaluate() behave the same.
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, simple",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4,  # optional, add concurrency
)

import { OpenAI } from "openai";
import { Client } from "langsmith";
import { evaluate, EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";
import { traceable } from "langsmith/traceable";
import { wrapOpenAI } from "langsmith/wrappers";


const oaiClient = wrapOpenAI(new OpenAI());

const toxicityClassifier = traceable(
  async (text: string) => {
    const result = await oaiClient.chat.completions.create({
      messages: [
        {
          role: "system",
          content: "Please review the user query below and determine if it contains any form of toxic behavior, such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't.",
        },
        { role: "user", content: text },
      ],
      model: "gpt-4o-mini",
      temperature: 0,
    });

    return result.choices[0].message.content;
  },
  { name: "toxicityClassifier" }
);

const langsmith = new Client();

// create a dataset
const labeledTexts = [
  ["Shut up, idiot", "Toxic"],
  ["You're a wonderful person", "Not toxic"],
  ["This is the worst thing ever", "Toxic"],
  ["I had a great day today", "Not toxic"],
  ["Nobody likes you", "Toxic"],
  ["This is unacceptable. I want to speak to the manager.", "Not toxic"],
];

const [inputs, outputs] = labeledTexts.reduce<
  [Array<{ input: string }>, Array<{ outputs: string }>]
>(
  ([inputs, outputs], item) => [
    [...inputs, { input: item[0] }],
    [...outputs, { outputs: item[1] }],
  ],
  [[], []]
);

const datasetName = "Toxic Queries";
const toxicDataset = await langsmith.createDataset(datasetName);
await langsmith.createExamples({ inputs, outputs, datasetId: toxicDataset.id });

// Row-level evaluator
function correct({
  outputs,
  referenceOutputs,
}: {
  outputs: Record<string, any>;
  referenceOutputs?: Record<string, any>;
}): EvaluationResult {
  const score = outputs.output === referenceOutputs?.outputs;
  return { key: "correct", score };
}

await evaluate((inputs) => toxicityClassifier(inputs["input"]), {
  data: datasetName,
  evaluators: [correct],
  experimentPrefix: "gpt-4o-mini, simple",  // optional, experiment name prefix
  maxConcurrency: 4, // optional, add concurrency
});

如何运行评估

定义应用程序

创建或选择数据集

定义评估器

运行评估

浏览结果

参考代码

此页是否对您有帮助？

您可以留下详细的反馈在 GitHub 上.

定义应用程序​

创建或选择数据集​

定义评估器​

运行评估​

浏览结果​

参考代码​

相关​

此页是否对您有帮助？

您可以留下详细的反馈 在 GitHub 上.

定义应用程序

创建或选择数据集

定义评估器

运行评估

浏览结果

参考代码

相关

您可以留下详细的反馈在 GitHub 上.