如何定义一个用于评估的目标函数

运行评估需要三个主要部分

测试输入和预期输出的数据集。
您正在评估的目标函数。
对您的目标函数的输出进行评分的评估器。

本指南将向您展示如何根据您正在评估的应用程序部分来定义目标函数。有关如何创建数据集和如何定义评估器的信息，请参阅此处；有关运行评估的端到端示例，请参阅此处。

目标函数签名

为了在代码中评估应用程序，我们需要一种运行应用程序的方法。使用 `evaluate()`（Python/TypeScript）时，我们将通过传入一个目标函数参数来完成此操作。这是一个函数，它接收数据集示例的输入，并将应用程序输出作为字典返回。在此函数中，我们可以随意调用我们的应用程序。我们也可以随意格式化输出。关键在于，我们定义的任何评估器函数都应与我们在目标函数中返回的输出格式兼容。

from langsmith import Client

# 'inputs' will come from your dataset.
def dummy_target(inputs: dict) -> dict:
    return {"foo": 1, "bar": "two"}

# 'inputs' will come from your dataset.
# 'outputs' will come from your target function.
def evaluator_one(inputs: dict, outputs: dict) -> bool:
    return outputs["foo"] == 2

def evaluator_two(inputs: dict, outputs: dict) -> bool:
    return len(outputs["bar"]) < 3

client = Client()
results = client.evaluate(
    dummy_target,  # <-- target function
    data="your-dataset-name",
    evaluators=[evaluator_one, evaluator_two], 
    ...
)

自动跟踪

`evaluate()` 将自动跟踪您的目标函数。这意味着如果您在目标函数中运行任何可跟踪的代码，这些代码也将作为目标跟踪的子运行被跟踪。

示例：单个 LLM 调用

当我们在提示词上进行迭代或比较模型时，评估单个 LLM 调用会很有用

Python
TypeScript
Python (LangChain)
TypeScript (LangChain)

设置环境变量 `OPENAI_API_KEY` 并安装依赖 `pip install -U openai langsmith`。

from langsmith import wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to automatically 
# trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a 'messages' key.
  # You can update to match your dataset schema.
  messages = inputs["messages"]
  response = oai_client.chat.completions.create(
      messages=messages,
      model="gpt-4o-mini",
  )
  return {"answer": response.choices[0].message.content}

设置环境变量 `OPENAI_API_KEY` 并安装 `openai` 和 `langsmith`。

import OpenAI from 'openai';
import { wrapOpenAI } from "langsmith/wrappers";

const client = wrapOpenAI(new OpenAI());

// This is the function you will evaluate.
const target = async(inputs) => {
  // This assumes your dataset has inputs with a `messages` key
  const messages = inputs.messages;
  const response = await client.chat.completions.create({
      messages: messages,
      model: 'gpt-4o-mini',
  });
  return { answer: response.choices[0].message.content };
}

设置环境变量 `OPENAI_API_KEY` 并安装依赖 `pip install -U langchain[openai]`。

from langchain.chat_models import init_chat_model

llm = init_chat_model("openai:gpt-4o-mini")

def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a `messages` key
  messages = inputs["messages"]
  response = llm.invoke(messages)
  return {"answer": response.content}

设置环境变量 `OPENAI_API_KEY` 并安装 `@langchain/openai`。

import { ChatOpenAI } from '@langchain/openai';

// This is the function you will evaluate.
const target = async(inputs) => {
  // This assumes your dataset has inputs with a `messages` key
  const messages = inputs.messages;
  const model = new ChatOpenAI({ model: "gpt-4o-mini" });
  const response = await model.invoke(messages);
  return {"answer": response.content};
}

示例：非 LLM 组件

有时，您可能希望评估应用程序中不涉及 LLM 的步骤。这包括但不限于

RAG 应用中的检索步骤
工具的执行

在此示例中，我们展示了如何测试一个简单的计算器工具。实际上，评估对于具有更复杂且难以进行单元测试行为的组件（如检索器或在线研究工具）非常有用。

Python
TypeScript

from langsmith import traceable

# Optionally decorate with '@traceable' to trace all invocations of this function.
@traceable
def calculator_tool(operation: str, number1: float, number2: float) -> str:
  if operation == "add":
      return str(number1 + number2)
  elif operation == "subtract":
      return str(number1 - number2)
  elif operation == "multiply":
      return str(number1 * number2)
  elif operation == "divide":
      return str(number1 / number2)
  else:
      raise ValueError(f"Unrecognized operation: {operation}.")

# This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys.
  operation = inputs["operation"]
  number1 = inputs["num1"]
  number2 = inputs["num2"]
  result = calculator_tool(operation, number1, number2)
  return {"result": result}

import { traceable } from "langsmith/traceable";

// Optionally wrap in 'traceable' to trace all invocations of this function. 
const calculatorTool = traceable(async ({ operation, number1, number2 }) => {
// Functions must return strings
if (operation === "add") {
  return (number1 + number2).toString();
} else if (operation === "subtract") {
  return (number1 - number2).toString();
} else if (operation === "multiply") {
  return (number1 * number2).toString();
} else if (operation === "divide") {
  return (number1 / number2).toString();
} else {
  throw new Error("Invalid operation.");
}
});

// This is the function you will evaluate.
const target = async (inputs) => {
// This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys
const result = await calculatorTool.invoke({
  operation: inputs.operation,
  number1: inputs.num1,
  number2: inputs.num2,
});
return { result };
}

示例：应用程序或代理

评估您的代理应用程序的完整输出可以捕获多个组件之间的交互，从而提供更真实的端到端性能视图。端到端评估还可能发现单独测试函数或单个 LLM 调用时可能遗漏的集成和错误处理问题。

Python
TypeScript

from my_agent import agent
      
# This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a `messages` key
  messages = inputs["messages"]
  # Replace `invoke` with whatever you use to call your agent
  response = agent.invoke({"messages": messages})
  # This assumes your agent output is in the right format
  return response

import { agent } from 'my_agent';

// This is the function you will evaluate.
const target = async(inputs) => {
// This assumes your dataset has inputs with a `messages` key
const messages = inputs.messages;
// Replace `invoke` with whatever you use to call your agent
const response = await agent.invoke({ messages });
// This assumes your agent output is in the right format
return response;
}

LangGraph / LangChain 目标

如果您有一个 LangGraph/LangChain 代理，它接受数据集中定义的输入，并返回您希望在评估器中使用的输出格式，您可以直接将该对象作为目标传入

from my_agent import agent
from langsmith import Client

client = Client()
client.evaluate(agent, ...)

如何定义一个用于评估的目标函数

目标函数签名

示例：单个 LLM 调用

示例：非 LLM 组件

示例：应用程序或代理

此页面有帮助吗？

您可以留下详细反馈在 GitHub 上.

目标函数签名​

示例：单个 LLM 调用​

示例：非 LLM 组件​

示例：应用程序或代理​

此页面有帮助吗？

您可以留下详细反馈 在 GitHub 上.

目标函数签名

示例：单个 LLM 调用

示例：非 LLM 组件

示例：应用程序或代理

您可以留下详细反馈在 GitHub 上.