跳到主要内容

如何评估 langchain runnable

关键概念

langchain Runnable 对象(例如聊天模型、检索器、链等)可以直接传递到 evaluate() / aevaluate() 中。

设置

让我们定义一个简单的链进行评估。首先,安装所有必需的软件包

pip install -U langsmith langchain[openai]

现在定义一个链

from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

instructions = (
"Please review the user query below and determine if it contains any form "
"of toxic behavior, such as insults, threats, or highly negative comments. "
"Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't."
)

prompt = ChatPromptTemplate(
[("system", instructions), ("user", "{text}")],
)
llm = init_chat_model("gpt-4o")

chain = prompt | llm | StrOutputParser()

评估

要评估我们的链,我们可以将其直接传递给 evaluate() / aevaluate() 方法。 请注意,链的输入变量必须与示例输入的键匹配。 在这种情况下,示例输入应采用 {"text": "..."} 的形式。

需要 langsmith>=0.2.0

from langsmith import aevaluate, Client

client = Client()

# Clone a dataset of texts with toxicity labels.
# Each example input has a "text" key and each output has a "label" key.
dataset = client.clone_public_dataset(
"https://smith.langchain.com/public/3d6831e6-1680-4c88-94df-618c8e01fc55/d"
)

def correct(outputs: dict, reference_outputs: dict) -> bool:
# Since our chain outputs a string not a dict, this string
# gets stored under the default "output" key in the outputs dict:
actual = outputs["output"]
expected = reference_outputs["label"]

return actual == expected

results = await aevaluate(
chain,
data=dataset,
evaluators=[correct],
experiment_prefix="gpt-4o, baseline",
)

Runnable 会为每个输出进行适当的追踪。


此页内容是否对您有帮助?


您可以留下详细的反馈 在 GitHub 上.