如何评估 `langgraph` 图

关键概念

langgraph 是一个用于构建有状态、多参与者应用程序的库，它使用 LLM，用于创建 Agent 和多 Agent 工作流程。评估 langgraph 图可能具有挑战性，因为单个调用可能涉及多个 LLM 调用，并且进行哪些 LLM 调用可能取决于先前调用的输出。在本指南中，我们将重点介绍如何将图和图节点传递给 evaluate() / aevaluate() 的机制。有关构建 Agent 时的评估技术和最佳实践，请访问 langgraph 文档。

端到端评估

最常见的评估类型是端到端评估，我们希望评估每个示例输入的最终图输出。

定义图

让我们构建一个简单的 ReACT Agent 开始

Python

from typing import Annotated, Literal, TypedDict

from langchain.chat_models import init_chat_model
from langchain_core.tools import tool
from langgraph.graph import END, START, StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.graph.message import add_messages

class State(TypedDict): # Messages have the type "list". The 'add_messages' function # in the annotation defines how this state key should be updated # (in this case, it appends messages to the list, rather than overwriting them)
messages: Annotated[list, add_messages]

# Define the tools for the agent to use
@tool
def search(query: str) -> str:
    """Call to surf the web.""" # This is a placeholder, but don't tell the LLM that...
    if "sf" in query.lower() or "san francisco" in query.lower():
        return "It's 60 degrees and foggy."
    return "It's 90 degrees and sunny."

tools = [search]
tool_node = ToolNode(tools)
model = init_chat_model("claude-3-5-sonnet-latest").bind_tools(tools)

# Define the function that determines whether to continue or not
def should_continue(state: State) -> Literal["tools", END]:
    messages = state['messages']
    last_message = messages[-1] # If the LLM makes a tool call, then we route to the "tools" node
    if last_message.tool_calls:
        return "tools" # Otherwise, we stop (reply to the user)
    return END

# Define the function that calls the model

def call_model(state: State):
    messages = state['messages']
    response = model.invoke(messages) # We return a list, because this will get added to the existing list
    return {"messages": [response]}

# Define a new graph
workflow = StateGraph(State)

# Define the two nodes we will cycle between
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)

# Set the entrypoint as 'agent'
# This means that this node is the first one called
workflow.add_edge(START, "agent")

# We now add a conditional edge
workflow.add_conditional_edges( # First, we define the start node. We use 'agent'. # This means these are the edges taken after the 'agent' node is called.
    "agent", # Next, we pass in the function that will determine which node is called next.
    should_continue,
)

# We now add a normal edge from 'tools' to 'agent'.
# This means that after 'tools' is called, 'agent' node is called next.
workflow.add_edge("tools", 'agent')

# Finally, we compile it!
# This compiles it into a LangChain Runnable,
# meaning you can use it as you would any other runnable.
# Note that we're (optionally) passing the memory when compiling the graph
app = workflow.compile()

创建数据集

让我们创建一个包含问题和预期响应的简单数据集

Python

from langsmith import Client

questions = [
    "what's the weather in sf",
    "whats the weather in san fran",
    "whats the weather in tangier"
]
answers = [
    "It's 60 degrees and foggy.",
    "It's 60 degrees and foggy.",
    "It's 90 degrees and sunny.",
]

ls_client = Client()

dataset = ls_client.create_dataset(
    "weather agent",
    inputs=[{"question": q} for q in questions],
    outputs=[{"answers": a} for a in answers],
)

创建评估器

以及一个简单的评估器

Python

需要 langsmith>=0.2.0

judge_llm = init_chat_model("gpt-4o")

async def correct(outputs: dict, reference_outputs: dict) -> bool:
    instructions = (
        "Given an actual answer and an expected answer, determine whether"
        " the actual answer contains all of the information in the"
        " expected answer. Respond with 'CORRECT' if the actual answer"
        " does contain all of the expected information and 'INCORRECT'"
        " otherwise. Do not include anything else in your response."
    )
    # Our graph outputs a State dictionary, which in this case means
    # we'll have a 'messages' key and the final message should
    # be our actual answer.
    actual_answer = outputs["messages"][-1].content
    expected_answer = reference_outputs["answer"]
    user_msg = (
        f"ACTUAL ANSWER: {actual_answer}"
        f"\n\nEXPECTED ANSWER: {expected_answer}"
    )
    response = await judge_llm.ainvoke(
        [
            {"role": "system", "content": instructions},
            {"role": "user", "content": user_msg}
        ]
    )
    return response.content.upper() == "CORRECT"

运行评估

现在我们可以运行评估并探索结果。我们只需要包装我们的图函数，使其可以接受以示例格式存储的输入

使用异步节点进行评估

如果您的所有图节点都定义为同步函数，则可以使用 evaluate 或 aevaluate。如果您的任何节点定义为异步，则需要使用 aevaluate

Python

需要 langsmith>=0.2.0

from langsmith import aevaluate

def example_to_state(inputs: dict) -> dict:
  return {"messages": [{"role": "user", "content": inputs['question']}]}

# We use LCEL declarative syntax here.
# Remember that langgraph graphs are also langchain runnables.
target = example_to_state | app

experiment_results = await aevaluate(
    target,
    data="weather agent",
    evaluators=[correct],
    max_concurrency=4,  # optional
    experiment_prefix="claude-3.5-baseline",  # optional
)

评估中间步骤

通常，不仅评估 Agent 的最终输出，还评估其采取的中间步骤也很有价值。langgraph 的优点在于图的输出是一个状态对象，该对象通常已经携带有关所采取中间步骤的信息。通常，我们可以通过查看状态中的消息来评估我们感兴趣的任何内容。例如，我们可以查看消息以断言模型在第一步调用了“search”工具。

Python

需要 langsmith>=0.2.0

def right_tool(outputs: dict) -> bool:
    tool_calls = outputs["messages"][1].tool_calls
    return bool(tool_calls and tool_calls[0]["name"] == "search")

experiment_results = await aevaluate(
    target,
    data="weather agent",
    evaluators=[correct, right_tool],
    max_concurrency=4,  # optional
    experiment_prefix="claude-3.5-baseline",  # optional
)

如果我们需要访问状态中没有的关于中间步骤的信息，我们可以查看 Run 对象。这包含所有节点输入和输出的完整跟踪

自定义评估器

有关您可以传递给自定义评估器的参数的更多信息，请参阅此操作指南。

Python

from langsmith.schemas import Run, Example

def right_tool_from_run(run: Run, example: Example) -> dict:
    # Get documents and answer
    first_model_run = next(run for run in root_run.child_runs if run.name == "agent")
    tool_calls = first_model_run.outputs["messages"][-1].tool_calls
    right_tool = bool(tool_calls and tool_calls[0]["name"] == "search")
    return {"key": "right_tool", "value": right_tool}

experiment_results = await aevaluate(
    target,
    data="weather agent",
    evaluators=[correct, right_tool_from_run],
    max_concurrency=4,  # optional
    experiment_prefix="claude-3.5-baseline",  # optional
)

运行和评估单个节点

有时您想直接评估单个节点以节省时间和成本。langgraph 使这很容易实现。在这种情况下，我们甚至可以继续使用我们一直在使用的评估器。

Python

node_target = example_to_state | app.nodes["agent"]

node_experiment_results = await aevaluate(
    node_target,
    data="weather agent",
    evaluators=[right_tool_from_run],
    max_concurrency=4,  # optional
    experiment_prefix="claude-3.5-model-node",  # optional
)

langgraph 评估文档

参考代码

点击查看整合的代码片段

Python

from typing import Annotated, Literal, TypedDict

from langchain.chat_models import init_chat_model
from langchain_core.tools import tool
from langgraph.graph import END, START, StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.graph.message import add_messages
from langsmith import Client, aevaluate

# Define a graph

class State(TypedDict): # Messages have the type "list". The 'add_messages' function # in the annotation defines how this state key should be updated # (in this case, it appends messages to the list, rather than overwriting them)
    messages: Annotated[list, add_messages]

# Define the tools for the agent to use

@tool
def search(query: str) -> str:
    """Call to surf the web.""" # This is a placeholder, but don't tell the LLM that...
    if "sf" in query.lower() or "san francisco" in query.lower():
        return "It's 60 degrees and foggy."
    return "It's 90 degrees and sunny."

tools = [search]
tool_node = ToolNode(tools)
model = init_chat_model("claude-3-5-sonnet-latest").bind_tools(tools)

# Define the function that determines whether to continue or not

def should_continue(state: State) -> Literal["tools", END]:
    messages = state['messages']
    last_message = messages[-1] # If the LLM makes a tool call, then we route to the "tools" node
    if last_message.tool_calls:
        return "tools" # Otherwise, we stop (reply to the user)
    return END

# Define the function that calls the model

def call_model(state: State):
    messages = state['messages']
    response = model.invoke(messages) # We return a list, because this will get added to the existing list
    return {"messages": [response]}

# Define a new graph
workflow = StateGraph(State)

# Define the two nodes we will cycle between
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)

# Set the entrypoint as 'agent'
# This means that this node is the first one called
workflow.add_edge(START, "agent")

# We now add a conditional edge
workflow.add_conditional_edges( # First, we define the start node. We use 'agent'. # This means these are the edges taken after the 'agent' node is called.
    "agent", # Next, we pass in the function that will determine which node is called next.
    should_continue,
)

# We now add a normal edge from 'tools' to 'agent'.
# This means that after 'tools' is called, 'agent' node is called next.
workflow.add_edge("tools", 'agent')

# Finally, we compile it!
# This compiles it into a LangChain Runnable,
# meaning you can use it as you would any other runnable.
# Note that we're (optionally) passing the memory when compiling the graph
app = workflow.compile()

questions = [
    "what's the weather in sf",
    "whats the weather in san fran",
    "whats the weather in tangier"
]
answers = [
    "It's 60 degrees and foggy.",
    "It's 60 degrees and foggy.",
    "It's 90 degrees and sunny.",
]

# Create a dataset
ls_client = Client()

dataset = ls_client.create_dataset(
    "weather agent",
    inputs=[{"question": q} for q in questions],
    outputs=[{"answers": a} for a in answers],
)

# Define evaluators

async def correct(outputs: dict, reference_outputs: dict) -> bool:
    instructions = (
        "Given an actual answer and an expected answer, determine whether"
        " the actual answer contains all of the information in the"
        " expected answer. Respond with 'CORRECT' if the actual answer"
        " does contain all of the expected information and 'INCORRECT'"
        " otherwise. Do not include anything else in your response."
    )
    # Our graph outputs a State dictionary, which in this case means
    # we'll have a 'messages' key and the final message should
    # be our actual answer.
    actual_answer = outputs["messages"][-1].content
    expected_answer = reference_outputs["answer"]
    user_msg = (
        f"ACTUAL ANSWER: {actual_answer}"
        f"\n\nEXPECTED ANSWER: {expected_answer}"
    )
    response = await judge_llm.ainvoke(
        [
            {"role": "system", "content": instructions},
            {"role": "user", "content": user_msg}
        ]
    )
    return response.content.upper() == "CORRECT"


def right_tool(outputs: dict) -> bool:
    tool_calls = outputs["messages"][1].tool_calls
    return bool(tool_calls and tool_calls[0]["name"] == "search")

# Run evaluation

experiment_results = await aevaluate(
    target,
    data="weather agent",
    evaluators=[correct, right_tool],
    max_concurrency=4,  # optional
    experiment_prefix="claude-3.5-baseline",  # optional
)

如何评估 `langgraph` 图

端到端评估

定义图

创建数据集

创建评估器

运行评估

评估中间步骤

运行和评估单个节点

参考代码

此页内容是否对您有帮助？

您可以留下详细的反馈在 GitHub 上.

端到端评估​

定义图​

创建数据集​

创建评估器​

运行评估​

评估中间步骤​

运行和评估单个节点​

相关内容​

参考代码​

此页内容是否对您有帮助？

您可以留下详细的反馈 在 GitHub 上.

端到端评估

定义图

创建数据集

创建评估器

运行评估

评估中间步骤

运行和评估单个节点

相关内容

参考代码

您可以留下详细的反馈在 GitHub 上.