evaluate#
- langsmith.evaluation._runner.evaluate(
- target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],
- /,
- data: DATA_T | None = None,
- evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,
- summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
- metadata: dict | None = None,
- experiment_prefix: str | None = None,
- description: str | None = None,
- max_concurrency: int | None = 0,
- num_repetitions: int = 1,
- client: langsmith.Client | None = None,
- blocking: bool = True,
- experiment: EXPERIMENT_T | None = None,
- upload_results: bool = True,
- **kwargs: Any,
在给定数据集上评估目标系统。
- 参数:
target (TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T]) – 要评估的目标系统或实验。可以是接受字典并返回字典的函数、langchain Runnable、现有实验 ID 或包含两个实验 ID 的元组。
data (DATA_T) – 要评估的数据集。可以是数据集名称、示例列表或示例生成器。
evaluators (Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None) – 要对每个示例运行的评估器列表。评估器签名取决于目标类型。默认为 None。
summary_evaluators (Sequence[SUMMARY_EVALUATOR_T] | None) – 要在整个数据集上运行的摘要评估器列表。如果比较两个现有实验,则不应指定此参数。默认为 None。
metadata (dict | None) – 要附加到实验的元数据。默认为 None。
experiment_prefix (str | None) – 为您的实验名称提供的前缀。默认为 None。
description (str | None) – 实验的自由格式文本描述。
max_concurrency (int | None) – 要运行的最大并发评估数。如果为 None,则不设置限制。如果为 0,则不并发。默认为 0。
client (langsmith.Client | None) – 要使用的 LangSmith 客户端。默认为 None。
blocking (bool) – 是否阻塞直到评估完成。默认为 True。
num_repetitions (int) – 运行评估的次数。数据集中的每个项目都将运行和评估这么多次。默认为 1。
experiment (schemas.TracerSession | None) – 要扩展的现有实验。如果提供此参数,则 experiment_prefix 将被忽略。仅供高级用法。如果 target 是现有实验或包含两个实验的元组,则不应指定此参数。
load_nested (bool) – 是否加载实验的所有子运行。默认仅加载顶层根运行。仅当 target 是现有实验或包含两个实验的元组时才应指定此参数。
randomize_order (bool) – 是否随机化每次评估输出的顺序。默认为 False。仅当 target 是包含两个现有实验的元组时才应指定此参数。
upload_results (bool)
kwargs (Any)
- 返回:
如果 target 是函数、Runnable 或现有实验,则返回 ExperimentResults。如果 target 是包含两个现有实验的元组,则返回 ComparativeExperimentResults。
- 返回类型:
示例
准备数据集
>>> from typing import Sequence >>> from langsmith import Client >>> from langsmith.evaluation import evaluate >>> from langsmith.schemas import Example, Run >>> client = Client() >>> dataset = client.clone_public_dataset( ... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d" ... ) >>> dataset_name = "Evaluate Examples"
基本用法
>>> def accuracy(run: Run, example: Example): ... # Row-level evaluator for accuracy. ... pred = run.outputs["output"] ... expected = example.outputs["answer"] ... return {"score": expected.lower() == pred.lower()} >>> def precision(runs: Sequence[Run], examples: Sequence[Example]): ... # Experiment-level evaluator for precision. ... # TP / (TP + FP) ... predictions = [run.outputs["output"].lower() for run in runs] ... expected = [example.outputs["answer"].lower() for example in examples] ... # yes and no are the only possible answers ... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"]) ... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)]) ... return {"score": tp / (tp + fp)} >>> def predict(inputs: dict) -> dict: ... # This can be any function or just an API call to your app. ... return {"output": "Yes"} >>> results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment", ... description="Evaluating the accuracy of a simple prediction model.", ... metadata={ ... "my-prompt-version": "abcd-1234", ... }, ... ) View the evaluation results for experiment:...
仅评估部分示例
>>> experiment_name = results.experiment_name >>> examples = client.list_examples(dataset_name=dataset_name, limit=5) >>> results = evaluate( ... predict, ... data=examples, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment", ... description="Just testing a subset synchronously.", ... ) View the evaluation results for experiment:...
流式传输每个预测以更轻松、更快速地进行调试。
>>> results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... description="I don't even have to block!", ... blocking=False, ... ) View the evaluation results for experiment:... >>> for i, result in enumerate(results): ... pass
使用现成的 LangChain 评估器与 evaluate API 配合使用
>>> from langsmith.evaluation import LangChainStringEvaluator >>> from langchain_openai import ChatOpenAI >>> def prepare_criteria_data(run: Run, example: Example): ... return { ... "prediction": run.outputs["output"], ... "reference": example.outputs["answer"], ... "input": str(example.inputs), ... } >>> results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[ ... accuracy, ... LangChainStringEvaluator("embedding_distance"), ... LangChainStringEvaluator( ... "labeled_criteria", ... config={ ... "criteria": { ... "usefulness": "The prediction is useful if it is correct" ... " and/or asks a useful followup question." ... }, ... "llm": ChatOpenAI(model="gpt-4o"), ... }, ... prepare_data=prepare_criteria_data, ... ), ... ], ... description="Evaluating with off-the-shelf LangChain evaluators.", ... summary_evaluators=[precision], ... ) View the evaluation results for experiment:...
评估 LangChain 对象
>>> from langchain_core.runnables import chain as as_runnable >>> @as_runnable ... def nested_predict(inputs): ... return {"output": "Yes"} >>> @as_runnable ... def lc_predict(inputs): ... return nested_predict.invoke(inputs) >>> results = evaluate( ... lc_predict.invoke, ... data=dataset_name, ... evaluators=[accuracy], ... description="This time we're evaluating a LangChain object.", ... summary_evaluators=[precision], ... ) View the evaluation results for experiment:...
版本 0.2.0 中更改:'max_concurrency' 的默认值已从 None(无并发限制)更新为 0(完全无并发)。