How to define a custom evaluator
Key concepts
Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into evaluate() / aevaluate().
Basic example
- Python
- TypeScript
from langsmith import evaluate
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]
def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct]
)
import type { EvaluationResult } from "langsmith/evaluation";
const correct = async ({ outputs, referenceOutputs }: {
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
const score = outputs?.answer === referenceOutputs?.answer;
return { key: "correct", score };
}
Evaluator args
Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:
Python and JS/TS
run: langsmith.schemas.Run
: The full Run object generated by the application on the given example.example: langsmith.schemas.Example
: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).
Currently Python only
inputs: dict
: A dictionary of the inputs corresponding to a single example in a dataset.outputs: dict
: A dictionary of the outputs generated by the application on the giveninputs
.reference_outputs: dict
: A dictionary of the reference outputs associated with the example, if available.
For most use cases you'll only need inputs
, outputs
, and reference_outputs
. run
and example
are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
Evaluator output
Custom evaluators are expected to return one of the following types:
Python and JS/TS
dict
: dicts of the form{"score" | "value": ..., "name": ...}
allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
Currently Python only
int | float | bool
: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.str
: this is intepreted as a categorical metric. The function name is used as the name of the metric.list[dict]
: return multiple metrics using a single function.
Additional examples
- Python
from langsmith import evaluate, wrappers
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel
# Compare actual and reference outputs
def correct(outputs: dict, reference_outputs: dict) -> bool:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]
# Just evaluate actual outputs
def concision(outputs: dict) -> int:
"""Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
return min(len(outputs["answer"]) // 1000, 4) + 1
# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())
async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
"""Use an LLM to judge if the reasoning and the answer are consistent."""
instructions = """\
Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer."""
class Response(BaseModel):
reasoning_is_valid: bool
msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
response = await oai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
response_format=Response
)
return response.choices[0].message.parsed.reasoning_is_valid
def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}
results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct, concision, valid_reasoning]
)
Related
- Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
- Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.