Skip to main content

How to run an evaluation locally (beta, Python only)

Beta

This feature is still in beta.

Sometimes it is helpful to run an evaluation locally without uploading any results to LangSmith. For example, if you're quickly iterating on a prompt and want to smoke test it on a few examples, or if you're validating that your target and evaluator functions are defined correctly, you may not want to record these evaluations.

You can do this by using the LangSmith Python SDK and passing upload_results=False to evaluate() / aevaluate().

This will run you application and evaluators exactly as it always does and return the same output, but nothing will be recorded to LangSmith. This includes not just the experiment results but also the application and evaluator traces.

Example

Let's take a look at an example:

Requires langsmith>=0.2.0. Example also uses pandas.

from langsmith import evaluate, Client

# 1. Create and/or select your dataset
client = Client()
dataset = client.clone_public_dataset(
"https://smith.langchain.com/public/a63525f9-bdf2-4512-83e3-077dc9417f96/d"
)

# 2. Define an evaluator
def is_concise(outputs: dict, reference_outputs: dict) -> bool:
return len(outputs["answer"]) < (3 * len(reference_outputs["answer"]))

# 3. Define the interface to your app
def chatbot(inputs: dict) -> dict:
return {"answer": inputs["question"] + " is a good question. I don't know the answer."}

# 4. Run an evaluation
experiment = evaluate(
chatbot,
data=dataset,
evaluators=[is_concise],
experiment_prefix="my-first-experiment",
# 'upload_results' is the relevant arg.
upload_results=False
)

# 5. Analyze results locally
results = list(experiment)

# Check if 'is_concise' returned False.
failed = [r for r in results if not r["evaluation_results"]["results"][0].score]

# Explore the failed inputs and outputs.
for r in failed:
print(r["example"].inputs)
print(r["run"].outputs)

# Explore the results as a Pandas DataFrame.
# Must have 'pandas' installed.
df = experiment.to_pandas()
df[["inputs.question", "outputs.answer", "reference.answer", "feedback.is_concise"]]
{'question': 'What is the largest mammal?'}
{'answer': "What is the largest mammal? is a good question. I don't know the answer."}

{'question': 'What do mammals and birds have in common?'}
{'answer': "What do mammals and birds have in common? is a good question. I don't know the answer."}
inputs.questionoutputs.answerreference.answerfeedback.is_concise
0What is the largest mammal?What is the largest mammal? is a good question. I don't know the answer.The blue whaleFalse
1What do mammals and birds have in common?What do mammals and birds have in common? is a good question. I don't know the answer.They are both warm-bloodedFalse

Was this page helpful?


You can leave detailed feedback on GitHub.