evaluation#

Evaluation chains for grading LLM and Chain outputs.

This module contains off-the-shelf evaluation chains for grading the output of LangChain primitives such as language models and chains.

Loading an evaluator

To load an evaluator, you can use the load_evaluators or load_evaluator functions with the names of the evaluators to load.

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("qa")
evaluator.evaluate_strings(
    prediction="We sold more than 40,000 units last week",
    input="How many units did we sell last week?",
    reference="We sold 32,378 units",
)

The evaluator must be one of EvaluatorType.

Datasets

To load one of the LangChain HuggingFace datasets, you can use the load_dataset function with the name of the dataset to load.

from langchain.evaluation import load_dataset
ds = load_dataset("llm-math")

Some common use cases for evaluation include:

Low-level API

These evaluators implement one of the following interfaces:

  • StringEvaluator: Evaluate a prediction string against a reference label and/or input context.

  • PairwiseStringEvaluator: Evaluate two prediction strings against each other. Useful for scoring preferences, measuring similarity between two chain or llm agents, or comparing outputs on similar inputs.

  • AgentTrajectoryEvaluator Evaluate the full sequence of actions taken by an agent.

These interfaces enable easier composability and usage within a higher level evaluation framework.

Classes

evaluation.agents.trajectory_eval_chain.TrajectoryEval

A named tuple containing the score and reasoning for a trajectory.

evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain

A chain for evaluating ReAct style agents.

evaluation.agents.trajectory_eval_chain.TrajectoryOutputParser

Trajectory output parser.

evaluation.comparison.eval_chain.LabeledPairwiseStringEvalChain

A chain for comparing two outputs, such as the outputs

evaluation.comparison.eval_chain.PairwiseStringEvalChain

A chain for comparing two outputs, such as the outputs

evaluation.comparison.eval_chain.PairwiseStringResultOutputParser

A parser for the output of the PairwiseStringEvalChain.

evaluation.criteria.eval_chain.Criteria(value)

A Criteria to evaluate.

evaluation.criteria.eval_chain.CriteriaEvalChain

LLM Chain for evaluating runs against criteria.

evaluation.criteria.eval_chain.CriteriaResultOutputParser

A parser for the output of the CriteriaEvalChain.

evaluation.criteria.eval_chain.LabeledCriteriaEvalChain

Criteria evaluation chain that requires references.

evaluation.embedding_distance.base.EmbeddingDistance(value)

Embedding Distance Metric.

evaluation.embedding_distance.base.EmbeddingDistanceEvalChain

Use embedding distances to score semantic difference between a prediction and reference.

evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain

Use embedding distances to score semantic difference between two predictions.

evaluation.exact_match.base.ExactMatchStringEvaluator(*)

Compute an exact match between the prediction and the reference.

evaluation.parsing.base.JsonEqualityEvaluator([...])

Evaluate whether the prediction is equal to the reference after

evaluation.parsing.base.JsonValidityEvaluator(...)

Evaluate whether the prediction is valid JSON.

evaluation.parsing.json_distance.JsonEditDistanceEvaluator([...])

An evaluator that calculates the edit distance between JSON strings.

evaluation.parsing.json_schema.JsonSchemaEvaluator(...)

An evaluator that validates a JSON prediction against a JSON schema reference.

evaluation.qa.eval_chain.ContextQAEvalChain

LLM Chain for evaluating QA w/o GT based on context

evaluation.qa.eval_chain.CotQAEvalChain

LLM Chain for evaluating QA using chain of thought reasoning.

evaluation.qa.eval_chain.QAEvalChain

LLM Chain for evaluating question answering.

evaluation.qa.generate_chain.QAGenerateChain

LLM Chain for generating examples for question answering.

evaluation.regex_match.base.RegexMatchStringEvaluator(*)

Compute a regex match between the prediction and the reference.

evaluation.schema.AgentTrajectoryEvaluator()

Interface for evaluating agent trajectories.

evaluation.schema.EvaluatorType(value[,ย ...])

The types of the evaluators.

evaluation.schema.LLMEvalChain

A base class for evaluators that use an LLM.

evaluation.schema.PairwiseStringEvaluator()

Compare the output of two models (or two outputs of the same model).

evaluation.schema.StringEvaluator()

Grade, tag, or otherwise evaluate predictions relative to their inputs and/or reference labels.

evaluation.scoring.eval_chain.LabeledScoreStringEvalChain

A chain for scoring the output of a model on a scale of 1-10.

evaluation.scoring.eval_chain.ScoreStringEvalChain

A chain for scoring on a scale of 1-10 the output of a model.

evaluation.scoring.eval_chain.ScoreStringResultOutputParser

A parser for the output of the ScoreStringEvalChain.

evaluation.string_distance.base.PairwiseStringDistanceEvalChain

Compute string edit distances between two predictions.

evaluation.string_distance.base.StringDistance(value)

Distance metric to use.

evaluation.string_distance.base.StringDistanceEvalChain

Compute string distances between the prediction and the reference.

Functions

evaluation.comparison.eval_chain.resolve_pairwise_criteria(...)

Resolve the criteria for the pairwise evaluator.

evaluation.criteria.eval_chain.resolve_criteria(...)

Resolve the criteria to evaluate.

evaluation.loading.load_dataset(uri)

Load a dataset from the LangChainDatasets on HuggingFace.

evaluation.loading.load_evaluator(evaluator,ย *)

Load the requested evaluation chain specified by a string.

evaluation.loading.load_evaluators(evaluators,ย *)

Load evaluators specified by a list of evaluator types.

evaluation.scoring.eval_chain.resolve_criteria(...)

Resolve the criteria for the pairwise evaluator.