Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 142 additions & 69 deletions src/content/docs/docs/evaluation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@ supportedLanguages:
- js
- go
- python
- dart
---

import Lang from '../../../components/Lang.astro';
import ThemeImage from '../../../components/ThemeImage.astro';

<Lang lang="js">


Evaluation is a form of testing that helps you validate your LLM's responses and
ensure they meet your quality bar.
Expand Down Expand Up @@ -46,6 +47,10 @@ Genkit supports two types of evaluation:

This section explains how to perform inference-based evaluation using Genkit.

<Lang lang="js">



## Quick start

### Setup
Expand Down Expand Up @@ -698,43 +703,7 @@ genkit flow:run synthesizeQuestions '{"filePath": "my_input.pdf"}' --output synt

<Lang lang="go">

Evaluation is a form of testing that helps you validate your LLM's responses and
ensure they meet your quality bar.

Genkit supports third-party evaluation tools through plugins, paired
with powerful observability features that provide insight into the runtime state
of your LLM-powered applications. Genkit tooling helps you automatically extract
data including inputs, outputs, and information from intermediate steps to
evaluate the end-to-end quality of LLM responses as well as understand the
performance of your system's building blocks.

### Types of evaluation

Genkit supports two types of evaluation:

- **Inference-based evaluation**: This type of evaluation runs against a
collection of pre-determined inputs, assessing the corresponding outputs for
quality.

This is the most common evaluation type, suitable for most use cases. This
approach tests a system's actual output for each evaluation run.

You can perform the quality assessment manually, by visually inspecting the
results. Alternatively, you can automate the assessment by using an
evaluation metric.

- **Raw evaluation**: This type of evaluation directly assesses the quality of
inputs without any inference. This approach typically is used with automated
evaluation using metrics. All required fields for evaluation (e.g., `input`,
`context`, `output` and `reference`) must be present in the input dataset. This
is useful when you have data coming from an external source (e.g., collected
from your production traces) and you want to have an objective measurement of
the quality of the collected data.

For more information, see the [Advanced use](#advanced-use) section of this
page.

This section explains how to perform inference-based evaluation using Genkit.

## Quick start

Expand Down Expand Up @@ -1189,39 +1158,7 @@ UI, located at `localhost:4000/evaluate`.

<Lang lang="python">

Evaluation is a form of testing that helps you validate your LLM's responses and
ensure they meet your quality bar.

Genkit supports third-party evaluation tools through plugins, paired
with powerful observability features that provide insight into the runtime state
of your LLM-powered applications. Genkit tooling helps you automatically extract
data including inputs, outputs, and information from intermediate steps to
evaluate the end-to-end quality of LLM responses as well as understand the
performance of your system's building blocks.

### Types of evaluation

Genkit supports two types of evaluation:

- **Inference-based evaluation**: This type of evaluation runs against a
collection of pre-determined inputs, assessing the corresponding outputs for
quality.

This is the most common evaluation type, suitable for most use cases. This approach tests a system's actual output for each evaluation run.

You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.

- **Raw evaluation**: This type of evaluation directly assesses the quality of
inputs without any inference. This approach typically is used with automated
evaluation using metrics. All required fields for evaluation (e.g., `input`,
`context`, `output` and `reference`) must be present in the input dataset. This
is useful when you have data coming from an external source (e.g., collected
from your production traces) and you want to have an objective measurement of
the quality of the collected data.

For more information, see the [Advanced use](#advanced-use) section of this page.

This section explains how to perform inference-based evaluation using Genkit.

## Quick start

Expand Down Expand Up @@ -1814,3 +1751,139 @@ genkit flow:run synthesize_questions '{"file_path": "my_input.pdf"}' --output sy
- [Models](/docs/models)

</Lang>

<Lang lang="dart">

Genkit for Dart supports the full evaluation framework. You can run evaluations with or without automated metrics. If you want automated metrics in Dart, you can implement them as custom evaluators.

## Quick start

### Setup

1. Use an existing Genkit app or create a new one by following our [Get started](/docs/get-started) guide.
2. Add the following code to define a simple RAG application to evaluate. For this guide, we use a dummy retriever that always returns the same documents.

```dart
import 'package:genkit/genkit.dart';
import 'package:genkit_google_genai/genkit_google_genai.dart';

final ai = Genkit(plugins: [googleAI()]);

// Dummy retriever that always returns the same facts
final dummyRetriever = ai.defineRetriever(
name: 'dummyRetriever',
fn: (query, context) async {
final facts = [
"Dog is man's best friend",
'Dogs have evolved and were domesticated from wolves',
];
return RetrieverResponse(
documents: facts.map((t) => Document.fromText(t)).toList(),
);
},
);

// A simple question-answering flow
final qaFlow = ai.defineFlow(
name: 'qaFlow',
inputSchema: .string(),
outputSchema: .string(),
Comment thread
pavelgj marked this conversation as resolved.
fn: (query, context) async {
final factDocs = await ai.retrieve(
retriever: dummyRetriever,
query: query,
);

final response = await ai.generate(
model: googleAI.gemini('gemini-2.5-flash'),
prompt: 'Answer this question with the given context: $query',
docs: factDocs.documents,
);
return response.text ?? '';
},
);
```

3. Start your Genkit application.

```bash
genkit start -- dart run bin/evals.dart
```

### Create a dataset

Create a dataset to define the examples we want to use for evaluating our flow.

1. Go to the Dev UI at `http://localhost:4000` and click the **Datasets** button to open the Datasets page.
2. Click on the **Create Dataset** button to open the create dataset dialog.
a. Provide a `datasetId` for your new dataset. This guide uses `myFactsQaDataset`.
b. Select `Flow` dataset type.
c. Leave the validation target field empty and click **Save**
3. Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps:
a. Click the **Add example** button to open the example editor panel.
b. Only the `input` field is required. Enter `"Who is man's best friend?"` in the `input` field, and click **Save**.
c. Repeat steps (a) and (b) to add more examples:
- `"Can I give milk to my cats?"`
- `"From which animals did dogs evolve?"`

### Run evaluation and view results

To start evaluating the flow, click the **Run new evaluation** button on your dataset page.

1. Select the `Flow` radio button to evaluate a flow.
2. Select `qaFlow` as the target flow to evaluate.
3. Select `myFactsQaDataset` as the target dataset to use for evaluation.
4. (Optional) If you have defined custom evaluators, you can select them here. Otherwise, you can run the evaluation without metrics to inspect the outputs manually.
5. Click **Run evaluation** to start evaluation. Once complete, click the link to go to the _Evaluation details_ page to view the results.

## Core concepts

### Terminology

- **Evaluation**: A process that assesses system performance.
- **Bulk inference**: Running inference on multiple inputs simultaneously.
- **Metric**: A criterion on which an inference is scored. In Dart, metrics are implemented as custom evaluators.
- **Dataset**: A collection of examples to use for inference-based evaluation.

## Custom evaluators

You can extend Genkit to support custom evaluation by defining your own evaluator functions. An evaluator can use an LLM as a judge, perform programmatic (heuristic) checks, or call external APIs to assess the quality of a response.

You define a custom evaluator using the `ai.defineEvaluator` method.

Here's an example of a custom evaluator:

```dart
import 'package:genkit/genkit.dart';

ai.defineEvaluator(
name: 'custom',
description: 'Custom evaluator',
fn: (input, context) async {
return [
...input.dataset.map(
(d) => EvalFnResponse(
testCaseId: d.testCaseId!,
evaluation: EvalFnResponseEvaluation.score(
Score(
score: ScoreScore.bool(true),
status: EvalStatusEnum.PASS,
details: {'reasoning': 'something, something, something....'},
),
),
),
),
];
},
);
```

## Advanced use

### Evaluation using the CLI

Genkit CLI provides 3 main evaluation commands: `eval:flow`, `eval:extractData`, and `eval:run`.

Refer to the Node.js or Go sections for more details on using these commands, as the CLI usage is consistent across languages.

</Lang>