genkit-ai · pavelgj · Apr 16, 2026 · Apr 20, 2026
diff --git a/src/content/docs/docs/evaluation.mdx b/src/content/docs/docs/evaluation.mdx
@@ -5,12 +5,13 @@ supportedLanguages:
   - js
   - go
   - python
+  - dart
 ---
 
 import Lang from '../../../components/Lang.astro';
 import ThemeImage from '../../../components/ThemeImage.astro';
 
-<Lang lang="js">
+
 
 Evaluation is a form of testing that helps you validate your LLM's responses and
 ensure they meet your quality bar.
@@ -46,6 +47,10 @@ Genkit supports two types of evaluation:
 
 This section explains how to perform inference-based evaluation using Genkit.
 
+<Lang lang="js">
+
+
+
 ## Quick start
 
 ### Setup
@@ -698,43 +703,7 @@ genkit flow:run synthesizeQuestions '{"filePath": "my_input.pdf"}' --output synt
 
 <Lang lang="go">
 
-Evaluation is a form of testing that helps you validate your LLM's responses and
-ensure they meet your quality bar.
-
-Genkit supports third-party evaluation tools through plugins, paired
-with powerful observability features that provide insight into the runtime state
-of your LLM-powered applications. Genkit tooling helps you automatically extract
-data including inputs, outputs, and information from intermediate steps to
-evaluate the end-to-end quality of LLM responses as well as understand the
-performance of your system's building blocks.
-
-### Types of evaluation
 
-Genkit supports two types of evaluation:
-
-- **Inference-based evaluation**: This type of evaluation runs against a
-  collection of pre-determined inputs, assessing the corresponding outputs for
-  quality.
-
-  This is the most common evaluation type, suitable for most use cases. This
-  approach tests a system's actual output for each evaluation run.
-
-  You can perform the quality assessment manually, by visually inspecting the
-  results. Alternatively, you can automate the assessment by using an
-  evaluation metric.
-
-- **Raw evaluation**: This type of evaluation directly assesses the quality of
-  inputs without any inference. This approach typically is used with automated
-  evaluation using metrics. All required fields for evaluation (e.g., `input`,
-  `context`, `output` and `reference`) must be present in the input dataset. This
-  is useful when you have data coming from an external source (e.g., collected
-  from your production traces) and you want to have an objective measurement of
-  the quality of the collected data.
-
-  For more information, see the [Advanced use](#advanced-use) section of this
-  page.
-
-This section explains how to perform inference-based evaluation using Genkit.
 
 ## Quick start
 
@@ -1189,39 +1158,7 @@ UI, located at `localhost:4000/evaluate`.
 
 <Lang lang="python">
 
-Evaluation is a form of testing that helps you validate your LLM's responses and
-ensure they meet your quality bar.
-
-Genkit supports third-party evaluation tools through plugins, paired
-with powerful observability features that provide insight into the runtime state
-of your LLM-powered applications. Genkit tooling helps you automatically extract
-data including inputs, outputs, and information from intermediate steps to
-evaluate the end-to-end quality of LLM responses as well as understand the
-performance of your system's building blocks.
 
-### Types of evaluation
-
-Genkit supports two types of evaluation:
-
-- **Inference-based evaluation**: This type of evaluation runs against a
-  collection of pre-determined inputs, assessing the corresponding outputs for
-  quality.
-
-  This is the most common evaluation type, suitable for most use cases. This approach tests a system's actual output for each evaluation run.
-
-  You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.
-
-- **Raw evaluation**: This type of evaluation directly assesses the quality of
-  inputs without any inference. This approach typically is used with automated
-  evaluation using metrics. All required fields for evaluation (e.g., `input`,
-  `context`, `output` and `reference`) must be present in the input dataset. This
-  is useful when you have data coming from an external source (e.g., collected
-  from your production traces) and you want to have an objective measurement of
-  the quality of the collected data.
-
-  For more information, see the [Advanced use](#advanced-use) section of this page.
-
-This section explains how to perform inference-based evaluation using Genkit.
 
 ## Quick start
 
@@ -1814,3 +1751,139 @@ genkit flow:run synthesize_questions '{"file_path": "my_input.pdf"}' --output sy
 - [Models](/docs/models)
 
 </Lang>
+
+<Lang lang="dart">
+
+Genkit for Dart supports the full evaluation framework. You can run evaluations with or without automated metrics. If you want automated metrics in Dart, you can implement them as custom evaluators.
+
+## Quick start
+
+### Setup
+
+1.  Use an existing Genkit app or create a new one by following our [Get started](/docs/get-started) guide.
+2.  Add the following code to define a simple RAG application to evaluate. For this guide, we use a dummy retriever that always returns the same documents.
+
+```dart
+import 'package:genkit/genkit.dart';
+import 'package:genkit_google_genai/genkit_google_genai.dart';
+
+final ai = Genkit(plugins: [googleAI()]);
+
+// Dummy retriever that always returns the same facts
+final dummyRetriever = ai.defineRetriever(
+  name: 'dummyRetriever',
+  fn: (query, context) async {
+    final facts = [
+      "Dog is man's best friend",
+      'Dogs have evolved and were domesticated from wolves',
+    ];
+    return RetrieverResponse(
+      documents: facts.map((t) => Document.fromText(t)).toList(),
+    );
+  },
+);
+
+// A simple question-answering flow
+final qaFlow = ai.defineFlow(
+  name: 'qaFlow',
+  inputSchema: .string(),
+  outputSchema: .string(),
+  fn: (query, context) async {
+    final factDocs = await ai.retrieve(
+      retriever: dummyRetriever,
+      query: query,
+    );
+
+    final response = await ai.generate(
+      model: googleAI.gemini('gemini-2.5-flash'),
+      prompt: 'Answer this question with the given context: $query',
+      docs: factDocs.documents,
+    );
+    return response.text ?? '';
+  },
+);
+```
+
+3.  Start your Genkit application.
+
+```bash
+genkit start -- dart run bin/evals.dart
+```
+
+### Create a dataset
+
+Create a dataset to define the examples we want to use for evaluating our flow.
+
+1. Go to the Dev UI at `http://localhost:4000` and click the **Datasets** button to open the Datasets page.
+2. Click on the **Create Dataset** button to open the create dataset dialog.
+   a. Provide a `datasetId` for your new dataset. This guide uses `myFactsQaDataset`.
+   b. Select `Flow` dataset type.
+   c. Leave the validation target field empty and click **Save**
+3. Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps:
+   a. Click the **Add example** button to open the example editor panel.
+   b. Only the `input` field is required. Enter `"Who is man's best friend?"` in the `input` field, and click **Save**.
+   c. Repeat steps (a) and (b) to add more examples:
+      - `"Can I give milk to my cats?"`
+      - `"From which animals did dogs evolve?"`
+
+### Run evaluation and view results
+
+To start evaluating the flow, click the **Run new evaluation** button on your dataset page.
+
+1. Select the `Flow` radio button to evaluate a flow.
+2. Select `qaFlow` as the target flow to evaluate.
+3. Select `myFactsQaDataset` as the target dataset to use for evaluation.
+4. (Optional) If you have defined custom evaluators, you can select them here. Otherwise, you can run the evaluation without metrics to inspect the outputs manually.
+5. Click **Run evaluation** to start evaluation. Once complete, click the link to go to the _Evaluation details_ page to view the results.
+
+## Core concepts
+
+### Terminology
+
+- **Evaluation**: A process that assesses system performance.
+- **Bulk inference**: Running inference on multiple inputs simultaneously.
+- **Metric**: A criterion on which an inference is scored. In Dart, metrics are implemented as custom evaluators.
+- **Dataset**: A collection of examples to use for inference-based evaluation.
+
+## Custom evaluators
+
+You can extend Genkit to support custom evaluation by defining your own evaluator functions. An evaluator can use an LLM as a judge, perform programmatic (heuristic) checks, or call external APIs to assess the quality of a response.
+
+You define a custom evaluator using the `ai.defineEvaluator` method.
+
+Here's an example of a custom evaluator:
+
+```dart
+import 'package:genkit/genkit.dart';
+
+ai.defineEvaluator(
+  name: 'custom',
+  description: 'Custom evaluator',
+  fn: (input, context) async {
+    return [
+      ...input.dataset.map(
+        (d) => EvalFnResponse(
+          testCaseId: d.testCaseId!,
+          evaluation: EvalFnResponseEvaluation.score(
+            Score(
+              score: ScoreScore.bool(true),
+              status: EvalStatusEnum.PASS,
+              details: {'reasoning': 'something, something, something....'},
+            ),
+          ),
+        ),
+      ),
+    ];
+  },
+);
+```
+
+## Advanced use
+
+### Evaluation using the CLI
+
+Genkit CLI provides 3 main evaluation commands: `eval:flow`, `eval:extractData`, and `eval:run`.
+
+Refer to the Node.js or Go sections for more details on using these commands, as the CLI usage is consistent across languages.
+
+</Lang>