From 43527f37c34efbe9c176ed885b0f4ea493e41a2f Mon Sep 17 00:00:00 2001 From: fcogidi <41602287+fcogidi@users.noreply.github.com> Date: Tue, 24 Mar 2026 11:42:13 -0400 Subject: [PATCH 1/2] Add guide for extending AML agent evaluation and datasets --- .../data/building-aml-eval-datasets.md | 562 ++++++++++++++++++ 1 file changed, 562 insertions(+) create mode 100644 implementations/aml_investigation/data/building-aml-eval-datasets.md diff --git a/implementations/aml_investigation/data/building-aml-eval-datasets.md b/implementations/aml_investigation/data/building-aml-eval-datasets.md new file mode 100644 index 00000000..40fbe372 --- /dev/null +++ b/implementations/aml_investigation/data/building-aml-eval-datasets.md @@ -0,0 +1,562 @@ +# Extending AML Agent Evaluation and Datasets + +This guide explains how the AML investigation workflow is wired today and how to extend it in a way that stays aligned with the code in: + +- [`aieng-eval-agents/aieng/agent_evals/aml_investigation`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation) +- [`implementations/aml_investigation`](../) + +It is meant to help you do two related things: + +1. build or extend AML evaluation datasets; +2. add new evaluation dimensions and improve the AML agent itself. + +--- + +## Core Idea + +The AML agent is not a plain QA agent. Each evaluation item is a structured investigation case: + +- the agent receives a `CaseFile`; +- the evaluator holds a `GroundTruth`; +- the agent returns an `AnalystOutput`. + +That means evaluation is also structured. The current system checks: + +- whether the agent reached the right laundering verdict; +- whether it picked the right typology; +- whether it flagged the right transaction IDs; +- whether its narrative is evidence-grounded; +- whether its SQL trace stayed read-only, respected the case window, and avoided redundant querying. + +The key design principle is: **treat AML evaluation as a combination of outcome quality, reasoning quality, and investigation-process quality**. + +--- + +## Where the Extension Points Live + +| Area | File(s) | What to change | +| --- | --- | --- | +| Case schema | [`cases.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py) | Add new input, ground-truth, or output fields | +| Case generation | [`cases.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py), [`cli.py`](cli.py) | Build new datasets or new case categories | +| Agent task wrapper | [`task.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/task.py) | Parse/validate new output schema | +| Agent behavior | [`agent.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py) | Prompt, tools, output schema | +| Item-level deterministic metrics | [`item.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/graders/item.py) | Add field-by-field checks | +| Trace-level metrics | [`trace.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/graders/trace.py) | Add tool-use or SQL-process checks | +| Run-level metrics | [`run.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/graders/run.py) | Add aggregate metrics and slices | +| LLM-judge rubric | [`narrative_pattern_quality.md`](../rubrics/narrative_pattern_quality.md) | Change reasoning-quality criteria | +| Evaluation wiring | [`evaluate.py`](../evaluate.py) | Register new evaluators in the experiment | + +--- + +## Current Dataset Structure + +The canonical schema lives in [`cases.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py). + +### Agent input: `CaseFile` + +| Field | Description | +| --- | --- | +| `case_id` | Unique identifier for the case | +| `seed_transaction_id` | The transaction that triggered the investigation | +| `seed_timestamp` | End of the case window | +| `window_start` | Start of the case window | +| `trigger_label` | Upstream alert label or heuristic hint | + +### Ground truth: `GroundTruth` + +| Field | Description | +| --- | --- | +| `is_laundering` | Whether the case is truly laundering | +| `pattern_type` | One of the supported typologies plus `NONE` | +| `pattern_description` | Human description of the true pattern | +| `attempt_transaction_ids` | Comma-separated causal-chain transaction IDs | + +### Agent output: `AnalystOutput` + +| Field | Description | +| --- | --- | +| `summary_narrative` | Evidence summary and reasoning | +| `is_laundering` | Agent verdict | +| `pattern_type` | Agent typology classification | +| `pattern_description` | Agent explanation of the typology | +| `flagged_transaction_ids` | Transactions the agent believes form the laundering pattern | + +The dataset file is JSONL, one `CaseRecord` per line: + +```json +{ + "input": { + "case_id": "case-001", + "seed_transaction_id": "txn-123", + "seed_timestamp": "2022-09-12T09:05:00", + "window_start": "2022-09-01T00:00:00", + "trigger_label": "RANDOM_REVIEW" + }, + "expected_output": { + "is_laundering": true, + "pattern_type": "STACK", + "pattern_description": "Funds move through sequential intermediary accounts before the seed transaction.", + "attempt_transaction_ids": "txn-100,txn-101,txn-123" + }, + "output": null +} +``` + +--- + +## What the AML Evaluation Already Measures + +Today’s evaluation in [`evaluate.py`](../evaluate.py) combines five distinct perspectives: + +| Level | Evaluator | What it measures | +| --- | --- | --- | +| Item | deterministic grader | Verdict correctness, typology correctness, consistency for benign cases, flagged-ID precision/coverage | +| Item | LLM judge | Narrative quality and pattern-description quality via rubric | +| Trace | deterministic grader | Presence of SQL, read-only compliance, time-window discipline, redundant queries | +| Trace | groundedness judge | Whether the final narrative is supported by tool outputs | +| Run | aggregate grader | Precision/recall/F1 for `is_laundering`, macro F1 and confusion matrix for `pattern_type` | + +This baseline separates: + +- **what answer the agent gave**; +- **why it said that**; +- **how it investigated**. + +--- + +## How to Build New Datasets + +### Option A: Use the built-in case generator + +The fastest path is the data CLI in [`cli.py`](cli.py): + +```bash +uv run --env-file .env python -m implementations.aml_investigation.data.cli create-db \ + --illicit-ratio HI \ + --transactions-size Small + +uv run --env-file .env python -m implementations.aml_investigation.data.cli create-cases \ + --illicit-ratio HI \ + --transactions-size Small \ + --num-laundering-cases 20 \ + --num-false-positive-cases 10 \ + --num-false-negative-cases 10 \ + --num-normal-cases 10 \ + --lookback-days 10 \ + --output-dir implementations/aml_investigation/data +``` + +This route uses: + +- [`normalize_transactions_data()`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/utils.py) +- [`parse_patterns_file()`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py) +- [`build_cases()`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py) + +### Option B: Bring your own transaction dataset + +If you already have transaction data: + +1. normalize it to the schema expected by [`normalize_transactions_data()`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/utils.py); +2. load it into a database compatible with [`schema.ddl`](schema.ddl); +3. generate `CaseRecord` rows directly, or create a compatible patterns file and call `build_cases()`. + +This is the right path when your real-world typologies, alert labels, or entity behavior differ from the Kaggle/IBM sample dataset. + +### Option C: Use an LLM with few-shot ground-truth patterns + +Another practical path is to use an LLM to draft new cases from a small set of validated laundering patterns. + +This works especially well when you already have: + +- a handful of trusted `CaseRecord` examples; +- known typologies with correct `attempt_transaction_ids`; +- seed examples of good false positives and false negatives. + +The few-shot examples teach the model: + +- how `CaseFile`, `GroundTruth`, and `CaseRecord` are structured; +- how to keep `attempt_transaction_ids` limited to the causal chain; +- how to vary `trigger_label` to create easy, weak-signal, or misleading cases; +- how to write short but concrete `pattern_description` values. + +Good uses of LLM assistance: + +- drafting additional case files from known laundering chains; +- generating hard-negative variants by weakening or misleading the `trigger_label`; + +Prompt pattern: + +```text +You are helping build an AML evaluation dataset. + +Here are 3 validated examples of AML CaseRecord objects: + + +Now generate 10 new CaseRecord objects that follow the same schema. +Requirements: +- Keep input and expected_output consistent. +- Only include causal-chain transactions in attempt_transaction_ids. +- Include a mix of laundering, false positive, false negative, and normal cases. +- Vary trigger_label difficulty. +- Return JSONL-ready objects only. +``` + +Important: treat the LLM as a drafting assistant, not the source of truth. Every generated case should still be checked against the underlying transactions or patterns file before it becomes part of the benchmark. + +### Option D: Curate cases by hand + +Manual curation is best for high-signal benchmark sets. + +Good candidates for hand-built AML cases: + +- typologies your current agent often confuses; +- benign cases that look suspicious at first glance; +- cases with incomplete but realistic evidence; +- cases where `trigger_label` is misleading or noisy; +- cases with long chains where over-flagging is common. + +### Optional: add metadata for slice analysis + +The Langfuse uploader in [`langfuse.py`](../../../aieng-eval-agents/aieng/agent_evals/langfuse.py) accepts optional per-record `metadata`. + +That is useful for labeling slices such as: + +- `difficulty: easy|medium|hard` +- `signal_regime: strong_trigger|weak_trigger|misleading_trigger` +- `window_regime: narrow|standard|noisy` +- `source: synthetic|manual|production_like` + +Example: + +```json +{ + "input": { "...": "..." }, + "expected_output": { "...": "..." }, + "metadata": { + "difficulty": "hard", + "signal_regime": "misleading_trigger", + "window_regime": "noisy" + } +} +``` + +Once those tags exist, it becomes much easier to add run-level reporting by slice instead of relying only on one global F1 score. + +--- + +## Tips for Writing Better AML Cases + +### 1. Treat `trigger_label` as a difficulty control + +The generator already uses this idea: + +- laundering cases often get strong labels; +- false negatives get low-signal review labels; +- false positives get alarming labels for benign activity. + +If you want to test robustness, add: + +- more low-signal labels; +- labels from your real alert taxonomy instead of typology names. + +### 2. Keep `attempt_transaction_ids` minimal + +The current item grader compares set overlap between: + +- `expected_output.attempt_transaction_ids` +- `output.flagged_transaction_ids` + +Only include the true causal chain. If you dump every ambient transaction in the window into ground truth, both `id_precision_like` and `id_coverage` become noisy. + +### 3. Vary the window width intentionally + +`window_start` is part of the task, not just metadata. Small changes in window width can make the same case: + +- easy, by exposing the full chain; +- hard, by burying the chain in benign traffic; +- unfair, by hiding crucial earlier hops. + +Create separate slices for narrow, medium, and noisy windows. + +### 4. Cover every typology you care about + +Run-level pattern scoring is macro F1, so missing typologies matter. If you add a new pattern label, you should also update: + +- [`LaunderingPattern`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py); +- the agent instructions in [`agent.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py); +- any deterministic graders that assume the existing label set; +- your dataset distribution. + +--- + +## How to Add New Evaluation Dimensions + +There are three main ways to extend evaluation. + +### 1. Add a new deterministic item-level grader + +Use this when you can compute the score directly from `input`, `output`, and `expected_output`. + +Good AML-specific examples: + +- whether the seed transaction is always included when laundering is predicted; +- whether the agent defaults to `NONE` when evidence is weak; +- whether the predicted chain preserves temporal order; +- whether the agent over-relies on `trigger_label`; +- whether the narrative mentions the same typology as the structured field. + +Minimal pattern: + +```python +from aieng.agent_evals.evaluation import Evaluation + + +def seed_transaction_flagged_grader(*, input, output, expected_output, metadata=None, **kwargs): + del expected_output, metadata, kwargs + + predicted_is_laundering = output.get("is_laundering") + predicted_ids = { + token.strip() + for token in str(output.get("flagged_transaction_ids", "")).split(",") + if token.strip() + } + seed_id = input.get("seed_transaction_id") + + applicable = predicted_is_laundering is True + passed = (seed_id in predicted_ids) if applicable else True + + return [ + Evaluation( + name="seed_transaction_flagged", + value=1.0 if passed else 0.0, + metadata={"applicable": applicable, "seed_transaction_id": seed_id}, + ) + ] +``` + +Then register it in [`evaluate.py`](../evaluate.py) by adding it to `evaluators=[...]`. + +### 2. Add a new LLM-judge rubric + +Use this when the metric depends on judgment rather than exact matching. + +Good candidates: + +- quality of benign-explanation analysis; +- clarity of escalation rationale; +- whether the narrative distinguishes evidence from speculation; +- whether the agent explains why alternative typologies were rejected. + +Create a new rubric markdown file under [`implementations/aml_investigation/rubrics`](../rubrics), then wire it in with: + +```python +custom_evaluator = create_llm_as_judge_evaluator( + name="benign_hypothesis_quality", + rubric_markdown="implementations/aml_investigation/rubrics/benign_hypothesis_quality.md", +) +``` + +### 3. Add a new trace-level evaluator + +Use trace evaluators when you care about investigation process, not just the final answer. + +Good candidates: + +- number of unique accounts investigated before concluding; +- whether the agent started with schema/aggregate queries before raw detail pulls; +- whether it queried both sides of the seed transaction; +- whether it repeatedly pulled large raw result sets instead of narrowing with aggregates. + +You can either: + +- extend [`trace.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/graders/trace.py), or +- create a separate trace grader file and add it to `trace_evaluators=[...]` in [`evaluate.py`](../evaluate.py). + +--- + +## Recommended New Metrics for This Agent + +If you want the next most valuable additions, these are the best candidates: + +| Metric | Why it is useful | Best level | +| --- | --- | --- | +| `seed_transaction_flagged` | Catches cases where the returned chain does not anchor to the case trigger | Item | +| `typology_narrative_consistency` | Detects disagreement between structured fields and prose | Item | +| `benign_hypothesis_quality` | Forces the agent to justify why activity is benign or suspicious | LLM judge | +| `account_coverage` | Measures whether the agent inspected both inbound and outbound sides of critical accounts | Trace | +| `aggregate_first_discipline` | Rewards starting with summaries instead of dumping raw transactions | Trace | +| `trigger_label_dependence` | Measures whether strong labels are inflating performance | Run / slice | +| `performance_by_pattern_type` | Makes blind spots obvious without reading the full confusion matrix | Run | +| `performance_by_window_width` | Shows when the agent breaks under noise or limited context | Run | + +The highest-leverage improvement is usually **slice-based reporting**. Even if overall F1 looks good, the agent may still fail badly on: + +- low-signal cases; +- benign-but-anomalous cases; +- long-hop stack/layering cases; +- rare typologies. + +--- + +## How to Add a New Output Field + +If you want to evaluate something the current output schema does not expose, add it explicitly. + +Examples: + +- `confidence` +- `benign_explanation` +- `missing_information` +- `investigated_accounts` +- `evidence_transaction_ids` + +When you add a new output field, update these pieces together: + +1. [`AnalystOutput`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/data/cases.py) +2. the output instructions in [`agent.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py) +3. any dataset rows or ground-truth fields needed to score it +4. [`task.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/task.py), if parsing assumptions change +5. your grader(s) in [`item.py`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/graders/item.py) or new evaluator files + +As a rule: **do not try to infer important evaluation fields back out of free text if you can ask the agent to emit them structurally instead**. + +--- + +## How to Extend the AML Agent Itself + +The main agent factory is [`create_aml_investigation_agent()`](../../../aieng-eval-agents/aieng/agent_evals/aml_investigation/agent.py). Right now it uses: + +- a read-only schema tool; +- a read-only SQL execution tool; +- a strict `AnalystOutput` response schema. + +### 1. Add higher-level domain tools + +The current agent can do everything with SQL, but not always efficiently. Good additions would be: + +- `get_seed_transaction_context(seed_transaction_id, window_start, seed_timestamp)` +- `summarize_account_activity(account_id, window_start, seed_timestamp)` +- `find_counterparty_clusters(account_id, window_start, seed_timestamp)` +- `trace_fund_path(seed_transaction_id, max_hops)` +- `compute_account_features(account_id, window_start, seed_timestamp)` + +These tools reduce prompt load, standardize analysis, and make trace evaluation easier because the intended workflow becomes more explicit. + +### 2. Add graph-oriented tooling + +AML patterns are graph problems. A dedicated tool that returns: + +- inbound degree; +- outbound degree; +- repeating counterparties; +- connected components; +- short path summaries; +- cycle candidates; + +would likely improve typology classification more than prompt tuning alone. + +### 3. Add a code interpreter + +This is optional, but it can be useful for: + +- multi-step feature calculations; +- clustering or graph analysis over query results; +- validating candidate chains before writing the final structured output. + +If you add it, pair it with trace evaluation so you can measure whether it actually improves difficult cases instead of just increasing tool churn. + +An relatively easy way to add a code interpreter is to use the `aieng-agents` package: + +```bash +pip install aieng-agents +``` + +Then create a new agent factory that extends the existing one with code-interpreter tools: + +```python +from aieng.agents.tools import CodeInterpreter + +interpreter = CodeInterpreter() +# add interpreter to the agent's toolset +``` + +### 4. Add account/entity enrichment + +The current database has `accounts` and an `account_transactions` view in [`schema.ddl`](schema.ddl), but the agent could benefit from richer context such as: + +- jurisdiction or bank risk tiers; +- business/entity categories; +- known customer profile expectations; +- payment-channel risk metadata. + +If you add enrichment, also add evaluation slices to test whether the agent uses it correctly instead of just becoming more verbose. + +### 5. Tighten the prompt around investigation discipline + +The existing prompt already says “start with aggregates” and “default benign unless evidence contradicts this.” You can strengthen it further by requiring: + +- an explicit benign hypothesis section; +- a hard query budget; +- a requirement to explain why the seed transaction belongs in the final chain; +- a requirement to cite the specific observations behind the verdict. + +### 6. Make the output more audit-friendly + +Consider extending the output schema with: + +- `decision_rationale` +- `key_evidence` +- `alternative_explanations_considered` +- `recommended_follow_up` +- `confidence` + +Those fields make both human review and automated scoring easier. + +--- + +## A Good Pattern for Co-Evolving Agent and Evaluation + +The good workflow is: + +1. add a new structured output field or tool; +2. add a deterministic or rubric-based metric that specifically checks whether the new capability is used well; +3. add dataset slices where that capability matters; +4. compare overall metrics and slice metrics before keeping the change. + +Examples: + +- If you add a graph tool, add trace metrics for whether it was used and whether pattern accuracy improves on multi-hop cases. +- If you add a `confidence` field, add calibration slices that compare confidence to correctness. +- If you add a benign-explanation field, add a rubric that rewards concrete benign reasoning and penalizes unsupported suspicion. + +--- + +## Suggested First Extensions + +If you want a practical roadmap, start here: + +1. Add a `seed_transaction_flagged` deterministic metric. +2. Add a second rubric focused on benign-hypothesis quality. +3. Report metrics by `trigger_label` regime and `pattern_type`. +4. Add one higher-level account-summary tool so the agent relies less on repeated raw SQL. +5. Extend the dataset with more benign-but-suspicious cases to pressure test false-positive behavior. + +That combination usually gives the best signal quickly: it improves both agent behavior and the quality of the benchmark you use to judge it. + +--- + +## Recommended Validation Loop + +After changing the dataset, agent, or graders, run the full loop: + +```bash +uv run --env-file .env implementations/aml_investigation/cli.py \ + --input-path implementations/aml_investigation/data/aml_cases.jsonl \ + --output-path implementations/aml_investigation/data/aml_cases_with_output.jsonl + +uv run --env-file .env implementations/aml_investigation/evaluate.py \ + --dataset-path implementations/aml_investigation/data/aml_cases.jsonl \ + --dataset-name AML-investigation +``` + +Use the first command when you want to inspect raw outputs case by case. Use the second when you want the full Langfuse-backed evaluation stack, including trace evaluators. From d6387ebe6e027f427781b73fb84b46c443aae178 Mon Sep 17 00:00:00 2001 From: fcogidi <41602287+fcogidi@users.noreply.github.com> Date: Tue, 24 Mar 2026 11:53:02 -0400 Subject: [PATCH 2/2] Fix typo --- .../aml_investigation/data/building-aml-eval-datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/implementations/aml_investigation/data/building-aml-eval-datasets.md b/implementations/aml_investigation/data/building-aml-eval-datasets.md index 40fbe372..f31cc1b7 100644 --- a/implementations/aml_investigation/data/building-aml-eval-datasets.md +++ b/implementations/aml_investigation/data/building-aml-eval-datasets.md @@ -465,7 +465,7 @@ This is optional, but it can be useful for: If you add it, pair it with trace evaluation so you can measure whether it actually improves difficult cases instead of just increasing tool churn. -An relatively easy way to add a code interpreter is to use the `aieng-agents` package: +A relatively easy way to add a code interpreter is to use the `aieng-agents` package: ```bash pip install aieng-agents