- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 6 EvalView judge calls (2489 tokens).
-
-
-
-
-
-
-
-
-
-
- ✗
- implement
-
-
- 20.0/100
-
- ⚡ 360030ms
-
- 🧠 Unknown
-
- ▾
-
-
-
- Model: Unknown
-
-
- Baseline: 2026-04-06 13:26
- Baseline model: Not recorded in snapshot
-
-
-
-
Score Breakdown
-
-
Tools0.0%× 30%
-
Output0.0/100× 50%
-
SequenceCorrect× 20%
-
=20.0/100
-
-
The response does not implement group_by_key(), does not run the __main__ block, and simply repeats the timeout error. It fails to address the user’s request entirely.
-
-
-
- QueryImplement the group_by_key() function in stub.py. The docstring and type signature are already there — the body is just pass. Implement it, then run the __main__ block to verify it prints the correctly grouped output.
-
-
-
-
-
Why it failed
-
Score 20.0 below minimum 70.0
Output quality: 0.0/100
Hallucination detected (70% confidence)
Tool accuracy: 0.0%
-
-
sequenceDiagram
- participant User
- participant Agent
- participant T0 as error
- User->>Agent: Implement the group_by_key f...
- Agent-xT0: error
- T0-->Agent: OpenCode timed out after 360...
- Agent-->>User: OpenCode timed out after 360...
-
-
-
-
Conversation Turns
-
-
-
Turn 1 · error · ⚡ 360030.9ms · 💰 $0
-
Implement the group_by_key() function in stub.py. The docstring and type signature are already there — the body is just pass. Implement it, then run the __main__ block to verify it prints the correctly grouped output.
-
-
OpenCode timed out after 360s
-
-
-
-
-
- 🔮 Hallucination detected · 70% · [Warning] - Tool 'error' failed/returned error, but agent did not acknowledge the failure · openai/gpt-5.4-mini
- 🛡 Safe
-
-
-
-
Unsupported claims: [Warning] - Tool 'error' failed/returned error, but agent did not acknowledge the failure
-(Confidence 70% below threshold 98% - not blocking)
-
-
-
-
-
-
-
- ✗
- bug-fix
-
-
- 32.5/100
-
- ⚡ 324936ms
-
- 🧠 Unknown
-
- ▾
-
-
-
- Model: Unknown
-
-
- Baseline: 2026-04-06 13:26
- Baseline model: Not recorded in snapshot
-
-
-
-
Score Breakdown
-
-
Tools33.3%× 30%
-
Output5.0/100× 50%
-
SequenceCorrect× 20%
-
=32.5/100
-
-
The response is relevant and acknowledges the bug, but it does not actually fix the code or run the file to confirm the output. It only states an intention to inspect the file, so it fails to satisfy the requested task and provides no concrete correction.
-
-
-
- QueryThere is a bug in buggy.py. The find_max function uses range(1, len(numbers) - 1) which means it never checks the last element. Fix the bug so the function correctly returns the maximum value including the last element. After fixing, run the file to confirm it prints 9.
-
-
-
-
-
Why it failed
-
Score 32.5 below minimum 70.0
Output quality: 5.0/100
Tool accuracy: 33.3%
-
-
sequenceDiagram
- participant User
- participant Agent
- participant T0 as read_file
- User->>Agent: There is a bug in buggy.py.
- Agent->>T0: filePath=/Users/hidaibar-mor...
- T0-->Agent: path/Users/hidaibar-mor/Down...
- Agent-->>User: Ill fix the bug in the find_...
-
-
-
-
Conversation Turns
-
-
-
Turn 1 · read_file · ⚡ 324936.4ms · 💰 $0
-
There is a bug in buggy.py. The find_max function uses range(1, len(numbers) - 1) which means it never checks the last element. Fix the bug so the function correctly returns the maximum value including the last element. After fixing, run the file to confirm it prints 9.
-
-
I'll fix the bug in the find_max function in buggy.py. Let me first examine the file to understand the current implementation.
-
-
-
-
-
- 🔮 No hallucination · No verifiable factual claims found in output. · openai/gpt-5.4-mini
- 🛡 Safe
-
-
-
-
-
-
-
-
-
-
-
- ✗
- refactor
-
-
- 20.0/100
-
- ⚡ 158438ms
-
- 🧠 Unknown
-
- ▾
-
-
-
- Model: Unknown
-
-
- Baseline: 2026-04-06 13:26
- Baseline model: Not recorded in snapshot
-
-
-
-
Score Breakdown
-
-
Tools0.0%× 30%
-
Output0.0/100× 50%
-
SequenceCorrect× 20%
-
=20.0/100
-
-
The response does not answer the query, provide a refactoring, or confirm execution. It only contains a context-size error message and no code, so it fails completeness and relevance entirely.
-
-
-
- QueryRefactor the function p() in messy.py. It has poor naming, deeply nested conditionals, and uses type() instead of isinstance(). Rewrite it to be clean and readable while preserving the exact same behaviour: filter out None values, strip whitespace from non-empty strings, and pass through all other types unchanged. Then run the file to confirm it still prints {'name': 'Alice', 'age': 30}.
-
-
-
-
-
Why it failed
-
Score 20.0 below minimum 70.0
Output quality: 0.0/100
Tool accuracy: 0.0%
-
-
◎ Direct response — no tools invoked
-
-
-
- 🔮 No hallucination · No verifiable factual claims found in output. · openai/gpt-5.4-mini
- 🛡 Safe
-
-
-
Tool sequence changed: 'glob' at step 1 was replaced by 'error'
-
Suggested fix: The agent is calling the right tools but in a different order. If the new order is valid, run `evalview snapshot --variant <name>` to accept it as an alternative.
-
-
-
-
Baseline Output
Output matches expected behavior:
-
-- `"fruit"` → `[apple, banana]`
-- `"veggie"` → `[carrot]`
-
-The implementation iterates over each item, skips any missing the key, and appends to the appropriate grou...
-
Current Output
OpenCode timed out after 360s
--- golden
+++ actual
@@ -1,6 +1 @@
-Output matches expected behavior:
-
-
-
-- `"fruit"` → `[apple, banana]`
-
-- `"veggie"` → `[carrot]`
-
-
-
-The implementation iterates over each item, skips any missing the key, and appends to the appropriate group in the result dict.
Tools 'edit_file', 'bash' were expected but not called
-
Suggested fix: Check if your agent's prompt or logic still triggers 'edit_file'. If the tool was intentionally removed, run `evalview snapshot` to update the baseline.
-
-
-
-
Baseline Output
Fixed. The bug was `range(1, len(numbers) - 1)` which excluded the last index. Changed to `range(1, len(numbers))` so all elements are checked. Output confirms `9`.
-
Current Output
I'll fix the bug in the find_max function in buggy.py. Let me first examine the file to understand the current implementation.
--- golden
+++ actual
@@ -1 +1 @@
-Fixed. The bug was `range(1, len(numbers) - 1)` which excluded the last index. Changed to `range(1, len(numbers))` so all elements are checked. Output confirms `9`.
+I'll fix the bug in the find_max function in buggy.py. Let me first examine the file to understand the current implementation.
Tools 'read_file', 'todowrite', 'edit_file', 'edit_file', 'todowrite', 'bash', 'todowrite' were expected but not called
-
Suggested fix: Check if your agent's prompt or logic still triggers 'read_file'. If the tool was intentionally removed, run `evalview snapshot` to update the baseline.
-
-
-
-
Baseline Output
Output confirmed: `{'name': 'Alice', 'age': 30}`
-
-Here's a summary of what changed in `messy.py`:
-
-| Before | After | Reason |
-|--------|-------|--------|
-| `p(d)` | `process_dict(data)` | Descriptive...
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-redesign-v2.html b/.evalview/latest-redesign-v2.html
deleted file mode 100644
index 6e51fb0..0000000
--- a/.evalview/latest-redesign-v2.html
+++ /dev/null
@@ -1,681 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
-
EvalView Check Report
-
2026-03-17 15:46
-
-
-
-
-
- ✗ 1 Failed
-
- 5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Pass Rate
-
-
-
4/5
-
-
-
80.0%
-
4 of 5 tests
-
-
-
-
Avg Score
-
-
-
79
-
-
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
in 2,170out 665
-
-
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v3.html b/.evalview/latest-v3.html
deleted file mode 100644
index 4de07bf..0000000
--- a/.evalview/latest-v3.html
+++ /dev/null
@@ -1,681 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
-
EvalView Check Report
-
2026-03-17 16:05
-
-
-
-
-
- ✗ 1 Failed
-
- 5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Pass Rate
-
-
-
4/5
-
-
-
80.0%
-
4 of 5 tests
-
-
-
-
Avg Score
-
-
-
79
-
-
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
in 2,170out 665
-
-
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v4.html b/.evalview/latest-v4.html
deleted file mode 100644
index 7084865..0000000
--- a/.evalview/latest-v4.html
+++ /dev/null
@@ -1,708 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
-
EvalView Check Report
-
2026-03-17 16:52
-
-
-
-
-
- ✗ 1 Failed
-
- 5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Pass Rate
-
80.0%
-
4 of 5 tests passing
-
-
-
4/5
-
-
-
-
-
Avg Score
-
79.7/100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
in 2,170 · out 665
-
-
-
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
- Model: openai/claude-3.5-sonnet
- in 520 / out 180 tokens$0.0045
-
-
-
-
-
-
- QueryVIP leaving
-
-
-
-
- ◎ Direct response — no tools invoked
-
-
-
-
-
- ResponseEscalated.
-
-
-
-
-
-
-
-
-
-
-
-
✨No diffs yet — run evalview check to compare against a baseline
-
-
-
-
-
-
-
⏱No step timing data
-
-
-
-
-
-
-
-
-
-
-
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v5.html b/.evalview/latest-v5.html
deleted file mode 100644
index 2586400..0000000
--- a/.evalview/latest-v5.html
+++ /dev/null
@@ -1,689 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
EvalView Check Report
2026-03-18 08:00
-
-
- ✗ 1 Failed5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
80.0%
-
Pass Rate
-
-
-
4 of 5 tests passing
-
-
-
-
-
-
Avg Score
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
in 2,170 · out 665
-
-
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
- Model: openai/claude-3.5-sonnet
- in 520 / out 180 tokens$0.0045
-
-
-
-
-
-
- QueryVIP leaving
-
-
◎ Direct response — no tools invoked
-
-
-
- ResponseEscalated.
-
-
-
-
-
-
-
-
-
✨No diffs yet — run evalview check to compare against a baseline
-
-
-
-
-
⏱No step timing data
-
-
-
-
-
-
-
-
-
-
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v6.html b/.evalview/latest-v6.html
deleted file mode 100644
index aee40eb..0000000
--- a/.evalview/latest-v6.html
+++ /dev/null
@@ -1,616 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
EvalView Check Report
2026-03-18 09:25
-
-
- ✗ 1 Failed5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
80.0%
-
Pass Rate
-
-
-
4 of 5 tests
-
-
-
-
Avg Score
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
in 2,170 · out 665
-
-
-
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
- Model: openai/claude-3.5-sonnet
- in 520 / out 180 tokens$0.0045
-
-
-
-
-
-
- QueryVIP
-
-
◎ Direct response — no tools invoked
-
-
-
- ResponseEscalated.
-
-
-
-
-
-
-
-
-
✨No diffs yet — run evalview check to compare against a baseline
-
-
-
-
-
⏱No step timing data
-
-
-
-
-
-
-
-
-
-
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v6b.html b/.evalview/latest-v6b.html
deleted file mode 100644
index e6e050e..0000000
--- a/.evalview/latest-v6b.html
+++ /dev/null
@@ -1,616 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
EvalView Check Report
2026-03-18 09:34
-
-
- ✗ 1 Failed5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
80.0%
-
Pass Rate
-
-
-
4 of 5 tests
-
-
-
-
Avg Score
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
-
in 2,170 · out 665
-
-
-
-
-
-
-
-
-
Agent Model
-
openai/gpt-4o-mini, openai/claude-3.5-sonnet
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
EvalView Judge (gpt-4o-mini)
-
$0.006
-
1,320 tokens across 5 judge calls
-
-
-
Judge Token Breakdown
-
in 1,100 / out 220
-
Separate from agent trace cost
-
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
refund-flow
-
openai/gpt-4o-mini
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
order-lookup
-
openai/gpt-4o-mini
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
billing-dispute
-
openai/gpt-4o-mini
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
password-reset
-
openai/gpt-4o-mini
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
vip-escalation
-
openai/claude-3.5-sonnet
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
- This check also used 5 EvalView judge calls (1320 tokens).
-
- Model: openai/claude-3.5-sonnet
- in 520 / out 180 tokens$0.0045
-
-
-
-
-
-
- QueryVIP
-
-
◎ Direct response — no tools invoked
-
-
-
- ResponseEscalated.
-
-
-
-
-
-
-
-
-
✨No diffs yet — run evalview check to compare against a baseline
-
-
-
-
-
⏱No step timing data
-
-
-
-
-
-
-
-
-
-
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.evalview/latest-v6c.html b/.evalview/latest-v6c.html
deleted file mode 100644
index 5caff22..0000000
--- a/.evalview/latest-v6c.html
+++ /dev/null
@@ -1,603 +0,0 @@
-
-
-
-
-
-EvalView Check Report
-
-
-
-
-
-
-
-
-
-
-
-
◈
-
EvalView Check Report
2026-03-18 09:37
-
-
- ✗ 1 Failed5 Tests
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
80.0%
-
Pass Rate
-
-
-
4 of 5 tests
-
-
-
-
Avg Score
-
79.7
-
out of 100
-
-
-
Total Cost
-
$0.017
-
2,835 tokens (verified)
-
-
-
Avg Latency
-
1360ms
-
per test
-
-
-
Model
-
anthropic/claude-sonnet-4-6
-
-
in 2,170 · out 665
-
-
-
-
-
-
-
-
-
Agent Model
-
anthropic/claude-sonnet-4-6
-
5 tests in this run
-
-
-
-
Token Usage
-
2,835 tokens
-
in 2,170 / out 665
-
-
-
-
-
-
-
-
-
-
Score per Test
-
-
-
-
Distribution
-
-
-
-
-
-
-
Execution Cost per Query
-
-
-
Test
Model
Trace Cost
Tokens
Latency
Score
-
-
-
refund-flow
-
anthropic/claude-sonnet-4-6
-
$0.0034
-
570 tokens
-
1200ms
-
92.5
-
-
order-lookup
-
anthropic/claude-sonnet-4-6
-
$0.0021
-
415 tokens
-
800ms
-
88.0
-
-
billing-dispute
-
anthropic/claude-sonnet-4-6
-
$0.0058
-
890 tokens
-
2400ms
-
45.0
-
-
password-reset
-
anthropic/claude-sonnet-4-6
-
$0.0012
-
260 tokens
-
600ms
-
95.0
-
-
vip-escalation
-
anthropic/claude-sonnet-4-6
-
$0.0045
-
700 tokens
-
1800ms
-
78.0
-
-
-
Total
—
-
$0.017
-
avg $0.003400 per query
-
-
-
-
- Trace cost comes from the agent execution trace only. Mock or non-metered tools will show $0 even when EvalView used a separate judge or local model during evaluation.
-
-
- Model: anthropic/claude-sonnet-4-6
- in 520 / out 180 tokens$0.0045
-
-
-
-
-
-
- QueryVIP
-
-
◎ Direct response — no tools invoked
-
-
-
- ResponseEscalated.
-
-
-
-
-
-
-
-
-
✨No diffs yet — run evalview check to compare against a baseline
-
-
-
-
-
⏱No step timing data
-
-
-
-
-
-
-
-
-
-
- Built with EvalView| Agent testing & regression detection
-
- Share
- Star
-
-
-
-
-
-
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index 866773b..c2ceaa2 100644
--- a/.gitignore
+++ b/.gitignore
@@ -47,6 +47,10 @@ env/
.evalview/config.yaml
.evalview/golden/
.evalview/history.jsonl
+# Generated artifacts that should not be tracked
+.evalview/*.html
+.evalview/badge.json
+.evalview/healing/
tests/test-cases/*.yaml
!tests/test-cases/example.yaml