Add evals for mapbox-search-patterns by mattpodwysocki · Pull Request #41 · mapbox/mapbox-agent-skills

mattpodwysocki · 2026-03-06T23:52:33Z

Summary

Adds 3 evals for mapbox-search-patterns
Benchmark: +25pp overall (92% with skill vs 67% without)

Eval Results

Eval	With Skill	Without Skill	Delta
1. Brand vs category tool selection	100%	75%	+25pp
2. Generic category 'near me' search	100%	75%	+25pp
3. Spatial constraints: 'in downtown'	75%	50%	+25pp
Overall	92%	67%	+25pp

Key Discriminating Patterns

Brand error (eval 1): Base model correctly picks search_and_geocode_tool for brands and category_search_tool for categories, but doesn't know that passing a brand name to category_search_tool returns an error (not just poor results).
Proximity semantics (eval 2): Base model recommends switching to category_search_tool correctly, but doesn't explain that proximity is a soft bias without hard exclusion — it even suggests bbox as an alternative for "near me", conflating the two.
"in" vs "near" distinction (eval 3): Base model recommends bbox + proximity combination but misses the linguistic distinction between "in downtown" (bbox enforces boundary) vs "near downtown" (proximity only). Also doesn't warn that too-tight bbox can return zero results.

Note: the base model has solid general Mapbox search knowledge; the skill's value is in the specific failure modes and parameter semantics.

🤖 Generated with Claude Code

Benchmark results (+25pp overall): - Eval 1 (brand vs category tool selection): +25pp (100% vs 75%) - Base model picks the right tools but doesn't know that passing a brand name to category_search_tool returns an error - Eval 2 (generic category 'near me' search): +25pp (100% vs 75%) - Base model switches to category_search_tool but doesn't explain proximity as soft-bias (conflates it with bbox) - Eval 3 (spatial constraints for 'in downtown'): +25pp (75% vs 50%) - Base model misses the 'in' vs 'near' linguistic distinction and the too-tight bbox → zero results warning Note: the base model has solid Mapbox search knowledge; +25pp reflects consistent gaps in nuanced failure modes and parameter semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ianshward

Looks good. I left one comment about something potentially not being an error.

ianshward · 2026-03-17T18:06:26Z

skills/mapbox-search-patterns/evals/evals.json

+      "expectations": [
+        "Uses search_and_geocode_tool for 'Find Starbucks' — Starbucks is a specific brand name, not a category",
+        "Uses category_search_tool for 'Find Coffee Shops' — coffee shops is a generic category/plural type",
+        "Explains that using category_search_tool for a brand name like Starbucks would return an error (brands are not valid category values)",


It may not return an error, but may end up returning no results.

mattpodwysocki requested a review from a team as a code owner March 6, 2026 23:52

ianshward approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evals for mapbox-search-patterns#41

Add evals for mapbox-search-patterns#41
mattpodwysocki wants to merge 1 commit intomainfrom
add-search-patterns-evals

mattpodwysocki commented Mar 6, 2026

Uh oh!

ianshward left a comment

Uh oh!

ianshward Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattpodwysocki commented Mar 6, 2026

Summary

Eval Results

Key Discriminating Patterns

Uh oh!

ianshward left a comment

Choose a reason for hiding this comment

Uh oh!

ianshward Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants