Skip to content

Add evals for mapbox-search-patterns#41

Open
mattpodwysocki wants to merge 1 commit intomainfrom
add-search-patterns-evals
Open

Add evals for mapbox-search-patterns#41
mattpodwysocki wants to merge 1 commit intomainfrom
add-search-patterns-evals

Conversation

@mattpodwysocki
Copy link
Contributor

Summary

  • Adds 3 evals for mapbox-search-patterns
  • Benchmark: +25pp overall (92% with skill vs 67% without)

Eval Results

Eval With Skill Without Skill Delta
1. Brand vs category tool selection 100% 75% +25pp
2. Generic category 'near me' search 100% 75% +25pp
3. Spatial constraints: 'in downtown' 75% 50% +25pp
Overall 92% 67% +25pp

Key Discriminating Patterns

  • Brand error (eval 1): Base model correctly picks search_and_geocode_tool for brands and category_search_tool for categories, but doesn't know that passing a brand name to category_search_tool returns an error (not just poor results).
  • Proximity semantics (eval 2): Base model recommends switching to category_search_tool correctly, but doesn't explain that proximity is a soft bias without hard exclusion — it even suggests bbox as an alternative for "near me", conflating the two.
  • "in" vs "near" distinction (eval 3): Base model recommends bbox + proximity combination but misses the linguistic distinction between "in downtown" (bbox enforces boundary) vs "near downtown" (proximity only). Also doesn't warn that too-tight bbox can return zero results.

Note: the base model has solid general Mapbox search knowledge; the skill's value is in the specific failure modes and parameter semantics.

🤖 Generated with Claude Code

Benchmark results (+25pp overall):
- Eval 1 (brand vs category tool selection): +25pp (100% vs 75%)
  - Base model picks the right tools but doesn't know that passing a
    brand name to category_search_tool returns an error
- Eval 2 (generic category 'near me' search): +25pp (100% vs 75%)
  - Base model switches to category_search_tool but doesn't explain
    proximity as soft-bias (conflates it with bbox)
- Eval 3 (spatial constraints for 'in downtown'): +25pp (75% vs 50%)
  - Base model misses the 'in' vs 'near' linguistic distinction and
    the too-tight bbox → zero results warning

Note: the base model has solid Mapbox search knowledge; +25pp reflects
consistent gaps in nuanced failure modes and parameter semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mattpodwysocki mattpodwysocki requested a review from a team as a code owner March 6, 2026 23:52
Copy link
Member

@ianshward ianshward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left one comment about something potentially not being an error.

"expectations": [
"Uses search_and_geocode_tool for 'Find Starbucks' — Starbucks is a specific brand name, not a category",
"Uses category_search_tool for 'Find Coffee Shops' — coffee shops is a generic category/plural type",
"Explains that using category_search_tool for a brand name like Starbucks would return an error (brands are not valid category values)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not return an error, but may end up returning no results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants