Open
Conversation
Benchmark results (+25pp overall):
- Eval 1 (brand vs category tool selection): +25pp (100% vs 75%)
- Base model picks the right tools but doesn't know that passing a
brand name to category_search_tool returns an error
- Eval 2 (generic category 'near me' search): +25pp (100% vs 75%)
- Base model switches to category_search_tool but doesn't explain
proximity as soft-bias (conflates it with bbox)
- Eval 3 (spatial constraints for 'in downtown'): +25pp (75% vs 50%)
- Base model misses the 'in' vs 'near' linguistic distinction and
the too-tight bbox → zero results warning
Note: the base model has solid Mapbox search knowledge; +25pp reflects
consistent gaps in nuanced failure modes and parameter semantics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ianshward
approved these changes
Mar 17, 2026
Member
ianshward
left a comment
There was a problem hiding this comment.
Looks good. I left one comment about something potentially not being an error.
| "expectations": [ | ||
| "Uses search_and_geocode_tool for 'Find Starbucks' — Starbucks is a specific brand name, not a category", | ||
| "Uses category_search_tool for 'Find Coffee Shops' — coffee shops is a generic category/plural type", | ||
| "Explains that using category_search_tool for a brand name like Starbucks would return an error (brands are not valid category values)", |
Member
There was a problem hiding this comment.
It may not return an error, but may end up returning no results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mapbox-search-patternsEval Results
Key Discriminating Patterns
search_and_geocode_toolfor brands andcategory_search_toolfor categories, but doesn't know that passing a brand name tocategory_search_toolreturns an error (not just poor results).category_search_toolcorrectly, but doesn't explain thatproximityis a soft bias without hard exclusion — it even suggestsbboxas an alternative for "near me", conflating the two.bbox+proximitycombination but misses the linguistic distinction between "in downtown" (bbox enforces boundary) vs "near downtown" (proximity only). Also doesn't warn that too-tight bbox can return zero results.Note: the base model has solid general Mapbox search knowledge; the skill's value is in the specific failure modes and parameter semantics.
🤖 Generated with Claude Code