Skip to content

feat: add real estate scorer benchmark task#115

Open
TomerWeissman wants to merge 1 commit intoSakanaAI:mainfrom
TomerWeissman:feat/real-estate-scorer-task
Open

feat: add real estate scorer benchmark task#115
TomerWeissman wants to merge 1 commit intoSakanaAI:mainfrom
TomerWeissman:feat/real-estate-scorer-task

Conversation

@TomerWeissman
Copy link
Copy Markdown

Summary

  • Adds tasks/real_estate_scorer/ — a new benchmark that evolves a Python scoring function for Buenos Aires apartment investment quality
  • Fitness metric: Spearman rank correlation between evolved scores and ground-truth price_per_m2 rankings on a held-out 10-listing test set
  • Baseline (linear weighted sum) achieves 0.9030 Spearman correlation; evolution should discover better feature interactions
  • Includes: dataset generator (data.py, seed=42), ShinkaEvolve evaluator (evaluate.py), evolvable initial program (initial.py), standalone baseline (baseline.py), evolution runner + YAML config for 10-generation runs

Test plan

  • python -m tasks.real_estate_scorer.data generates deterministic train/test JSON
  • python -m tasks.real_estate_scorer.baseline prints Spearman 0.9030
  • python evaluate.py --program_path initial.py produces combined_score: 0.903 and correct: true
  • ruff check tasks/ passes clean
  • python run_evo.py --config_path shinka.yaml runs 10-generation evolution (requires LLM API keys)

🤖 Generated with Claude Code

New example task that evolves a scoring function for Buenos Aires apartment
investment quality, using Spearman rank correlation as the fitness metric.
Includes dataset generator, evaluator, baseline (0.903 Spearman), and
ShinkaEvolve config for 10-generation runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RobertTLange
Copy link
Copy Markdown
Collaborator

Thanks for putting this together. As it stands, this looks more like a standalone example/benchmark task than something that demonstrates a novel intrinsic of ShinkaEvolve itself.

At the moment, we are not looking to merge example use-cases that do not clearly exemplify new ShinkaEvolve capabilities or behaviors. In practice, that means examples should highlight something framework-specific, such as a new execution mode, evaluator pattern, language/backend capability, or a core optimization/evolution behavior.

Are you planning to extend this PR to showcase any of those kinds of ShinkaEvolve-specific intrinsics? If yes, please call that out explicitly in the PR description and align the example and results around that. If not, I do not think this is a fit for merge in its current form.

Please also refer to the contribution guidelines, especially the note that we should not add random benchmark tasks or examples just to justify a PR, and that representative tasks should highlight the capability being changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants