Refactor GenderExtractor to use context.ontology instead of hardcoded URIs#834
Conversation
📝 WalkthroughWalkthroughThe changes refactor three independent components to improve robustness and maintainability: GenderExtractor transitions from hardcoded URIs to ontology-driven lookups with division-by-zero safeguards, WikiInfo enhances CSV parsing with defensive validation and logging, and ExtractionManager switches to Option-based namespace resolution with better error handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 86-91: The new guard `secondCount > 0` in the extraction condition
changes semantics and prevents single‑gender pages from being extracted; either
remove that guard and instead explicitly handle a zero `secondCount` (e.g.,
treat `secondCount == 0` as an automatic pass when `maxCount >
GenderExtractorConfig.minCount`), or keep the guard but add a clear comment
explaining the intentional behavior change; update the condition around
`maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and
`GenderExtractorConfig.minDifference` so that zero `secondCount` is handled
deterministically (avoid silent skipping) and document the chosen rationale
adjacent to the `secondCount` check.
🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (1)
31-31: Nit: remove emoji from code comment.The
✅emoji in a source file comment (Line 31) is unusual for this codebase and may cause encoding issues in some editors/tools. A plain comment would be more consistent.core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (2)
85-94: Silently defaulting pages to 0 on parse failure may mask data issues.When the page count is malformed, defaulting to
0means this wiki will appear to have zero pages, which could affect downstream logic (e.g., Language construction usespagesfrom WikiInfo). ANonereturn (skip the line entirely) might be safer, similar to the invalid language code handling.
77-83: Extract the hardcoded field count as a named constant with documentation.The minimum field count of 15 lacks justification — the code only accesses fields at indices 2 and 4 (minimum requirement of 5). The wikistats CSV schema is not formally documented and may change per the wmflabs service. Extract
15to a top-level constant with a comment explaining which specific CSV format and field layout it applies to.
| if ( | ||
| maxGender != "" && | ||
| maxCount > GenderExtractorConfig.minCount && | ||
| secondCount > 0 && | ||
| maxCount / secondCount > GenderExtractorConfig.minDifference | ||
| ) |
There was a problem hiding this comment.
Behavior change: single-gender pages will no longer be extracted.
The added secondCount > 0 guard (Line 89) changes semantics. Previously, when only one gender's pronouns were found (secondCount == 0.0), the division maxCount / 0.0 yielded Infinity, which exceeded minDifference, so the gender was reported. Now those cases are silently skipped.
This could suppress valid extractions for pages where pronouns of only one gender appear. If intentional, please add a comment explaining the rationale. If not, consider an alternative like:
if (
maxGender != "" &&
maxCount > GenderExtractorConfig.minCount &&
- secondCount > 0 &&
- maxCount / secondCount > GenderExtractorConfig.minDifference
+ (secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference)
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| maxGender != "" && | |
| maxCount > GenderExtractorConfig.minCount && | |
| secondCount > 0 && | |
| maxCount / secondCount > GenderExtractorConfig.minDifference | |
| ) | |
| if ( | |
| maxGender != "" && | |
| maxCount > GenderExtractorConfig.minCount && | |
| (secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference) | |
| ) |
🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 86 - 91, The new guard `secondCount > 0` in the extraction
condition changes semantics and prevents single‑gender pages from being
extracted; either remove that guard and instead explicitly handle a zero
`secondCount` (e.g., treat `secondCount == 0` as an automatic pass when
`maxCount > GenderExtractorConfig.minCount`), or keep the guard but add a clear
comment explaining the intentional behavior change; update the condition around
`maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and
`GenderExtractorConfig.minDifference` so that zero `secondCount` is handled
deterministically (avoid silent skipping) and document the chosen rationale
adjacent to the `secondCount` check.



Refactors GenderExtractor to use context.ontology for property and class lookups
instead of hardcoded URI strings.
This removes magic string constants and aligns the extractor with the
centralized ontology management system.
Fixes #825
Summary by CodeRabbit
Bug Fixes
Improvements