Skip to content

Refactor GenderExtractor to use context.ontology instead of hardcoded URIs#834

Open
prashantkumarniraj wants to merge 3 commits intodbpedia:masterfrom
prashantkumarniraj:refactor-gender-extractor-ontology
Open

Refactor GenderExtractor to use context.ontology instead of hardcoded URIs#834
prashantkumarniraj wants to merge 3 commits intodbpedia:masterfrom
prashantkumarniraj:refactor-gender-extractor-ontology

Conversation

@prashantkumarniraj
Copy link
Contributor

@prashantkumarniraj prashantkumarniraj commented Feb 11, 2026

Refactors GenderExtractor to use context.ontology for property and class lookups
instead of hardcoded URI strings.

This removes magic string constants and aligns the extractor with the
centralized ontology management system.

Fixes #825

Summary by CodeRabbit

  • Bug Fixes

    • Fixed potential division by zero error in gender extraction logic.
    • Improved data parsing robustness with enhanced validation and error handling for malformed input.
    • Enhanced error resilience in mapping configuration resolution with fallback handling when configurations are unavailable.
  • Improvements

    • Increased logging and observability for data ingestion issues and configuration problems.

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

The changes refactor three independent components to improve robustness and maintainability: GenderExtractor transitions from hardcoded URIs to ontology-driven lookups with division-by-zero safeguards, WikiInfo enhances CSV parsing with defensive validation and logging, and ExtractionManager switches to Option-based namespace resolution with better error handling.

Changes

Cohort / File(s) Summary
Ontology-driven Gender Extraction
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Replaces hardcoded RDF/OWL URIs (foaf:gender, rdf:type, Person class) with dynamic lookups via context.ontology properties and classes. Adds division-by-zero guard in secondCount comparison and ensures Seq.empty return type consistency.
Defensive CSV Parsing
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala
Rewrites fromLines and fromLine methods to validate field counts, safely parse numeric fields with NumberFormatException handling, validate language codes, skip malformed lines with logging, and return None for invalid entries instead of throwing exceptions.
Namespace Resolution Safety
server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala
Replaces pattern matching with Option-based approach for mapping namespace determination. Adds guard clause when namespace not found, refactors file-loading logic to use resolved namespace, and logs warnings for missing local files instead of relying on implicit fallback behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes out-of-scope changes to WikiInfo.scala and ExtractionManager.scala that are unrelated to the GenderExtractor refactoring objective in issue #825. Remove changes to WikiInfo.scala and ExtractionManager.scala, or create separate PRs for those refactoring improvements to keep this PR focused on GenderExtractor.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: refactoring GenderExtractor to use context.ontology instead of hardcoded URIs, which matches the primary objective of the PR.
Linked Issues check ✅ Passed The PR successfully addresses all three requirements from issue #825: replaces hardcoded genderProperty with context.ontology.properties('foaf:gender'), replaces hardcoded typeProperty with context.ontology.properties('rdf:type'), and replaces hardcoded personUri with context.ontology.classes('Person').
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 86-91: The new guard `secondCount > 0` in the extraction condition
changes semantics and prevents single‑gender pages from being extracted; either
remove that guard and instead explicitly handle a zero `secondCount` (e.g.,
treat `secondCount == 0` as an automatic pass when `maxCount >
GenderExtractorConfig.minCount`), or keep the guard but add a clear comment
explaining the intentional behavior change; update the condition around
`maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and
`GenderExtractorConfig.minDifference` so that zero `secondCount` is handled
deterministically (avoid silent skipping) and document the chosen rationale
adjacent to the `secondCount` check.
🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (1)

31-31: Nit: remove emoji from code comment.

The emoji in a source file comment (Line 31) is unusual for this codebase and may cause encoding issues in some editors/tools. A plain comment would be more consistent.

core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (2)

85-94: Silently defaulting pages to 0 on parse failure may mask data issues.

When the page count is malformed, defaulting to 0 means this wiki will appear to have zero pages, which could affect downstream logic (e.g., Language construction uses pages from WikiInfo). A None return (skip the line entirely) might be safer, similar to the invalid language code handling.


77-83: Extract the hardcoded field count as a named constant with documentation.

The minimum field count of 15 lacks justification — the code only accesses fields at indices 2 and 4 (minimum requirement of 5). The wikistats CSV schema is not formally documented and may change per the wmflabs service. Extract 15 to a top-level constant with a comment explaining which specific CSV format and field layout it applies to.

Comment on lines +86 to +91
if (
maxGender != "" &&
maxCount > GenderExtractorConfig.minCount &&
secondCount > 0 &&
maxCount / secondCount > GenderExtractorConfig.minDifference
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Behavior change: single-gender pages will no longer be extracted.

The added secondCount > 0 guard (Line 89) changes semantics. Previously, when only one gender's pronouns were found (secondCount == 0.0), the division maxCount / 0.0 yielded Infinity, which exceeded minDifference, so the gender was reported. Now those cases are silently skipped.

This could suppress valid extractions for pages where pronouns of only one gender appear. If intentional, please add a comment explaining the rationale. If not, consider an alternative like:

      if (
        maxGender != "" &&
        maxCount > GenderExtractorConfig.minCount &&
-       secondCount > 0 &&
-       maxCount / secondCount > GenderExtractorConfig.minDifference
+       (secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference)
      )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (
maxGender != "" &&
maxCount > GenderExtractorConfig.minCount &&
secondCount > 0 &&
maxCount / secondCount > GenderExtractorConfig.minDifference
)
if (
maxGender != "" &&
maxCount > GenderExtractorConfig.minCount &&
(secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference)
)
🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 86 - 91, The new guard `secondCount > 0` in the extraction
condition changes semantics and prevents single‑gender pages from being
extracted; either remove that guard and instead explicitly handle a zero
`secondCount` (e.g., treat `secondCount == 0` as an automatic pass when
`maxCount > GenderExtractorConfig.minCount`), or keep the guard but add a clear
comment explaining the intentional behavior change; update the condition around
`maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and
`GenderExtractorConfig.minDifference` so that zero `secondCount` is handled
deterministically (avoid silent skipping) and document the chosen rationale
adjacent to the `secondCount` check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor GenderExtractor to use context.ontology instead of hardcoded URIs

1 participant