Refactor GenderExtractor to use context.ontology instead of hardcoded URIs by prashantkumarniraj · Pull Request #834 · dbpedia/extraction-framework

prashantkumarniraj · 2026-02-11T10:36:14Z

Refactors GenderExtractor to use context.ontology for property and class lookups
instead of hardcoded URI strings.

This removes magic string constants and aligns the extractor with the
centralized ontology management system.

Fixes #825

Summary by CodeRabbit

Bug Fixes
- Fixed potential division by zero error in gender extraction logic.
- Improved data parsing robustness with enhanced validation and error handling for malformed input.
- Enhanced error resilience in mapping configuration resolution with fallback handling when configurations are unavailable.
Improvements
- Increased logging and observability for data ingestion issues and configuration problems.

… URIs

coderabbitai · 2026-02-11T10:36:45Z

📝 Walkthrough

Walkthrough

The changes refactor three independent components to improve robustness and maintainability: GenderExtractor transitions from hardcoded URIs to ontology-driven lookups with division-by-zero safeguards, WikiInfo enhances CSV parsing with defensive validation and logging, and ExtractionManager switches to Option-based namespace resolution with better error handling.

Changes

Cohort / File(s)	Summary
Ontology-driven Gender Extraction `core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`	Replaces hardcoded RDF/OWL URIs (foaf:gender, rdf:type, Person class) with dynamic lookups via context.ontology properties and classes. Adds division-by-zero guard in secondCount comparison and ensures Seq.empty return type consistency.
Defensive CSV Parsing `core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala`	Rewrites fromLines and fromLine methods to validate field counts, safely parse numeric fields with NumberFormatException handling, validate language codes, skip malformed lines with logging, and return None for invalid entries instead of throwing exceptions.
Namespace Resolution Safety `server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala`	Replaces pattern matching with Option-based approach for mapping namespace determination. Adds guard clause when namespace not found, refactors file-loading logic to use resolved namespace, and logs warnings for missing local files instead of relying on implicit fallback behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes out-of-scope changes to WikiInfo.scala and ExtractionManager.scala that are unrelated to the GenderExtractor refactoring objective in issue `#825`.	Remove changes to WikiInfo.scala and ExtractionManager.scala, or create separate PRs for those refactoring improvements to keep this PR focused on GenderExtractor.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: refactoring GenderExtractor to use context.ontology instead of hardcoded URIs, which matches the primary objective of the PR.
Linked Issues check	✅ Passed	The PR successfully addresses all three requirements from issue `#825`: replaces hardcoded genderProperty with context.ontology.properties('foaf:gender'), replaces hardcoded typeProperty with context.ontology.properties('rdf:type'), and replaces hardcoded personUri with context.ontology.classes('Person').
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-02-11T10:37:23Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 86-91: The new guard `secondCount > 0` in the extraction condition
changes semantics and prevents single‑gender pages from being extracted; either
remove that guard and instead explicitly handle a zero `secondCount` (e.g.,
treat `secondCount == 0` as an automatic pass when `maxCount >
GenderExtractorConfig.minCount`), or keep the guard but add a clear comment
explaining the intentional behavior change; update the condition around
`maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and
`GenderExtractorConfig.minDifference` so that zero `secondCount` is handled
deterministically (avoid silent skipping) and document the chosen rationale
adjacent to the `secondCount` check.

🧹 Nitpick comments (3)

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (1)

31-31: Nit: remove emoji from code comment.

The ✅ emoji in a source file comment (Line 31) is unusual for this codebase and may cause encoding issues in some editors/tools. A plain comment would be more consistent.

core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (2)

85-94: Silently defaulting pages to 0 on parse failure may mask data issues.

When the page count is malformed, defaulting to 0 means this wiki will appear to have zero pages, which could affect downstream logic (e.g., Language construction uses pages from WikiInfo). A None return (skip the line entirely) might be safer, similar to the invalid language code handling.

77-83: Extract the hardcoded field count as a named constant with documentation.

The minimum field count of 15 lacks justification — the code only accesses fields at indices 2 and 4 (minimum requirement of 5). The wikistats CSV schema is not formally documented and may change per the wmflabs service. Extract 15 to a top-level constant with a comment explaining which specific CSV format and field layout it applies to.

coderabbitai · 2026-02-11T10:41:17Z

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

+      if (
+        maxGender != "" &&
+        maxCount > GenderExtractorConfig.minCount &&
+        secondCount > 0 &&
+        maxCount / secondCount > GenderExtractorConfig.minDifference
+      )


⚠️ Potential issue | 🟠 Major

Behavior change: single-gender pages will no longer be extracted.

The added secondCount > 0 guard (Line 89) changes semantics. Previously, when only one gender's pronouns were found (secondCount == 0.0), the division maxCount / 0.0 yielded Infinity, which exceeded minDifference, so the gender was reported. Now those cases are silently skipped.

This could suppress valid extractions for pages where pronouns of only one gender appear. If intentional, please add a comment explaining the rationale. If not, consider an alternative like:

if ( maxGender != "" && maxCount > GenderExtractorConfig.minCount && - secondCount > 0 && - maxCount / secondCount > GenderExtractorConfig.minDifference + (secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference) )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (

maxGender != "" &&

maxCount > GenderExtractorConfig.minCount &&

secondCount > 0 &&

maxCount / secondCount > GenderExtractorConfig.minDifference

)

if (

maxGender != "" &&

maxCount > GenderExtractorConfig.minCount &&

(secondCount == 0 || maxCount / secondCount > GenderExtractorConfig.minDifference)

)

🤖 Prompt for AI Agents

In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala` around lines 86 - 91, The new guard `secondCount > 0` in the extraction condition changes semantics and prevents single‑gender pages from being extracted; either remove that guard and instead explicitly handle a zero `secondCount` (e.g., treat `secondCount == 0` as an automatic pass when `maxCount > GenderExtractorConfig.minCount`), or keep the guard but add a clear comment explaining the intentional behavior change; update the condition around `maxGender`, `maxCount`, `secondCount`, `GenderExtractorConfig.minCount` and `GenderExtractorConfig.minDifference` so that zero `secondCount` is handled deterministically (avoid silent skipping) and document the chosen rationale adjacent to the `secondCount` check.

prashantkumarniraj added 3 commits February 5, 2026 16:10

Skip languages without mapping namespaces instead of failing

63ce932

Skip malformed wikipedias.csv lines instead of failing

df6510e

Refactor GenderExtractor to use context.ontology instead of hardcoded…

f068c50

… URIs

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GenderExtractor to use context.ontology instead of hardcoded URIs#834

Refactor GenderExtractor to use context.ontology instead of hardcoded URIs#834
prashantkumarniraj wants to merge 3 commits intodbpedia:masterfrom
prashantkumarniraj:refactor-gender-extractor-ontology

prashantkumarniraj commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

sonarqubecloud bot commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prashantkumarniraj commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

sonarqubecloud bot commented Feb 11, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

prashantkumarniraj commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading