Skip to content

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824

Open
vaibhav45sktech wants to merge 9 commits intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor
Open

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824
vaibhav45sktech wants to merge 9 commits intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor

Conversation

@vaibhav45sktech
Copy link
Contributor

@vaibhav45sktech vaibhav45sktech commented Jan 24, 2026

Replaces hardcoded URI strings with context.ontology lookups and improves code quality.

Changes:

  • Use context.ontology.properties() and context.ontology.classes() instead of raw URIs
  • Fix pronoun regex: word boundaries + case-insensitive + proper escaping
  • Pre-instantiate langStringDatatype at class level
  • Handle division-by-zero in gender ratio calculation
  • Clean up whitespace and formatting

Resolves issue #825

Summary by CodeRabbit

  • Improvements

    • More accurate gender extraction using language-aware pronoun counting and stricter person-entity validation
    • Emits a single language-tagged string result when confidence thresholds are met; suppresses output otherwise
    • Applies minimum-count and dominance-difference thresholds to reduce false positives
  • Refactor

    • Consolidated processing to compute pronoun counts and decide output before emission
    • Uses ontology-driven properties and class checks for consistent validation

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Verifies the subject is a dbo:Person via context.ontology, counts language-specific pronouns in the page text, determines a dominant gender using minCount and minDifference thresholds, and emits a single foaf:gender quad with rdf:langString when thresholds are satisfied.

Changes

Cohort / File(s) Summary
Gender Extractor Logic
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Replaced heuristic with pronoun-based extraction; added ontology-based rdf:type (Person) check; load language-aware pronoun map; read full page text and count pronouns with case-insensitive word-boundary matching; aggregate per-gender counts, compute dominant gender and threshold checks (minCount, minDifference); use context.ontology for URIs; emit single foaf:gender Quad with rdf:langString only when thresholds met; removed prior multi-pass/early-return flow.

Sequence Diagram(s)

sequenceDiagram
    participant WikiPage
    participant GenderExtractor
    participant OntologyContext as Context/Ontology
    participant QuadSink as QuadWriter

    WikiPage->>GenderExtractor: provide Page (text, meta)
    GenderExtractor->>OntologyContext: query rdf:type for subject
    OntologyContext-->>GenderExtractor: rdf:type(s) (e.g., dbo:Person)
    alt is Person
        GenderExtractor->>GenderExtractor: load pronoun map for language
        GenderExtractor->>GenderExtractor: count pronouns (word-boundary, case-insensitive)
        GenderExtractor->>GenderExtractor: aggregate counts, compute max/second, apply minCount/minDifference
        alt thresholds met
            GenderExtractor->>QuadSink: emit foaf:gender Quad (rdf:langString)
        else thresholds not met
            GenderExtractor-->>QuadSink: no output
        end
    else not Person
        GenderExtractor-->>QuadSink: no output
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the primary refactoring focus: replacing hardcoded URIs with ontology lookups and removing whitespace in GenderExtractor, which aligns with the main changes described.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (2)

26-45: Inconsistent indentation throughout the class body.

Continuation lines for val declarations (e.g., Lines 28, 31, 34, 36, 38, 40) sit at the same indent level as their private val keywords, making them visually ambiguous—they look like standalone statements rather than value expressions. Additionally, Lines 43–45 inside extract are indented at the class level (2 spaces) instead of the method-body level (4 spaces), while Lines 47+ correctly use 4-space indentation.

🔧 Suggested indentation fix (representative excerpt)
   private val language: String =
-  context.language.wikiCode
+    context.language.wikiCode

   private val pronounMap: Map[String, String] =
-  GenderExtractorConfig.pronounsMap(language)
+    GenderExtractorConfig.pronounsMap(language)

   private val genderProperty =
-  context.ontology.properties("foaf:gender")
+    context.ontology.properties("foaf:gender")
   private val typeProperty =
-  context.ontology.properties("rdf:type")
+    context.ontology.properties("rdf:type")
   private val personClass =
-  context.ontology.classes("Person")
+    context.ontology.classes("Person")
   private val langStringDatatype =
-  new Datatype("rdf:langString")
+    new Datatype("rdf:langString")

   override def extract(node: PageNode, subjectUri: String): Seq[Quad] = {
-  /** First pass: extract mappings to detect rdf:type */
-  val mappingGraph: Seq[Quad] =
-  super.extract(node, subjectUri)
+    /** First pass: extract mappings to detect rdf:type */
+    val mappingGraph: Seq[Quad] =
+      super.extract(node, subjectUri)

60-73: Pre-compile pronoun regexes at the class level.

new Regex(...) is created for every pronoun on every extract() call. Since pronounMap is immutable and fixed at construction time, compile the regexes once to avoid repeated work per page.

♻️ Suggested refactor

Add a class-level val after pronounMap:

+  /** Pre-compiled pronoun regexes: (regex, gender) */
+  private val pronounRegexes: Seq[(Regex, String)] =
+    pronounMap.map { case (pronoun, gender) =>
+      (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
+    }.toSeq

Then simplify the counting loop inside extract:

     var genderCounts: Map[String, Int] =
       Map.empty.withDefaultValue(0)

-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
+    for ((regex, gender) <- pronounRegexes) {
+      val count = regex.findAllIn(wikiText).size
       genderCounts =
         genderCounts.updated(gender, genderCounts(gender) + count)
     }

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont ,Kindly review my pr whenever available

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 33-34: The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
- Around line 36-43: Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.
🧹 Nitpick comments (4)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (4)

20-27: Residual blank lines contradict the "clean up whitespace" objective.

Lines 22–25 contain four consecutive blank lines between the constructor's closing paren and extends MappingExtractor(context). These appear to be leftover formatting artifacts.

♻️ Suggested cleanup
     def redirects : Redirects
   } 
-     
-   
-  
-
 ) 
-extends MappingExtractor(context) 
+) extends MappingExtractor(context)
 {

45-48: Inconsistent indentation inside extract body.

Lines 46–48 (/** First pass … */, val mappingGraph, super.extract(…)) sit at 2-space indent — the same level as class members — while the rest of the method body (lines 50+) is indented at 4+ spaces. This applies to the class-level vals (lines 29–44) as well, where the assigned values appear at the class-body margin rather than indented from the =.

Please apply a consistent indentation style throughout the file.


64-76: Pre-compile pronoun regexes at the class level instead of inside extract.

new Regex(…) is called for every pronoun on every page. Since pronounMap is fixed at construction time, these regexes can be compiled once and reused, avoiding redundant compilation across potentially millions of pages.

♻️ Suggested refactor

Add a class-level field:

private val pronounRegexes: Seq[(Regex, String)] =
  pronounMap.toSeq.map { case (pronoun, gender) =>
    (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
  }

Then simplify the loop in extract:

-    var genderCounts: Map[String, Int] =
-      Map.empty.withDefaultValue(0)
-
-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
-      genderCounts =
-        genderCounts.updated(gender, genderCounts(gender) + count)
-    }
+    val genderCounts: Map[String, Int] =
+      pronounRegexes.foldLeft(Map.empty[String, Int].withDefaultValue(0)) {
+        case (acc, (regex, gender)) =>
+          val count = regex.findAllIn(wikiText).size
+          acc.updated(gender, acc(gender) + count)
+      }

78-78: genderCounts.isEmpty is only true when pronounMap is empty — consider clarifying intent.

Because the loop on lines 67–76 always inserts an entry (even when count is 0), genderCounts will only be empty when pronounMap itself is empty. The guard works as a safety net for unconfigured languages but reads as though it's checking "no pronouns found in the text." A brief inline comment clarifying the intent would help future readers.

@sonarqubecloud
Copy link

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@jimkont

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@jimkont

@jimkont

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant