refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughVerifies the subject is a dbo:Person via Changes
Sequence Diagram(s)sequenceDiagram
participant WikiPage
participant GenderExtractor
participant OntologyContext as Context/Ontology
participant QuadSink as QuadWriter
WikiPage->>GenderExtractor: provide Page (text, meta)
GenderExtractor->>OntologyContext: query rdf:type for subject
OntologyContext-->>GenderExtractor: rdf:type(s) (e.g., dbo:Person)
alt is Person
GenderExtractor->>GenderExtractor: load pronoun map for language
GenderExtractor->>GenderExtractor: count pronouns (word-boundary, case-insensitive)
GenderExtractor->>GenderExtractor: aggregate counts, compute max/second, apply minCount/minDifference
alt thresholds met
GenderExtractor->>QuadSink: emit foaf:gender Quad (rdf:langString)
else thresholds not met
GenderExtractor-->>QuadSink: no output
end
else not Person
GenderExtractor-->>QuadSink: no output
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Greetings @jimkont ,Kindly review my pr whenever available |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 33-34: The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
- Around line 36-43: Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.
🧹 Nitpick comments (4)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (4)
20-27: Residual blank lines contradict the "clean up whitespace" objective.Lines 22–25 contain four consecutive blank lines between the constructor's closing paren and
extends MappingExtractor(context). These appear to be leftover formatting artifacts.♻️ Suggested cleanup
def redirects : Redirects } - - - - ) -extends MappingExtractor(context) +) extends MappingExtractor(context) {
45-48: Inconsistent indentation insideextractbody.Lines 46–48 (
/** First pass … */,val mappingGraph,super.extract(…)) sit at 2-space indent — the same level as class members — while the rest of the method body (lines 50+) is indented at 4+ spaces. This applies to the class-level vals (lines 29–44) as well, where the assigned values appear at the class-body margin rather than indented from the=.Please apply a consistent indentation style throughout the file.
64-76: Pre-compile pronoun regexes at the class level instead of insideextract.
new Regex(…)is called for every pronoun on every page. SincepronounMapis fixed at construction time, these regexes can be compiled once and reused, avoiding redundant compilation across potentially millions of pages.♻️ Suggested refactor
Add a class-level field:
private val pronounRegexes: Seq[(Regex, String)] = pronounMap.toSeq.map { case (pronoun, gender) => (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender) }Then simplify the loop in
extract:- var genderCounts: Map[String, Int] = - Map.empty.withDefaultValue(0) - - for ((pronoun, gender) <- pronounMap) { - val regex = - new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b") - - val count = - regex.findAllIn(wikiText).size - - genderCounts = - genderCounts.updated(gender, genderCounts(gender) + count) - } + val genderCounts: Map[String, Int] = + pronounRegexes.foldLeft(Map.empty[String, Int].withDefaultValue(0)) { + case (acc, (regex, gender)) => + val count = regex.findAllIn(wikiText).size + acc.updated(gender, acc(gender) + count) + }
78-78:genderCounts.isEmptyis only true whenpronounMapis empty — consider clarifying intent.Because the loop on lines 67–76 always inserts an entry (even when
countis 0),genderCountswill only be empty whenpronounMapitself is empty. The guard works as a safety net for unconfigured languages but reads as though it's checking "no pronouns found in the text." A brief inline comment clarifying the intent would help future readers.
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Show resolved
Hide resolved
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Show resolved
Hide resolved
|
|
Greetings @jimkont , @Integer-Ctrl Could you please review my pr . |
|
|
|



Replaces hardcoded URI strings with
context.ontologylookups and improves code quality.Changes:
context.ontology.properties()andcontext.ontology.classes()instead of raw URIslangStringDatatypeat class levelResolves issue #825
Summary by CodeRabbit
Improvements
Refactor