Gsoc25 hindi chapter by advenk · Pull Request #776 · dbpedia/extraction-framework

advenk · 2025-06-11T13:12:03Z

Changes required for Hindi Chapter.

Summary by CodeRabbit

New Features
- Added support for recognizing the typographic apostrophe character (') as a minute indicator in English and Hindi duration parsing.

- Minor Fix to some of the Hindi Mappings(xml) - Updated the new extractors to the properties file

…di_chapter

sonarqubecloud · 2025-06-11T13:13:25Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.4% Duplication on New Code

See analysis details on SonarQube Cloud

jimkont · 2025-06-11T15:18:12Z

mappings/Mapping_hi.xml

@@ -1,37 +1,39 @@
-<?xml version="1.0" encoding="UTF-8"?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"><page><title>Mapping hi:ज्ञानसन्दूक क्रिकेट टीम</title><ns>230</ns><id>11829</id><revision><id>52037</id><timestamp>2017-05-01T07:02:11Z</timestamp><text>{{TemplateMapping 
+<?xml version='1.0' encoding='UTF-8'?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/">


Hi, quick question: where are these mapping pages sourced from? I do not see them all in https://mappings.dbpedia.org/index.php/Special:AllPages/Mapping_hi:

@jimkont hi, this was added by @deba-iitbh as part of GSoC 2024 but not merged therefore these are not present in the mappings.

The source of truth is the mappings wiki, and when the extraction starts, the download-mappings task is executed to get the latest version of the mappings. This means that the changes here will be overwritten once that happens. Ideally, the mappings defined here should be moved there, and you commit the updated output of the download mappings task once that is done

sonarqubecloud · 2026-03-09T21:15:21Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
100.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai · 2026-03-09T21:15:23Z

📝 Walkthrough

Walkthrough

A new mapping entry is added to duration parser configuration files, enabling recognition of a typographic apostrophe variant (\') as a minute unit indicator in English and Hindi language mappings for token normalization.

Changes

Cohort / File(s)	Summary
Duration Parser Configuration `core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`	Added mapping for minute unit symbol (\') to both English and Hindi duration unit maps for improved token recognition during parsing.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

[GSoC 2024] DBpedia Hindi chapter #762: Modifies the same DurationParserConfig mappings including Hindi duration entries.

Suggested reviewers

jimkont

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Gsoc25 hindi chapter' is vague and generic, using non-descriptive terms that don't convey the specific nature of the changes. While it references Hindi and GSOC25, it fails to communicate what actual changes are being made (e.g., adding duration parser support, configuring extractors, etc.).	Provide a more specific title that describes the actual changes, such as 'Add Hindi language support for duration parser' or 'Configure Hindi extractors and language mappings'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`:
- Line 388: The alias entry "\\’" in DurationParserConfig.scala uses a backslash
plus U+2019 which gets stripped by the cleanup in DurationParser.scala
(replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing
the problematic key with the ASCII apostrophe variant used on the next line (use
"'" as the map key) or alternatively update the cleanup regex in
DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019
so the original key survives normalization — modify either
DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII
apostrophe) or DurationParser.scala (expand the character class to include
\u2019) so map lookups succeed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6839efc6-b7b5-494b-91dd-33ad4be7c6fe

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 9903ce4.

📒 Files selected for processing (1)

core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala

coderabbitai · 2026-03-09T21:20:19Z

core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala

            "s" -> "second",
            "\"" -> "second",
            "मिनट" -> "minute",
+            "\\’" -> "minute",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspect regex construction:" sed -n '19,28p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala echo echo "Inspect unit cleanup + map lookup:" sed -n '129,137p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala echo echo "Inspect upstream string normalization:" sed -n '46,53p' core/src/main/scala/org/dbpedia/extraction/dataparser/StringParser.scala

Repository: dbpedia/extraction-framework

Length of output: 1712

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Line 388 context with surrounding lines and hex dump ===" sed -n '385,391p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | cat -A echo echo "=== Hex dump of the key at line 388 ===" sed -n '388p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | od -c echo echo "=== Check if this is part of a map initialization ===" sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala echo echo "=== Verify the regex character class behavior ===" echo 'Testing character class [^\'"'"'"a-zA-Z]' python3 << 'EOF' import re # Test the regex pattern used in cleanup pattern = r"[^'\"a-zA-Z]" test_cases = [ ("'", "ASCII apostrophe"), ("'", "U+2019 right single quotation mark"), ("\\", "backslash"), ("a", "letter a"), ('"', "ASCII quote"), ] for char, desc in test_cases: removed = re.sub(pattern, "", char) print(f"{desc:40} | Input: {repr(char):10} | After cleanup: {repr(removed):10} | Removed: {removed == ''}") EOF

Repository: dbpedia/extraction-framework

Length of output: 587

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Full context around line 388 with line numbers ===" sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | nl -v 380 echo echo "=== Hex/bytes of line 388 using Python ===" python3 << 'EOF' with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f: lines = f.readlines() line_388 = lines[387] # 0-indexed print(f"Line 388 content: {repr(line_388)}") print(f"Bytes (hex): {line_388.encode('utf-8').hex()}") # Extract the key part import re match = re.search(r'"([^"]*)"', line_388) if match: key = match.group(1) print(f"Key: {repr(key)}") print(f"Key bytes (hex): {key.encode('utf-8').hex()}") for i, char in enumerate(key): print(f" [{i}] {repr(char)} (U+{ord(char):04X})") EOF echo echo "=== Check surrounding lines for similar patterns ===" python3 << 'EOF' with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f: lines = f.readlines() for i in range(380, 400): line = lines[i] if i < len(lines) else "" if line.strip(): match_obj = __import__('re').search(r'"([^"]*)"', line) if match_obj: key = match_obj.group(1) print(f"Line {i+1:3d}: key={repr(key):20s} value_start={lines[i].split('->')[1][:20].strip() if '->' in lines[i] else 'N/A'}") EOF

Repository: dbpedia/extraction-framework

Length of output: 186

🏁 Script executed:

#!/bin/bash python3 << 'EOFPYTHON' import re with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f: lines = f.readlines() print("=== Lines 380-395 with key analysis ===") for i in range(379, min(395, len(lines))): line = lines[i] line_num = i + 1 print(f"Line {line_num}: {line.rstrip()}") # Extract key from quoted string match = re.search(r'"([^"]*)"', line) if match: key = match.group(1) key_hex = key.encode('utf-8').hex() print(f" Key: {repr(key)} | Hex: {key_hex}") for j, char in enumerate(key): print(f" [{j}] = {repr(char):8s} (U+{ord(char):04X})") print("\n=== Analyzing the cleanup regex behavior ===") pattern = r"[^'\"a-zA-Z]" # Test cases from the actual keys found test_keys = [ "\\M-bM-^@M-^Y", # How it appeared in cat -A "\\'", # As shown in review snippet "'", # ASCII apostrophe ] # Actually read what's in the file print("\nActual keys from file and cleanup result:") for i in [387, 388, 389]: # Lines around the area in question if i < len(lines): line = lines[i] match = re.search(r'"([^"]*)"', line) if match: key = match.group(1) cleaned = re.sub(pattern, "", key) print(f"Line {i+1}: Original key={repr(key):20s} | After cleanup={repr(cleaned):15s} | Matches itself? {key == cleaned}") EOFPYTHON

Repository: dbpedia/extraction-framework

Length of output: 3075

🏁 Script executed:

#!/bin/bash python3 << 'EOFPYTHON' import re with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f: lines = f.readlines() print("=== Context around line 388 with comments ===") for i in range(375, min(400, len(lines))): line = lines[i].rstrip() if line.strip(): print(f"Line {i+1:3d}: {line}") print("\n=== Summary of the minute keys ===") print("Line 387: 'मिनट' (Hindi 'minute') -> 'minute'") print("Line 388: '\\\\\\'' (backslash + backslash + U+2019) -> 'minute'") print("Line 389: \"'\" (ASCII apostrophe U+0027) -> 'minute' [copied from english]") print("Line 390: '′' (prime symbol U+2032) -> 'minute'") print("Line 391: 'm' (ASCII m) -> 'minute'") print("\n=== Testing what cleanup does to each ===") pattern = r"[^'\"a-zA-Z]" test_cases = [ ("मिनट", "Hindi minute"), ("\\\\'" , "Line 388 key"), ("'", "ASCII apostrophe"), ("′", "Prime symbol"), ("m", "ASCII m"), ] for key, desc in test_cases: cleaned = re.sub(pattern, "", key) matches = "✓" if key == cleaned else "✗" print(f"{matches} {desc:25s}: {repr(key):20s} -> {repr(cleaned):20s}") EOFPYTHON

Repository: dbpedia/extraction-framework

Length of output: 1879

🏁 Script executed:

#!/bin/bash echo "=== Searching for tests related to Hindi duration parsing ===" find . -type f -name "*.scala" -o -name "*.java" | head -20 echo echo "=== Search for DurationParser tests ===" find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "DurationParser" 2>/dev/null | head -5 echo echo "=== Search for 'hi' language or Hindi in tests ===" find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "\"hi\"" 2>/dev/null | head -5 echo echo "=== Check if there are any test inputs with the problematic characters ===" find . -type f -name "*.scala" | xargs grep -E "(मिनट|U\+2019|\\\\\\\\'" 2>/dev/null | head -10

Repository: dbpedia/extraction-framework

Length of output: 1699

🏁 Script executed:

#!/bin/bash echo "=== DurationParser test file ===" cat core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scala echo echo "=== Check if test covers Hindi language ===" grep -n "hi\|मिनट\|hindi\|Hindi" core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scala

Repository: dbpedia/extraction-framework

Length of output: 7512

This alias won't resolve during parsing.

The key "\\'" as displayed maps to two backslashes followed by U+2019 (right single quotation mark) in the actual file. The cleanup step at DurationParser.scala:129-137 applies replaceAll("""[^\'\"a-zA-Z]""", ""), which removes anything outside the ASCII apostrophe, ASCII quote, and ASCII letter set. This strips the backslashes and U+2019 to an empty string, preventing any map lookup from succeeding. The ASCII apostrophe at line 389 works correctly because it's in the allowed set, but this entry cannot produce "minute".

Use the ASCII apostrophe from line 389 or adjust the cleanup regex to preserve the intended character before lookup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala` at line 388, The alias entry "\\’" in DurationParserConfig.scala uses a backslash plus U+2019 which gets stripped by the cleanup in DurationParser.scala (replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing the problematic key with the ASCII apostrophe variant used on the next line (use "'" as the map key) or alternatively update the cleanup regex in DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019 so the original key survives normalization — modify either DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII apostrophe) or DurationParser.scala (expand the character class to include \u2019) so map lookups succeed.

deba-iitbh and others added 9 commits August 16, 2024 12:01

[GSoC 2024 Hindi Chapter] Added Hindi mapping

b56ffc7

[GSoC 2024 Hindi Chapter] Added Hindi XML Mapping

f9dfbb5

[gsoc 2024 Hindi Chapter] Updated Extraction Prop

b2855b3

- Minor Fix to some of the Hindi Mappings(xml) - Updated the new extractors to the properties file

Fix: ja properties

babc055

Fix: ja properties

3023dbd

[Add] DateInterval Extractor Updated for Hindi

bce5115

[Add] DateTimeParser and DurationParser configured

b5dd55d

Merge remote-tracking branch 'deba-ext/Hindi-chapter' into gsoc25_hin…

8bc174b

…di_chapter

[GSoC25] Hindi Chapter: Fixed configmaps

82f75a7

jimkont reviewed Jun 11, 2025

View reviewed changes

advenk mentioned this pull request Aug 29, 2025

[GSoC25_H] Main Pull Request for all changes related to the Hindi Chapter Changes for NEF dbpedia/neural-extraction-framework#20

Merged

Merge branch 'master' into gsoc25_hindi_chapter

9903ce4

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gsoc25 hindi chapter#776

Gsoc25 hindi chapter#776
advenk wants to merge 10 commits intodbpedia:masterfrom
advenk:gsoc25_hindi_chapter

advenk commented Jun 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

sonarqubecloud bot commented Jun 11, 2025

Uh oh!

jimkont Jun 11, 2025

Uh oh!

advenk Jun 11, 2025 •

edited

Loading

Uh oh!

jimkont Jun 12, 2025

Uh oh!

sonarqubecloud bot commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1,37 +1,39 @@
		<?xml version="1.0" encoding="UTF-8"?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"><page><title>Mapping hi:ज्ञानसन्दूक क्रिकेट टीम</title><ns>230</ns><id>11829</id><revision><id>52037</id><timestamp>2017-05-01T07:02:11Z</timestamp><text>{{TemplateMapping
		<?xml version='1.0' encoding='UTF-8'?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/">

Conversation

advenk commented Jun 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

sonarqubecloud bot commented Jun 11, 2025

Quality Gate passed

Uh oh!

jimkont Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

advenk Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimkont Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 9, 2026

Quality Gate passed

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

advenk commented Jun 11, 2025 •

edited by coderabbitai bot

Loading

advenk Jun 11, 2025 •

edited

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading