Skip to content

Gsoc25 hindi chapter#776

Open
advenk wants to merge 10 commits intodbpedia:masterfrom
advenk:gsoc25_hindi_chapter
Open

Gsoc25 hindi chapter#776
advenk wants to merge 10 commits intodbpedia:masterfrom
advenk:gsoc25_hindi_chapter

Conversation

@advenk
Copy link

@advenk advenk commented Jun 11, 2025

Changes required for Hindi Chapter.

Summary by CodeRabbit

  • New Features
    • Added support for recognizing the typographic apostrophe character (') as a minute indicator in English and Hindi duration parsing.

@sonarqubecloud
Copy link

@@ -1,37 +1,39 @@
<?xml version="1.0" encoding="UTF-8"?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"><page><title>Mapping hi:ज्ञानसन्दूक क्रिकेट टीम</title><ns>230</ns><id>11829</id><revision><id>52037</id><timestamp>2017-05-01T07:02:11Z</timestamp><text>{{TemplateMapping
<?xml version='1.0' encoding='UTF-8'?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, quick question: where are these mapping pages sourced from? I do not see them all in https://mappings.dbpedia.org/index.php/Special:AllPages/Mapping_hi:

Copy link
Author

@advenk advenk Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimkont hi, this was added by @deba-iitbh as part of GSoC 2024 but not merged therefore these are not present in the mappings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of truth is the mappings wiki, and when the extraction starts, the download-mappings task is executed to get the latest version of the mappings. This means that the changes here will be overwritten once that happens. Ideally, the mappings defined here should be moved there, and you commit the updated output of the download mappings task once that is done

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 9, 2026

@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

A new mapping entry is added to duration parser configuration files, enabling recognition of a typographic apostrophe variant (\') as a minute unit indicator in English and Hindi language mappings for token normalization.

Changes

Cohort / File(s) Summary
Duration Parser Configuration
core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala
Added mapping for minute unit symbol (\') to both English and Hindi duration unit maps for improved token recognition during parsing.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested reviewers

  • jimkont
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Gsoc25 hindi chapter' is vague and generic, using non-descriptive terms that don't convey the specific nature of the changes. While it references Hindi and GSOC25, it fails to communicate what actual changes are being made (e.g., adding duration parser support, configuring extractors, etc.). Provide a more specific title that describes the actual changes, such as 'Add Hindi language support for duration parser' or 'Configure Hindi extractors and language mappings'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`:
- Line 388: The alias entry "\\’" in DurationParserConfig.scala uses a backslash
plus U+2019 which gets stripped by the cleanup in DurationParser.scala
(replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing
the problematic key with the ASCII apostrophe variant used on the next line (use
"'" as the map key) or alternatively update the cleanup regex in
DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019
so the original key survives normalization — modify either
DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII
apostrophe) or DurationParser.scala (expand the character class to include
\u2019) so map lookups succeed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6839efc6-b7b5-494b-91dd-33ad4be7c6fe

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 9903ce4.

📒 Files selected for processing (1)
  • core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala

"s" -> "second",
"\"" -> "second",
"मिनट" -> "minute",
"\\’" -> "minute",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Inspect regex construction:"
sed -n '19,28p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala

echo
echo "Inspect unit cleanup + map lookup:"
sed -n '129,137p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala

echo
echo "Inspect upstream string normalization:"
sed -n '46,53p' core/src/main/scala/org/dbpedia/extraction/dataparser/StringParser.scala

Repository: dbpedia/extraction-framework

Length of output: 1712


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Line 388 context with surrounding lines and hex dump ==="
sed -n '385,391p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | cat -A

echo
echo "=== Hex dump of the key at line 388 ==="
sed -n '388p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | od -c

echo
echo "=== Check if this is part of a map initialization ==="
sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala

echo
echo "=== Verify the regex character class behavior ==="
echo 'Testing character class [^\'"'"'"a-zA-Z]'
python3 << 'EOF'
import re
# Test the regex pattern used in cleanup
pattern = r"[^'\"a-zA-Z]"
test_cases = [
    ("'", "ASCII apostrophe"),
    ("'", "U+2019 right single quotation mark"),
    ("\\", "backslash"),
    ("a", "letter a"),
    ('"', "ASCII quote"),
]
for char, desc in test_cases:
    removed = re.sub(pattern, "", char)
    print(f"{desc:40} | Input: {repr(char):10} | After cleanup: {repr(removed):10} | Removed: {removed == ''}")
EOF

Repository: dbpedia/extraction-framework

Length of output: 587


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full context around line 388 with line numbers ==="
sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | nl -v 380

echo
echo "=== Hex/bytes of line 388 using Python ==="
python3 << 'EOF'
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    line_388 = lines[387]  # 0-indexed
    print(f"Line 388 content: {repr(line_388)}")
    print(f"Bytes (hex): {line_388.encode('utf-8').hex()}")
    
    # Extract the key part
    import re
    match = re.search(r'"([^"]*)"', line_388)
    if match:
        key = match.group(1)
        print(f"Key: {repr(key)}")
        print(f"Key bytes (hex): {key.encode('utf-8').hex()}")
        for i, char in enumerate(key):
            print(f"  [{i}] {repr(char)} (U+{ord(char):04X})")
EOF

echo
echo "=== Check surrounding lines for similar patterns ==="
python3 << 'EOF'
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    for i in range(380, 400):
        line = lines[i] if i < len(lines) else ""
        if line.strip():
            match_obj = __import__('re').search(r'"([^"]*)"', line)
            if match_obj:
                key = match_obj.group(1)
                print(f"Line {i+1:3d}: key={repr(key):20s} value_start={lines[i].split('->')[1][:20].strip() if '->' in lines[i] else 'N/A'}")
EOF

Repository: dbpedia/extraction-framework

Length of output: 186


🏁 Script executed:

#!/bin/bash
python3 << 'EOFPYTHON'
import re

with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
    lines = f.readlines()

print("=== Lines 380-395 with key analysis ===")
for i in range(379, min(395, len(lines))):
    line = lines[i]
    line_num = i + 1
    print(f"Line {line_num}: {line.rstrip()}")
    
    # Extract key from quoted string
    match = re.search(r'"([^"]*)"', line)
    if match:
        key = match.group(1)
        key_hex = key.encode('utf-8').hex()
        print(f"  Key: {repr(key)} | Hex: {key_hex}")
        for j, char in enumerate(key):
            print(f"    [{j}] = {repr(char):8s} (U+{ord(char):04X})")

print("\n=== Analyzing the cleanup regex behavior ===")
pattern = r"[^'\"a-zA-Z]"

# Test cases from the actual keys found
test_keys = [
    "\\M-bM-^@M-^Y",  # How it appeared in cat -A
    "\\'",            # As shown in review snippet  
    "'",              # ASCII apostrophe
]

# Actually read what's in the file
print("\nActual keys from file and cleanup result:")
for i in [387, 388, 389]:  # Lines around the area in question
    if i < len(lines):
        line = lines[i]
        match = re.search(r'"([^"]*)"', line)
        if match:
            key = match.group(1)
            cleaned = re.sub(pattern, "", key)
            print(f"Line {i+1}: Original key={repr(key):20s} | After cleanup={repr(cleaned):15s} | Matches itself? {key == cleaned}")
EOFPYTHON

Repository: dbpedia/extraction-framework

Length of output: 3075


🏁 Script executed:

#!/bin/bash
python3 << 'EOFPYTHON'
import re

with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
    lines = f.readlines()

print("=== Context around line 388 with comments ===")
for i in range(375, min(400, len(lines))):
    line = lines[i].rstrip()
    if line.strip():
        print(f"Line {i+1:3d}: {line}")

print("\n=== Summary of the minute keys ===")
print("Line 387: 'मिनट' (Hindi 'minute') -> 'minute'")
print("Line 388: '\\\\\\'' (backslash + backslash + U+2019) -> 'minute'")
print("Line 389: \"'\" (ASCII apostrophe U+0027) -> 'minute' [copied from english]")
print("Line 390: '′' (prime symbol U+2032) -> 'minute'")
print("Line 391: 'm' (ASCII m) -> 'minute'")

print("\n=== Testing what cleanup does to each ===")
pattern = r"[^'\"a-zA-Z]"
test_cases = [
    ("मिनट", "Hindi minute"),
    ("\\\\'" , "Line 388 key"),
    ("'", "ASCII apostrophe"),
    ("′", "Prime symbol"),
    ("m", "ASCII m"),
]

for key, desc in test_cases:
    cleaned = re.sub(pattern, "", key)
    matches = "✓" if key == cleaned else "✗"
    print(f"{matches} {desc:25s}: {repr(key):20s} -> {repr(cleaned):20s}")
EOFPYTHON

Repository: dbpedia/extraction-framework

Length of output: 1879


🏁 Script executed:

#!/bin/bash

echo "=== Searching for tests related to Hindi duration parsing ==="
find . -type f -name "*.scala" -o -name "*.java" | head -20

echo
echo "=== Search for DurationParser tests ==="
find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "DurationParser" 2>/dev/null | head -5

echo
echo "=== Search for 'hi' language or Hindi in tests ==="
find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "\"hi\"" 2>/dev/null | head -5

echo
echo "=== Check if there are any test inputs with the problematic characters ==="
find . -type f -name "*.scala" | xargs grep -E "(मिनट|U\+2019|\\\\\\\\'" 2>/dev/null | head -10

Repository: dbpedia/extraction-framework

Length of output: 1699


🏁 Script executed:

#!/bin/bash

echo "=== DurationParser test file ==="
cat core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scala

echo
echo "=== Check if test covers Hindi language ==="
grep -n "hi\|मिनट\|hindi\|Hindi" core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scala

Repository: dbpedia/extraction-framework

Length of output: 7512


This alias won't resolve during parsing.

The key "\\'" as displayed maps to two backslashes followed by U+2019 (right single quotation mark) in the actual file. The cleanup step at DurationParser.scala:129-137 applies replaceAll("""[^\'\"a-zA-Z]""", ""), which removes anything outside the ASCII apostrophe, ASCII quote, and ASCII letter set. This strips the backslashes and U+2019 to an empty string, preventing any map lookup from succeeding. The ASCII apostrophe at line 389 works correctly because it's in the allowed set, but this entry cannot produce "minute".

Use the ASCII apostrophe from line 389 or adjust the cleanup regex to preserve the intended character before lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`
at line 388, The alias entry "\\’" in DurationParserConfig.scala uses a
backslash plus U+2019 which gets stripped by the cleanup in DurationParser.scala
(replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing
the problematic key with the ASCII apostrophe variant used on the next line (use
"'" as the map key) or alternatively update the cleanup regex in
DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019
so the original key survives normalization — modify either
DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII
apostrophe) or DurationParser.scala (expand the character class to include
\u2019) so map lookups succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants