Conversation
- Minor Fix to some of the Hindi Mappings(xml) - Updated the new extractors to the properties file
|
| @@ -1,37 +1,39 @@ | |||
| <?xml version="1.0" encoding="UTF-8"?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"><page><title>Mapping hi:ज्ञानसन्दूक क्रिकेट टीम</title><ns>230</ns><id>11829</id><revision><id>52037</id><timestamp>2017-05-01T07:02:11Z</timestamp><text>{{TemplateMapping | |||
| <?xml version='1.0' encoding='UTF-8'?><mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"> | |||
There was a problem hiding this comment.
Hi, quick question: where are these mapping pages sourced from? I do not see them all in https://mappings.dbpedia.org/index.php/Special:AllPages/Mapping_hi:
There was a problem hiding this comment.
@jimkont hi, this was added by @deba-iitbh as part of GSoC 2024 but not merged therefore these are not present in the mappings.
There was a problem hiding this comment.
The source of truth is the mappings wiki, and when the extraction starts, the download-mappings task is executed to get the latest version of the mappings. This means that the changes here will be overwritten once that happens. Ideally, the mappings defined here should be moved there, and you commit the updated output of the download mappings task once that is done
|
📝 WalkthroughWalkthroughA new mapping entry is added to duration parser configuration files, enabling recognition of a typographic apostrophe variant (\') as a minute unit indicator in English and Hindi language mappings for token normalization. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`:
- Line 388: The alias entry "\\’" in DurationParserConfig.scala uses a backslash
plus U+2019 which gets stripped by the cleanup in DurationParser.scala
(replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing
the problematic key with the ASCII apostrophe variant used on the next line (use
"'" as the map key) or alternatively update the cleanup regex in
DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019
so the original key survives normalization — modify either
DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII
apostrophe) or DurationParser.scala (expand the character class to include
\u2019) so map lookups succeed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6839efc6-b7b5-494b-91dd-33ad4be7c6fe
📒 Files selected for processing (1)
core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala
| "s" -> "second", | ||
| "\"" -> "second", | ||
| "मिनट" -> "minute", | ||
| "\\’" -> "minute", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Inspect regex construction:"
sed -n '19,28p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala
echo
echo "Inspect unit cleanup + map lookup:"
sed -n '129,137p' core/src/main/scala/org/dbpedia/extraction/dataparser/DurationParser.scala
echo
echo "Inspect upstream string normalization:"
sed -n '46,53p' core/src/main/scala/org/dbpedia/extraction/dataparser/StringParser.scalaRepository: dbpedia/extraction-framework
Length of output: 1712
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Line 388 context with surrounding lines and hex dump ==="
sed -n '385,391p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | cat -A
echo
echo "=== Hex dump of the key at line 388 ==="
sed -n '388p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | od -c
echo
echo "=== Check if this is part of a map initialization ==="
sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala
echo
echo "=== Verify the regex character class behavior ==="
echo 'Testing character class [^\'"'"'"a-zA-Z]'
python3 << 'EOF'
import re
# Test the regex pattern used in cleanup
pattern = r"[^'\"a-zA-Z]"
test_cases = [
("'", "ASCII apostrophe"),
("'", "U+2019 right single quotation mark"),
("\\", "backslash"),
("a", "letter a"),
('"', "ASCII quote"),
]
for char, desc in test_cases:
removed = re.sub(pattern, "", char)
print(f"{desc:40} | Input: {repr(char):10} | After cleanup: {repr(removed):10} | Removed: {removed == ''}")
EOFRepository: dbpedia/extraction-framework
Length of output: 587
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full context around line 388 with line numbers ==="
sed -n '380,395p' core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala | nl -v 380
echo
echo "=== Hex/bytes of line 388 using Python ==="
python3 << 'EOF'
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
lines = f.readlines()
line_388 = lines[387] # 0-indexed
print(f"Line 388 content: {repr(line_388)}")
print(f"Bytes (hex): {line_388.encode('utf-8').hex()}")
# Extract the key part
import re
match = re.search(r'"([^"]*)"', line_388)
if match:
key = match.group(1)
print(f"Key: {repr(key)}")
print(f"Key bytes (hex): {key.encode('utf-8').hex()}")
for i, char in enumerate(key):
print(f" [{i}] {repr(char)} (U+{ord(char):04X})")
EOF
echo
echo "=== Check surrounding lines for similar patterns ==="
python3 << 'EOF'
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
lines = f.readlines()
for i in range(380, 400):
line = lines[i] if i < len(lines) else ""
if line.strip():
match_obj = __import__('re').search(r'"([^"]*)"', line)
if match_obj:
key = match_obj.group(1)
print(f"Line {i+1:3d}: key={repr(key):20s} value_start={lines[i].split('->')[1][:20].strip() if '->' in lines[i] else 'N/A'}")
EOFRepository: dbpedia/extraction-framework
Length of output: 186
🏁 Script executed:
#!/bin/bash
python3 << 'EOFPYTHON'
import re
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
lines = f.readlines()
print("=== Lines 380-395 with key analysis ===")
for i in range(379, min(395, len(lines))):
line = lines[i]
line_num = i + 1
print(f"Line {line_num}: {line.rstrip()}")
# Extract key from quoted string
match = re.search(r'"([^"]*)"', line)
if match:
key = match.group(1)
key_hex = key.encode('utf-8').hex()
print(f" Key: {repr(key)} | Hex: {key_hex}")
for j, char in enumerate(key):
print(f" [{j}] = {repr(char):8s} (U+{ord(char):04X})")
print("\n=== Analyzing the cleanup regex behavior ===")
pattern = r"[^'\"a-zA-Z]"
# Test cases from the actual keys found
test_keys = [
"\\M-bM-^@M-^Y", # How it appeared in cat -A
"\\'", # As shown in review snippet
"'", # ASCII apostrophe
]
# Actually read what's in the file
print("\nActual keys from file and cleanup result:")
for i in [387, 388, 389]: # Lines around the area in question
if i < len(lines):
line = lines[i]
match = re.search(r'"([^"]*)"', line)
if match:
key = match.group(1)
cleaned = re.sub(pattern, "", key)
print(f"Line {i+1}: Original key={repr(key):20s} | After cleanup={repr(cleaned):15s} | Matches itself? {key == cleaned}")
EOFPYTHONRepository: dbpedia/extraction-framework
Length of output: 3075
🏁 Script executed:
#!/bin/bash
python3 << 'EOFPYTHON'
import re
with open('core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala', 'r', encoding='utf-8') as f:
lines = f.readlines()
print("=== Context around line 388 with comments ===")
for i in range(375, min(400, len(lines))):
line = lines[i].rstrip()
if line.strip():
print(f"Line {i+1:3d}: {line}")
print("\n=== Summary of the minute keys ===")
print("Line 387: 'मिनट' (Hindi 'minute') -> 'minute'")
print("Line 388: '\\\\\\'' (backslash + backslash + U+2019) -> 'minute'")
print("Line 389: \"'\" (ASCII apostrophe U+0027) -> 'minute' [copied from english]")
print("Line 390: '′' (prime symbol U+2032) -> 'minute'")
print("Line 391: 'm' (ASCII m) -> 'minute'")
print("\n=== Testing what cleanup does to each ===")
pattern = r"[^'\"a-zA-Z]"
test_cases = [
("मिनट", "Hindi minute"),
("\\\\'" , "Line 388 key"),
("'", "ASCII apostrophe"),
("′", "Prime symbol"),
("m", "ASCII m"),
]
for key, desc in test_cases:
cleaned = re.sub(pattern, "", key)
matches = "✓" if key == cleaned else "✗"
print(f"{matches} {desc:25s}: {repr(key):20s} -> {repr(cleaned):20s}")
EOFPYTHONRepository: dbpedia/extraction-framework
Length of output: 1879
🏁 Script executed:
#!/bin/bash
echo "=== Searching for tests related to Hindi duration parsing ==="
find . -type f -name "*.scala" -o -name "*.java" | head -20
echo
echo "=== Search for DurationParser tests ==="
find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "DurationParser" 2>/dev/null | head -5
echo
echo "=== Search for 'hi' language or Hindi in tests ==="
find . -type f \( -name "*Test*.scala" -o -name "*test*.scala" \) | xargs grep -l "\"hi\"" 2>/dev/null | head -5
echo
echo "=== Check if there are any test inputs with the problematic characters ==="
find . -type f -name "*.scala" | xargs grep -E "(मिनट|U\+2019|\\\\\\\\'" 2>/dev/null | head -10Repository: dbpedia/extraction-framework
Length of output: 1699
🏁 Script executed:
#!/bin/bash
echo "=== DurationParser test file ==="
cat core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scala
echo
echo "=== Check if test covers Hindi language ==="
grep -n "hi\|मिनट\|hindi\|Hindi" core/src/test/scala/org/dbpedia/extraction/dataparser/DurationParserTest.scalaRepository: dbpedia/extraction-framework
Length of output: 7512
This alias won't resolve during parsing.
The key "\\'" as displayed maps to two backslashes followed by U+2019 (right single quotation mark) in the actual file. The cleanup step at DurationParser.scala:129-137 applies replaceAll("""[^\'\"a-zA-Z]""", ""), which removes anything outside the ASCII apostrophe, ASCII quote, and ASCII letter set. This strips the backslashes and U+2019 to an empty string, preventing any map lookup from succeeding. The ASCII apostrophe at line 389 works correctly because it's in the allowed set, but this entry cannot produce "minute".
Use the ASCII apostrophe from line 389 or adjust the cleanup regex to preserve the intended character before lookup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/scala/org/dbpedia/extraction/config/dataparser/DurationParserConfig.scala`
at line 388, The alias entry "\\’" in DurationParserConfig.scala uses a
backslash plus U+2019 which gets stripped by the cleanup in DurationParser.scala
(replaceAll("""[^\'\"a-zA-Z]""", "")), so lookups never match; fix by replacing
the problematic key with the ASCII apostrophe variant used on the next line (use
"'" as the map key) or alternatively update the cleanup regex in
DurationParser.scala (the replaceAll call around lines 129-137) to allow U+2019
so the original key survives normalization — modify either
DurationParserConfig.scala (change "\\’" -> "minute" to use the ASCII
apostrophe) or DurationParser.scala (expand the character class to include
\u2019) so map lookups succeed.



Changes required for Hindi Chapter.
Summary by CodeRabbit