Skip to content

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904

Open
lewismc wants to merge 3 commits intoapache:masterfrom
lewismc:NUTCH-3161
Open

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904
lewismc wants to merge 3 commits intoapache:masterfrom
lewismc:NUTCH-3161

Conversation

@lewismc
Copy link
Copy Markdown
Member

@lewismc lewismc commented Feb 26, 2026

PR to address NUTCH-3161. This patch addresses following Security Hotspota

High

(false positives): Exclude plugin resource directories from analysis in sonar-project.properties. No Java code lives in conf, data, or sample under src/plugin/**, so these paths are excluded from scanning., and

Medium

  • ParseOutputFormat (src/java/.../ParseOutputFormat.java): Replace regex " *, *" for db.parsemeta.to.crawldb with comma-split + trim in getParseMetaToCrawlDBKeys(). Add TestParseOutputFormat with tests for empty, single, comma-separated, trim, and empty-segment handling.
  • HtmlParser (parse-html plugin): Remove metaPattern, charsetPattern, and charsetPatternHTML5. Use linear string parsing in extractCharsetFromMeta() / extractCharsetValue() for HTML4 and HTML5 meta charset detection. Add testExtractCharsetFromMeta in existing TestHtmlParser.
  • JSParseFilter (parse-js plugin): Remove STRING_PATTERN and URI_PATTERN. Use extractQuotedStrings() and looksLikeUri() (linear scan / simple checks). Add testExtractQuotedStrings and testLooksLikeUri in existing TestJSParseFilter.
  • UrlValidator (urlfilter-validator plugin): Remove URL_PATTERN and AUTHORITY_PATTERN. Use java.net.URI for URL structure and parseAuthority() for host/port. Remove unused isBlankOrNull. Add testParseAuthority in existingTestUrlValidator.

Thanks for any review.

@lewismc lewismc self-assigned this Feb 26, 2026
@lewismc
Copy link
Copy Markdown
Member Author

lewismc commented Feb 26, 2026

Confirmed this PR addresses the security hotspots and passes tests.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
62.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewismc, thanks!

So far, this is only a partial review. See the inline comments. Without having tried it, I expect some regressions, e.g. when extracting the charset from <meta charset = "UTF-8" />. Replacing regular expressions by hand-written scanner code is difficult and makes code maintenance much more difficult. Maybe use a scanner generator instead, e.g. Ragel? Alternatively, we could use RE2/J or dk.brics.automaton (see urlfilter-automaton) which are faster (and more safe) because they do not backtrack.

In addition, the URL validator needs a more substantial fix, see NUTCH-2986.

* Package-private for unit testing.
*/
static String extractCharsetFromMeta(String str) {
String lower = str.toLowerCase();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add Locale.ROOT.

@@ -102,20 +171,7 @@ private static String sniffCharacterEncoding(byte[] content) {
// {U+0041, U+0082, U+00B7}.
String str = new String(content, 0, length, StandardCharsets.US_ASCII);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content bytes are converted into a String assuming ASCII encoding. If a "hand-crafted" scanner is used, could directly operate on the bytes avoiding the conversion to a string. Alternatively, could use a CharSequence wrapping the bytes, cf. ByteArrayCharSequence. But this can be an improvement for later.

if (tagEnd < 0) {
break;
}
String tagContent = str.substring(metaStart, tagEnd);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that indexes in the lower-case string and the original one are the same might be dangerous, for example, in German there is the pair "ß" <> "SS" (would held for uppercasing the string). See Unicode TR #21 Case Mappings: "Case mappings may produce strings of different length than the original."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, if we strictly ensure that the method is used on String only including ASCII characters and using always the root locale, this should be safe.

break;
}
String tagContent = str.substring(metaStart, tagEnd);
String tagLower = tagContent.toLowerCase();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: must use Locale.ROOT.

"<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>",
Pattern.CASE_INSENSITIVE);
private static final String META_TAG_START = "<meta";
private static final String CHARSET_EQ = "charset=";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex allows white space between "charset" and "=".

Path data = new Path(new Path(out, ParseData.DIR_NAME), name);
Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name);

final String[] parseMDtoCrawlDB = conf.get("db.parsemeta.to.crawldb", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a string from the configuration, it's controlled by the user and should be short. It's parsed once, so this is for sure not critical.

Apart from that should simply rely on getTrimmedStrings.

return null;
}
int start = idx + CHARSET_EQ.length();
while (start < s.length() && (s.charAt(start) == ' ' || s.charAt(start) == '\t')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\\s matches more characters than blank (U+0020) and the tab character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants