NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots by lewismc · Pull Request #904 · apache/nutch

lewismc · 2026-02-26T02:51:03Z

PR to address NUTCH-3161. This patch addresses following Security Hotspota

High

(false positives): Exclude plugin resource directories from analysis in sonar-project.properties. No Java code lives in conf, data, or sample under src/plugin/**, so these paths are excluded from scanning., and

Medium

ParseOutputFormat (src/java/.../ParseOutputFormat.java): Replace regex " *, *" for db.parsemeta.to.crawldb with comma-split + trim in getParseMetaToCrawlDBKeys(). Add TestParseOutputFormat with tests for empty, single, comma-separated, trim, and empty-segment handling.
HtmlParser (parse-html plugin): Remove metaPattern, charsetPattern, and charsetPatternHTML5. Use linear string parsing in extractCharsetFromMeta() / extractCharsetValue() for HTML4 and HTML5 meta charset detection. Add testExtractCharsetFromMeta in existing TestHtmlParser.
JSParseFilter (parse-js plugin): Remove STRING_PATTERN and URI_PATTERN. Use extractQuotedStrings() and looksLikeUri() (linear scan / simple checks). Add testExtractQuotedStrings and testLooksLikeUri in existing TestJSParseFilter.
UrlValidator (urlfilter-validator plugin): Remove URL_PATTERN and AUTHORITY_PATTERN. Use java.net.URI for URL structure and parseAuthority() for host/port. Remove unused isBlankOrNull. Add testParseAuthority in existingTestUrlValidator.

Thanks for any review.

lewismc · 2026-02-26T03:43:27Z

Confirmed this PR addresses the security hotspots and passes tests.

sonarqubecloud · 2026-02-26T04:09:54Z

Quality Gate failed

Failed conditions
62.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

sebastian-nagel

Hi @lewismc, thanks!

So far, this is only a partial review. See the inline comments. Without having tried it, I expect some regressions, e.g. when extracting the charset from <meta charset = "UTF-8" />. Replacing regular expressions by hand-written scanner code is difficult and makes code maintenance much more difficult. Maybe use a scanner generator instead, e.g. Ragel? Alternatively, we could use RE2/J or dk.brics.automaton (see urlfilter-automaton) which are faster (and more safe) because they do not backtrack.

In addition, the URL validator needs a more substantial fix, see NUTCH-2986.

sebastian-nagel · 2026-04-15T21:39:21Z

+   * Package-private for unit testing.
+   */
+  static String extractCharsetFromMeta(String str) {
+    String lower = str.toLowerCase();


Should add Locale.ROOT.

sebastian-nagel · 2026-04-15T21:42:10Z

@@ -102,20 +171,7 @@ private static String sniffCharacterEncoding(byte[] content) {
    // {U+0041, U+0082, U+00B7}.
    String str = new String(content, 0, length, StandardCharsets.US_ASCII);


The content bytes are converted into a String assuming ASCII encoding. If a "hand-crafted" scanner is used, could directly operate on the bytes avoiding the conversion to a string. Alternatively, could use a CharSequence wrapping the bytes, cf. ByteArrayCharSequence. But this can be an improvement for later.

sebastian-nagel · 2026-04-15T21:51:07Z

+      if (tagEnd < 0) {
+        break;
+      }
+      String tagContent = str.substring(metaStart, tagEnd);


Assuming that indexes in the lower-case string and the original one are the same might be dangerous, for example, in German there is the pair "ß" <> "SS" (would held for uppercasing the string). See Unicode TR #21 Case Mappings: "Case mappings may produce strings of different length than the original."

Of course, if we strictly ensure that the method is used on String only including ASCII characters and using always the root locale, this should be safe.

sebastian-nagel · 2026-04-15T21:53:10Z

+        break;
+      }
+      String tagContent = str.substring(metaStart, tagEnd);
+      String tagLower = tagContent.toLowerCase();


Same: must use Locale.ROOT.

sebastian-nagel · 2026-04-15T21:54:45Z

-      "<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>",
-      Pattern.CASE_INSENSITIVE);
+  private static final String META_TAG_START = "<meta";
+  private static final String CHARSET_EQ = "charset=";


The regex allows white space between "charset" and "=".

sebastian-nagel · 2026-04-15T22:13:56Z

    Path data = new Path(new Path(out, ParseData.DIR_NAME), name);
    Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name);

-    final String[] parseMDtoCrawlDB = conf.get("db.parsemeta.to.crawldb", "")


It's a string from the configuration, it's controlled by the user and should be short. It's parsed once, so this is for sure not critical.

Apart from that should simply rely on getTrimmedStrings.

sebastian-nagel · 2026-04-15T22:14:40Z

+      return null;
+    }
+    int start = idx + CHARSET_EQ.length();
+    while (start < s.length() && (s.charAt(start) == ' ' || s.charAt(start) == '\t')) {


\\s matches more characters than blank (U+0020) and the tab character.

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots

08665a9

lewismc self-assigned this Feb 26, 2026

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots

becde79

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots

3ad7711

sebastian-nagel requested changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904

NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904
lewismc wants to merge 3 commits intoapache:masterfrom
lewismc:NUTCH-3161

lewismc commented Feb 26, 2026

Uh oh!

lewismc commented Feb 26, 2026

Uh oh!

sonarqubecloud Bot commented Feb 26, 2026

Uh oh!

sebastian-nagel left a comment •

edited

Loading

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

sebastian-nagel Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -102,20 +171,7 @@ private static String sniffCharacterEncoding(byte[] content) {
		// {U+0041, U+0082, U+00B7}.
		String str = new String(content, 0, length, StandardCharsets.US_ASCII);

Conversation

lewismc commented Feb 26, 2026

High

Medium

Uh oh!

lewismc commented Feb 26, 2026

Uh oh!

sonarqubecloud Bot commented Feb 26, 2026

Quality Gate failed

Uh oh!

sebastian-nagel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebastian-nagel left a comment •

edited

Loading