NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904
NUTCH-3161 Address Sonarcloud High and Medium Security Hotspots#904lewismc wants to merge 3 commits intoapache:masterfrom
Conversation
|
Confirmed this PR addresses the security hotspots and passes tests. |
|
There was a problem hiding this comment.
Hi @lewismc, thanks!
So far, this is only a partial review. See the inline comments. Without having tried it, I expect some regressions, e.g. when extracting the charset from <meta charset = "UTF-8" />. Replacing regular expressions by hand-written scanner code is difficult and makes code maintenance much more difficult. Maybe use a scanner generator instead, e.g. Ragel? Alternatively, we could use RE2/J or dk.brics.automaton (see urlfilter-automaton) which are faster (and more safe) because they do not backtrack.
In addition, the URL validator needs a more substantial fix, see NUTCH-2986.
| * Package-private for unit testing. | ||
| */ | ||
| static String extractCharsetFromMeta(String str) { | ||
| String lower = str.toLowerCase(); |
There was a problem hiding this comment.
Should add Locale.ROOT.
| @@ -102,20 +171,7 @@ private static String sniffCharacterEncoding(byte[] content) { | |||
| // {U+0041, U+0082, U+00B7}. | |||
| String str = new String(content, 0, length, StandardCharsets.US_ASCII); | |||
There was a problem hiding this comment.
The content bytes are converted into a String assuming ASCII encoding. If a "hand-crafted" scanner is used, could directly operate on the bytes avoiding the conversion to a string. Alternatively, could use a CharSequence wrapping the bytes, cf. ByteArrayCharSequence. But this can be an improvement for later.
| if (tagEnd < 0) { | ||
| break; | ||
| } | ||
| String tagContent = str.substring(metaStart, tagEnd); |
There was a problem hiding this comment.
Assuming that indexes in the lower-case string and the original one are the same might be dangerous, for example, in German there is the pair "ß" <> "SS" (would held for uppercasing the string). See Unicode TR #21 Case Mappings: "Case mappings may produce strings of different length than the original."
There was a problem hiding this comment.
Of course, if we strictly ensure that the method is used on String only including ASCII characters and using always the root locale, this should be safe.
| break; | ||
| } | ||
| String tagContent = str.substring(metaStart, tagEnd); | ||
| String tagLower = tagContent.toLowerCase(); |
There was a problem hiding this comment.
Same: must use Locale.ROOT.
| "<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>", | ||
| Pattern.CASE_INSENSITIVE); | ||
| private static final String META_TAG_START = "<meta"; | ||
| private static final String CHARSET_EQ = "charset="; |
There was a problem hiding this comment.
The regex allows white space between "charset" and "=".
| Path data = new Path(new Path(out, ParseData.DIR_NAME), name); | ||
| Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name); | ||
|
|
||
| final String[] parseMDtoCrawlDB = conf.get("db.parsemeta.to.crawldb", "") |
There was a problem hiding this comment.
It's a string from the configuration, it's controlled by the user and should be short. It's parsed once, so this is for sure not critical.
Apart from that should simply rely on getTrimmedStrings.
| return null; | ||
| } | ||
| int start = idx + CHARSET_EQ.length(); | ||
| while (start < s.length() && (s.charAt(start) == ' ' || s.charAt(start) == '\t')) { |
There was a problem hiding this comment.
\\s matches more characters than blank (U+0020) and the tab character.


PR to address NUTCH-3161. This patch addresses following Security Hotspota
High
(false positives): Exclude plugin resource directories from analysis in
sonar-project.properties. No Java code lives inconf,data, orsampleundersrc/plugin/**, so these paths are excluded from scanning., andMedium
src/java/.../ParseOutputFormat.java): Replace regex" *, *"fordb.parsemeta.to.crawldbwith comma-split + trim ingetParseMetaToCrawlDBKeys(). AddTestParseOutputFormatwith tests for empty, single, comma-separated, trim, and empty-segment handling.metaPattern,charsetPattern, andcharsetPatternHTML5. Use linear string parsing inextractCharsetFromMeta()/extractCharsetValue()for HTML4 and HTML5 meta charset detection. AddtestExtractCharsetFromMetain existingTestHtmlParser.STRING_PATTERNandURI_PATTERN. UseextractQuotedStrings()andlooksLikeUri()(linear scan / simple checks). AddtestExtractQuotedStringsandtestLooksLikeUriin existingTestJSParseFilter.URL_PATTERNandAUTHORITY_PATTERN. Usejava.net.URIfor URL structure andparseAuthority()for host/port. Remove unusedisBlankOrNull. AddtestParseAuthorityin existingTestUrlValidator.Thanks for any review.