-
Notifications
You must be signed in to change notification settings - Fork 235
add word boundaries to regex patterns to reduce false positives #1125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,81 +15,81 @@ rule: | |
| - al-khaser_x86.exe_ | ||
| features: | ||
| - or: | ||
| - string: /ollydbg(\.exe)?/i | ||
| - string: /ProcessHacker(\.exe)?/i | ||
| - string: /tcpview(\.exe)?/i | ||
| - string: /autoruns(\.exe)?/i | ||
| - string: /autorunsc(\.exe)?/i | ||
| - string: /filemon(\.exe)?/i | ||
| - string: /procmon(\.exe)?/i | ||
| - string: /regmon(\.exe)?/i | ||
| - string: /procexp(\.exe)?/i | ||
| - string: /\bollydbg(\.exe)?\b/i | ||
| - string: /\bProcessHacker(\.exe)?\b/i | ||
| - string: /\btcpview(\.exe)?\b/i | ||
| - string: /\bautoruns(\.exe)?\b/i | ||
| - string: /\bautorunsc(\.exe)?\b/i | ||
| - string: /\bfilemon(\.exe)?\b/i | ||
| - string: /\bprocmon(\.exe)?\b/i | ||
| - string: /\bregmon(\.exe)?\b/i | ||
| - string: /\bprocexp(\.exe)?\b/i | ||
| - string: /(?<!\w)ida[gqtuw]?(\.exe)?$/i | ||
| - string: /ida[gqtuw]?64(\.exe)?$/i | ||
| - string: /ImmunityDebugger(\.exe)?/i | ||
| - string: /Wireshark(\.exe)?/i | ||
| - string: /dumpcap(\.exe)?/i | ||
| - string: /HookExplorer(\.exe)?/i | ||
| - string: /ImportREC(\.exe)?/i | ||
| - string: /PETools(\.exe)?/i | ||
| - string: /LordPE(\.exe)?/i | ||
| - string: /SysInspector(\.exe)?/i | ||
| - string: /proc_analyzer(\.exe)?/i | ||
| - string: /sysAnalyzer(\.exe)?/i | ||
| - string: /sniff_hit(\.exe)?/i | ||
| - string: /windbg(\.exe)?/i | ||
| - string: /joeboxcontrol(\.exe)?/i | ||
| - string: /joeboxserver(\.exe)?/i | ||
| - string: /ResourceHacker(\.exe)?/i | ||
| - string: /x32dbg(\.exe)?/i | ||
| - string: /x64dbg(\.exe)?/i | ||
| - string: /Fiddler(\.exe)?/i | ||
| - string: /httpdebugger(\.exe)?/i | ||
| - string: /fakenet(\.exe)?/i | ||
| - string: /netmon(\.exe)?/i | ||
| - string: /WPE PRO(\.exe)?/i | ||
| - string: /decompile(\.exe)?/i | ||
| - string: /scylla/i | ||
| - string: /megadumper/i | ||
| - string: /apdagent(\.exe)?/i | ||
| - string: /apimonitor(\.exe)?/i | ||
| - string: /azurearcsystray(\.exe)?/i | ||
| - string: /binaryninja(\.exe)?/i | ||
| - string: /burpsuite(\.exe)?/i | ||
| - string: /charles\.exe/i | ||
| - string: /cutter(\.exe)?/i | ||
| - string: /dbgx\.shell(\.exe)?/i | ||
| - string: /df5serv(\.exe)?/i | ||
| - string: /frida(\.exe)?/i | ||
| - string: /httpanalyzerv7(\.exe)?/i | ||
| - string: /httpdebuggerui(\.exe)?/i | ||
| - string: /netcat(\.exe)?/i | ||
| - string: /pin\.exe/i | ||
| - string: /prl_tools(\.exe)?/i | ||
| - string: /qemu-ga(\.exe)?/i | ||
| - string: /rammap(\.exe)?/i | ||
| - string: /rammap64(\.exe)?/i | ||
| - string: /rdpclip(\.exe)?/i | ||
| - string: /tasklist/i | ||
| - string: /cred-store(\.exe)?/i | ||
| - string: /decoder\.exe/i | ||
| - string: /dnspy(\.exe)?/i | ||
| - string: /drrun(\.exe)?/i | ||
| - string: /dumpit(\.exe)?/i | ||
| - string: /frida-inject(\.exe)?/i | ||
| - string: /frida-server(\.exe)?/i | ||
| - string: /gdb\.exe/i | ||
| - string: /httpdebuggersvc(\.exe)?/i | ||
| - string: /ilspy(\.exe)?/i | ||
| - string: /inetsim(\.exe)?/i | ||
| - string: /ksdumper(\.exe)?/i | ||
| - string: /ksdumperclient(\.exe)?/i | ||
| - string: /mitmdump(\.exe)?/i | ||
| - string: /pestudio(\.exe)?/i | ||
| - string: /private-cloud-proxy(\.exe)?/i | ||
| - string: /process\.exe/i | ||
| - string: /r2\.exe/i | ||
| - string: /rekall(\.exe)?/i | ||
| - string: /tcpdump(\.exe)?/i | ||
| - string: /windasm(\.exe)?/i | ||
| - string: /x32dbgn(\.exe)?/i | ||
| - string: /\bImmunityDebugger(\.exe)?\b/i | ||
| - string: /\bWireshark(\.exe)?\b/i | ||
| - string: /\bdumpcap(\.exe)?\b/i | ||
| - string: /\bHookExplorer(\.exe)?\b/i | ||
| - string: /\bImportREC(\.exe)?\b/i | ||
| - string: /\bPETools(\.exe)?\b/i | ||
| - string: /\bLordPE(\.exe)?\b/i | ||
| - string: /\bSysInspector(\.exe)?\b/i | ||
| - string: /\bproc_analyzer(\.exe)?\b/i | ||
| - string: /\bsysAnalyzer(\.exe)?\b/i | ||
| - string: /\bsniff_hit(\.exe)?\b/i | ||
| - string: /\bwindbg(\.exe)?\b/i | ||
| - string: /\bjoeboxcontrol(\.exe)?\b/i | ||
| - string: /\bjoeboxserver(\.exe)?\b/i | ||
| - string: /\bResourceHacker(\.exe)?\b/i | ||
| - string: /\bx32dbg(\.exe)?\b/i | ||
| - string: /\bx64dbg(\.exe)?\b/i | ||
| - string: /\bFiddler(\.exe)?\b/i | ||
| - string: /\bhttpdebugger(\.exe)?\b/i | ||
| - string: /\bfakenet(\.exe)?\b/i | ||
| - string: /\bnetmon(\.exe)?\b/i | ||
| - string: /\bWPE PRO(\.exe)?\b/i | ||
| - string: /\bdecompile(\.exe)?\b/i | ||
| - string: /\bscylla\b/i | ||
| - string: /\bmegadumper\b/i | ||
| - string: /\bapdagent(\.exe)?\b/i | ||
| - string: /\bapimonitor(\.exe)?\b/i | ||
| - string: /\bazurearcsystray(\.exe)?\b/i | ||
| - string: /\bbinaryninja(\.exe)?\b/i | ||
| - string: /\bburpsuite(\.exe)?\b/i | ||
| - string: /\bcharles\.exe\b/i | ||
| - string: /\bcutter(\.exe)?\b/i | ||
| - string: /\bdbgx\.shell(\.exe)?\b/i | ||
| - string: /\bdf5serv(\.exe)?\b/i | ||
| - string: /\bfrida(\.exe)?\b/i | ||
| - string: /\bhttpanalyzerv7(\.exe)?\b/i | ||
| - string: /\bhttpdebuggerui(\.exe)?\b/i | ||
| - string: /\bnetcat(\.exe)?\b/i | ||
| - string: /\bpin\.exe\b/i | ||
| - string: /\bprl_tools(\.exe)?\b/i | ||
| - string: /\bqemu-ga(\.exe)?\b/i | ||
| - string: /\brammap(\.exe)?\b/i | ||
| - string: /\brammap64(\.exe)?\b/i | ||
| - string: /\brdpclip(\.exe)?\b/i | ||
| - string: /\btasklist\b/i | ||
| - string: /\bcred-store(\.exe)?\b/i | ||
| - string: /\bdecoder\.exe\b/i | ||
| - string: /\bdnspy(\.exe)?\b/i | ||
| - string: /\bdrrun(\.exe)?\b/i | ||
| - string: /\bdumpit(\.exe)?\b/i | ||
| - string: /\bfrida-inject(\.exe)?\b/i | ||
| - string: /\bfrida-server(\.exe)?\b/i | ||
| - string: /\bgdb\.exe\b/i | ||
| - string: /\bhttpdebuggersvc(\.exe)?\b/i | ||
| - string: /\bilspy(\.exe)?\b/i | ||
| - string: /\binetsim(\.exe)?\b/i | ||
| - string: /\bksdumper(\.exe)?\b/i | ||
| - string: /\bksdumperclient(\.exe)?\b/i | ||
| - string: /\bmitmdump(\.exe)?\b/i | ||
| - string: /\bpestudio(\.exe)?\b/i | ||
| - string: /\bprivate-cloud-proxy(\.exe)?\b/i | ||
| - string: /\bprocess\.exe\b/i | ||
| - string: /\br2\.exe\b/i | ||
| - string: /\brekall(\.exe)?\b/i | ||
| - string: /\btcpdump(\.exe)?\b/i | ||
| - string: /\bwindasm(\.exe)?\b/i | ||
| - string: /\bx32dbgn(\.exe)?\b/i | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. with this change, the patterns are much harder to read as a human, which is unfortunate. this isn't a comment on the quality of the PR, just about what we're considering. it seems many of these regexes are regexes versus substrings so that we can match the optional extension in a single pass. we could replace these with something like: - substring: foo
- substring: foo.exethis would be much easier to read (the \b would happen behind the scenes), though presumably runtime would be longer, because there are twice as many regexes to run. with a little bit of benchmarking we could show this is or is not acceptable. thoughts @mr-tz @mike-hunhoff @Shaktisinhchavda ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, the regexes are hard to audit. I propose refactoring them into simple substring pairs for the tool names and extensions. This prioritizes clarity while maintaining the same logic. I’m happy to update the PR this way if you agree.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, this is what i proposed above. please suggest a plan that includes benchmarking capa before/after to show that the runtime doesn't change substantially when introducing additional substring features.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in the scenario that runtime is worse, we can imagine an optimization to capa that translates an
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed that data should drive the decision. Here is my plan for benchmarking the performance impact: --Test Set: Run capa against the complete capa-testfiles repository. I’ll share the results here once complete. I also agree that if we do see a performance hit, an engine-level optimization to merge OR-ed substrings would be an excellent long-term solution.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've completed the benchmarking using a 3-pass average against capa-testfiles. Results confirmed identical hits but showed a ~55% increase in average runtime (from 1.04s up to 1.62s per file). While the delta is ~0.6s, I believe the significant gain in readability and auditability justifies the overhead. We could potentially recover this performance later via the engine-level optimization mentioned. Is this trade-off acceptable to the team? @williballenthin |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,7 +14,7 @@ rule: | |
| features: | ||
| - or: | ||
| - api: EnumDeviceDrivers | ||
| - string: /driverquery(\.exe)?/i | ||
| - string: /\bdriverquery(\.exe)?\b/i | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how did you decide when to have a trailing \b and when not to?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The \b was intended to prevent partial matches like driverqueryhost. I agree it’s messy, so I propose switching to dual substring entries instead (e.g., driverquery and driverquery.exe). It provides the same protection with much better readability. What do you think? @williballenthin
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you have one \b for some strings and two \b in other strings?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The inconsistency was an oversight during the update. Switching to the proposed dual substring approach will fix this entirely and ensure consistent whole-word matching across all rules. |
||
| - and: | ||
| - or: | ||
| - match: query or enumerate registry key | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @mr-tz!
-- IDA Regexes (Lines 27-28): Good catch. I missed these because they had slightly different structures (specifically the lookbehind and end-of-string anchor). I've updated them to "/\bida[gqtuw]?(.exe)?\b/i" and "/\bida[gqtuw]?64(.exe)?\b/i" for consistency with the rest of the file.
-- Substring Features: I primarily focused on existing regex string patterns in this pass. However, searching the ruleset, I see several substring features (like ifconfig, TouchSocket, and CHttpModule) that would definitely benefit from word boundaries to reduce false positives. I'll start converting these to regex string features with \b boundaries as suggested.
I've pushed an update with the IDA fixes. Should I include the substring conversions in this PR, or would you prefer a separate one focused on that?