matching optimization strategies #821
Replies: 4 comments 8 replies
-
Pre-select rulesI think in certain use-cases a fair amount of rules can be pre-filtered. As a simple example for ELF files we can easily ignore all rules that require the windows OS etc. at their root-level. Pruning the rules first could get more complicated so let's proof that checking less rules really provide performance benefits (Willi's last paragraph above is really key here). |
Beta Was this translation helpful? Give feedback.
-
Function selectionFrom experience (or imagined experience) really large functions (a lot of features) slow down feature extraction and rule matching. Maybe we can select functions to analyze in a smarter way. via @mr-tz |
Beta Was this translation helpful? Give feedback.
-
Query optimizerDo a pre-pass over the rule instances and re-order the logic nodes such that "cheaper" features are checked first. I assume that its "cheaper" to check a hash-backed feature, such as mnemonic or number, than a scanning feature, such as regex or substring. Therefore, re-order the logic nodes in each rule such that cheaper operations are done first. (This assumes each node has an equal chance of matching/failing, which is not true, but not something we expect rule authors to exploit.) Nodes like Proposed optimization rules (brainstormed, not profiled):
|
Beta Was this translation helpful? Give feedback.
-
Bottom up matchingThis is a recursive application of Pre-select rules. We currently match from "top down"; that is, from root node down to the leaf nodes. However, most features of most rules will not match. Therefore, I think we spend a lot of time looking for things that don't exist. We should instead work with the thing we have. As we encounter each feature, figure out which logic trees are satisfied, working from leaf to root. When a root is satisfied, a rule has matched. Prepare an indexed data structure built from all logic nodes across all rules. Index the leaf nodes by feature. Provide an accessor from child to parent. As each feature is encountered, use the index to find the nodes that are satisfied. Mark them as such. Bring their parents into consideration for matching (add to the nodes by feature index). Once a root node is marked as satisfied, the rule has matched. Careful: handle rule dependencies, too. Its not exactly clear to me when bottom-up matching will out perform top-down matching. Intuitively, when there are fewer logic nodes than features then I think we'd want top-down matching. But, when there are fewer features than logic nodes, I think we'd want bottom-up matching. We need to measure this. This is one of the more involved strategies, as it requires a lot of new code, so lets make sure it works. It might be possible to use both matching strategies and pick which is appropriate based on the feature set size (e.g. bb scope use one, file scope use the other). |
Beta Was this translation helpful? Give feedback.
-
Issue #602 describes how capa can feel slow because it does so much work to match the ~700 rules we distribute. We want capa to run faster for at least two reasons:
This thread will capture strategies that we investigate to improve the performance of capa. There are some obvious categories, such as:
We should be careful to guide our research with profiling and experimentation; its easy to philosophize about algorithms and get detached from reality. In order to accept a strategy, we should be able to show how much faster it is for a reasonable workload. We should also consider the complexity of implementation and maintenance.
Beta Was this translation helpful? Give feedback.
All reactions