matching optimization strategies #821

williballenthin · 2021-11-04T18:00:59Z

williballenthin
Nov 4, 2021
Maintainer

Issue #602 describes how capa can feel slow because it does so much work to match the ~700 rules we distribute. We want capa to run faster for at least two reasons:

a fast tool is a better experience for users. We don't want people to avoid running capa because they feel that it takes too long to produce results.
a fast tool costs less to run at scale.

This thread will capture strategies that we investigate to improve the performance of capa. There are some obvious categories, such as:

do less work/better algorithms
use a faster language/runtime
do things in a faster way

We should be careful to guide our research with profiling and experimentation; its easy to philosophize about algorithms and get detached from reality. In order to accept a strategy, we should be able to show how much faster it is for a reasonable workload. We should also consider the complexity of implementation and maintenance.

mr-tz · 2021-11-04T22:10:16Z

mr-tz
Nov 4, 2021
Maintainer

Pre-select rules

I think in certain use-cases a fair amount of rules can be pre-filtered. As a simple example for ELF files we can easily ignore all rules that require the windows OS etc. at their root-level. Pruning the rules first could get more complicated so let's proof that checking less rules really provide performance benefits (Willi's last paragraph above is really key here).

3 replies

williballenthin Nov 5, 2021
Maintainer Author

For a given function, most rules will not match and most rule features will not be present (sidebar: might be interesting to capture stats on this to confirm). Therefore, if we can approximate (even conservatively) which rules might match, we can avoid the wasted effort of trying to match the majority of rules that won't match.

Once we've built the RuleSet, we can walk each rule and each logic tree and record which rules have which feature. Build a mapping rules_by_feature: Dict[Feature, Rule]. When evaluating a scope, first collect all the features. Then, use the mapping to collect the candidate rules that might match because at least one of their features is present. Finally, only evaluate these rules against the scope's features.

There's a bit of nuance here in handling rules with match statements (match other rules). We'd still need to evaluate the candidate rules in an appropriate order and when we find a match, extend the candidate rules with any rules that depended upon the match. It might be a little a tricky, but with testing, review, and documentation, I think we can implement this appropriately.

williballenthin Nov 9, 2021
Maintainer Author

Upon some further thought, there are some tricky thing here:

how to handle substring/regex/bytes features that require scanning? I think we want to avoid doing those expensive scans too early, but then how do we ensure we don't forget to evaluate some rules?
what about match features? or chains of matches, like A -> B -> C?

I think we can begin to experiment with this idea by only indexing rules by feature if they are "simple"; that is, they only contain hash-based features (i.e. no substring/regex/bytes) and no match features. Then, evaluate the remaining rules as before. The intuition is there should be a bunch of simple rules (like this one) for which this optimization works.

Another difficulty: we currently pass a list of Rule instances into match(...). There's not an obvious place to pass/provide/store the rules_by_feature index. We could change the parameters to this function, but this is semi-public (but I bet no one calls it directly).

For this final difficulty, we can introduce an additional layer between find_capabilities and match that computes and/or stores the caches.

williballenthin Nov 9, 2021
Maintainer Author

#830 adds an implementation that buckets rules into "easy" and "hard" and indexes the easy rules by feature, thereby skipping rule evaluation of rules that can never match. This strategy alone reduces matching runtime by 50%.

This strategy is orthogonal to short circuiting/query optimization, so they can be combined for even better performance.

williballenthin · 2021-11-05T15:29:16Z

williballenthin
Nov 5, 2021
Maintainer Author

Function selection

From experience (or imagined experience) really large functions (a lot of features) slow down feature extraction and rule matching. Maybe we can select functions to analyze in a smarter way.

via @mr-tz

1 reply

mr-tz Nov 6, 2021
Maintainer

see #602 (comment) for example large and slow functions

williballenthin · 2021-11-05T15:39:46Z

williballenthin
Nov 5, 2021
Maintainer Author

Query optimizer

Do a pre-pass over the rule instances and re-order the logic nodes such that "cheaper" features are checked first.

I assume that its "cheaper" to check a hash-backed feature, such as mnemonic or number, than a scanning feature, such as regex or substring. Therefore, re-order the logic nodes in each rule such that cheaper operations are done first. (This assumes each node has an equal chance of matching/failing, which is not true, but not something we expect rule authors to exploit.)

Nodes like and and or may have multiple children that each get checked until success. Today, our implementation walks through the children sequentially. We should make this behavior part of the internal API and document this order. Then, the optimizer can rely on the ordering of the children nodes to affect performance.

Proposed optimization rules (brainstormed, not profiled):

do OS checks first
do hash features (e.g. mnemonic or number) before scanning features (e.g. regex or substring)
do leaves before branches, i.e. check for features closer to the root before features closer to the leaves
compute cost of sub-trees and use these to inform the order of branches

2 replies

williballenthin Nov 5, 2021
Maintainer Author

It turns out that we currently don't short-circuit during evaluation of and/or/range statements. 1) I think we should add this, and 2) this is required for the query optimizer to make any difference (otherwise, all nodes are always evaluated).

#827 adds support for short circuiting

williballenthin Nov 5, 2021
Maintainer Author

#829 adds support for the rule optimizer

combined with #827, we see approximately a 40% improvement in matching time!

williballenthin · 2021-11-05T15:59:17Z

williballenthin
Nov 5, 2021
Maintainer Author

Bottom up matching

This is a recursive application of Pre-select rules.

We currently match from "top down"; that is, from root node down to the leaf nodes. However, most features of most rules will not match. Therefore, I think we spend a lot of time looking for things that don't exist. We should instead work with the thing we have. As we encounter each feature, figure out which logic trees are satisfied, working from leaf to root. When a root is satisfied, a rule has matched.

Prepare an indexed data structure built from all logic nodes across all rules. Index the leaf nodes by feature. Provide an accessor from child to parent. As each feature is encountered, use the index to find the nodes that are satisfied. Mark them as such. Bring their parents into consideration for matching (add to the nodes by feature index). Once a root node is marked as satisfied, the rule has matched. Careful: handle rule dependencies, too.

Its not exactly clear to me when bottom-up matching will out perform top-down matching. Intuitively, when there are fewer logic nodes than features then I think we'd want top-down matching. But, when there are fewer features than logic nodes, I think we'd want bottom-up matching. We need to measure this.

This is one of the more involved strategies, as it requires a lot of new code, so lets make sure it works.

It might be possible to use both matching strategies and pick which is appropriate based on the feature set size (e.g. bb scope use one, file scope use the other).

2 replies

williballenthin Nov 9, 2021
Maintainer Author

Optimization: do scanning features (i.e. substring/regex/bytes) last.

edit: nvm, it doesn't matter the order, there's no short circuiting here?

williballenthin Nov 10, 2021
Maintainer Author

Handling not: statements is a bit more tricky than I thought: they start out satisfied, and may temporarily satisfy a rule, until their feature is seen, and then they become unsatisfied, potentially causing their parent rule to become unsatisfied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matching optimization strategies #821

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

matching optimization strategies #821

williballenthin Nov 4, 2021 Maintainer

Replies: 4 comments · 8 replies

mr-tz Nov 4, 2021 Maintainer

Pre-select rules

williballenthin Nov 5, 2021 Maintainer Author

williballenthin Nov 9, 2021 Maintainer Author

williballenthin Nov 9, 2021 Maintainer Author

williballenthin Nov 5, 2021 Maintainer Author

Function selection

mr-tz Nov 6, 2021 Maintainer

williballenthin Nov 5, 2021 Maintainer Author

Query optimizer

williballenthin Nov 5, 2021 Maintainer Author

williballenthin Nov 5, 2021 Maintainer Author

williballenthin Nov 5, 2021 Maintainer Author

Bottom up matching

williballenthin Nov 9, 2021 Maintainer Author

williballenthin Nov 10, 2021 Maintainer Author

williballenthin
Nov 4, 2021
Maintainer

Replies: 4 comments 8 replies

mr-tz
Nov 4, 2021
Maintainer

williballenthin Nov 5, 2021
Maintainer Author

williballenthin Nov 9, 2021
Maintainer Author

williballenthin Nov 9, 2021
Maintainer Author

williballenthin
Nov 5, 2021
Maintainer Author

mr-tz Nov 6, 2021
Maintainer

williballenthin
Nov 5, 2021
Maintainer Author

williballenthin Nov 5, 2021
Maintainer Author

williballenthin Nov 5, 2021
Maintainer Author

williballenthin
Nov 5, 2021
Maintainer Author

williballenthin Nov 9, 2021
Maintainer Author

williballenthin Nov 10, 2021
Maintainer Author