-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible performance issue when performing look-ups for non-existent entries #21
Comments
Thanks for taking a look at the internals. My first question: Have You Measured It? If the profiler confirms your intuition, then we're onto something. But micro-optimization usually only makes sense once you have empirical evidence that the critical path needs tuning. For example, I could equally see an issue that we're stressing the allocator by computing the hashes in a vector in the first place. But without data confirming this hypothesis, I wouldn't spend any cycles proactively fixes low-impact code paths. |
Yes I have. The use-case is large objects (eg: strings) +32bytes, that cause a greater hash computation time. When an object does not exist in the Bloom filter the expectation (mean) is that it will be known by at least the k/2'th hash/lookup. However in the current situation it will compute all k hashes before attempting to even lookup the bits, so it will be performing 2x the required expected amount of hashing, and presumably hashing will be a much larger portion of the total operation time when compared to looking up the state of a bit in an array. Here is an example:
Given the design both solutions will be exposed to that particular inefficiency, but yeah that leads to another problem, with regards to how the hasher operates. When either dealing with a small number of hashes or data that is small enough that it can be hashed inline (eg: 2, 8 bytes) it would make sense to use the current strategy, as the loop can be trivially unrolled. Unfortunately the Bloom filter would not necessarily know about the rules or policies the hasher would have incorporated. A possible approach would be to have a hasher that takes bits_ and the object from the Bloom filter and returns true/false/void based on the operation, and let the hasher make the decisions. That would somewhat resolve the allocation issue of digests because it could simply be a member local to the hasher instance, but now you've got the problem that the lookup method is not thread-safe anymore. |
Great. Thanks for sharing the results! An easy way would involve using double hashing. In this way, we compute exactly two hash functions and use a linear combination to produce more values if needed. In your current code, you use the default hasher which doesn't have that optimization. The Another solution would involve lazy computation of hash values. Once I finally will get some time to work more on this library, I'll try to add this to the N3980 branch, which will contain the new hashing API.
Exactly. In particular, the number of hash functions is a runtime parameter, hence the vector representation of hash digests. |
In the method 'basic_bloom_filter::lookup' digests are computed first then subsequently quantized and looked up in the filter.
https://github.com/mavam/libbf/blob/master/src/bf/bloom_filter/basic.cc#L60
Implementations typically for efficiency purposes will have the look-up perform a trivial exit on the first digest that encounters a miss - as computing the hash may cost more than the lookup itself.
This will probably only be a problem when there is a larger expectation for queries/look-ups of non-existent entries versus present/existent ones.
The text was updated successfully, but these errors were encountered: