Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results of simhash.find_all() #44

Open
bikashg opened this issue May 23, 2018 · 7 comments
Open

Results of simhash.find_all() #44

bikashg opened this issue May 23, 2018 · 7 comments

Comments

@bikashg
Copy link

bikashg commented May 23, 2018

I have the following code where 'a' and 'b' have the same value. However, they are not considered as simhash pairs by the simhash.find_all() method.

a = 8550830854347186281
b = 8550830854347186281

print ("Inputs differ in "+str(simhash.num_differing_bits(a, b))+" bits.")

all_simhash_pairs = simhash.find_all([a,b], 2, 1)
print ("Simhash Pair counts = "+str(len(all_simhash_pairs)))

Shouldn't they be included in the results? Or, maybe I didn't understand simhashing properly.

@dlecocq
Copy link
Contributor

dlecocq commented May 23, 2018

They should, but the current implementation in simhash-cpp is the trouble. I believe @neilmb recognized this fact a while back.

I seem to recall the use of std::unordered_map as being a deliberate choice (https://github.com/seomoz/simhash-cpp/blob/master/src/simhash.cpp#L57), but at the moment I don't see why it couldn't be a std::vector. Certainly in the body of the code a std::vector is used.

@b4hand
Copy link
Contributor

b4hand commented May 23, 2018

Oddly, this line seems to indicate that the input is a vector. However, at least as far as the repo history is concerned the argument has always been an unordered_map.

@b4hand
Copy link
Contributor

b4hand commented May 23, 2018

It also seems odd to me that the input hashes are passed as a non-const reference, especially since the code doesn't seem to modify that argument directly, but rather only copies from it.

@MatFluor
Copy link

Any news on that?

I came across the same behavior - two identical hashes are not seen as duplicates by find_all

@dlecocq
Copy link
Contributor

dlecocq commented May 21, 2019

It seems to me that the easiest fix is to change it from unordered_set to vector in simhash-cpp's find_all. I have the bandwidth for review, but not the PR itself, and it doesn't look like I have merge powers on either repo, either :-/

@nslallu
Copy link

nslallu commented Jun 15, 2020

Any update on this issue? Would be great if it's fixed.

@sushilr007
Copy link

@dlecocq I changed unordered_set to vector but getting an error while installaing

simhash.cpp

Simhash::matches_t Simhash::find_all(
    std::vector<Simhash::hash_t>& hashes,
    size_t number_of_blocks,
    size_t different_bits)
{

Getting error:

simhash/simhash-cpp/src/simhash.cpp:59:26: error: ‘Simhash::matches_t Simhash::find_all(std::vector<long unsigned int>&, size_t, size_t)’ should have been declared inside ‘Simhash’
   59 |     size_t different_bits)
      |                          ^

can you show me how you made it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants