-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explanation in README.md should more descriptive #47
Comments
@sany2k8 - You will likely want to strip out any syntax that is for transport or formatting before trying to compute hashes. Additionally, it should be fairly obvious from the example.
a,b are your hashes of your 'documents' in this case. If you have JSON documents, you will want to pass in the object property that represents the textual content you are wanting to compare against. |
Unfortunately, it's a bit more complicated than that. Getting good results with simhash is highly dependent on the input to It's an old post, but our original post on the subject will help to provide a lot of context for what I'll talk about here (https://moz.com/devblog/near-duplicate-detection/). The input to Tokenizing the text is a great place to start. This probably involves removing cruft, perhaps punctuation, maybe doing some canonicalization like downcasing where appropriate, and perhaps unicode normalization. This is far from a complete tokenization method, but it's good enough for us to use to talk about: def tokenize(doc):
import re
return re.split('\s+', re.sub(r'[^\w\s]', '', doc.lower())) When we tokenize our string, we get: ['ive', 'been', 'working', 'on', 'the', 'railroad', 'all', 'the', 'live', 'long', 'day'] We could just take the hashes of each of these words and then use that as an input. In some ways, it is representative of our document - if we were to change the words, it would change the vector of hashes. But simhash doesn't care about the order of the input hashes, so it doesn't capture the order of the words: similar = "I've been working, all the live long day, on the railroad!"
different = "And now for something completely different"
document = "I've been working on the railroad, all the live long day"
def makeSimhash(doc):
import ctypes
# They need to be unsigned 64-bit ints
return simhash.compute([ctypes.c_ulong(hash(token)).value for token in tokenize(doc)])
>>> makeSimhash(similar)
10533350132719349034L
>>> makeSimhash(document)
10533350132719349034L
>>> makeSimhash(different)
9341659116513922L
# These first two documents would be considered identical with this hashing scheme
>>> simhash.num_differing_bits(makeSimhash(similar), makeSimhash(document))
0
>>> simhash.num_differing_bits(makeSimhash(different), makeSimhash(document))
28 So, while we do get different outputs for very different inputs, there's still the problem of word order. The easiest way to solve this is to use overlapping windows of tokens. Let's take three consecutive tokens at a time and get a hash of those. For our example, that would give us:
These are often referred to as shingles because of the way they span multiple tokens and overlap. To this end, there's a >>> list(simhash.shingle(tokenize(document), 4))
[['ive', 'been', 'working', 'on'], ['been', 'working', 'on', 'the'], ['working', 'on', 'the', 'railroad'], ['on', 'the', 'railroad', 'all'], ['the', 'railroad', 'all', 'the'], ['railroad', 'all', 'the', 'live'], ['all', 'the', 'live', 'long'], ['the', 'live', 'long', 'day']] Let's adjust our
And now we see that these are indeed different (fewer differing bits means more similar): >>> simhash.num_differing_bits(makeSimhash(similar), makeSimhash(document))
26
>>> simhash.num_differing_bits(makeSimhash(different), makeSimhash(document))
34 The threshold for similarity is also something that you need to figure out, and it is also contextually dependent. Ideally, you'd take a corpus of documents of the kind you'll be comparing, categorize matching pairs within that, and then do some analysis of what windows work best for detection. You also need to consider the precision and recall (https://en.wikipedia.org/wiki/Precision_and_recall) of your window choice. To determine the matching pairs of a corpus, you would probably want to use Hopefully this provides y'all with some idea of how to at least get going with Let me know if you've got any other questions! |
Apologies @dlecocq - I was under the (obviously incorrect) assumption that tokenization was happening under the hood automagically (as it were) when calling the Great write up though - thanks for the contribution! |
@wmelton - I believe there indeed was a time when that was included. As I recall it, we removed it in part for simplicity and in part because we didn't want to take too many liberties / make too many assumptions about the inputs. This way it seems less like magic, but admittedly it makes it less clear. I think the right balance is probably one in which a helper utility or two just works in a basic way on text. |
Yep - that makes perfect sense. |
This project is awesome but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information on how to try to do that? It shows only to compute
OR how to find matches using
but before that how can I make hashes of my documents? Can anyone update the README.md or post a full step by step example/tutorial to implement this simhash using python?
The text was updated successfully, but these errors were encountered: