-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Python 3 compatibility #27
Conversation
Making sure that the `unsigned_hash` function is only called when `PYTHONHASHSEED` is set to 0 on Python >= 3.2
@@ -7,10 +7,10 @@ def shingle(tokens, window=4): | |||
if window <= 0: | |||
raise ValueError('Window size must be positive') | |||
its = [] | |||
for number in xrange(window): | |||
for number in range(window): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to use either future
or six
to get equivalent behavior on Python 2 and 3 regarding xrange
vs range
.
@b4hand Thanks for the feedback. I added an import of range/xrange from six as you requested. |
We still need to figure out what to do regarding the test failures before we can merge. I suspect that will get discussed further in #28. |
I added a commit to increase the mach_threshold from 3 to 5 bits on Python 3.4-3.5 to pass the failing test. Would this be an acceptable solution? I mean, from my limited understanding of this, it does not look like the implementation is not working for those Python versions, but just that particular example produces 5 different bits instead of 3 with the updated hash function. #28 can probably be addressed in a separate PR ? |
My recollection of that test is that it proves that a small change in the source text contributes to a small change in the hash. Doubling the bit difference may indicate an issue. I want to take a bit more time to look at the underlying values and make sure this is reasonable before merging it as is. |
@b4hand I was wondering if you had time to look into this Pull Request so it could be merged? I am aware that you had some concerns about the validity of the results with PY3; please let me know if I can do anything to address those (more tests etc). From a user perspective, having a PY3 compatible version for simhash-py would be much appreciated. For instance, @kfrancoi recently did some work on porting to Python 3 in Sagacify#2 fork. So I imagine it would better to have an official PY3 port (with if necessary some warning about the domain of validity) that would take into account your feedback and summarize work from different contributors, rather then multiple implementations in forks, each evolving independently.. As a side note, as far as I understood most of the calculations are done in your C++ library, so at least on Linux where both PY2 & PY3 would use gcc to compile this extension, the version of Python, I imagine, shouldn't matter so much beyond a few adjustments needed by this wrapper. On Windows, if the Microsoft Visual Studio compiler is used, with a version that depends on the Python version, that may be more problematic (cf. issue #29). Thanks again! |
It appears we don't actually use |
BTW, sorry for the delay on getting this merged, but I've been pulled in a bunch of different directions over the last few months. |
Thanks a lot, @b4hand ! |
This PR adds Python 3 compatibility to simhash-py,
shingle
generator runs on PY3, but never exists (infinite loop). Fixed that by replacingmap(next, its)
with list comprehension.hash
function, which makes the output ofunsigned_hash
function non deterministic, andTestFunctional.test_added_text
test may randomly fail. Adding some checks to ensure that this randomization is disabled with thePYTHONHASHSEED
environment variable (which is probably safer). Despite it, theTestFunctional.test_added_text
test still fails on Python 3.4-3.5, which may have something to do with the fact that the value returned by the buildinhash
function changed in those versions (issue Choice of the hashing function for shingles #28 ).