Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Python 3 compatibility #27

Merged
merged 7 commits into from
Jan 17, 2017
Merged

Adding Python 3 compatibility #27

merged 7 commits into from
Jan 17, 2017

Conversation

rth
Copy link
Contributor

@rth rth commented Oct 17, 2016

This PR adds Python 3 compatibility to simhash-py,

  • initial porting to PY3 done by @rolando
  • updating Travis CI to test PY 3.2 - 3.5
  • The updated version of shingle generator runs on PY3, but never exists (infinite loop). Fixed that by replacing map(next, its) with list comprehension.
  • Starting from Python 3.2 a random seed is used with the Python hash function, which makes the output of unsigned_hash function non deterministic, and TestFunctional.test_added_text test may randomly fail. Adding some checks to ensure that this randomization is disabled with the PYTHONHASHSEED environment variable (which is probably safer). Despite it, the TestFunctional.test_added_text test still fails on Python 3.4-3.5, which may have something to do with the fact that the value returned by the buildin hash function changed in those versions (issue Choice of the hashing function for shingles #28 ).

@@ -7,10 +7,10 @@ def shingle(tokens, window=4):
if window <= 0:
raise ValueError('Window size must be positive')
its = []
for number in xrange(window):
for number in range(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to use either future or six to get equivalent behavior on Python 2 and 3 regarding xrange vs range.

@rth
Copy link
Contributor Author

rth commented Oct 17, 2016

@b4hand Thanks for the feedback. I added an import of range/xrange from six as you requested.

@b4hand
Copy link
Contributor

b4hand commented Oct 17, 2016

We still need to figure out what to do regarding the test failures before we can merge. I suspect that will get discussed further in #28.

@rth
Copy link
Contributor Author

rth commented Oct 17, 2016

I added a commit to increase the mach_threshold from 3 to 5 bits on Python 3.4-3.5 to pass the failing test. Would this be an acceptable solution? I mean, from my limited understanding of this, it does not look like the implementation is not working for those Python versions, but just that particular example produces 5 different bits instead of 3 with the updated hash function. #28 can probably be addressed in a separate PR ?

@b4hand
Copy link
Contributor

b4hand commented Oct 17, 2016

My recollection of that test is that it proves that a small change in the source text contributes to a small change in the hash. Doubling the bit difference may indicate an issue. I want to take a bit more time to look at the underlying values and make sure this is reasonable before merging it as is.

@rth rth mentioned this pull request Oct 22, 2016
@rth
Copy link
Contributor Author

rth commented Jan 15, 2017

@b4hand I was wondering if you had time to look into this Pull Request so it could be merged? I am aware that you had some concerns about the validity of the results with PY3; please let me know if I can do anything to address those (more tests etc).

From a user perspective, having a PY3 compatible version for simhash-py would be much appreciated. For instance, @kfrancoi recently did some work on porting to Python 3 in Sagacify#2 fork. So I imagine it would better to have an official PY3 port (with if necessary some warning about the domain of validity) that would take into account your feedback and summarize work from different contributors, rather then multiple implementations in forks, each evolving independently..

As a side note, as far as I understood most of the calculations are done in your C++ library, so at least on Linux where both PY2 & PY3 would use gcc to compile this extension, the version of Python, I imagine, shouldn't matter so much beyond a few adjustments needed by this wrapper. On Windows, if the Microsoft Visual Studio compiler is used, with a version that depends on the Python version, that may be more problematic (cf. issue #29).

Thanks again!

@b4hand
Copy link
Contributor

b4hand commented Jan 17, 2017

It appears we don't actually use simhash.unsigned_hash, so this really only impacts the test. Given that, I'm willing to merge this as is with the expectation that simhash.unsigned_hash will probably be removed at some point in the future since I don't think it's a good idea to rely on the built-in Python hash implementation unless we're going to use one of the "standard" hashing functions that can be language agnostic.

@b4hand
Copy link
Contributor

b4hand commented Jan 17, 2017

BTW, sorry for the delay on getting this merged, but I've been pulled in a bunch of different directions over the last few months.

@b4hand b4hand merged commit 8923486 into seomoz:master Jan 17, 2017
@rth
Copy link
Contributor Author

rth commented Jan 18, 2017

Thanks a lot, @b4hand !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants