-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider to use the Edit distance for string comparison #1
Comments
I have just been reading up on edit distance, especially on the Levenshtein distance and Hamming distance algorithms, and they do seem like a very good idea for large parts of the HTML documents. I was doing some tests where I altered images only very slightly, e.g. drew a line on them or cropped them. Then encoded the original and the altered file to base64, and compared them via vimdiff. The same goes for base64-strings from textual files. In a test, adding only one . adds a total of 9 characters to the base64 code, a typo-riddled file compared to one spelled correctly was entirely unrecognizable. I would suggest we restrict our result quantification to non-base64 parts of the HTML, then, and factor out graphical elements etc. Thoughts? |
Comparing base64 encoded texts will potentially show large differences in the output even for minor differences in the input because the algorithm works position based for segments of 6 bits that are encoded by a replacement table. So a shift in length has great effects, even though the patterns repeat themselves. this might still produce more similarity than hashing due to the design goal of cryptographic hashing algorithms to show the avalanche effect where slight changes on the input always result in significant changes in the output. according to your observations the edit distance might indeed be useful for comparison, so we need a way of identifying base64 passages in order to exclude them properly. in html this might be done by finding the div and span class identifiers. a pdf might need to be decoded at first. anyway we now got some methods to provide different levels of comparison:
I think the order of this list also reflects ascending degrees of difficulty concerning the implementation. the erc checker should perhaps be able to decide what method is appropriate for each case and then facilitate the work for the user as much as possible. |
I like the levels of comparison. For each we can have different suitable input, for example if bitwise is used then images (= base64 stuff in HTML) should be ignored. These levels can also run one after the other with increasing complexity. |
integrate changes from origin
For measuring the extent of difference in strings, we could use an algorithm for the Edit distance, e.g. the Levenshtein distance.
Doing so might enable us to better compare html or pdf files or segements of text within them that are close but no exact matches. It might also be a means for quantifying the comparison result in contrast to checksums that are compared in a boolean way.
The text was updated successfully, but these errors were encountered: