Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider to use the Edit distance for string comparison #1

Open
ghost opened this issue Apr 7, 2017 · 3 comments
Open

Consider to use the Edit distance for string comparison #1

ghost opened this issue Apr 7, 2017 · 3 comments

Comments

@ghost
Copy link

ghost commented Apr 7, 2017

For measuring the extent of difference in strings, we could use an algorithm for the Edit distance, e.g. the Levenshtein distance.

Doing so might enable us to better compare html or pdf files or segements of text within them that are close but no exact matches. It might also be a means for quantifying the comparison result in contrast to checksums that are compared in a boolean way.

@Timmimim
Copy link
Contributor

I have just been reading up on edit distance, especially on the Levenshtein distance and Hamming distance algorithms, and they do seem like a very good idea for large parts of the HTML documents.
However, I would only suggest running these algorithms on non-base64-encoded parts of the document, or at the very least not on images.

I was doing some tests where I altered images only very slightly, e.g. drew a line on them or cropped them. Then encoded the original and the altered file to base64, and compared them via vimdiff.
The base64 strings have next to nothing in common, even after only very small alternations.

The same goes for base64-strings from textual files. In a test, adding only one . adds a total of 9 characters to the base64 code, a typo-riddled file compared to one spelled correctly was entirely unrecognizable.

I would suggest we restrict our result quantification to non-base64 parts of the HTML, then, and factor out graphical elements etc.

Thoughts?

@ghost
Copy link
Author

ghost commented Apr 11, 2017

Comparing base64 encoded texts will potentially show large differences in the output even for minor differences in the input because the algorithm works position based for segments of 6 bits that are encoded by a replacement table. So a shift in length has great effects, even though the patterns repeat themselves. this might still produce more similarity than hashing due to the design goal of cryptographic hashing algorithms to show the avalanche effect where slight changes on the input always result in significant changes in the output.

according to your observations the edit distance might indeed be useful for comparison, so we need a way of identifying base64 passages in order to exclude them properly. in html this might be done by finding the div and span class identifiers. a pdf might need to be decoded at first.

anyway we now got some methods to provide different levels of comparison:

  • bitwise (hashes, checksums)
  • stringwise (edit distance)
  • imagewise (perceptual hashes)
  • intellectual comparison (manually by user/expert)

I think the order of this list also reflects ascending degrees of difficulty concerning the implementation. the erc checker should perhaps be able to decide what method is appropriate for each case and then facilitate the work for the user as much as possible.

@nuest
Copy link
Member

nuest commented Apr 11, 2017

I like the levels of comparison. For each we can have different suitable input, for example if bitwise is used then images (= base64 stuff in HTML) should be ignored.

These levels can also run one after the other with increasing complexity.
That said, we should start with the simplest one, and I don't expect us to handle more then bitwise and stringwise for now...

nuest pushed a commit to nuest/erc-checker that referenced this issue Aug 3, 2017
nuest pushed a commit to nuest/erc-checker that referenced this issue Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants