Consider to use the Edit distance for string comparison #1

ghost · 2017-04-07T09:55:31Z

For measuring the extent of difference in strings, we could use an algorithm for the Edit distance, e.g. the Levenshtein distance.

Doing so might enable us to better compare html or pdf files or segements of text within them that are close but no exact matches. It might also be a means for quantifying the comparison result in contrast to checksums that are compared in a boolean way.

Timmimim · 2017-04-11T13:35:00Z

I have just been reading up on edit distance, especially on the Levenshtein distance and Hamming distance algorithms, and they do seem like a very good idea for large parts of the HTML documents.
However, I would only suggest running these algorithms on non-base64-encoded parts of the document, or at the very least not on images.

I was doing some tests where I altered images only very slightly, e.g. drew a line on them or cropped them. Then encoded the original and the altered file to base64, and compared them via vimdiff.
The base64 strings have next to nothing in common, even after only very small alternations.

The same goes for base64-strings from textual files. In a test, adding only one . adds a total of 9 characters to the base64 code, a typo-riddled file compared to one spelled correctly was entirely unrecognizable.

I would suggest we restrict our result quantification to non-base64 parts of the HTML, then, and factor out graphical elements etc.

Thoughts?

ghost · 2017-04-11T14:04:50Z

Comparing base64 encoded texts will potentially show large differences in the output even for minor differences in the input because the algorithm works position based for segments of 6 bits that are encoded by a replacement table. So a shift in length has great effects, even though the patterns repeat themselves. this might still produce more similarity than hashing due to the design goal of cryptographic hashing algorithms to show the avalanche effect where slight changes on the input always result in significant changes in the output.

according to your observations the edit distance might indeed be useful for comparison, so we need a way of identifying base64 passages in order to exclude them properly. in html this might be done by finding the div and span class identifiers. a pdf might need to be decoded at first.

anyway we now got some methods to provide different levels of comparison:

bitwise (hashes, checksums)
stringwise (edit distance)
imagewise (perceptual hashes)
intellectual comparison (manually by user/expert)

I think the order of this list also reflects ascending degrees of difficulty concerning the implementation. the erc checker should perhaps be able to decide what method is appropriate for each case and then facilitate the work for the user as much as possible.

nuest · 2017-04-11T15:48:00Z

I like the levels of comparison. For each we can have different suitable input, for example if bitwise is used then images (= base64 stuff in HTML) should be ignored.

These levels can also run one after the other with increasing complexity.
That said, we should start with the simplest one, and I don't expect us to handle more then bitwise and stringwise for now...

Dev suggestions

integrate changes from origin

nuest pushed a commit to nuest/erc-checker that referenced this issue Aug 3, 2017

Merge pull request o2r-project#1 from nuest/dev

0fac8a3

Dev suggestions

Timmimim mentioned this issue Nov 6, 2017

Add text comparison and visualisation for results #8

Closed

nuest added the enhancement label Jan 26, 2018

nuest pushed a commit to nuest/erc-checker that referenced this issue Mar 11, 2020

Merge pull request o2r-project#1 from o2r-project/master

42b0c21

integrate changes from origin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider to use the Edit distance for string comparison #1

Consider to use the Edit distance for string comparison #1

ghost commented Apr 7, 2017

Timmimim commented Apr 11, 2017

ghost commented Apr 11, 2017

nuest commented Apr 11, 2017

Consider to use the Edit distance for string comparison #1

Consider to use the Edit distance for string comparison #1

Comments

ghost commented Apr 7, 2017

Timmimim commented Apr 11, 2017

ghost commented Apr 11, 2017

nuest commented Apr 11, 2017