Add text comparison and visualisation for results #8

Timmimim · 2017-11-06T10:57:05Z

The Checker should be able to compare text chunks in papers, to find differences in their content beyond just images.
Compare Issue #1.

The Checker is currently only programmed to handle HTML files. A textual Check should focus on the HTML's <body>.
The body includes minor style elements and CSS, the actual text of the paper, and images. Since the Checker is already able to find and extract images, it may just as well assume that the remainder of the paper is mostly Text.
So the Checker already discriminates successfully between the two major parts of the body.

I would propose running text comparisons using String Edit Distance algorithms such as Levenshtein or similar (Issue #1). There are some fast and stable implementations available in npm.

The runtime of these algorithms inflates massively (!) for larger text chunks. Therefore I would suggest dividing the text into chunks, e.g. paragraph-wise. This, however, may prove problematic, if one of the papers has e.g. an extra paragraph somewhere in the middle of the paper. This would 'blow up' the results, since every pair of paragraphs following this extra (or missing) paragraph would most likely differ by a lot.
Text comparison must keep such eventualities in mind!

The <head>s usually contain base64-encoded JavaScript and plain CSS. I'm not sure if these make sense checking, but could maybe be useful as additional information.

Furthermore, String Edit Distance algorithms quantify differences, but do nothing for highlighting.
So there should be visualisation for quantified differences. This may be done by plotting differences, however these plots need to represent the differences and their meaning comprehensively, which may be tricky.
A better way may be to visualise differences in the UI as part of the diff-HTML view. Therefore, difference highlighting (git, vimdiff, ...) would be very helpful to make differences visible and easy to comprehend.

Finally, quantified differences and probably the position of these differences should be added to the Check-Result JSON.

The text was updated successfully, but these errors were encountered:

nuest · 2017-11-17T10:07:51Z

I suggest to start with using the diff module to compare the words (JsDiff.diffWords) in the body of the HTML files. It should be possible to integrate this into the already existing HTML diff, see https://www.npmjs.com/package/diff#examples

nuest · 2018-03-12T09:04:27Z

Also, add an example of an Rmd file that produces diferent texts based on some random component, including just changing a single number in a sentence.

Timmimim · 2018-08-03T09:57:06Z

#20 includes first text diff visualization.

nuest closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add text comparison and visualisation for results #8

Add text comparison and visualisation for results #8

Timmimim commented Nov 6, 2017 •

edited by nuest

Loading

nuest commented Nov 17, 2017 •

edited

Loading

nuest commented Mar 12, 2018

Timmimim commented Aug 3, 2018

Add text comparison and visualisation for results #8

Add text comparison and visualisation for results #8

Comments

Timmimim commented Nov 6, 2017 • edited by nuest Loading

nuest commented Nov 17, 2017 • edited Loading

nuest commented Mar 12, 2018

Timmimim commented Aug 3, 2018

Timmimim commented Nov 6, 2017 •

edited by nuest

Loading

nuest commented Nov 17, 2017 •

edited

Loading