Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text comparison and visualisation for results #8

Closed
Timmimim opened this issue Nov 6, 2017 · 3 comments
Closed

Add text comparison and visualisation for results #8

Timmimim opened this issue Nov 6, 2017 · 3 comments

Comments

@Timmimim
Copy link
Contributor

Timmimim commented Nov 6, 2017

The Checker should be able to compare text chunks in papers, to find differences in their content beyond just images.
Compare Issue #1.

The Checker is currently only programmed to handle HTML files. A textual Check should focus on the HTML's <body>.
The body includes minor style elements and CSS, the actual text of the paper, and images. Since the Checker is already able to find and extract images, it may just as well assume that the remainder of the paper is mostly Text.
So the Checker already discriminates successfully between the two major parts of the body.

I would propose running text comparisons using String Edit Distance algorithms such as Levenshtein or similar (Issue #1). There are some fast and stable implementations available in npm.

The runtime of these algorithms inflates massively (!) for larger text chunks. Therefore I would suggest dividing the text into chunks, e.g. paragraph-wise. This, however, may prove problematic, if one of the papers has e.g. an extra paragraph somewhere in the middle of the paper. This would 'blow up' the results, since every pair of paragraphs following this extra (or missing) paragraph would most likely differ by a lot.
Text comparison must keep such eventualities in mind!

The <head>s usually contain base64-encoded JavaScript and plain CSS. I'm not sure if these make sense checking, but could maybe be useful as additional information.

Furthermore, String Edit Distance algorithms quantify differences, but do nothing for highlighting.
So there should be visualisation for quantified differences. This may be done by plotting differences, however these plots need to represent the differences and their meaning comprehensively, which may be tricky.
A better way may be to visualise differences in the UI as part of the diff-HTML view. Therefore, difference highlighting (git, vimdiff, ...) would be very helpful to make differences visible and easy to comprehend.

Finally, quantified differences and probably the position of these differences should be added to the Check-Result JSON.

@nuest
Copy link
Member

nuest commented Nov 17, 2017

I suggest to start with using the diff module to compare the words (JsDiff.diffWords) in the body of the HTML files. It should be possible to integrate this into the already existing HTML diff, see https://www.npmjs.com/package/diff#examples

@nuest
Copy link
Member

nuest commented Mar 12, 2018

Also, add an example of an Rmd file that produces diferent texts based on some random component, including just changing a single number in a sentence.

@Timmimim
Copy link
Contributor Author

Timmimim commented Aug 3, 2018

#20 includes first text diff visualization.

@nuest nuest closed this as completed Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants