You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Checker should be able to compare text chunks in papers, to find differences in their content beyond just images.
Compare Issue #1.
The Checker is currently only programmed to handle HTML files. A textual Check should focus on the HTML's <body>.
The body includes minor style elements and CSS, the actual text of the paper, and images. Since the Checker is already able to find and extract images, it may just as well assume that the remainder of the paper is mostly Text.
So the Checker already discriminates successfully between the two major parts of the body.
I would propose running text comparisons using String Edit Distance algorithms such as Levenshtein or similar (Issue #1). There are some fast and stable implementations available in npm.
The runtime of these algorithms inflates massively (!) for larger text chunks. Therefore I would suggest dividing the text into chunks, e.g. paragraph-wise. This, however, may prove problematic, if one of the papers has e.g. an extra paragraph somewhere in the middle of the paper. This would 'blow up' the results, since every pair of paragraphs following this extra (or missing) paragraph would most likely differ by a lot.
Text comparison must keep such eventualities in mind!
The <head>s usually contain base64-encoded JavaScript and plain CSS. I'm not sure if these make sense checking, but could maybe be useful as additional information.
Furthermore, String Edit Distance algorithms quantify differences, but do nothing for highlighting.
So there should be visualisation for quantified differences. This may be done by plotting differences, however these plots need to represent the differences and their meaning comprehensively, which may be tricky.
A better way may be to visualise differences in the UI as part of the diff-HTML view. Therefore, difference highlighting (git, vimdiff, ...) would be very helpful to make differences visible and easy to comprehend.
Finally, quantified differences and probably the position of these differences should be added to the Check-Result JSON.
The text was updated successfully, but these errors were encountered:
I suggest to start with using the diff module to compare the words (JsDiff.diffWords) in the body of the HTML files. It should be possible to integrate this into the already existing HTML diff, see https://www.npmjs.com/package/diff#examples
Also, add an example of an Rmd file that produces diferent texts based on some random component, including just changing a single number in a sentence.
The Checker should be able to compare text chunks in papers, to find differences in their content beyond just images.
Compare Issue #1.
The Checker is currently only programmed to handle HTML files. A textual Check should focus on the HTML's
<body>
.The body includes minor style elements and CSS, the actual text of the paper, and images. Since the Checker is already able to find and extract images, it may just as well assume that the remainder of the paper is mostly Text.
So the Checker already discriminates successfully between the two major parts of the
body
.I would propose running text comparisons using String Edit Distance algorithms such as Levenshtein or similar (Issue #1). There are some fast and stable implementations available in
npm
.The runtime of these algorithms inflates massively (!) for larger text chunks. Therefore I would suggest dividing the text into chunks, e.g. paragraph-wise. This, however, may prove problematic, if one of the papers has e.g. an extra paragraph somewhere in the middle of the paper. This would 'blow up' the results, since every pair of paragraphs following this extra (or missing) paragraph would most likely differ by a lot.
Text comparison must keep such eventualities in mind!
The
<head>
s usually contain base64-encoded JavaScript and plain CSS. I'm not sure if these make sense checking, but could maybe be useful as additional information.Furthermore, String Edit Distance algorithms quantify differences, but do nothing for highlighting.
So there should be visualisation for quantified differences. This may be done by plotting differences, however these plots need to represent the differences and their meaning comprehensively, which may be tricky.
A better way may be to visualise differences in the UI as part of the diff-HTML view. Therefore, difference highlighting (git, vimdiff, ...) would be very helpful to make differences visible and easy to comprehend.
Finally, quantified differences and probably the position of these differences should be added to the Check-Result JSON.
The text was updated successfully, but these errors were encountered: