Why is content comparison so slow #1244

sejtam · 2024-08-25T14:09:55Z

sejtam
Aug 25, 2024

I just came across dupeguru, and am running a comparison of many video files, using content comparison (I am not interested in somewhat-like files but just exact copies thst may have been uploaded from the camera a few times)

Still, the comparison is very slow. and I wonder why that is

If the filesizes are the same, i would expect dg to use hashes to determine whether the files are the same.
But does it really run a complete hash over the whole file?

I would expect it to use progressive hashes, eg
1.compare a hash of the first 1kb
2. if that is the same, do the same with the next kb (or possibly even skip ahead some largish amount to another known place)
3. and so on, until a mismatch is found

Does dg keep a database of already checked files on disk? so that it does not have to re-calculate the hashes again?
In that case, it should only store the hashes calculated the last time around, and of course how far the checks went.

If there was a comparison with a new file, it could then use those, and continue checking later segments of the old file only if needed (and then of course record those hashes).

Say few had two files of the exact same size (say several MB)

The cached hashes could be recorded

file1: size=646567625 full-hash=none partial-hashes=md5:5 [0x1234 0x2345 0x3456 0x4567 0x5678 ]

If a new file was found to have the same size ans its hash 1 (first kb) came out to 1234, dg woudl continue to check hashes until either a mismatch was. found in one of the first 5 hashes. Is it was still the same, then it would need to calculate the hashes for BOTH (or more) matching files in segments 6, 7, 8 (and record these) until a mismatch was found, or the new file was found to be the same and marked/reported as such
So after this, the same file's entry might

file1: size=646567625 full-hash=none partial-hashes=md5:8 [0x1234 0x2345 0x3456 0x4567 0x5678 0x6789 0x7890 0x8901 ]

Of course, there should then be a mode to re-calculate remembered/cached hashes when needed (a job that would possibly run when convenient, say overnight) [ including one that then calculated complete hashes

file1: size=646567625 full-hash=md5:0xabcdefg12 partial-hashes=md5:8 [1234 2345 3456 4567 5678 6789 7890 8901 ]

That might make comparisons much faster, because it would only have to check partial files

Also, would it not be a good idea to indicate in the progress bar how many files were left to be checked?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is content comparison so slow #1244

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why is content comparison so slow #1244

sejtam Aug 25, 2024

Replies: 0 comments

sejtam
Aug 25, 2024