You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just came across dupeguru, and am running a comparison of many video files, using content comparison (I am not interested in somewhat-like files but just exact copies thst may have been uploaded from the camera a few times)
Still, the comparison is very slow. and I wonder why that is
If the filesizes are the same, i would expect dg to use hashes to determine whether the files are the same.
But does it really run a complete hash over the whole file?
I would expect it to use progressive hashes, eg
1.compare a hash of the first 1kb
2. if that is the same, do the same with the next kb (or possibly even skip ahead some largish amount to another known place)
3. and so on, until a mismatch is found
Does dg keep a database of already checked files on disk? so that it does not have to re-calculate the hashes again?
In that case, it should only store the hashes calculated the last time around, and of course how far the checks went.
If there was a comparison with a new file, it could then use those, and continue checking later segments of the old file only if needed (and then of course record those hashes).
Say few had two files of the exact same size (say several MB)
If a new file was found to have the same size ans its hash 1 (first kb) came out to 1234, dg woudl continue to check hashes until either a mismatch was. found in one of the first 5 hashes. Is it was still the same, then it would need to calculate the hashes for BOTH (or more) matching files in segments 6, 7, 8 (and record these) until a mismatch was found, or the new file was found to be the same and marked/reported as such
So after this, the same file's entry might
Of course, there should then be a mode to re-calculate remembered/cached hashes when needed (a job that would possibly run when convenient, say overnight) [ including one that then calculated complete hashes
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I just came across dupeguru, and am running a comparison of many video files, using content comparison (I am not interested in somewhat-like files but just exact copies thst may have been uploaded from the camera a few times)
Still, the comparison is very slow. and I wonder why that is
If the filesizes are the same, i would expect dg to use hashes to determine whether the files are the same.
But does it really run a complete hash over the whole file?
I would expect it to use progressive hashes, eg
1.compare a hash of the first 1kb
2. if that is the same, do the same with the next kb (or possibly even skip ahead some largish amount to another known place)
3. and so on, until a mismatch is found
Does dg keep a database of already checked files on disk? so that it does not have to re-calculate the hashes again?
In that case, it should only store the hashes calculated the last time around, and of course how far the checks went.
If there was a comparison with a new file, it could then use those, and continue checking later segments of the old file only if needed (and then of course record those hashes).
Say few had two files of the exact same size (say several MB)
The cached hashes could be recorded
file1: size=646567625 full-hash=none partial-hashes=md5:5 [0x1234 0x2345 0x3456 0x4567 0x5678 ]
If a new file was found to have the same size ans its hash 1 (first kb) came out to 1234, dg woudl continue to check hashes until either a mismatch was. found in one of the first 5 hashes. Is it was still the same, then it would need to calculate the hashes for BOTH (or more) matching files in segments 6, 7, 8 (and record these) until a mismatch was found, or the new file was found to be the same and marked/reported as such
So after this, the same file's entry might
file1: size=646567625 full-hash=none partial-hashes=md5:8 [0x1234 0x2345 0x3456 0x4567 0x5678 0x6789 0x7890 0x8901 ]
Of course, there should then be a mode to re-calculate remembered/cached hashes when needed (a job that would possibly run when convenient, say overnight) [ including one that then calculated complete hashes
file1: size=646567625 full-hash=md5:0xabcdefg12 partial-hashes=md5:8 [1234 2345 3456 4567 5678 6789 7890 8901 ]
That might make comparisons much faster, because it would only have to check partial files
Also, would it not be a good idea to indicate in the progress bar how many files were left to be checked?
Beta Was this translation helpful? Give feedback.
All reactions