-
-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False Positive on .eml files #1279
Comments
No false positives after the cache file was removed. Still getting those [WinError 6] The handle is invalid: errors and clearing the cache still doesn't reduce the size of that .db file. |
I just noticed that the NAS started doing a backup of the drive I'm working on at 10pm last night and it's only 58% done - a lot has changed since the last backup - so perhaps that's what's causing the invalid handle errors? |
Alright, so, I ran the same scan again to see if I got the same invalid handle errors in the debug log or perhaps different ones (as the backup is still running). No invalid handle errors in the log any more. And also no false positives even when I put the cache file back in and ran it yet again. So it looks like it's a combination of the cache and whatever was causing those file handle errors. |
My best guess for what happened here - and without going off to learn Python and trying to understand the code, all it can be is a guess - is that something caused a whole bunch of files to be temporarily inaccessible at the same time as DupeGuru was trying to check them, causing them to look empty, and this is the information which was cached. Because the sizes matched, they looked identical. If only one of them had been having issues, or one had only become inaccessible part way through DG analysing it, they wouldn't have matched. When DG was run without the cache, or with the cache but at a time when the files were no longer inaccessible, there were no false matches. So it's obviously a rare set of circumstances, and I probably haven't ended up deleting any files as duplicates which weren't duplicates, but I can't be sure of that, and I'd argue that rather than silently dumping file handle errors into the debug log (if that's turned on) as it currently does, it should be a) notifying the user b) not adding those items to the cache c) excluding them from the scan. I've been trying to figure out what could have caused the invalid handle errors, which I've since also seen in BeyondCompare as well (but only after I'd stopped trying to use DG) and after a bunch of messing about, the most likely culprit appears to be emptying the recycle bin. |
How big are the files that were skipped? You might have enabled the option to skip smaller files, check your settings. |
Was this response meant for someone else? I did not have a problem with files being skipped. |
I may be wrong but it looks like the cause of the invalid handle errors was the RAID protesting at having been running at or near capacity for more than a week, so I'm being gentler with it now (restore from NAS is obviously limited by the network speed so that's not stressing it) and will be letting it rest between DupeGuru scans and trying to keep them smaller in future. I still think it's a problem that file access errors are only written to the debug log, and that DG seems to be keeping files it couldn't access properly in the list of files to be compared. It would also be super useful to be able to pause it - I'm assuming drive access is system-controlled and so would be difficult to throttle - manually or, better yet, on a schedule, so for example I could set a long-running scan going and have it pause disk operations for 5 minutes every hour or whatever so it's not continually hammering. |
I've just done a content scan which identified a whole load of .eml files as duplicates, but they're not even close: Different file modification dates, the headers are completely different, and even the body of the email isn't identical: They're notifications of activity on a forum saying that user x started a new thread and giving a link to it. The users are different, so are the links.
I thought maybe something else I'd done since starting the app might have confused it, so I told it to clear the cache ready to start again, and when going into the folder to remove the log/profile files as well, I noticed that the hash_cache.db was still 2GB.
Also, all these supposed duplicates were in a folder I had scanned previously, and got "no duplicates" from, the difference this time was that I'd added another folder to the list.
The files themselves are unimportant but this behavior isn't - I have to assume that while I noticed THIS problem, there have been others which I didn't notice. So I'm going to have to restore the backups from when I started this round of deduplication and start over.
I've had so many weird little things this last week, most of them ones I couldn't reproduce or which I wasn't 100% sure I'd seen after they'd happened, that I'm also going to uninstall the app, delete its folders and registry entries and start over. Probably from code.
To Reproduce
I'm currently running a scan on the subfolder where the false duplicates were found with the possibly-busted cache still in place. That is quite large on its own, so I'm expecting it to take a couple of hours. If that finds nothing then I guess I'll try again with the selection I had previously, but that was a VERY big scan so would probably have to run overnight.
If it finds the false duplicates in the subfolder, then I'll close the app, manually delete the cache file and see what happens then.
Expected behavior
Not identify things as duplicates when they're not?!
Screenshots
I would include screenshots of two of the email files side-by-side in notepad++ but they're not my emails (and it's not the sort of forum you'd want your mother or boss knowing you're on)
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: