Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False Positive on .eml files #1279

Open
pantropia opened this issue Feb 1, 2025 · 8 comments
Open

False Positive on .eml files #1279

pantropia opened this issue Feb 1, 2025 · 8 comments
Labels
bug Bug reports.

Comments

@pantropia
Copy link

I've just done a content scan which identified a whole load of .eml files as duplicates, but they're not even close: Different file modification dates, the headers are completely different, and even the body of the email isn't identical: They're notifications of activity on a forum saying that user x started a new thread and giving a link to it. The users are different, so are the links.
I thought maybe something else I'd done since starting the app might have confused it, so I told it to clear the cache ready to start again, and when going into the folder to remove the log/profile files as well, I noticed that the hash_cache.db was still 2GB.
Also, all these supposed duplicates were in a folder I had scanned previously, and got "no duplicates" from, the difference this time was that I'd added another folder to the list.
The files themselves are unimportant but this behavior isn't - I have to assume that while I noticed THIS problem, there have been others which I didn't notice. So I'm going to have to restore the backups from when I started this round of deduplication and start over.
I've had so many weird little things this last week, most of them ones I couldn't reproduce or which I wasn't 100% sure I'd seen after they'd happened, that I'm also going to uninstall the app, delete its folders and registry entries and start over. Probably from code.

To Reproduce
I'm currently running a scan on the subfolder where the false duplicates were found with the possibly-busted cache still in place. That is quite large on its own, so I'm expecting it to take a couple of hours. If that finds nothing then I guess I'll try again with the selection I had previously, but that was a VERY big scan so would probably have to run overnight.
If it finds the false duplicates in the subfolder, then I'll close the app, manually delete the cache file and see what happens then.

Expected behavior
Not identify things as duplicates when they're not?!

Screenshots
I would include screenshots of two of the email files side-by-side in notepad++ but they're not my emails (and it's not the sort of forum you'd want your mother or boss knowing you're on)

Desktop (please complete the following information):

  • OS: Win 11
  • Version: 4.3.1
@pantropia pantropia added the bug Bug reports. label Feb 1, 2025
@pantropia
Copy link
Author

pantropia commented Feb 1, 2025

OK - ran it again with the large cache file still in place and the false positives have come up again.
The debug log is showing a lot of errors like this, at least some of which reference those false positives:
2025-02-01 12:58:40,922 - WARNING - An error '[WinError 6] The handle is invalid: 'D:\For Deduping\D\geeknas\public\userremoved@gmail.com\2015\6\23\1-384388.eml'' was raised while decoding 'WindowsPath('D:/For Deduping/D/geeknas/public/userremoved@gmail.com/2015/6/23/1-384388.eml')'
Compared the two largest ones (which weren't related to that Interesting site) with beyondcompare and this is the start of it, so as you can see they're very different.
Image
I've confirmed that the file in question does exist, so I'm assuming that when dupeguru hits one of those invalid handle errors that for whatever reason it ends up comparing size only and not actually doing the content comparison.
I'm trying again now with the cache file removed, will report back how that goes.

@pantropia
Copy link
Author

No false positives after the cache file was removed. Still getting those [WinError 6] The handle is invalid: errors and clearing the cache still doesn't reduce the size of that .db file.

@pantropia
Copy link
Author

I just noticed that the NAS started doing a backup of the drive I'm working on at 10pm last night and it's only 58% done - a lot has changed since the last backup - so perhaps that's what's causing the invalid handle errors?

@pantropia
Copy link
Author

Alright, so, I ran the same scan again to see if I got the same invalid handle errors in the debug log or perhaps different ones (as the backup is still running). No invalid handle errors in the log any more. And also no false positives even when I put the cache file back in and ran it yet again. So it looks like it's a combination of the cache and whatever was causing those file handle errors.
I searched here for 'invalid file handle' and nothing came up. Anyone got any ideas what the hell might have been going on at that point? Any other logs I should be looking at to try to figure it out?
I'm still going to have to restore from backup and start over, because I have no idea whether there were false positives previously which were removed without me noticing.

@pantropia
Copy link
Author

My best guess for what happened here - and without going off to learn Python and trying to understand the code, all it can be is a guess - is that something caused a whole bunch of files to be temporarily inaccessible at the same time as DupeGuru was trying to check them, causing them to look empty, and this is the information which was cached. Because the sizes matched, they looked identical. If only one of them had been having issues, or one had only become inaccessible part way through DG analysing it, they wouldn't have matched. When DG was run without the cache, or with the cache but at a time when the files were no longer inaccessible, there were no false matches.

So it's obviously a rare set of circumstances, and I probably haven't ended up deleting any files as duplicates which weren't duplicates, but I can't be sure of that, and I'd argue that rather than silently dumping file handle errors into the debug log (if that's turned on) as it currently does, it should be a) notifying the user b) not adding those items to the cache c) excluding them from the scan.

I've been trying to figure out what could have caused the invalid handle errors, which I've since also seen in BeyondCompare as well (but only after I'd stopped trying to use DG) and after a bunch of messing about, the most likely culprit appears to be emptying the recycle bin.

@glubsy
Copy link
Contributor

glubsy commented Feb 2, 2025

How big are the files that were skipped? You might have enabled the option to skip smaller files, check your settings.

@pantropia
Copy link
Author

How big are the files that were skipped? You might have enabled the option to skip smaller files, check your settings.

Was this response meant for someone else? I did not have a problem with files being skipped.

@pantropia
Copy link
Author

pantropia commented Feb 3, 2025

I may be wrong but it looks like the cause of the invalid handle errors was the RAID protesting at having been running at or near capacity for more than a week, so I'm being gentler with it now (restore from NAS is obviously limited by the network speed so that's not stressing it) and will be letting it rest between DupeGuru scans and trying to keep them smaller in future.

I still think it's a problem that file access errors are only written to the debug log, and that DG seems to be keeping files it couldn't access properly in the list of files to be compared.

It would also be super useful to be able to pause it - I'm assuming drive access is system-controlled and so would be difficult to throttle - manually or, better yet, on a schedule, so for example I could set a long-running scan going and have it pause disk operations for 5 minutes every hour or whatever so it's not continually hammering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug reports.
Projects
None yet
Development

No branches or pull requests

2 participants