Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some duplicate images not in resultlist #27

Open
blannoy opened this issue Oct 15, 2023 · 3 comments
Open

Some duplicate images not in resultlist #27

blannoy opened this issue Oct 15, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@blannoy
Copy link

blannoy commented Oct 15, 2023

Hello,

The scan seems to miss some photos that are exactly the same (image, filename, dimensions). I don't know why they don't show up in the similarity map. Sometimes running it a few times eventually finds them. I suppose its because the model doesn't use the metadata, only the content to compare the images?

This got me thinking that a simple scanning feature could be just a comparison of photo metadata instead of content wise.
So this is somewhere between a bug and a feature request.

The deletion feature is also something that could be used separately, e.g. load a list of IDs and run that through the plugin.

PS: it's a really cool project

@mtalcott mtalcott added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Oct 15, 2023
@mtalcott
Copy link
Owner

Yes, currently photo metadata like filename and dimensions are not taken into account to calculate similarity scores. So this is somewhat expected behavior. I agree that the same filename (perhaps excluding common suffixes for dupes like copy, (1), etc.) and same dimensions should boost the similarity score. Count that as a desired enhancement!

The model that calculates image similarity is static though, so I would not expect the output to vary for the same image. Perhaps it was comparing against different photos, resulting in slightly different similarity scores. Do you have any example images you'd be willing to provide?

@blannoy
Copy link
Author

blannoy commented Oct 16, 2023

Hi,

I did some digging...

I extracted both 250px images from mongo & filesystem. When I run them through the model online (mediapipe mobilenet large) I get a score of 98.58% so thats below the std 99% limit. So I lowered the limit in the deduper and now I get the images (score of 98.79%, they are maybe a bit too noisy). The problem is that I get other images as well along side the ones that were missing the first time. Some of them are real duplicates, but others not (but very similar).

If there would be a way to filter the results "only the ones with the same size/dimensions/filename" that would be better.
So I guess a front-end result filter would do the trick, but my React skills are not up to the task ;)

@mtalcott
Copy link
Owner

Yes, I agree the filtering would be a great addition. That's represented as a desired enhancement over on #7

It's also a bummer that true duplicates get a 98.58% match with the MobileNet-V3 (large) model. Thanks for confirming that, I've found the same with some dupes in my own photos. I explored lowering the default limit, but also found that it started including many more non-duplicates as well. This is undesirable behavior likely due to using a model optimized for mobile use; previously I was using the clip-ViT-B-32 model and sentence-transformers which had better accuracy but at the expense of MUCH longer (3x?) runtime and out of memory errors due to MUCH higher memory usage. Perhaps there is a better model compatible with MediaPipe that can be used...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants