-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Duplicate JPEGs are not being removed #1
Comments
Relevant Research Findings
ConclusionsGiven that the S2 collection has its duplicates filtered out before the download ever begins makes me confused how duplicate jpegs for the S2 collection are even being generated. I think it will take some testing to figure out when this happens. It's also possible that each person has a different definition of what a duplicate image is. From the coastsat implementation it classifies a duplicate image as an image from the same satellite collection and timestamp that's less than 24 hours apart. As I write this I realize the issue might not be the filtering technique but the fact that the collections are in two different tiers. It's possible that there are timestamps that are the same across both tiers for the same satellite and this would cause there to be duplicate imagery even though it should have been filtered out. While the S2 collection does not have two tiers the other satellites do, so I'm going to do some testing and see if duplicates are arising because of this. |
These two jpegs |
We'd want to keep both images. That's super valuable having images on consecutive days! Duplicates are only when images are identical times |
Ah good to know, thanks for helping me double check that. When I ran the script below on 700 S2 images I downloaded I didn't find any duplicate imagery. Sometimes the images are only a few minutes apart, but other than that I'm not finding duplicates. import os
from collections import defaultdict
from collections import Counter
file_list=os.listdir('/home/sha23/development/coastseg/CoastSeg/data/ID_kyg1_datetime10-02-23__03_11_52/jpg_files/preprocessed/RGB')
counter = Counter(file_list)
duplicates = {file: count for file, count in counter.items() if count > 1}
# Print the duplicates
for duplicate, count in duplicates.items():
print(f"Filename: {duplicate} - Count: {count}") |
I ran this script across all the data I've downloaded and I didn't find any duplicates
|
@dbuscombe-usgs have you found duplicate imagery in any of the downloads you've performed? |
I heard back from Catherine on the duplicate images issue and here is what she said:
So it seems there aren't identical images being generated just multiple images that are sometimes seconds apart. |
During the meeting today we addressed the confusion about "duplicates" images or more accurately images captured within a few minutes of each other or less. We determined that it would be easiest to make this a post-processing script that removes images that are less than a few minutes apart. It was suggested that this would be a script located in the scripts directory. |
|
I think the point of the script would be for the user to specify what time period they like, and it should go in SDS-tools. It wouldn't filter out images, but shorelines |
And no, we don't want to remove any imagery. The SDS tools script will remove duplicate shorelines. It will look at all the shorelines with X minutes (hours days whatever) of another one, and remove them |
Description:
Users have found that duplicate images are being downloaded and used to extract shorelines. Downloading duplicate images has always been an issue with coastsat's download workflow, but the question is the process of removing the duplicates only happening to the tifs and not removing duplicate jpegs? Users are also wondering if its possible to modify the download workflow so that duplicates are detected before they are downloaded so that downloading duplicate images does not further slow down the downloads. Issues with duplicates are most prevalent with S2 imagery. This has led to significant delays in download times, impacting user experience and overall workflow efficiency.
Concerns:
Tasks:
Acceptance Criteria:
The text was updated successfully, but these errors were encountered: