Bug: Duplicate JPEGs are not being removed #1

2320sharon · 2023-09-28T21:14:07Z

Description:

Users have found that duplicate images are being downloaded and used to extract shorelines. Downloading duplicate images has always been an issue with coastsat's download workflow, but the question is the process of removing the duplicates only happening to the tifs and not removing duplicate jpegs? Users are also wondering if its possible to modify the download workflow so that duplicates are detected before they are downloaded so that downloading duplicate images does not further slow down the downloads. Issues with duplicates are most prevalent with S2 imagery. This has led to significant delays in download times, impacting user experience and overall workflow efficiency.

Concerns:

The removal of duplicate images seems to be limited to TIFFs only, leaving behind duplicate JPEGs.
Users have expressed a need for refining the download workflow to identify and exclude duplicate images before the download process begins, ideally improving workflow speed.

Tasks:

Task 1: Investigate whether the “remove duplicates” function is currently excluding JPEGs and ensure all duplicate JPEGs are removed in the process.
Task 2: Validate the sequence of the workflow, ensuring the “remove duplicates” function is executed before the shoreline extraction begins.
Task 3: Explore the feasibility of identifying duplicate images in the dataset prior to initiating downloads from Google Earth Engine, possibly through advanced filtering techniques.

Acceptance Criteria:

The “remove duplicates” function should effectively remove both TIFF and JPEG duplicates.

2320sharon · 2023-10-04T16:21:35Z

Relevant Research Findings

The remove_duplicates defined in coastsat removes duplicate shorelines from the Shoreline dictionary. It does not remove duplicate tiffs or jpegs
duplicate tips are typically S2 imagery
handle_duplicate_image_names defined in coastsat renames the duplicate tiffs by adding dup_X to the file name. These duplicate files are images from the same satellite collection with the same image timestamp. These files are temporary tiffs that are later converted into the real tiffs that are saved.
the filtering of the S2 collection begins before the download ever starts. This process is performed in im_dict_T1["S2"] = filter_S2_collection(im_dict_T1["S2"])
- Basically the filter_S2_collection function deletes all the S2 imagery with the same time stamp and different UTM zones. Images with the same timestamp and the same UTM zone keep only the first one.

Conclusions

Given that the S2 collection has its duplicates filtered out before the download ever begins makes me confused how duplicate jpegs for the S2 collection are even being generated. I think it will take some testing to figure out when this happens. It's also possible that each person has a different definition of what a duplicate image is. From the coastsat implementation it classifies a duplicate image as an image from the same satellite collection and timestamp that's less than 24 hours apart.

As I write this I realize the issue might not be the filtering technique but the fact that the collections are in two different tiers. It's possible that there are timestamps that are the same across both tiers for the same satellite and this would cause there to be duplicate imagery even though it should have been filtered out. While the S2 collection does not have two tiers the other satellites do, so I'm going to do some testing and see if duplicates are arising because of this.

2320sharon · 2023-10-04T16:35:02Z

These two jpegs 2018-12-03-18-39-48_RGB_L8.jpg and 2018-12-03-18-40-12_RGB_L8.jpg were captured on the same day
2018-12-03 and at almost the same time 18-39-48 and 18-40-12 would these images be considered duplicates since they are less than 24 hours apart? @dbuscombe-usgs

dbuscombe-usgs · 2023-10-04T16:45:29Z

We'd want to keep both images. That's super valuable having images on consecutive days!

Duplicates are only when images are identical times

2320sharon · 2023-10-04T17:06:20Z

We'd want to keep both images. That's super valuable having images on consecutive days!

Duplicates are only when images are identical times

Ah good to know, thanks for helping me double check that.

When I ran the script below on 700 S2 images I downloaded I didn't find any duplicate imagery. Sometimes the images are only a few minutes apart, but other than that I'm not finding duplicates.

import os 
from collections import defaultdict
from collections import Counter

file_list=os.listdir('/home/sha23/development/coastseg/CoastSeg/data/ID_kyg1_datetime10-02-23__03_11_52/jpg_files/preprocessed/RGB')

counter = Counter(file_list)
duplicates = {file: count for file, count in counter.items() if count > 1}

# Print the duplicates
for duplicate, count in duplicates.items():
    print(f"Filename: {duplicate} - Count: {count}")

2320sharon · 2023-10-04T17:11:19Z

I ran this script across all the data I've downloaded and I didn't find any duplicates

import os 
from collections import defaultdict
from collections import Counter

data_directory = r'C:\development\doodleverse\coastseg\CoastSeg\data'
roi_dirs =  os.listdir(data_directory)
for roi_dir in roi_dirs:
    jpeg_directory =  os.path.join(roi_dir,"jpg_files","preprocessed","RGB")
    if os.path.exists(jpeg_directory):
        file_list=os.listdir(jpeg_directory)
        counter = Counter(file_list)
        duplicates = {file: count for file, count in counter.items() if count > 1}

        # Print the duplicates
        for duplicate, count in duplicates.items():
            print(f"Filename: {duplicate} - Count: {count}")

2320sharon · 2023-10-04T17:12:41Z

@dbuscombe-usgs have you found duplicate imagery in any of the downloads you've performed?

2320sharon · 2023-10-04T22:34:33Z

I heard back from Catherine on the duplicate images issue and here is what she said:

I'm going through the images currently. It appears that the images may actually have unique IDs, but they were so extremely similar with the exact date, hour, and even minutes for some. There are some days with S2 that have 2-3 images for the same day and extremely similar times, hence I thought it was identical. Here is an example that is only 15 mins away from each other. I haven't seen it with Landsat yet. This happens with nearly 75% of S2 images after 2018.

So it seems there aren't identical images being generated just multiple images that are sometimes seconds apart.
@dbuscombe-usgs do we want to keep these images that are minutes/seconds apart?

2320sharon · 2023-10-05T21:07:30Z

During the meeting today we addressed the confusion about "duplicates" images or more accurately images captured within a few minutes of each other or less. We determined that it would be easiest to make this a post-processing script that removes images that are less than a few minutes apart. It was suggested that this would be a script located in the scripts directory.

2320sharon · 2023-10-05T21:08:49Z

write a script the removes all images within a designated time frame from other imagery.
@dbuscombe-usgs maybe images within 5-10 minutes of each other should be removed?

dbuscombe-usgs · 2023-10-05T21:14:29Z

I think the point of the script would be for the user to specify what time period they like, and it should go in SDS-tools. It wouldn't filter out images, but shorelines

dbuscombe-usgs · 2023-10-05T21:15:42Z

And no, we don't want to remove any imagery. The SDS tools script will remove duplicate shorelines. It will look at all the shorelines with X minutes (hours days whatever) of another one, and remove them

2320sharon added the bug Something isn't working label Sep 28, 2023

2320sharon self-assigned this Sep 28, 2023

2320sharon transferred this issue from SatelliteShorelines/CoastSeg Oct 10, 2023

2320sharon moved this from In Progress to Todo in CoastSeg Project Jan 18, 2024

dbuscombe-usgs closed this as completed Jan 11, 2025

github-project-automation bot moved this from Todo to Done in CoastSeg Project Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Duplicate JPEGs are not being removed #1

Bug: Duplicate JPEGs are not being removed #1

2320sharon commented Sep 28, 2023 •

edited

Loading

2320sharon commented Oct 4, 2023 •

edited

Loading

2320sharon commented Oct 4, 2023 •

edited

Loading

dbuscombe-usgs commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 5, 2023

2320sharon commented Oct 5, 2023 •

edited

Loading

dbuscombe-usgs commented Oct 5, 2023 •

edited

Loading

dbuscombe-usgs commented Oct 5, 2023

Bug: Duplicate JPEGs are not being removed #1

Bug: Duplicate JPEGs are not being removed #1

Comments

2320sharon commented Sep 28, 2023 • edited Loading

Description:

Concerns:

Tasks:

Acceptance Criteria:

2320sharon commented Oct 4, 2023 • edited Loading

Relevant Research Findings

Conclusions

2320sharon commented Oct 4, 2023 • edited Loading

dbuscombe-usgs commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 4, 2023

2320sharon commented Oct 5, 2023

2320sharon commented Oct 5, 2023 • edited Loading

dbuscombe-usgs commented Oct 5, 2023 • edited Loading

dbuscombe-usgs commented Oct 5, 2023

2320sharon commented Sep 28, 2023 •

edited

Loading

2320sharon commented Oct 4, 2023 •

edited

Loading

2320sharon commented Oct 4, 2023 •

edited

Loading

2320sharon commented Oct 5, 2023 •

edited

Loading

dbuscombe-usgs commented Oct 5, 2023 •

edited

Loading