-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uploading Image Files to IDR and BBBC #106
Comments
The process is the same as outlined here |
and for our internal (imaging platform) notes: I made some notes here about storage policy related to this question |
thanks @shntnu - these instructions are only slightly different than the one's @hkhawar sent me on slack. Using both sets of instructions, and running the following command on 1 example plate, I receive the following error (reproduced below). Is there something obvious that I'm doing wrong, or, is there a quick fix? If not, I will keep digging. Command(cell-health) ubuntu@ip-10-0-9-22:~/efs/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/software/imaging-backup-scripts$ parallel \
> --results restore \
> -a list_of_plates.txt \
> ./glacier_restore.sh \
> --project_name ${PROJECT_NAME} \
> --batch_id ${BATCH_ID} \
> --plate_id {1} \
> --get_images ErrorGet images ...
Download:s3://imaging-platform-cold/imaging_analysis/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/plates/2015_07_01_Cell_Health_Vazquez_Cancer_Broad_CRISPR_PILOT_B1_SQ00014610_images_illum_analysis.tar.gz
An error occurred (NoSuchKey) when calling the RestoreObject operation: The specified key does not exist.
An error occurred (404) when calling the HeadObject operation: Not Found |
Ah – looks like Cell Health images were never archived, so I think you are all set! s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images For our notes: this dataset cost ~$72/mo to store and so it went down in the priority list. So glad we have a new process in place now that doesn't rely on running this archival step! |
155520 = 384 wells * 9 sites * 5 channels * 3 cell lines * 3 replicates, so this looks good |
It seems like load_data_csv folder on S3 showed
that load_data_with_illum.csv file was created but there are no illum
folder containing illum files on S3
…On Tue, Feb 25, 2020 at 4:38 PM Greg Way ***@***.***> wrote:
Thanks @shntnu <https://github.com/shntnu> ! This process is new to me so
thanks for bearing with me :)
In chatting with @hkhawar <https://github.com/hkhawar> about the file
structure, is that number slightly concerning? i.e. do we have the
illumination corrected images? The load_csv file seems to indicate
illumination correction was performed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106?email_source=notifications&email_token=AIUGCWK2MZGL56VKBSKAHODREWFTTA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5TH3I#issuecomment-591082477>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWPS57ZDABL3FVFAQCLREWFTTANCNFSM4K3RAT7Q>
.
|
When you inspect the path of illum files, you will find that they are nested inside the I bet you will find them at that location read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
|
👋 shantanu
…On Tue, Feb 25, 2020 at 4:51 PM Shantanu Singh ***@***.***> wrote:
When you inspect the path of illum files, you will find that they are
nested inside the analysis folder – this was our previous standard (we
later changed to storing illum functions *with* images)
I bet you will find them at that location
read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
name value
PathName_IllumAGP
/home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumDNA
/home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumER
/home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumMito
/home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumRNA
/home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106?email_source=notifications&email_token=AIUGCWPGTPS6FQCE3SYRE7LREWHFFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5UVZQ#issuecomment-591088358>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWNUL3VK35B2OB7IA7DREWHFFANCNFSM4K3RAT7Q>
.
|
@gwaygenomics note that only illumination correction functions are stored, not the corrected images themselves. In https://idr.openmicroscopy.org/webclient/?show=screen-1751, we decided to additionally store the illumination corrected images, so you would need to generate those separately if you decide to do that here |
@gwaygenomics Some history of why we have illum corrected images in Click to expand
|
And some more email logs from the time we submitted https://idr.openmicroscopy.org/webclient/?show=screen-1751 Click to expand
|
Great! Thanks for providing this context @shntnu - I'd like to include both raw and illumination corrected images. I see the illumination correction functions ( |
Greg if you want, I can help you out in getting the illumination corrected
images
…On Wed, Feb 26, 2020 at 4:40 PM Greg Way ***@***.***> wrote:
Great! Thanks for providing this context @shntnu
<https://github.com/shntnu> - I'd like to include both raw and
illumination corrected images.
I see the illumination correction functions (.mat files), but I will need
help applying them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106?email_source=notifications&email_token=AIUGCWNUPSMUG6L2LM54IW3RE3OWBA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENB73IY#issuecomment-591658403>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWMHCY7C2MLVU7LRITTRE3OWBANCNFSM4K3RAT7Q>
.
|
@hkhawar - yes please! I will find a time on your calendar for a quick meeting |
Sure
…On Wed, Feb 26, 2020 at 4:54 PM Greg Way ***@***.***> wrote:
@hkhawar <https://github.com/hkhawar> - yes please! I will find a time on
your calendar for a quick meeting
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106?email_source=notifications&email_token=AIUGCWLJVCZWOPFKIOE27X3RE3QLFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCBIIA#issuecomment-591664160>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWOIVNAGCF4WEBLAIB3RE3QLFANCNFSM4K3RAT7Q>
.
|
Hamdah reprocessed some illum corrected files that were corrected and stored them in folders like this
I am now going to copy these to their corresponding original locations e.g. here
using this command origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
# copy all files (the ones in the temppath will fail)
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
This step revealed that some files were missing in the parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
"if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
|
thank you Shantanu ❤️ (and Hamdah too for the upfront processing) |
Steps to perform once the missing files listed at the end of #106 (comment) are recreated
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
"if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
# copy all files (the ones missing in the temppath will fail)
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
identify illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2} | grep "Can not read TIFF"
aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt
# get file sizes and counts
cat /tmp/image_files.txt |tr -s " "|cut -d" " -f3|sort -n|uniq -c Once you've confirmed everything works, you can have IDR run step 3 at their end. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
For my understanding, is this the complete order of operations?
@hkhawar can you help with step 1 above? Thanks again Shantanu and Hamdah! |
@gwaygenomics Do I need to process only following nine files? |
I am also concerned some of the files that IDR has not listed as corrupted are actually corrupted. E.g. this one
I downloaded it like this aws s3 cp s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff .
But I'm not able to open the file using My suspicion is that all the files with infrequent file sizes are actually corrupted files. Welcome to the rabbit hole! :) Get the file listing aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt Now download the files that whose files sizes are infrequent: library(tidyverse)
sizes <-
read_delim("/tmp/image_files.txt",
col_names = c("date", "time", "size", "path"),
trim_ws = TRUE,
delim = " ") %>%
mutate(download = sprintf("aws s3 cp s3://imaging-platform/%s %s", path, path)) %>%
mutate(dirpath = dirname(path))
dirpaths <-
sizes %>%
distinct(dirpath)
dirpaths$dirpath %>%
walk(function(dirpath) dir.create(dirpath, showWarnings = FALSE, recursive = TRUE))
frac_sizes <-
sizes %>%
group_by(size) %>%
tally() %>%
arrange(desc(size)) %>%
mutate(frac = n / sum(n))
frac_sizes %>%
head() %>%
knitr::kable()
frac_sizes %>%
filter(frac < 0.001) %>%
select(size) %>%
inner_join(sizes) %>%
magrittr::extract2("download") %>%
walk(function(download) system(download)) I ran that and then did a random sampling of images by trying to open using @gwaygenomics I gotta run but hopefully, you can take it from here and figure out the next steps. If not, ping me on this and I'll have a look once back from vacation
|
@gwaygenomics I just saw #106 (comment) Yes, that's the right order of operations. But @hkhawar, unfortunately, you will also need to reprocess those files listed at the end of #106 (comment) because my random sampling revealed that those are also corrupted. I have no clue why so many files are getting corrupted but hopefully you will figure that out. @hkhawar Thanks very much for helping out! |
@hkhawar one more thing – could you please briefly describe the setup you are using to reprocess these images? Are you mounting the S3 bucket on your computer and running it on your computer by any chance? If so, I think that could be the issue because S3 mounts suck with heavy I/O. |
@shntnu I ran this experiment on AWS. I am not sure why we have gotten lot of corrupted images. I could guess something happened during running DCP and instead of ending up in dead message queues for unfinished jobs. They somehow created an image file with 0 Bytes |
@hkhawar Thanks for clarifying. Very strange! And note that the issue is that some output files are actually pretty large e.g. 8Mb but are still corrupted. Worth checking in with Beth on this via Slack. |
@gwaygenomics Could you please do the same thing that you did before |
@shntnu Sure I will check with Beth on this tomorrow |
Here's an example: r06c11f02p01-ch2sk1fk1fl1.tiff.zip located at It doesn't open using Preview: But it does open in Fiji but the bottom pixels are missing |
I've posted this internally https://broadinstitute.slack.com/archives/G3QFDHXC4/p1595538827014000 |
@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today? |
Sure - what folder do you want them in? Also, do you think reprocessing them the same way as before is a good idea? (are you going to do anything different?) |
I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3 |
For our notes, could you pen down why they need to be in a new folder (vs creating a loaddata file pointing to the original locations?) Will be useful to know when we need to reprocess small batches |
I was avoiding to use load_data.csv and wanted to download images locally and using CellProfiler locally to reprocess files. This is how I typically do for small set of images. |
Occasionally, CellProfiler just stochastically seems to do this- any operation, even write or sync, will sometimes stochastically just go ker-flop, and when we're working on 10K/100K/1M/10M images, the likelihood it will happen >=1 times becomes significant. Since each plate has ~21K images, based on the list above, the likelihood is in the 1-to-low-thousands. If there's a problem with the source image, obviously that's one thing; if the problem is truly stochastic (aka when you run the same image again the output file comes out fine), there isn't a ton to do (though if these were done <60 days ago it's worth checking the logs for the known bad sites since that's easy while the logs are still in CloudWatch). If we think the file is being written correctly, but not synced correctly, we could always institute a 30 or 60 second pause after the CellProfiler pipeline is done before syncing. It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either if filesize in accepted_file_sizes:
goodfile_count +=1
if goodfile_count >= N:
reprocess = False or if filesize not in known_bad_file_sizes:
goodfile_count +=1
if goodfile_count >= N:
reprocess = False |
the corrupted files are ready to go! located at |
@bethac07 Logs are not available now. I guess is that problem happened during syncing of files. On redoing reprocessing those images again just worked fine |
Thanks for clarifying @bethac07 🥇. @hkhawar details are below but tl;dr: we could have gone with fixed file size because these are uncompressed TIFFS so I think they should all be the same file size. But there's one aberration (below). So instead let's go with Details I dug into this a bit for our future reference with this kind of issue.
From this table, looks like
All other sizes have only 1-2 occurrences (except
|
@gwaygenomics I have reprocessed illum corrected images and they are available in the same folder s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two/ Note: I haven't synced them to s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/ I guess you can do it @shntnu I have no idea why we get images of this 93487181 size. Did you try opening an image of this size in Fiji? |
Progress
|
Download integrity confirmed! This is the output of the R code in #106 (comment):
🎉 All that remains is to send IDR the S3 links |
One potentially interesting observation is that all of the corrupted files that we needed to fix ended up having the smaller file size listed above. |
next hurdle incoming! SummaryIDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images. SpecificsThe folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar or @shntnu might know the answer to right away.
Issue 1 - Missing
|
@gregory Way <[email protected]> its a huge pain. Again I think it is
related to same problem not transferring them to S3 properly and produced
corrupted and missing image files. if they provided us a list of missing
illum images then I have to redo it again
…On Tue, Aug 25, 2020 at 2:42 PM Greg Way ***@***.***> wrote:
next hurdle incoming!
Summary
IDR has all non-illumination corrected images, but they are missing 1,925
illumination corrected images.
Specifics
The folks at IDR are working towards verifying the submission. A couple of
points that either @hkhawar <https://github.com/hkhawar> or @shntnu
<https://github.com/shntnu> might know the answer to right away.
1. Images with f09 in their name are missing from the illumination
corrected set (there are 1920 of these).
2. There are 5 additional images missing in the illumination corrected
set all from plate SQ00014610
Issue 1 - Missing f09
Here are example images:
r16c24f09p01-ch2sk1fk1fl1.tiff
r16c24f09p01-ch3sk1fk1fl1.tiff
r16c24f09p01-ch4sk1fk1fl1.tiff
r16c24f09p01-ch5sk1fk1fl1.tiff
Issue 2 - Five more
r16c24f01p01-ch1sk1fk1fl1.tiff
r16c24f01p01-ch2sk1fk1fl1.tiff
r16c24f01p01-ch3sk1fk1fl1.tiff
r16c24f01p01-ch4sk1fk1fl1.tiff
r16c24f01p01-ch5sk1fk1fl1.tiff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWKWXJQLADG6RA2HDBDSCQHZNANCNFSM4K3RAT7Q>
.
|
Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue? |
Yup that would be a great help. Let me know once they are done. I will be
work on it.
…On Tue, Aug 25, 2020 at 3:06 PM Greg Way ***@***.***> wrote:
Argh! Is there something that I can do to ease the pain? Transfer files
into a new folder again? It seems like this is an AWS transfer issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWJFPJQDDKMPP4J6WOTSCQKV7ANCNFSM4K3RAT7Q>
.
|
turns out that we actually have 17,285 illum corrected files missing.
I have confirmed that all of these files are now in a separate folder. The folder is Note that the subfolder is tmp_version_three. @hkhawar all set for the next (and hopefully final!) iteration of the illum correction pipeline. Thanks again |
@gregory Way <[email protected]> Thanks alot. I will start working on
it tomorrow
…On Wed, Sep 2, 2020 at 4:50 PM Greg Way ***@***.***> wrote:
turns out that we actually have 17,285 illum corrected files missing.
1,920 "f09" files missing per plate
9 plates
5 "f01" files missing only in plate SQ00014610
1,920 * 9 + 5 = 17,285
Transfer files into a new folder again?
Yup that would be a great help. Let me know once they are done.
I have confirmed that all of these files are now in a separate folder. The
folder is
/home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_three
.
*Note that the subfolder is tmp_version_three*.
@hkhawar <https://github.com/hkhawar> all set for the next (and hopefully
final!) iteration of the illum correction pipeline. Thanks again
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUGCWNEWSRLCBOYJWHV5FTSD243JANCNFSM4K3RAT7Q>
.
|
I have confirmed that all 17,285 files have been corrected, and that they are all the same size:
I will work on uploading them directly to IDR now (they gave us an FTP) |
@gwaygenomics great |
images are now public: https://idr.openmicroscopy.org/webclient/?show=screen-2701 |
Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.
We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?
edit
data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701
The text was updated successfully, but these errors were encountered: