Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading Image Files to IDR and BBBC #106

Closed
gwaybio opened this issue Feb 25, 2020 · 54 comments
Closed

Uploading Image Files to IDR and BBBC #106

gwaybio opened this issue Feb 25, 2020 · 54 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Feb 25, 2020

Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?


edit

data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

The process is the same as outlined here
broadinstitute/lincs-cell-painting#2 (comment)
except that you stop at step 4 ("Delete the files from EBS"; the compression related comments are not relevant)

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

and for our internal (imaging platform) notes: I made some notes here about storage policy related to this question

@gwaybio
Copy link
Member Author

gwaybio commented Feb 25, 2020

thanks @shntnu - these instructions are only slightly different than the one's @hkhawar sent me on slack. Using both sets of instructions, and running the following command on 1 example plate, I receive the following error (reproduced below).

Is there something obvious that I'm doing wrong, or, is there a quick fix? If not, I will keep digging.

Command

(cell-health) ubuntu@ip-10-0-9-22:~/efs/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/software/imaging-backup-scripts$ parallel \
>   --results restore \
>   -a list_of_plates.txt \
>   ./glacier_restore.sh \
>   --project_name ${PROJECT_NAME} \
>   --batch_id ${BATCH_ID} \
>   --plate_id {1} \
>   --get_images

Error

Get images ...
Download:s3://imaging-platform-cold/imaging_analysis/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/plates/2015_07_01_Cell_Health_Vazquez_Cancer_Broad_CRISPR_PILOT_B1_SQ00014610_images_illum_analysis.tar.gz

An error occurred (NoSuchKey) when calling the RestoreObject operation: The specified key does not exist.

An error occurred (404) when calling the HeadObject operation: Not Found

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

Ah – looks like Cell Health images were never archived, so I think you are all set!

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images

For our notes: this dataset cost ~$72/mo to store and so it went down in the priority list. So glad we have a new process in place now that doesn't rely on running this archival step!

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

~$ aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images|grep tiff  > ~/Desktop/CRISPR_PILOT_B1.txt
~$ wc -l ~/Desktop/CRISPR_PILOT_B1.txt
  155520 /Users/shsingh/Desktop/CRISPR_PILOT_B1.txt

155520 = 384 wells * 9 sites * 5 channels * 3 cell lines * 3 replicates, so this looks good

@gwaybio
Copy link
Member Author

gwaybio commented Feb 25, 2020

Thanks @shntnu ! This process is new to me so thanks for bearing with me :)

In chatting with @hkhawar about the file structure, is that number slightly concerning? i.e. do we have the illumination corrected images? The load_csv file seems to indicate illumination correction was performed.

@hkhawar
Copy link

hkhawar commented Feb 25, 2020 via email

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

When you inspect the path of illum files, you will find that they are nested inside the analysis folder – this was our previous standard (we later changed to storing illum functions with images)

I bet you will find them at that location

read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
name value
PathName_IllumAGP /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumDNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumER /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumMito /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumRNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/

@hkhawar
Copy link

hkhawar commented Feb 25, 2020 via email

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

do we have the illumination corrected images?

@gwaygenomics note that only illumination correction functions are stored, not the corrected images themselves.

In https://idr.openmicroscopy.org/webclient/?show=screen-1751, we decided to additionally store the illumination corrected images, so you would need to generate those separately if you decide to do that here

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

@gwaygenomics Some history of why we have illum corrected images in
https://idr.openmicroscopy.org/webclient/?show=screen-1751:

Click to expand
Forwarded Conversation
Subject: Depositing images to Image Data Resource
------------------------

From: Anne Carpenter <[email protected]>
Date: Wed, Dec 7, 2016 at 12:28 PM
To: Shantanu Singh <[email protected]>, Mohammad Hossein Rohban
<[email protected]>, Juan C Caicedo <[email protected]>


Mohammad will soon be receiving a hard drive to transfer images to IDR. 
I want to discuss whether to upload illumination corrected images in addition
to or instead of the raw images. And, outlines of segmented objects.
I ask this because I clicked on the Wawer screen link
http://idr-demo.openmicroscopy.org/webclient/?show=screen-1251
and clicked a few wells and some stains are very dramatically dimmer across
the top. For deep learning purposes and/or for simple viewing it would be much
nicer to have the corrected ones. 
I guess it comes down to whether IDR's tools make it easy to ignore the
corrected and/or non-corrected ones if you want, and whether IDR is not
enthusiastic about doubling the size of the data for something simple like
correction/not. 
Secondly, I think we should routinely submit the nucleus outlines and cell
outlines. These should be very small (binary) images so I don't think IDR
would object to the size. The question for them is whether it's better to
submit two binary images (one cell outlines and one nucleus outlines). The 2nd
Q is whether they have any tools built in that can overlay these onto the
intensity images.

Unless someone has profound insights/discussion on this I would suggest the
next step is for a volunteer to contact IDR and ask about all this. 
Anne

----------
From: Shantanu Singh <[email protected]>
Date: Wed, Dec 7, 2016 at 1:17 PM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>, Juan C Caicedo
<[email protected]>


On Wed, Dec 7, 2016 at 12:27 PM, Anne Carpenter <[email protected]>
wrote:
> Unless someone has profound insights/discussion on this I would suggest the
> next step is for a volunteer to contact IDR and ask about all this.

I agree

Mohammad – given that you are our link to IDR at the moment, would you
mind asking them this question? i.e.

- We would like to upload both, raw images, and images corrected for
illumination inhomogeneity

- We would also like to upload  nucleus outlines and cell outlines for
each image

- Let us know how to go about this, and if you have any concerns about
these extra image data
Forwarded Conversation
Subject: Re: Hard-drive for Target Accelerator ORF data to go in IDR
------------------------

From: Mohammad Hossein Rohban <[email protected]>
Date: Thu, Dec 8, 2016 at 10:39 AM
To: Eleanor Williams <[email protected]>
Cc: Anne Carpenter <[email protected]>, Shantanu Singh
<[email protected]>, Juan Caicedo <[email protected]>


Dear Eleanor,

Thanks for sending us the hard drive. We were also wondering if we could also
include illumination corrected images and images containing nucleus and cell
outlines in addition to the original images (size will still be less than 1
TB). Please let us know how to go about this and if you have any concerns
about these extra image data.

Best,
Mohammad 


----------
From: Anne Carpenter <[email protected]>
Date: Thu, Dec 8, 2016 at 10:42 AM
To: Mohammad Hossein Rohban <[email protected]>
Cc: Eleanor Williams <[email protected]>, Shantanu Singh
<[email protected]>, Juan Caicedo <[email protected]>


In addition to just knowing whether it's acceptable to receive these extra
images, we wondered whether the tools in IDR would allow for choosing which of
10 channels to display (5 original channels, 5 illumination corrected
channels, 2 outline 'channels') and whether the outlines could be overlaid on
intensity images.
Anne

----------
From: Eleanor Williams <[email protected]>
Date: Fri, Dec 9, 2016 at 4:19 AM
To: Anne Carpenter <[email protected]>, Mohammad Hossein Rohban
<[email protected]>
Cc: Shantanu Singh <[email protected]>, Juan Caicedo
<[email protected]>, [email protected]
<[email protected]>


Hi Anne and MohammadPlease do add the illumination corrected images and images
containing nucleus and cell outlines.  The IDR is set up so that it is
possible to choose which of the channels to display.  I'm less sure about the
overlaying of the outlines but we can certainly look into coming up with a
solution to this. 
Best regardsEleanor


----------
From: Mohammad Hossein Rohban <[email protected]>
Date: Mon, Dec 12, 2016 at 5:48 PM
To: Eleanor Williams <[email protected]>
Cc: Anne Carpenter <[email protected]>, Shantanu Singh
<[email protected]>, Juan Caicedo <[email protected]>,
[email protected] <[email protected]>


Hi Eleanor, 
We are about to submit our images. The raw images are generally in the range
of 0-2000 in an image file format that spans 0-4095. Does IDR have viewing
options that allow the viewer to contrast-stretch the images if desired so
they do not appear too dim? Or do you recommend we adjust the images on our
side prior to submission (e.g. find the 99.9th percentile of maximum pixel
value in all images and contrast-stretch all images to that value)?
Thanks,Mohammad

----------
From: Eleanor Williams <[email protected]>
Date: Tue, Dec 13, 2016 at 5:24 AM
To: Mohammad Hossein Rohban <[email protected]>
Cc: Anne Carpenter <[email protected]>, Shantanu Singh
<[email protected]>, Juan Caicedo <[email protected]>,
[email protected] <[email protected]>


Hi MohammadYes we can apply rendering settings in IDR.  We normally do this
using a config file. The rendering setting are applied at the whole dataset
level but it should be possible to apply at more specific levels too.  Here is
an example of a config file for a screen

https://github.com/IDR/idr-metadata/blob/master/idr0001-graml-sysgro/screenA/idr0001-screenA-renderdef.yml.
If you could include a table of the settings you'd like to apply in each
channel will create the config files.
Best regardsEleanor


----------
From: Anne Carpenter <[email protected]>
Date: Tue, Dec 13, 2016 at 7:57 AM
To: Eleanor Williams <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>, Shantanu Singh
<[email protected]>, Juan Caicedo <[email protected]>,
[email protected] <[email protected]>


That is excellent. In that case we will send the raw images and the
illumination corrected images, and allow the config file to adjust contrast if
needed.Anne

@shntnu
Copy link
Collaborator

shntnu commented Feb 25, 2020

And some more email logs from the time we submitted https://idr.openmicroscopy.org/webclient/?show=screen-1751

Click to expand
Forwarded Conversation
Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go
in IDR
------------------------

From: Anne Carpenter <[email protected]>
Date: Mon, Mar 6, 2017 at 9:22 AM
To: Mohammad Hossein Rohban <[email protected]>
Cc: Shantanu Singh <[email protected]>


Great! SS or MB would be best suited to answer about mapping and scaling of
channels. It's an important Q and will ease our future work if we get the
right answer.


On Mon, Mar 6, 2017 at 9:20 AM, Mohammad Hossein Rohban
<[email protected]> wrote:
Hi Anne
A DOI has been assigned for the TA ORF dataset at IDR, so we can now include
it in eLife submission. They also asked (see forwarded email) for the
rendering setting. Do we want to change assigned colors of the channels?
—Mohammad

Begin forwarded message:
From: Eleanor Williams <[email protected]>
Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go
in IDR
Date: March 3, 2017 at 6:14:30 PM EST
To: Mohammad Hossein Rohban <[email protected]>,
"[email protected]" <[email protected]>
Reply-To: <[email protected]>,
<[email protected]>

I forgot that I wanted to ask you about rendering settings.  At the moment
there is a green channel, a red channel and 3 blue channels, in both the raw
image and illumination corrected images (screenshots attached).   Would you
like to change the color of some of the channels and is there a particular max
and min value you'd like applied across all plates (or all raw and all
illumination corrected plates)?Best regardsEleanor  


On 03/03/2017 22:52, Eleanor Williams wrote:
Hi MohammadThe data DOI for your dataset will be
http://dx.doi.org/10.17867/10000105 and this can be put in your publication.
The sentence should be along the lines of 'Image files are available in the
Image Data Resource under DOI http://dx.doi.org/10.17867/10000105'.I have also
attached the depositor agreement for the University of Dundee, which one of
the authors should sign and then ideally scan and email back to us.
We have now been able to test load a few plates and they look fine so we'll go
ahead and get them all into private version of IDR ready for the next data
release.   I am looking at the annotations now and will let you know if I have
any questions.
Best regardsEleanor


On 01/03/2017 15:48, Eleanor Williams wrote:
Great, thanks.  I'll ask for the DOI to be generated and email it to you when
we get it.
Best regardsEleanor

On 01/03/2017 15:39, Mohammad Hossein Rohban wrote:
Thanks! Indeed we have changed the title to “Systematic morphological
profiling of human gene and allele function via Cell Painting”. Everything
else is precise in the attached excel file. Known ORCIDs of the authors are
:Mohammad H. Rohban : 0000-0001-6589-850XAnne E. Carpenter :
0000-0003-1555-8261
Best,Mohammad
On Mar 1, 2017, at 7:20 AM, Eleanor Williams <[email protected]> wrote:
Hi MohammadThe data is still in the process of being added to the IDR as we've
had a few infrastructure changes in the last month which has prevented data
loading.  But we can create a place holder and get a data DOI for you to put
in the paper in the next couple of days. To get the data DOI minted I need to
submit basic details including the dataset creators (paper authors) and
license information.  Could you check over the attached spreadsheet to make
sure the details are correct?   
If you know the ORCID IDs (https://orcid.org/) of any of the authors it would
be useful to have them. 
If you want to add any subject keywords they can be added on the 'subject'
line.  
The default license is CC-BY 4.0
(https://creativecommons.org/licenses/by/4.0/). 
After minting the data DOI, the University of Dundee will create a depositor
agreement which I can email you for signing.  It can then be scanned and
returned.  
We anticipate that the raw and corrected images with annotations would be live
in IDR by 15th March.  The segmented images might take a little longer.  I
hope this will with your publication time frame. 
Best regardsEleanor



On 28/02/2017 22:14, Mohammad Hossein Rohban wrote:
Hi Eleanor,
Hope all is well!We have just submitted a revision of our paper related to
this dataset and the editor asked us to provide details of the image dataset
we are using (actually an access URL). I was wondering if the data has been
processed at your end and if there is any access URL available for it?
Thanks,Mohammad 
On Feb 7, 2017, at 9:10 AM, Eleanor Williams <[email protected]> wrote:
The hard drive has arrived safely and I'll try and look at it as soon as I
can.  
Best regardsEleanor

On 02/02/2017 21:44, Jeanelle Ackerman wrote:
Hi, I just dropped this hard-drive in the mail today so it should be back to
you shortly. 
Please let me know when it's been received, thanks. 
Best, 
Jeanelle
On Tue, Dec 6, 2016 at 5:23 AM, Eleanor Williams <[email protected]>
wrote:
Dear Anne, Mohammad and Jeanelle

A 1Tb hard-drive is on its way to you (addressed to Anne) by Fedex.  If
you could put the image files and any metadata files you have on there
and send it back to me that I would be great.  I have included a return
address label in the package.

Best regards

Eleanor



----------
From: Shantanu Singh <[email protected]>
Date: Tue, Mar 7, 2017 at 7:59 PM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>



On Mon, Mar 6, 2017 at 9:22 AM, Anne Carpenter <[email protected]>
wrote:
Great! SS or MB would be best suited to answer about mapping and scaling of
channels. It's an important Q and will ease our future work if we get the
right answer.
(1)
I don't have a quick answer about mapping – we had discussed this before in
the group and there was no satisfactory conclusion. If we need an immediate
(<2 weeks) answer then I'd suggest going with their defaults.
(I'll message on Slack to see if anyone recollects how that fancy image for
the Science article was made; I think we thought about color mapping for that)
(2)For scaling, we have a better sense: 
for illumination corrected images – rescale 0-65535 to 0-255 when displaying.
This is ok because the images are already illum corrected so we know that
plate-to-plate variations are already adjusted for. Mohammad – worth checking
whether this indeed works out ok by testing for a few images.
for raw images – go with their defaults because we don't really have an easy
way of setting scales 

----------
From: Shantanu Singh <[email protected]>
Date: Tue, Mar 7, 2017 at 8:02 PM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


On Tue, Mar 7, 2017 at 7:59 PM, Shantanu Singh
<[email protected]> wrote:
> (I'll message on Slack to see if anyone recollects how that fancy image for
> the Science article was made; I think we thought about color mapping for
> that)

https://broadinstitute.slack.com/archives/ip-general/p1488934898000419


----------
From: Shantanu Singh <[email protected]>
Date: Thu, Mar 23, 2017 at 7:10 AM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


On Tue, Mar 7, 2017 at 8:02 PM, Shantanu Singh
<[email protected]> wrote:
>> (I'll message on Slack to see if anyone recollects how that fancy image for
>> the Science article was made; I think we thought about color mapping for
>> that)
>
> https://broadinstitute.slack.com/archives/ip-general/p1488934898000419

PS – I didn't end up asking w Mark/David; please go ahead and ask them


----------
From: Shantanu Singh <[email protected]>
Date: Thu, Mar 23, 2017 at 7:14 AM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


Also, the page says 12 plates whereas we have only 6

https://twitter.com/jrswedlow/status/844626088512933888

May be because they are counting the illumination corrected ones too?
In any case, please let them know

----------
From: Shantanu Singh <[email protected]>
Date: Thu, Mar 23, 2017 at 7:19 AM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


Also worth checking to make sure they have got the rest of the metadata
correct:
https://github.com/IDR/idr-metadata/tree/master/idr0033-rohban-pathways

e.g. they are using the old channel names (PhGolgi, etc.)

On Thu, Mar 23, 2017 at 7:14 AM, Shantanu Singh

----------
From: Anne Carpenter <[email protected]>
Date: Fri, Mar 24, 2017 at 8:28 AM
To: Shantanu Singh <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


Ach! Amazing you noticed this. Mohammad will you be able to track this down
and help them fix what's needed?

Anne E. Carpenter, Ph.D.Director, Imaging PlatformBroad Institute of Harvard
and MIT415 Main Street, Cambridge MA 02142
phone: (617)
[email protected]://www.broadinstitute.org/~anne


----------
From: Mohammad Hossein Rohban <[email protected]>
Date: Tue, Mar 28, 2017 at 10:28 AM
To: Anne Carpenter <[email protected]>
Cc: Shantanu Singh <[email protected]>


The metadata is good, but I am going to email them about the number of
plates. 
—Mohammad

----------
From: Anne Carpenter <[email protected]>
Date: Mon, Apr 3, 2017 at 10:33 AM
To: Mohammad Hossein Rohban <[email protected]>, Shantanu Singh
<[email protected]>


Shantanu, you can skip reading below, the question is: both the
illum-corrected and raw images will be available for *download* from IDR (once
they set up download functionality) so our decision right now is which one do
we want available for browsing. I vote for illum corrected personally.
Anne


On Mon, Apr 3, 2017 at 8:40 AM, Mohammad Hossein Rohban
<[email protected]> wrote:
Hi Anne
Eleanor asked me about some visualization settings of TA ORF at IDR and
whether we want to illumination corrected plates just for download (2 last
emails). I was not sure about them. Do you have any preferences about them?
—Mohammad 

Begin forwarded message:
From: Eleanor Williams <[email protected]>
Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go
in IDR
Date: April 3, 2017 at 6:55:25 AM EDT
To: Mohammad Hossein Rohban <[email protected]>
Cc: "Eleanor Williams (Staff)" <[email protected]>,
"[email protected]" <[email protected]>

Hi MohammadJust checking about whether you want just the 6 raw image plates in
IDR for now?Best regardsEleanor

On 29/03/2017 17:32, Eleanor Williams wrote:
Hi MohammadOk, I will update the annotations with the transcripts listed in
idr0033-screenA-library_new.txt and add to this the Quality Control columns.  
Would you rather that in the IDR website we just list the 6 raw image plates
and then perhaps the illumination corrected versions can be just available for
download (when we get the download facility ready)?  I can easily delete the 6
illumination corrected plates from the IDR website.
Have you had any thoughts about the rendering settings (a max and min
intensity we could use for all raw images) and the channel colours?Best
regardsEleanor

On 28/03/2017 17:12, Mohammad Hossein Rohban wrote:
Hi Eleanor
Sorry for replying to this with delay. In case of the discrepancy, we would
like to use idr0033-screenA-library_new.txt which has been obtained directly
from GPP. And I agree the idea of adding the Quality control column to the
library file makes it clearer. Thanks for doing this. I also wanted to mention
that the number of Plate Count was set to 12 in the IDR website, while we only
have 6 (and the other 6 were just illumination corrected versions of the same
plates). Would you please set it to 6 instead?
Thanks,Mohammad
On Mar 15, 2017, at 5:29 PM, Eleanor Williams <[email protected]> wrote:
Hi MohammadDue to technical problems our release hasn't happened yet today,
but will hopefully happen very soon. I'll let you and Anne know when it has. 
In this release I haven't included the Transcript identifiers because time was
running short and the identifiers you listed in
idr0033-screenA-library_new.txt are not the same as those in Supplemental
Table 1 and I wanted to double check this. e.g. RICTOR_WT/ccsbBroad304_13449
has a target transcript of  BC029608.1 in the table and XM_006714463.3 in your
file.  Happy to use the ones in your file, or not include this if column if
its a complicated issue.  
I was also not expecting  transcript identifiers when there is the comment of
"ORF did not map to any transcripts" but I see now that in the study file
there is a note that  "The ORF sequence is compared against the target
transcript; ORFs with matching percentage of less than 99% in either
nucleotide or protein sequences are filtered out".  Maybe we should add a
Quality Control column to the library file and enter 'Fail' for these rows and
put a  "Quality Control Comment" of "ORF matched transcript with percentage
less than 99% in either nucleotide or protein sequences"?  Example of what I
mean is attached. If you'd rather just leave it as it is then that is fine but
I thought this might make it clearer that these images were not used in the
analyses.
Best regardsEleanor




On 14/03/2017 12:12, Mohammad Hossein Rohban wrote:
Hi Eleanor 
That’s great! Thanks for handling this. Here are the answers:
1. Yes.2. Unfortunately they do not directly map to them. One has to use the
actual ORF sequence to map them. 3. I think that would be great.
Unfortunately, BRDN samples are only shown only in the internal GPP portal (I
think because of the privacy reasons). 4. I included the Transcript as a
column and you can find the new library file attached.5. We prefer to use
'small asymmetric cells’. 6. Note that both of the two sql files should be
imported in the same database. It appears that you are importing them in
separate databases.7. For each treatment, there are 5 different replicates. We
obtain correlations between all pairs of profiles corresponding to these
replicates. This gives us 10 correlation values. Median replicate correlation
is then defined as the median of these 10 numbers.
—Mohammad

On Mar 10, 2017, at 10:07 AM, Eleanor Williams <[email protected]>
wrote:
Hi MohammadI've been through all the annotation files you sent and slightly
reformatted them but no major changes except that I added in the
_illum_corrected plates to the library file and added the phenotypes to the
processed file.  I've attached all the files to this email for you to check
over.  
If there are no major issues then we will get these annotations added to the
images on Monday or Tuesday next week. If necessary we can update them again
after that. 
I had a few minor questions for you:1. Does EMPTY mean untreated cells?
2. Do you know if the clone identifiers you have listed map directly to
identifiers in the human ORFeome resource (http://horfdb.dfci.harvard.edu/)?
The reason I ask is that we have another high content screen using clones from
the human ORFeome resource and I wondered if we could link between any of
them. 
3. Would it be useful for us to link out to the ORF sequence alignments in
the gpp portal?  I found that the 'ccsb' type clone IDs linked to this. But
not the 'BRDN*` type ones - is information about them held anywhere?4. I also
wondered if it would be useful for us to add the Target Transcripts from Suppl
Table 1 to the library file.  If you have this information accessible would
you be able to add it?5. I noticed a slight difference in terms used for one
of the phenotypes - 'small cells (condensed)' in paper vs 'small asymmetric
cells' in the file I was sent.  Which would you rather use?
6. I tried importing the two sql files of feature data into mysql databases.
TargetAccelerator.sql worked fine but i got the following error with the other
onemysql -u root -p idr0033_Per_Object_View < Per_Object_View.sql
ERROR 1146 (42S02) at line 1: Table
'idr0033_per_object_view.sigma2_pilot_2013_10_11_analysis_per_nuclei' doesn't
exist7. Could you give me a short description of what the 'Median Replicate
Correlation' is?

I think that's all.  For now I have not mapped any of your phenotypes to
ontology terms as we don't have good ways of expressing enriched or
de-enriched phenotypes but I'd like work on this in future. 
If you could let me know if the files I have attached are OK as soon as
possible that would be great. I can add any other extra bits of information
later. Also we need the depositor agreement returned.
Have a good weekend. 
Best regardsEleanor

On 07/03/2017 18:36, Mohammad Hossein Rohban wrote:
Hi Eleanor,
Thanks! We will soon let you know about the rendering setting.
—Mohammad
On Mar 3, 2017, at 6:14 PM, Eleanor Williams <[email protected]> wrote:
I forgot that I wanted to ask you about rendering settings.  At the moment
there is a green channel, a red channel and 3 blue channels, in both the raw
image and illumination corrected images (screenshots attached).   Would you
like to change the color of some of the channels and is there a particular max
and min value you'd like applied across all plates (or all raw and all
illumination corrected plates)?Best regardsEleanor  


On 03/03/2017 22:52, Eleanor Williams wrote:
Hi MohammadThe data DOI for your dataset will be
http://dx.doi.org/10.17867/10000105 and this can be put in your publication.
The sentence should be along the lines of 'Image files are available in the
Image Data Resource under DOI http://dx.doi.org/10.17867/10000105'.I have also
attached the depositor agreement for the University of Dundee, which one of
the authors should sign and then ideally scan and email back to us. 
We have now been able to test load a few plates and they look fine so we'll go
ahead and get them all into private version of IDR ready for the next data
release.   I am looking at the annotations now and will let you know if I have
any questions. 
Best regardsEleanor


On 01/03/2017 15:48, Eleanor Williams wrote:
Great, thanks.  I'll ask for the DOI to be generated and email it to you when
we get it. 
Best regardsEleanor

On 01/03/2017 15:39, Mohammad Hossein Rohban wrote:
Thanks! Indeed we have changed the title to “Systematic morphological
profiling of human gene and allele function via Cell Painting†. Everything
else is precise in the attached excel file.

----------
From: Shantanu Singh <[email protected]>
Date: Mon, Apr 3, 2017 at 1:28 PM
To: Anne Carpenter <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


On Mon, Apr 3, 2017 at 10:33 AM, Anne Carpenter <[email protected]>
wrote:
> I vote for illum corrected personally.

I agree


----------
From: Anne Carpenter <[email protected]>
Date: Mon, Apr 3, 2017 at 1:42 PM
To: Shantanu Singh <[email protected]>
Cc: Mohammad Hossein Rohban <[email protected]>


Right now it appears both copies are visible at IDR and it's clear which are
corrected and which not - Mohammad, why not just keep it as is?


----------
From: Mohammad Hossein Rohban <[email protected]>
Date: Mon, Apr 3, 2017 at 1:43 PM
To: Anne Carpenter <[email protected]>
Cc: Shantanu Singh <[email protected]>


Apparently if we keep it as is, the number of plates would be automatically
shown as 12. 

----------
From: Anne Carpenter <[email protected]>
Date: Mon, Apr 3, 2017 at 1:44 PM
To: Mohammad Hossein Rohban <[email protected]>
Cc: Shantanu Singh <[email protected]>


If that is how that number arises and it can't be changed except by deleting
data, I think that is ok to leave as is. Now we are aware that # plates = #
plates of data uploaded rather than # of plates tested in the experiment, I
can live with it as is.


@gwaybio
Copy link
Member Author

gwaybio commented Feb 26, 2020

Great! Thanks for providing this context @shntnu - I'd like to include both raw and illumination corrected images.

I see the illumination correction functions (.mat files), but I will need help applying them.

@hkhawar
Copy link

hkhawar commented Feb 26, 2020 via email

@gwaybio
Copy link
Member Author

gwaybio commented Feb 26, 2020

@hkhawar - yes please! I will find a time on your calendar for a quick meeting

@hkhawar
Copy link

hkhawar commented Feb 26, 2020 via email

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

Hamdah reprocessed some illum corrected files that were corrected and stored them in folders like this

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp/SQ00014610/illum_corrected/

I am now going to copy these to their corresponding original locations e.g. here

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014610/Images/

using this command

origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp


# copy all files (the ones in the temppath will fail)
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}

corrupted_image.csv is available here

This step revealed that some files were missing in the tmp folder:

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    "if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing
Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing
Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing

@gwaybio
Copy link
Member Author

gwaybio commented Jul 23, 2020

thank you Shantanu ❤️ (and Hamdah too for the upfront processing)

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

Steps to perform once the missing files listed at the end of #106 (comment) are recreated

  1. Make sure all the files are present
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    "if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
  1. Copy files to the original location; make sure there are no errors
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp

# copy all files (the ones missing in the temppath will fail)
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
  1. Download files
parallel \
    mkdir -p illumcorrected_CRISPR_PILOT_B1/images/{1} ::: SQ00014610 SQ00014611 SQ00014612 SQ00014613 SQ00014614 SQ00014615 SQ00014616 SQ00014617 SQ00014618 

origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${origpath}/{1}/Images/{2} illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2}
  1. brew install imagemagick to do a quick test of fidelity after downloading
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    identify illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2} | grep "Can not read TIFF"
  1. Check file sizes. Files that are unusually small may be corruped
aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt

# get file sizes and counts
cat /tmp/image_files.txt |tr -s " "|cut -d" " -f3|sort -n|uniq -c

Once you've confirmed everything works, you can have IDR run step 3 at their end.

@hkhawar

This comment has been minimized.

@shntnu

This comment has been minimized.

@gwaybio
Copy link
Member Author

gwaybio commented Jul 23, 2020

Steps to perform once the missing files listed at the end of #106 (comment) are recreated

For my understanding, is this the complete order of operations?

  1. we first need to reprocess these 9 files in Uploading Image Files to IDR and BBBC #106 (comment)
  2. Make sure they are in the right folders
  3. Then I perform the 5 steps in Uploading Image Files to IDR and BBBC #106 (comment)
  4. Then I confirm the download integrity
  5. Then I give step 3 to IDR

@hkhawar can you help with step 1 above?

Thanks again Shantanu and Hamdah!

@hkhawar
Copy link

hkhawar commented Jul 23, 2020

@gwaygenomics Do I need to process only following nine files?
Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing
Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing
Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

I am also concerned some of the files that IDR has not listed as corrupted are actually corrupted. E.g. this one

2020-03-08 10:41:41     743346 projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff`

I downloaded it like this

aws s3 cp s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff .

identify did not report issues

identify ./r02c08f03p01-ch5sk1fk1fl1.tiff
./r02c08f03p01-ch5sk1fk1fl1.tiff TIFF 2160x2160 2160x2160+0+0 16-bit Grayscale Gray 743346B 0.000u 0:00.000

But I'm not able to open the file using Preview ("It may be damaged or use a file format that Preview doesn’t recognize.")

My suspicion is that all the files with infrequent file sizes are actually corrupted files.

Welcome to the rabbit hole! :)

Get the file listing

aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt

Now download the files that whose files sizes are infrequent:

library(tidyverse)

sizes <- 
  read_delim("/tmp/image_files.txt", 
             col_names = c("date", "time", "size", "path"), 
             trim_ws = TRUE, 
             delim = " ") %>%
  mutate(download = sprintf("aws s3 cp s3://imaging-platform/%s %s", path, path)) %>%
  mutate(dirpath = dirname(path))

dirpaths <- 
  sizes %>% 
  distinct(dirpath)

dirpaths$dirpath %>% 
  walk(function(dirpath) dir.create(dirpath, showWarnings = FALSE, recursive = TRUE))

frac_sizes <- 
  sizes %>% 
  group_by(size) %>% 
  tally() %>% 
  arrange(desc(size)) %>% 
  mutate(frac = n / sum(n))

frac_sizes %>%
  head() %>%
  knitr::kable()

frac_sizes %>% 
  filter(frac < 0.001) %>%
  select(size) %>%
  inner_join(sizes) %>%
  magrittr::extract2("download") %>%
  walk(function(download) system(download))

I ran that and then did a random sampling of images by trying to open using Preview and found that all in that random sample were corrupted. This is the full list of all files downloaded (below).

@gwaygenomics I gotta run but hopefully, you can take it from here and figure out the next steps. If not, ping me on this and I'll have a look once back from vacation

projects/
└── 2015_07_01_Cell_Health_Vazquez_Cancer_Broad
    └── illumcorrected_CRISPR_PILOT_B1
        └── images
            ├── SQ00014610
            │   └── Images
            │       ├── r01c18f01p01-ch4sk1fk1fl1.tiff
            │       ├── r01c19f08p01-ch5sk1fk1fl1.tiff
            │       ├── r02c07f06p01-ch2sk1fk1fl1.tiff
            │       ├── r02c13f02p01-ch1sk1fk1fl1.tiff
            │       ├── r04c01f01p01-ch5sk1fk1fl1.tiff
            │       ├── r07c07f03p01-ch2sk1fk1fl1.tiff
            │       ├── r10c12f05p01-ch2sk1fk1fl1.tiff
            │       ├── r13c03f08p01-ch1sk1fk1fl1.tiff
            │       ├── r13c09f01p01-ch2sk1fk1fl1.tiff
            │       ├── r16c19f02p01-ch3sk1fk1fl1.tiff
            │       └── r16c20f07p01-ch4sk1fk1fl1.tiff
            ├── SQ00014611
            │   └── Images
            │       ├── r02c18f03p01-ch1sk1fk1fl1.tiff
            │       ├── r06c11f02p01-ch2sk1fk1fl1.tiff
            │       └── r14c08f07p01-ch5sk1fk1fl1.tiff
            ├── SQ00014612
            │   └── Images
            │       ├── r03c08f01p01-ch4sk1fk1fl1.tiff
            │       ├── r06c06f08p01-ch5sk1fk1fl1.tiff
            │       ├── r10c15f07p01-ch1sk1fk1fl1.tiff
            │       ├── r11c05f02p01-ch5sk1fk1fl1.tiff
            │       └── r13c08f06p01-ch4sk1fk1fl1.tiff
            ├── SQ00014613
            │   └── Images
            │       ├── r02c08f08p01-ch1sk1fk1fl1.tiff
            │       ├── r03c15f04p01-ch4sk1fk1fl1.tiff
            │       ├── r07c05f02p01-ch1sk1fk1fl1.tiff
            │       ├── r07c21f05p01-ch2sk1fk1fl1.tiff
            │       ├── r08c19f04p01-ch1sk1fk1fl1.tiff
            │       ├── r10c08f05p01-ch1sk1fk1fl1.tiff
            │       └── r11c18f08p01-ch2sk1fk1fl1.tiff
            ├── SQ00014614
            │   └── Images
            │       ├── r03c04f01p01-ch4sk1fk1fl1.tiff
            │       ├── r03c07f05p01-ch5sk1fk1fl1.tiff
            │       ├── r05c09f08p01-ch1sk1fk1fl1.tiff
            │       ├── r09c07f01p01-ch5sk1fk1fl1.tiff
            │       └── r15c02f03p01-ch4sk1fk1fl1.tiff
            ├── SQ00014615
            │   └── Images
            │       ├── r02c08f03p01-ch5sk1fk1fl1.tiff
            │       ├── r02c14f04p01-ch1sk1fk1fl1.tiff
            │       ├── r08c07f01p01-ch3sk1fk1fl1.tiff
            │       ├── r08c07f07p01-ch5sk1fk1fl1.tiff
            │       ├── r08c14f07p01-ch1sk1fk1fl1.tiff
            │       ├── r09c07f03p01-ch1sk1fk1fl1.tiff
            │       ├── r10c09f08p01-ch5sk1fk1fl1.tiff
            │       ├── r10c18f03p01-ch2sk1fk1fl1.tiff
            │       ├── r13c21f07p01-ch2sk1fk1fl1.tiff
            │       ├── r15c15f08p01-ch4sk1fk1fl1.tiff
            │       └── r16c21f05p01-ch1sk1fk1fl1.tiff
            ├── SQ00014616
            │   └── Images
            │       ├── r01c17f07p01-ch5sk1fk1fl1.tiff
            │       ├── r02c21f01p01-ch1sk1fk1fl1.tiff
            │       ├── r03c19f02p01-ch5sk1fk1fl1.tiff
            │       ├── r07c04f03p01-ch1sk1fk1fl1.tiff
            │       └── r14c17f03p01-ch2sk1fk1fl1.tiff
            ├── SQ00014617
            │   └── Images
            │       ├── r02c23f05p01-ch1sk1fk1fl1.tiff
            │       ├── r03c06f02p01-ch4sk1fk1fl1.tiff
            │       ├── r06c01f08p01-ch4sk1fk1fl1.tiff
            │       ├── r06c16f02p01-ch2sk1fk1fl1.tiff
            │       ├── r08c16f07p01-ch4sk1fk1fl1.tiff
            │       ├── r11c14f07p01-ch2sk1fk1fl1.tiff
            │       ├── r12c04f02p01-ch3sk1fk1fl1.tiff
            │       ├── r12c08f04p01-ch4sk1fk1fl1.tiff
            │       ├── r12c10f04p01-ch5sk1fk1fl1.tiff
            │       ├── r13c09f07p01-ch4sk1fk1fl1.tiff
            │       └── r15c14f04p01-ch2sk1fk1fl1.tiff
            └── SQ00014618
                └── Images
                    ├── r01c14f08p01-ch1sk1fk1fl1.tiff
                    ├── r03c09f07p01-ch1sk1fk1fl1.tiff
                    ├── r03c09f07p01-ch5sk1fk1fl1.tiff
                    ├── r03c12f06p01-ch4sk1fk1fl1.tiff
                    ├── r05c10f08p01-ch5sk1fk1fl1.tiff
                    ├── r06c01f07p01-ch1sk1fk1fl1.tiff
                    ├── r07c09f07p01-ch1sk1fk1fl1.tiff
                    ├── r13c05f04p01-ch5sk1fk1fl1.tiff
                    ├── r14c10f02p01-ch3sk1fk1fl1.tiff
                    └── r16c23f01p01-ch2sk1fk1fl1.tiff

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

@gwaygenomics I just saw #106 (comment)

Yes, that's the right order of operations.

But @hkhawar, unfortunately, you will also need to reprocess those files listed at the end of #106 (comment) because my random sampling revealed that those are also corrupted. I have no clue why so many files are getting corrupted but hopefully you will figure that out.

@hkhawar Thanks very much for helping out!

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

@hkhawar one more thing – could you please briefly describe the setup you are using to reprocess these images? Are you mounting the S3 bucket on your computer and running it on your computer by any chance? If so, I think that could be the issue because S3 mounts suck with heavy I/O.

@hkhawar
Copy link

hkhawar commented Jul 23, 2020

@shntnu I ran this experiment on AWS. I am not sure why we have gotten lot of corrupted images. I could guess something happened during running DCP and instead of ending up in dead message queues for unfinished jobs. They somehow created an image file with 0 Bytes

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

@hkhawar Thanks for clarifying. Very strange! And note that the issue is that some output files are actually pretty large e.g. 8Mb but are still corrupted. Worth checking in with Beth on this via Slack.

@hkhawar
Copy link

hkhawar commented Jul 23, 2020

@gwaygenomics Could you please do the same thing that you did before
Sorting other channels for these images?

@hkhawar
Copy link

hkhawar commented Jul 23, 2020

@shntnu Sure I will check with Beth on this tomorrow

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

Here's an example: r06c11f02p01-ch2sk1fk1fl1.tiff.zip located at projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014611/Images/r06c11f02p01-ch2sk1fk1fl1.tiff

It doesn't open using Preview:

image

But it does open in Fiji but the bottom pixels are missing

image

@shntnu
Copy link
Collaborator

shntnu commented Jul 23, 2020

@hkhawar
Copy link

hkhawar commented Jul 24, 2020

@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?

@gwaybio
Copy link
Member Author

gwaybio commented Jul 24, 2020

@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?

Sure - what folder do you want them in? Also, do you think reprocessing them the same way as before is a good idea? (are you going to do anything different?)

@hkhawar
Copy link

hkhawar commented Jul 24, 2020

I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3

@shntnu
Copy link
Collaborator

shntnu commented Jul 24, 2020

I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3

For our notes, could you pen down why they need to be in a new folder (vs creating a loaddata file pointing to the original locations?) Will be useful to know when we need to reprocess small batches

@hkhawar
Copy link

hkhawar commented Jul 24, 2020

I was avoiding to use load_data.csv and wanted to download images locally and using CellProfiler locally to reprocess files. This is how I typically do for small set of images.

@bethac07
Copy link

bethac07 commented Jul 24, 2020

Occasionally, CellProfiler just stochastically seems to do this- any operation, even write or sync, will sometimes stochastically just go ker-flop, and when we're working on 10K/100K/1M/10M images, the likelihood it will happen >=1 times becomes significant. Since each plate has ~21K images, based on the list above, the likelihood is in the 1-to-low-thousands.

If there's a problem with the source image, obviously that's one thing; if the problem is truly stochastic (aka when you run the same image again the output file comes out fine), there isn't a ton to do (though if these were done <60 days ago it's worth checking the logs for the known bad sites since that's easy while the logs are still in CloudWatch). If we think the file is being written correctly, but not synced correctly, we could always institute a 30 or 60 second pause after the CellProfiler pipeline is done before syncing.

It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either

if filesize in accepted_file_sizes:
    goodfile_count +=1
if goodfile_count >= N:
    reprocess = False

or

if filesize not in known_bad_file_sizes:
    goodfile_count +=1
if goodfile_count >= N:
    reprocess = False

@gwaybio
Copy link
Member Author

gwaybio commented Jul 24, 2020

@hkhawar

the corrupted files are ready to go! located at /home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two thanks!

@hkhawar
Copy link

hkhawar commented Jul 24, 2020

@bethac07 Logs are not available now. I guess is that problem happened during syncing of files. On redoing reprocessing those images again just worked fine
@gwaygenomics thanks I am going to work on it

@shntnu
Copy link
Collaborator

shntnu commented Jul 24, 2020

It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either

Thanks for clarifying @bethac07 🥇.

@hkhawar details are below but tl;dr: we could have gone with fixed file size because these are uncompressed TIFFS so I think they should all be the same file size. But there's one aberration (below). So instead let's go with CHECK_IF_DONE=TRUE and MIN_FILE_SIZE = 9348718.

Details

I dug into this a bit for our future reference with this kind of issue.

frac_sizes %>% head() %>% knitr::kable()

From this table, looks like 9348786 is the value to go with. But I don't know what's happening with 9348718 – why are there 1240 instances of that? No clue. Also, files with size 9348718 open fine with Preview.

size n frac
9348786 136926 0.9905378
9348718 1240 0.0089703
9210546 1 0.0000072
8795826 1 0.0000072
8683506 1 0.0000072
8631666 1 0.0000072

All other sizes have only 1-2 occurrences (except 8 which occurs 8 times).

frac_sizes %>% filter(size < 9348718) %>% count(n) %>% knitr::kable()
n nn
1 56
2 2
8 1

93487181 is certainly special because if any one channel of a site has that value, then all channels have that value

sizes %>% filter(size == 9348718)  %>% mutate(site = basename(path), plate = str_match(dirpath, "SQ[0-9]{8}"))  %>% separate(site, c("site", "channel"), sep = "-") %>% group_by(site, plate) %>% tally() %>% ungroup() %>% arrange(site) %>% count(n)
n nn
5 248

@hkhawar
Copy link

hkhawar commented Jul 24, 2020

@gwaygenomics I have reprocessed illum corrected images and they are available in the same folder

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two/

Note: I haven't synced them to s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/

I guess you can do it

@shntnu I have no idea why we get images of this 93487181 size. Did you try opening an image of this size in Fiji?

@gwaybio
Copy link
Member Author

gwaybio commented Jul 28, 2020

Progress

  • we first need to reprocess these 9 files in Uploading Image Files to IDR and BBBC #106 (comment)
  • Make sure they are in the right folders
  • Then I perform the 5 steps in Uploading Image Files to IDR and BBBC #106 (comment)
    • Make sure all the files are present
    • Copy files to the original location; make sure there are no errors
    • Download files
    • brew install imagemagick to do a quick test of fidelity after downloading
    • Check file sizes. Files that are unusually small may be corruped
  • Then I confirm the download integrity
  • Then I give step 3 to IDR

@gwaybio
Copy link
Member Author

gwaybio commented Jul 28, 2020

Download integrity confirmed! This is the output of the R code in #106 (comment):

size n frac
9348786 136926 0.9905306
9348718 1309 0.0094694

🎉

All that remains is to send IDR the S3 links

@gwaybio
Copy link
Member Author

gwaybio commented Jul 28, 2020

One potentially interesting observation is that all of the corrupted files that we needed to fix ended up having the smaller file size listed above.

@gwaybio
Copy link
Member Author

gwaybio commented Aug 25, 2020

next hurdle incoming!

Summary

IDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images.

Specifics

The folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar or @shntnu might know the answer to right away.

  1. Images with f09 in their name are missing from the illumination corrected set (there are 1920 of these).
  2. There are 5 additional images missing in the illumination corrected set all from plate SQ00014610

Issue 1 - Missing f09

Here are example images:

r16c24f09p01-ch2sk1fk1fl1.tiff
r16c24f09p01-ch3sk1fk1fl1.tiff
r16c24f09p01-ch4sk1fk1fl1.tiff
r16c24f09p01-ch5sk1fk1fl1.tiff

Issue 2 - Five more

r16c24f01p01-ch1sk1fk1fl1.tiff
r16c24f01p01-ch2sk1fk1fl1.tiff
r16c24f01p01-ch3sk1fk1fl1.tiff
r16c24f01p01-ch4sk1fk1fl1.tiff
r16c24f01p01-ch5sk1fk1fl1.tiff

@hkhawar
Copy link

hkhawar commented Aug 25, 2020 via email

@gwaybio
Copy link
Member Author

gwaybio commented Aug 25, 2020

Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue?

@hkhawar
Copy link

hkhawar commented Aug 25, 2020 via email

@gwaybio
Copy link
Member Author

gwaybio commented Sep 2, 2020

turns out that we actually have 17,285 illum corrected files missing.

1,920 "f09" files missing per plate
9 plates
5 "f01" files missing only in plate SQ00014610
1,920 * 9 + 5 = 17,285

Transfer files into a new folder again?

Yup that would be a great help. Let me know once they are done.

I have confirmed that all of these files are now in a separate folder. The folder is/home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_three.

Note that the subfolder is tmp_version_three.

@hkhawar all set for the next (and hopefully final!) iteration of the illum correction pipeline. Thanks again

@hkhawar
Copy link

hkhawar commented Sep 2, 2020 via email

@gwaybio
Copy link
Member Author

gwaybio commented Sep 14, 2020

I have confirmed that all 17,285 files have been corrected, and that they are all the same size:

size n frac
9348718 17285 1

I will work on uploading them directly to IDR now (they gave us an FTP)

@hkhawar
Copy link

hkhawar commented Sep 14, 2020

@gwaygenomics great

@gwaybio
Copy link
Member Author

gwaybio commented Dec 10, 2020

images are now public: https://idr.openmicroscopy.org/webclient/?show=screen-2701

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants