Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing data using DeepProfiler #2

Open
shntnu opened this issue Feb 19, 2020 · 42 comments
Open

Processing data using DeepProfiler #2

shntnu opened this issue Feb 19, 2020 · 42 comments
Assignees

Comments

@shntnu
Copy link
Collaborator

shntnu commented Feb 19, 2020

Juan asked this:

I need to process the LINCS dataset to proceed with the plan we discussed for LUAD. I'm going to need access to the images, which is a lot of data! Here is my plan to make it efficient, and I would like to get your feedback and recommendations:

  • Get all the plates back from Glacier for a few days. If I remember correctly, the entire set is 21TB or so.
  • Use EC2 instances to compress the images using DeepProfiler. I did this in the past and we can get down to 800GB or so.
  • Save the compressed dataset in S3 to work with it during the next couple of months, then send it to Glacier.  In fact, with that size we can probably keep it in the DGX and the GPU-cluster too.

Does this make sense? Do you have any recommendations for me before moving forward?

By the way, the compression should take roughly 4 hours per plate (pessimistic estimate), and can be run in parallel, with multiple plates per machine (one per CPU). So using 30 cheap instances (with 4 cores each) in spot mode should do the trick in one day, including the operator's time :)

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 19, 2020

Lets try a small set first, maybe just 1 plate, to make sure this process works

  1. Restore the tarball from Glacier, following instructions here.
    And while you are at it, it would be great if you could help with the TODO in that file :D
  2. Download the files to an EBS volume, instructions are above
  3. Upload the uncompressed files back to S3, at imaging-platform instead of imaging-platform-cold. E.g. SQ00015096 would go to s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/SQ00015096, instructions are above. The reason is that going forward, we will no longer be tarballing and archiving, but instead setting lifecycle rules that will automatically move any object that is >60 days to Glacier state. So we might as well put these files back at imaging-platform
  4. Delete the files from EBS (expensive to have larger EBS volumes sitting around)
  5. Now run your compressions in EC2 by directly accessing images from s3 (if you DeepProfiler can't do that yet, I'd recommend adding that functionality)
  6. Save your images back to imaging-platform e.g. s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/SQ00015096/Images
  7. Add an automatically generated README in s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/SQ00015096/ explaining how this was generated

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 19, 2020

@jccaicedo

@jccaicedo
Copy link

Sounds great! Thanks for these details and suggestions. Will report the results of compressing one plate soon.

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 19, 2020

@jccaicedo I changed some paths, please note:

s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/SQ00015096/

and not

s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/SQ00015096_compressed/

@shntnu

This comment has been minimized.

@jccaicedo
Copy link

@shntnu here is an update from following the plan above, with a couple of questions:

Steps 1 to 4 were run by @MarziehHaghighi. She used EFS and then deleted the unarchived files from glacier. So no EBS volumes were needed in the process and no extra storage is being used after retrieving plates. The original images are back in their original locations.

Step 5: I run DeepProfiler in an EC2 instance (m5d.xlarge 4CPUs and 16GB of RAM). I have an script that configures the environment and runs everything given the plate name. Question: should I put this script in this repository?

Step 6: This is easy and will be part of the script mentioned in the previous step. I noted the new path that you suggest for this.

Step 7: Question: what exactly this file should have? Is it enough with the script that I mentioned above?

The process is currently running so I will update this thread again with the relevant statistics of the resulting compression.

@jccaicedo
Copy link

Plate compression is done. Here are the stats:

  • Copying data from S3 to local machine for compression:
    15 minutes, 162 GB, 17,280 individual TIFF files.
  • Running the compression methods:
    8.5 hours computing and applying illumination correction, rescaling and PNG compression.
  • Uploading compressed data to S3:
    1 minute, 6.8 GB, 17,280 individual PNG files.
  • Compression rates:
    Up to 95% of space savings.
  • Computing cost:
    Approx. $0.7 per plate, using a m5n.xlarge spot instance ($0.07/hour x 10 hours).

Note that this instance can potentially process 4 plates at a time because it has 4 cores. However, I/O may slow down overall performance. I think it's safe to assume that we can compress two plates for approximately $1.

The data has been uploaded here: s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/SQ00015096/

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 22, 2020

Steps 1 to 4 were run by @MarziehHaghighi. She used EFS and then deleted the unarchived files from glacier. So no EBS volumes were needed in the process and no extra storage is being used after retrieving plates. The original images are back in their original locations.

Nice! When doing the whole dataset, please keep these numbers in mind:

  • The image dataset is ~22TB.
  • EFS is 0.30 $/GB-month, pay what you use (unlike EBS where you pay for what you block)
  • 22,528 GB per month x 0.30 USD = 6,758.40 USD / month or $225/day
  • S3 Standard is 0.023 $/GB-month

So you'd want to batch it appropriately so that storage costs don't pile up during the period where the unarchived images have been downloaded and are being uploaded back to S3.

Step 5: I run DeepProfiler in an EC2 instance (m5d.xlarge 4CPUs and 16GB of RAM). I have an script that configures the environment and runs everything given the plate name. Question: should I put this script in this repository?

Yes please make a PR

Step 7: Question: what exactly this file should have? Is it enough with the script that I mentioned above?

A link to the URL of the file on GitHub would be great

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 23, 2020

  • Copying data from S3 to local machine for compression:
    15 minutes, 162 GB, 17,280 individual TIFF files.

Sounds good. I assume it is not easy to modify so that you can read directly from s3 via http? skimage.io.imread can do it. But if its not trivial, feel free to stick with this approach

  • Running the compression methods:
    8.5 hours computing and applying illumination correction, rescaling and PNG compression.

Nice

  • Uploading compressed data to S3:
    1 minute, 6.8 GB, 17,280 individual PNG files.

8-bit PNG?

  • Compression rates:
    Up to 95% of space savings.

Wow!

  • Computing cost:
    Approx. $0.7 per plate, using a m5n.xlarge spot instance ($0.07/hour x 10 hours).

That's great! Do keep in mind that the unarchived uncompressed TIFFs sitting on EFS (during Step 2 to Step 3) may drive the cost depending on how you do it, so as long as you have a reasonable plan for that, we are all set.

@shntnu
Copy link
Collaborator Author

shntnu commented May 7, 2020

@shntnu will check whether some of these images are still available on /cmap/imaging/

@MarziehHaghighi
Copy link

Here is the plan after discussing with Shantanu, unless we find data on /cmap/imaging, for unarchiving and placing the images to S3, @MarziehHaghighi will download restored data to EFS in 10 plates chunks to avoid costs caused by keeping huge data on EFS. I will start tomorrow and will be done in two weeks.

@shntnu
Copy link
Collaborator Author

shntnu commented May 10, 2020

@MarziehHaghighi I found several plates, possibly all. I've started uploading from /cmap/imaging/. So no need to unarchive etc. Please ping me on this thread in a few days to get status.

@MarziehHaghighi
Copy link

@shntnu Pinging you here and also two questions regarding the data you are transferring:

  1. Do we have cp profiles on /cmap/imaging/ as well or we just have images? I think we will need profiles (single cells) for at least cell center information. @jccaicedo can you confirm that? Also I personally need cell outlines to create cell masks.
  2. Location of the uploaded images in DGX?

@shntnu
Copy link
Collaborator Author

shntnu commented May 18, 2020

  • Do we have cp profiles on /cmap/imaging/ as well or we just have images? I think we will need profiles (single cells) for at least cell center information. @jccaicedo can you confirm that? Also I personally need cell outlines to create cell masks.

Only images, but @gwaygenomics knows where to find the SQLite files (Level 2) because he just finished creating Level 3 from that.

  • Location of the uploaded images in DGX?

I thought the plan was to transfer them to S3, and had started that, but looks like the process stopped and only 7 plates were transferred (120,060 tiff files).

But I now realize you wanted them transferred to DGX-1.

The good news is that I've verified that all the images you need are indeed available on /cmap/imaging.

After learning more about the DeepProfiler configuration today, I think it will be relatively easily to set it up on the Broad cluster and prepare the images right there, avoiding the need to transfer the 22TB out of /cmap/imaging. IIUC @jccaicedo is able to reduce it to ~1TB with 8-bit png and 2x downsampling.

Have a look at this https://broad.service-now.com/sp?id=kb_article&sys_id=63354be4137f0b00449fb86f3244b049 to get started with UGER.

Other links I found on Confluence

If you'd rather not figure that out, you can transfer the images DGX-1 by connecting to Broad login node and then scp from /cmap/imaging to DGX1.

@gwaybio
Copy link
Member

gwaybio commented May 18, 2020

Only images, but @gwaygenomics knows where to find the SQLite files (Level 2) because he just finished creating Level 3 from that.

👉 https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profiling_pipeline.py#L16-L20

@MarziehHaghighi
Copy link

@shntnu I assume I can't use compressed images for creating new CP profiles (with mito features) and need to run CellProfiler on original images. Also, for generating single cell images my initial experiments show results based on uncompressed images are more promising.

@shntnu
Copy link
Collaborator Author

shntnu commented May 18, 2020

@shntnu I assume I can't use compressed images for creating new CP profiles (with mito features) and need to run CellProfiler on original images. Also, for generating single cell images my initial experiments show results based on uncompressed images are more promising.

Ah so you need the images for two different tasks

  1. Extract new sets of features using CellProfiler https://github.com/broadinstitute/2016_08_01_RadialMitochondriaDistribution_donna/issues/3#issuecomment-533186206 https://github.com/broadinstitute/2016_08_01_RadialMitochondriaDistribution_donna/pull/4 using this pipeline.
  2. Extract deep learning features using DeepProfiler, which is what we want to do in this repo.

Is that correct?

If so, then I agree you need uncompressed images for 1. and sounds like you need them for 2. as well. You have instructions for doing so here. Does that address what you need to do next?

@jccaicedo
Copy link

After learning more about the DeepProfiler configuration today, I think it will be relatively easily to set it up on the Broad cluster and prepare the images right there, avoiding the need to transfer the 22TB out of /cmap/imaging. IIUC @jccaicedo is able to reduce it to ~1TB with 8-bit png and 2x downsampling.

Yes, this is correct. We can reduce the size to less than 1TB with the compression implemented in DeepProfiler.

If you'd rather not figure that out, you can transfer the images DGX-1 by connecting to Broad login node and then scp from /cmap/imaging to DGX1.

Remember that the DGX-1 is invisible to the rest of the Broad network. We can't copy data from/to UGER, unless there is a sort of tunnel that goes through the web.

@MarziehHaghighi
Copy link

@shntnu Considering Juan's comment about transfer to DGX from /cmap/imaging, Do you have any objection with transfering all the plates from /cmap/imaging to S3 and then transfer from S3 to DGX?

@shntnu
Copy link
Collaborator Author

shntnu commented May 26, 2020

Remember that the DGX-1 is invisible to the rest of the Broad network. We can't copy data from/to UGER, unless there is a sort of tunnel that goes through the web.

Pretty sure we can still do this https://github.com/broadinstitute/2016_08_01_RadialMitochondriaDistribution_donna/blob/issues/3/process-repurposing-data/1.download-image-files.sh#L4-L6
because we did that before
i.e. log in to a Broad login node, then rsync data from /cmap/imaging to DGX-1
You'd need ssh keys to access the DGX-1 set up on the Broad login of course.
Does that make sense?

Considering Juan's comment about transfer to DGX from /cmap/imaging, Do you have any objection with transfering all the plates from /cmap/imaging to S3 and then transfer from S3 to DGX?

Pretty sure that what I have suggested above should work. Can you test that out?

@MarziehHaghighi
Copy link

MarziehHaghighi commented May 26, 2020

Pretty sure that what I have suggested above should work. Can you test that out?

Yes we have done that before :D It seems that my memory restarts every night! Sorry Shantanu.
@jccaicedo I started the transfer. Will update you as soon as it is finished.
data address on DGX:
/dgx1nas1/cellpainting-datasets/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1

@MarziehHaghighi
Copy link

MarziehHaghighi commented Jun 5, 2020

@shntnu and @jccaicedo 136 of 140 plates are transferred now and NAS is full. @jccaicedo Should we apply compresion on these images now and delete a couple of plates and move the rest?

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 5, 2020 via email

@MarziehHaghighi
Copy link

@shntnu where can I find the name of 136 plates? Since there are 140 plates on /cmap/imaging and also in an old metadata file I have.

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 5, 2020 via email

@MarziehHaghighi
Copy link

MarziehHaghighi commented Jun 5, 2020

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 5, 2020 via email

@MarziehHaghighi
Copy link

Great! I removed the plates mentioned by Shantanu in previous comment and transferred the remained 4 plates. And Nas is full now:)
@jccaicedo data is ready!

@jccaicedo
Copy link

The first batch of 65 plates has been compressed and it's available in the following path in the DGX server:
/dgx1nas1/cellpainting-datasets/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/DP-project/outputs/compressed/

The total space being used so far is 450 GBs.

@jccaicedo
Copy link

The second batch of plates has been successfully compressed and has been added to the same directory above. This results in a total of 136 plates available in compressed format. The total amount of space used for storing the images is 891GB, which makes it easier to handle in a single server for machine learning research. I will make a back up in S3 as well.

Thanks @MarziehHaghighi and @shntnu for your help with this process.

@shntnu
Copy link
Collaborator Author

shntnu commented Dec 2, 2020

@jccaicedo Super! Can you point to the parameters used for compression?

@jccaicedo
Copy link

jccaicedo commented Dec 4, 2020

The compression procedure for this dataset included the following steps:

  1. Compute one illumination correction function for each channel-plate. The illumination correction function is computed at 25% of the width/height of the original images.
  2. Apply the illumination correction function to images before any of the following compression steps
  3. Resize the images to half the width and height, effectively reducing the total number of pixels to 25% of the original (4X compression). Images in this dataset were 2160x2160, resulting in 1080x1080. This makes sense for this specific dataset because it did not have binning, resulting in double the resolution w.r.t other datasets acquired at the same magnification.
  4. Stretch the histogram of intensities by removing pixels that are too dark or too bright (below 0.05 and above 99.95 percentiles). This expands the bin ranges when changing pixel depth and prevents having dark images as a result of saturated pixels.
  5. Change the pixel depth from 16 bits to 8 bits. This results in 2X compression.
  6. Save the resulting image in PNG lossless format. This results in approximately 3X compression.

The total compression factor is (4X)(2X)(3X) = 24X, resulting in 21TB being compressed in 891GB (about 96% compression rate). Do we lose any relevant information? Our experiments in other smaller Cell Painting datasets have shown zero loss of precision in downstream analysis applications (MoA or Pathway connections), indicating no harm to the biologically meaningful data. We are just re-encoding redundant imaging information in a more efficient way.

DeepProfiler has built-in functionalities to compress high-throughput images using these strategies and in a parallel way.

@gwaybio
Copy link
Member

gwaybio commented Dec 4, 2020

This is great 💯 @jccaicedo - I do not doubt that compression results in biological information loss according to those metrics, but do you know of a citation to support this? (I am curious)

@shntnu
Copy link
Collaborator Author

shntnu commented Dec 4, 2020

@jccaicedo Thanks for documenting this!

IIUC, you rescale intensities after step 3, correct?

https://github.com/cytomining/DeepProfiler/blob/8be5994805df92479d030c19f186350aa31a0173/deepprofiler/dataset/compression.py#L90

@jccaicedo
Copy link

@gwaygenomics I don't know of references that report results like this. However, we have data from DeepProfiler experiments in other datasets that I can share if you are interested. We hope to make those results public soon as well.

@shntnu Line 90 corresponds to step 4 (rescale intensities from 16 bits to 8 bits). We rescale images to half the resolution in step 3, according to line 82 in the code: https://github.com/cytomining/DeepProfiler/blob/master/deepprofiler/dataset/compression.py#L82

We set the scaling_factor=1.0 for most projects. For this dataset we used scaling_factor=0.5, which reduces the image to half the resolution: https://github.com/cytomining/DeepProfiler/blob/master/deepprofiler/dataset/compression.py#L62

@shntnu
Copy link
Collaborator Author

shntnu commented Dec 5, 2020

Line 90 corresponds to step 4 (rescale intensities from 16 bits to 8 bits).

Got it. I think I meant to say that before mapping 16-bit to 8-bit, you clip the intensities to lie between 5th and 99.5th percentiles.

@jccaicedo
Copy link

jccaicedo commented Dec 6, 2020

You're totally right! I forgot to document that step in the list. Just edited the procedure to include this.
Note that it's not the 5th percentile, that'd be too much. It's just the 0.05th percentile, as well as the 99.95th percentile, for a total of 0.1% of the illumination values clipped (that is 1,000 pixels adjusted per megapixel).

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

Marzieh or Juan had started unarchiving #2 (comment)

There are 12 plates still on S3 from that aborted exercise (because they images were available on /cmap/imaging)

I am cleaning up. I will delete those images from s3://imaging-platform/ (they still live on s3://imaging-platform-cold)

aws s3 rm \
  --recursive \
  --quiet \
  s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

@gwaygenomics I'll post here instead of #54 to avoid cluttering that thread.

The 136 plates that lived on s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/ have been archived but they are available on /cmap/imaging

See #2 (comment) for a thread that explains how to fetch those images into, say, DGX-1.

But we needn't do that. We can instead ask BITS for help to transfer them from /cmap/imaging (on the Broad servers) to wherever IDR wants them.

Note that 2016_04_01_a549_48hr_batch1 is ~22TB (same as LCKP)

@gwaybio
Copy link
Member

gwaybio commented Mar 15, 2021

Thanks @shntnu

Note that 2016_04_01_a549_48hr_batch1 is ~22TB (same as LCKP)

Awesome, I updated #54 (comment) - do you know if, roughly, there are the same amount of tifs? There are ~2.7 million in batch 2.

We can instead ask BITS for help to transfer them from /cmap/imaging

I love this idea. Who is the point person?

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 15, 2021

Awesome, I updated #54 (comment) - do you know if, roughly, there are the same amount of tifs? There are ~2.7 million in batch 2.

There are 136 plates in Pilot 1, and 135 plates in Pilot 2. So there should be nearly (135/136) as many tiffs. Oh but wait, we didn't have brightfield in Pilot 1 but do in Pilot 2, so there should be 6/5*135/136 = 19% more tiffs in Pilot 2

I love this idea. Who is the point person?

Once you know the destination, you can just email help@broad and someone from scientific computing should be able to help.

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 24, 2023

The compression procedure for this dataset included the following steps:

Snippet from https://www.biorxiv.org/content/10.1101/2022.08.12.503783v2.full.pdf

3.1 Image preprocessing

The original Cell Painting images in all the datasets used in this work are encoded and stored in 16-bit TIFF format. To facilitate image loading from disk to memory during training of deep learning models, we used image compression. This is only required for training, which requires repeated randomized loading of images for minibatch-based optimization.

The compression procedure is as follows:

  • Compute one illumination correction function for each channel-plate 35. The illumination correction function is computed at 25% of the width/height of the original images.
  • Apply the illumination correction function to images before any of the following compression steps.
  • Stretch the histogram of intensities of each image by removing pixels that are too dark or too bright (below 0.05 and above 99.95 percentiles). This expands the bin ranges when changing pixel depth and prevents having dark images as a result of saturated pixels.
  • Change the pixel depth from 16 bits to 8 bits. This results in 2X compression.
  • Save the resulting image in PNG lossless format. This results in approximately 3X compression.
  • The resulting compression factor is approximately (2X)(3X) = 6X.

This preprocessing pipeline is implemented in DeepProfiler and can be run with a metadata file that lists the images in the dataset that require compression, together with plate, well and site (image or field of view) information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants