-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing data using DeepProfiler #2
Comments
Lets try a small set first, maybe just 1 plate, to make sure this process works
|
Sounds great! Thanks for these details and suggestions. Will report the results of compressing one plate soon. |
@jccaicedo I changed some paths, please note: s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/SQ00015096/ and not s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/SQ00015096_compressed/ |
This comment has been minimized.
This comment has been minimized.
@shntnu here is an update from following the plan above, with a couple of questions: Steps 1 to 4 were run by @MarziehHaghighi. She used EFS and then deleted the unarchived files from glacier. So no EBS volumes were needed in the process and no extra storage is being used after retrieving plates. The original images are back in their original locations. Step 5: I run DeepProfiler in an EC2 instance (m5d.xlarge 4CPUs and 16GB of RAM). I have an script that configures the environment and runs everything given the plate name. Question: should I put this script in this repository? Step 6: This is easy and will be part of the script mentioned in the previous step. I noted the new path that you suggest for this. Step 7: Question: what exactly this file should have? Is it enough with the script that I mentioned above? The process is currently running so I will update this thread again with the relevant statistics of the resulting compression. |
Plate compression is done. Here are the stats:
Note that this instance can potentially process 4 plates at a time because it has 4 cores. However, I/O may slow down overall performance. I think it's safe to assume that we can compress two plates for approximately $1. The data has been uploaded here: |
Nice! When doing the whole dataset, please keep these numbers in mind:
So you'd want to batch it appropriately so that storage costs don't pile up during the period where the unarchived images have been downloaded and are being uploaded back to S3.
Yes please make a PR
A link to the URL of the file on GitHub would be great |
Sounds good. I assume it is not easy to modify so that you can read directly from s3 via http?
Nice
8-bit PNG?
Wow!
That's great! Do keep in mind that the unarchived uncompressed TIFFs sitting on EFS (during Step 2 to Step 3) may drive the cost depending on how you do it, so as long as you have a reasonable plan for that, we are all set. |
@shntnu will check whether some of these images are still available on |
Here is the plan after discussing with Shantanu, unless we find data on /cmap/imaging, for unarchiving and placing the images to S3, @MarziehHaghighi will download restored data to EFS in 10 plates chunks to avoid costs caused by keeping huge data on EFS. I will start tomorrow and will be done in two weeks. |
@MarziehHaghighi I found several plates, possibly all. I've started uploading from |
@shntnu Pinging you here and also two questions regarding the data you are transferring:
|
Only images, but @gwaygenomics knows where to find the SQLite files (Level 2) because he just finished creating Level 3 from that.
I thought the plan was to transfer them to S3, and had started that, but looks like the process stopped and only 7 plates were transferred (120,060 tiff files). But I now realize you wanted them transferred to DGX-1. The good news is that I've verified that all the images you need are indeed available on After learning more about the DeepProfiler configuration today, I think it will be relatively easily to set it up on the Broad cluster and Have a look at this https://broad.service-now.com/sp?id=kb_article&sys_id=63354be4137f0b00449fb86f3244b049 to get started with UGER. Other links I found on Confluence
If you'd rather not figure that out, you can transfer the images DGX-1 by connecting to Broad login node and then |
|
@shntnu I assume I can't use compressed images for creating new CP profiles (with mito features) and need to run CellProfiler on original images. Also, for generating single cell images my initial experiments show results based on uncompressed images are more promising. |
Ah so you need the images for two different tasks
Is that correct? If so, then I agree you need uncompressed images for 1. and sounds like you need them for 2. as well. You have instructions for doing so here. Does that address what you need to do next? |
Yes, this is correct. We can reduce the size to less than 1TB with the compression implemented in DeepProfiler.
Remember that the DGX-1 is invisible to the rest of the Broad network. We can't copy data from/to UGER, unless there is a sort of tunnel that goes through the web. |
@shntnu Considering Juan's comment about transfer to DGX from /cmap/imaging, Do you have any objection with transfering all the plates from /cmap/imaging to S3 and then transfer from S3 to DGX? |
Pretty sure we can still do this https://github.com/broadinstitute/2016_08_01_RadialMitochondriaDistribution_donna/blob/issues/3/process-repurposing-data/1.download-image-files.sh#L4-L6
Pretty sure that what I have suggested above should work. Can you test that out? |
Yes we have done that before :D It seems that my memory restarts every night! Sorry Shantanu. |
@shntnu and @jccaicedo 136 of 140 plates are transferred now and NAS is full. @jccaicedo Should we apply compresion on these images now and delete a couple of plates and move the rest? |
The dataset has only 136 plates; not sure what the extra ones are
…On Fri, Jun 5, 2020 at 2:01 PM MarziehHaghighi ***@***.***> wrote:
@shntnu <https://github.com/shntnu> and @jccaicedo
<https://github.com/jccaicedo> 136 of 142 plates are transferred now and
NAS is full. @jccaicedo <https://github.com/jccaicedo> Should we apply
compresion on these images now and delete a couple of plates and move the
rest?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJHQPFLTX5KWOYE3EZQYDLRVEXIVANCNFSM4KX4D3PA>
.
|
@shntnu where can I find the name of 136 plates? Since there are 140 plates on /cmap/imaging and also in an old metadata file I have. |
… On Fri, Jun 5, 2020 at 6:14 PM MarziehHaghighi ***@***.***> wrote:
@shntnu <https://github.com/shntnu> where can I find the name of 136
plates? Since there are 140 plates on /cmap/imaging and also in an old
metadata file I have.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJHQPE3SYGFBCOE6D2IVA3RVFU6HANCNFSM4KX4D3PA>
.
|
These are missing
https://github.com/broadinstitute/lincs-cell-painting/blob/460f69052934f27a56e1e3de3b7bab62dea190cd/metadata/clue_manifest/clue_data_library.Rmd#L77-L82
On Fri, Jun 5, 2020 at 6:24 PM MarziehHaghighi <[email protected]>
wrote:
…
https://github.com/broadinstitute/lincs-cell-painting/blob/master/metadata/platemaps/2016_04_01_a549_48hr_batch1/barcode_platemap.csv
… <#m_3724885139149466672_>
On Fri, Jun 5, 2020 at 6:14 PM MarziehHaghighi *@*.***> wrote: @shntnu
<https://github.com/shntnu> https://github.com/shntnu where can I find
the name of 136 plates? Since there are 140 plates on /cmap/imaging and
also in an old metadata file I have. — You are receiving this because you
were mentioned. Reply to this email directly, view it on GitHub <#2
(comment)
<#2 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJHQPE3SYGFBCOE6D2IVA3RVFU6HANCNFSM4KX4D3PA
.
This one also has 140 plates.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJHQPCUO3OWGO42UDVEMCTRVFWCBANCNFSM4KX4D3PA>
.
|
Great! I removed the plates mentioned by Shantanu in previous comment and transferred the remained 4 plates. And Nas is full now:) |
The first batch of 65 plates has been compressed and it's available in the following path in the DGX server: The total space being used so far is 450 GBs. |
The second batch of plates has been successfully compressed and has been added to the same directory above. This results in a total of 136 plates available in compressed format. The total amount of space used for storing the images is 891GB, which makes it easier to handle in a single server for machine learning research. I will make a back up in S3 as well. Thanks @MarziehHaghighi and @shntnu for your help with this process. |
@jccaicedo Super! Can you point to the parameters used for compression? |
The compression procedure for this dataset included the following steps:
The total compression factor is DeepProfiler has built-in functionalities to compress high-throughput images using these strategies and in a parallel way. |
This is great 💯 @jccaicedo - I do not doubt that compression results in biological information loss according to those metrics, but do you know of a citation to support this? (I am curious) |
@jccaicedo Thanks for documenting this! IIUC, you rescale intensities after step 3, correct? |
@gwaygenomics I don't know of references that report results like this. However, we have data from DeepProfiler experiments in other datasets that I can share if you are interested. We hope to make those results public soon as well. @shntnu Line 90 corresponds to step 4 (rescale intensities from 16 bits to 8 bits). We rescale images to half the resolution in step 3, according to line 82 in the code: https://github.com/cytomining/DeepProfiler/blob/master/deepprofiler/dataset/compression.py#L82 We set the |
Got it. I think I meant to say that before mapping 16-bit to 8-bit, you clip the intensities to lie between 5th and 99.5th percentiles. |
You're totally right! I forgot to document that step in the list. Just edited the procedure to include this. |
Marzieh or Juan had started unarchiving #2 (comment) There are 12 plates still on S3 from that aborted exercise (because they images were available on I am cleaning up. I will delete those images from aws s3 rm \
--recursive \
--quiet \
s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/ |
@gwaygenomics I'll post here instead of #54 to avoid cluttering that thread. The 136 plates that lived on See #2 (comment) for a thread that explains how to fetch those images into, say, DGX-1. But we needn't do that. We can instead ask BITS for help to transfer them from Note that |
Thanks @shntnu
Awesome, I updated #54 (comment) - do you know if, roughly, there are the same amount of tifs? There are ~2.7 million in batch 2.
I love this idea. Who is the point person? |
There are 136 plates in Pilot 1, and 135 plates in Pilot 2. So there should be nearly (135/136) as many tiffs. Oh but wait, we didn't have brightfield in Pilot 1 but do in Pilot 2, so there should be 6/5*135/136 = 19% more tiffs in Pilot 2
Once you know the destination, you can just email help@broad and someone from scientific computing should be able to help. |
Snippet from https://www.biorxiv.org/content/10.1101/2022.08.12.503783v2.full.pdf 3.1 Image preprocessingThe original Cell Painting images in all the datasets used in this work are encoded and stored in 16-bit TIFF format. To facilitate image loading from disk to memory during training of deep learning models, we used image compression. This is only required for training, which requires repeated randomized loading of images for minibatch-based optimization. The compression procedure is as follows:
This preprocessing pipeline is implemented in DeepProfiler and can be run with a metadata file that lists the images in the dataset that require compression, together with plate, well and site (image or field of view) information. |
Juan asked this:
I need to process the LINCS dataset to proceed with the plan we discussed for LUAD. I'm going to need access to the images, which is a lot of data! Here is my plan to make it efficient, and I would like to get your feedback and recommendations:
Does this make sense? Do you have any recommendations for me before moving forward?
By the way, the compression should take roughly 4 hours per plate (pessimistic estimate), and can be run in parallel, with multiple plates per machine (one per CPU). So using 30 cheap instances (with 4 cores each) in spot mode should do the trick in one day, including the operator's time :)
The text was updated successfully, but these errors were encountered: