Comparing Cytominer and Pycytominer Profiles #28

gwaybio · 2020-05-10T17:30:00Z

In this issue, I will discuss results of step 3 outlined in #22 (comment)

Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results

Comparing Pycytominer and Cytominer Processing

We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.

We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.

We output several metrics comparing the two approaches

Metrics

In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.

Mean, median, and sum of element-wise differencs
Per feature mean, median, and sum of element-wise differences
Feature selection procedure differences per feature (level 4b only)

In addition, we confirm alignment of the following metadata columns:

Well
Broad Sample Name
Plate

Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.

Data Levels

Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.

Data	Level	Comparison
Images	Level 1	NA
SQLite File (single cell profiles )	Level 2	NA
Aggregated Profiles with Well Information (metadata)	Level 3	Yes
Normalized Aggregated Profiles with Metadata	Level 4a	Yes
Normalized and Feature Selected Aggregated Profiles with Metadata	Level 4b	Yes
Perturbation Profiles created Summarizing Replicates	Level 5	No

gwaybio · 2020-05-10T17:47:49Z

Important Note: The plates and features are visualized in the same order in all figures.

EDIT: These comparisons were originally being made incorrectly. I was comparing cytominer DMSO normalization to pycytominer whole plate. See #28 (comment) for the correct comparisons. In the edit I also add a collapsable tag to these out-of-date figures.

Summary

Click to show figures

Figure Legend: The top plot above represents the complete summary within each plate. The bottom plot above zooms in to absolute difference < 10. Each point represents the mean/median/sum of absolute value differences comparing cytominer and pycytominer. I derived the metrics using every matrix entry per plate. There is one point per plate.

Interpretation

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data. This level data is only derived from aggregating single cell profiles. In level 4a and level 4b data, the mean and median per plate difference sits around 0.5. In level 4a data, the extreme outliers in mean and sum are driven largely by only a few very large outliers (we don't see this in medians). These features are removed in the level 4b data. Some extreme outliers in the level 4a outliers may be explained by the handful of plates that are oddly different in level 3 data.

Within Level Summaries

Interpretation

On the surface, it looks like there are insurmountable differences between the two tools. However, upon a closer inspection, the differences are not as large as they seem. Indeed, there are examples of extremely large differences in a subset of the plates and a subset of the features. This view highlights the strange differences observed in a small subset of level 3 plates, and these plates continue to be different in level 4a data. However these differences appear to dissolve in level 4b indicating that they are likely to be explained by a relatively small subset of specific features. Likewise, the handful of examples of extremely large level 4a differences also appear largely resolved in the level 4b data. This observation further emphasizes the need to perform an independent round of feature selection when merging level 4a plates. It also begs the question if we need to develop different normalization strategies per feature since the inherent differences in feature distributions lead to differential responses to a single normalization strategy.

Level 3 - Aggregated Profiles

Click to show figures

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4a - Aggregated and Normalized Profiles

Click to show figures

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4b - Aggregated and Normalized Profiles

Click to show figures

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the middle plot shows a per plate view. The bottom plot represents the distribution of feature differences in total. Most differences are very near zero.

Figure Legend: Feature selection summary. This figure shows which features were selected, by plate, by each of the two tools.

gwaybio · 2020-05-10T18:04:41Z

Recommendation

We knew at the onset that aligning the profiles derived from each tool was going to be an arduous task. We knew that some cytominer-derived profiles were normalized differently (#3 (comment)), which also implies that different plates could have been processed at different times with different methods. We do not know the extent to which different plates have been processed differently with cytominer. We do know that all plates have been uniformly processed with the same pycytominer pipeline.

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

gwaybio · 2020-05-10T18:11:13Z

The code to generate these results is provided in #29

cc'ing @niranjchandrasekaran @shntnu and @AnneCarpenter - any feedback/suggestions/comments are welcome!

shntnu · 2020-05-10T21:28:04Z

This represents an incredible amount of work on your part! 💯

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data.

These very likely correspond to the 10 plates listed here #3 (comment).

When creating the notebook, do leave those plates out.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

R https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mad.html
Python https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html

shntnu · 2020-05-10T21:40:50Z

We knew that some cytominer-derived profiles were normalized differently (#3 (comment))

Ah – they were aggregated differently, not normalized differently

shntnu · 2020-05-11T00:02:34Z

Recommendation

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

I'm on board with this plan! 💯

gwaybio · 2020-05-11T18:32:02Z

These very likely correspond to the 10 plates listed here #3 (comment).
When creating the notebook, do leave those plates out.

Confirmed that those are the 10 plates. Some features (not all) still do have a bit of a wobble.

Ah – they were aggregated differently, not normalized differently

Ah! I got it - makes sense.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

Digging into this now. One thing I did quickly check is that both python and R implementations use the same scaling factor (constant = 1.4826).

Three other things to note in the analysis log linked in #3 (comment).

The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!
Also, the default cytominer_scripts/normalize is "robustize" but the default cytominer/normalize is "standardize".
- Sidenote: The version of cytominer/normalize used in processing the cytominer profiles was likely the August 15, 2016 version, which also uses "standardize". The February 23rd, 2017 cytominer_scripts/normalize also uses robustize.
I think the current implementation of cytominer_scripts/normalize samples profiles before normalization. (Maybe I am reading the code wrong though). Without knowing the seed used in the processing, there is no way to simulate this

shntnu · 2020-05-11T19:38:02Z

The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!

My apologies for the confusion (in case I introduced it), but we did do a whole-plate normalization
when creating profiles for the "pseudo" batch 2016_04_01_a549_48hr_batch1_cmap_style.

This was the script. There are two different sets of files on S3.

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control
The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate, but everything else upstream of normalization was identical to 2016_04_01_a549_48hr_batch1.

I think you are using 2016_04_01_a549_48hr_batch1. An easy way to check is to compute the median DMSO profile per plate; it should be all zeros (within some epsilon). Will do so.

For our notes: here are the two sets of files for a sample plate.

~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
2018-06-15 20:18:12 9778892027 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz
2018-06-19 16:05:52   31196865 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz

shntnu · 2020-05-11T19:40:08Z

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control

The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate,

I can confirm this

See these two plots in the attached notebook

x2016_04_01_a549_48hr_batch1_cmap_style_SQ00015233_normalized_median_full_plate
x2016_04_01_a549_48hr_batch1_SQ00015233_normalized_median_dmso

Notebook resolving_normalization_issue.nb.html.zip

So if you use the Level 4a profiles from 2016_04_01_a549_48hr_batch1_cmap_style, pretty sure it will nearly identical to your output.

gwaybio · 2020-05-11T19:47:20Z

This could be a reason for the discrepancy!!

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control

The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate,

I can confirm this

Looks like we are on to something here

Comparing DMSO normalized plates 👇 (after removing those pesky 10 nonuniform plates)

Check out those axes! 👀

gwaybio · 2020-05-11T20:02:11Z

I am much more comfortable with these comparisons 👍 a couple of takeaways:

The feature selection comparisons are within floating point estimates, but there is some wobble. There is more wobble in the 4a profiles (before feature selection). I think this is why feature selection tends to work well in our cases.
- Our normalization procedures do not play nicely with all CellProfiler feature outputs.
- This will change experiment-to-experiment, and might even change between CellProfiler versions.
- CellProfiler features that are introduced with each analysis module could come alongside a metadata spreadsheet for data scientists that describes expected feature behavior and normalization strategies.
- I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.
I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

shntnu · 2020-05-11T20:09:17Z

I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

Agree

shntnu · 2020-05-11T20:11:54Z

I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.

distributions of features at the aggregate level or single cell level?

gwaybio · 2020-05-11T20:25:54Z

distributions of features at the aggregate level or single cell level?

Aggregate level, but probably would work at single cell right? I haven't played with single cell yet so I don't know these distributions

gwaybio · 2020-05-11T21:35:30Z

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

Comparing Cytominer 4a to Pyctyominer 4b

Results below. There are a couple discrepancies but not much more than we see in 4b intersection.

Click to show figures

gwaybio · 2020-05-15T18:13:59Z

closed by merging #29

gwaybio mentioned this issue May 10, 2020

Comparing cytominer- and pycytominer-derived profiles #29

Merged

shntnu changed the title ~~Comparing Cytominer and Pyctominer Profiles~~ Comparing Cytominer and Pycytominer Profiles May 11, 2020

gwaybio closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing Cytominer and Pycytominer Profiles #28

Comparing Cytominer and Pycytominer Profiles #28

gwaybio commented May 10, 2020 •

edited

Loading

gwaybio commented May 10, 2020 •

edited

Loading

gwaybio commented May 10, 2020

gwaybio commented May 10, 2020

shntnu commented May 10, 2020 •

edited

Loading

shntnu commented May 10, 2020

shntnu commented May 11, 2020

Recommendation

gwaybio commented May 11, 2020 •

edited

Loading

shntnu commented May 11, 2020 •

edited

Loading

shntnu commented May 11, 2020 •

edited

Loading

gwaybio commented May 11, 2020 •

edited

Loading

gwaybio commented May 11, 2020

shntnu commented May 11, 2020

shntnu commented May 11, 2020

gwaybio commented May 11, 2020

gwaybio commented May 11, 2020

gwaybio commented May 15, 2020

Comparing Cytominer and Pycytominer Profiles #28

Comparing Cytominer and Pycytominer Profiles #28

Comments

gwaybio commented May 10, 2020 • edited Loading

Comparing Pycytominer and Cytominer Processing

Metrics

Data Levels

gwaybio commented May 10, 2020 • edited Loading

EDIT: These comparisons were originally being made incorrectly. I was comparing cytominer DMSO normalization to pycytominer whole plate. See #28 (comment) for the correct comparisons. In the edit I also add a collapsable tag to these out-of-date figures.

Summary

Interpretation

Within Level Summaries

Interpretation

Level 3 - Aggregated Profiles

Level 4a - Aggregated and Normalized Profiles

Level 4b - Aggregated and Normalized Profiles

gwaybio commented May 10, 2020

Recommendation

gwaybio commented May 10, 2020

shntnu commented May 10, 2020 • edited Loading

shntnu commented May 10, 2020

shntnu commented May 11, 2020

Recommendation

gwaybio commented May 11, 2020 • edited Loading

shntnu commented May 11, 2020 • edited Loading

shntnu commented May 11, 2020 • edited Loading

gwaybio commented May 11, 2020 • edited Loading

gwaybio commented May 11, 2020

shntnu commented May 11, 2020

shntnu commented May 11, 2020

gwaybio commented May 11, 2020

gwaybio commented May 11, 2020

Comparing Cytominer 4a to Pyctyominer 4b

gwaybio commented May 15, 2020

gwaybio commented May 10, 2020 •

edited

Loading

gwaybio commented May 10, 2020 •

edited

Loading

shntnu commented May 10, 2020 •

edited

Loading

gwaybio commented May 11, 2020 •

edited

Loading

shntnu commented May 11, 2020 •

edited

Loading

shntnu commented May 11, 2020 •

edited

Loading

gwaybio commented May 11, 2020 •

edited

Loading