Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing Cytominer and Pycytominer Profiles #28

Closed
gwaybio opened this issue May 10, 2020 · 16 comments
Closed

Comparing Cytominer and Pycytominer Profiles #28

gwaybio opened this issue May 10, 2020 · 16 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented May 10, 2020

In this issue, I will discuss results of step 3 outlined in #22 (comment)

Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results

Comparing Pycytominer and Cytominer Processing

We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.

We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.

We output several metrics comparing the two approaches

Metrics

In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.

  1. Mean, median, and sum of element-wise differencs
  2. Per feature mean, median, and sum of element-wise differences
  3. Feature selection procedure differences per feature (level 4b only)

In addition, we confirm alignment of the following metadata columns:

  • Well
  • Broad Sample Name
  • Plate

Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.

Data Levels

Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.

Data Level Comparison
Images Level 1 NA
SQLite File (single cell profiles ) Level 2 NA
Aggregated Profiles with Well Information (metadata) Level 3 Yes
Normalized Aggregated Profiles with Metadata Level 4a Yes
Normalized and Feature Selected Aggregated Profiles with Metadata Level 4b Yes
Perturbation Profiles created Summarizing Replicates Level 5 No
@gwaybio
Copy link
Member Author

gwaybio commented May 10, 2020

Important Note: The plates and features are visualized in the same order in all figures.

EDIT: These comparisons were originally being made incorrectly. I was comparing cytominer DMSO normalization to pycytominer whole plate. See #28 (comment) for the correct comparisons. In the edit I also add a collapsable tag to these out-of-date figures.

Summary

Click to show figures

summary_metrics_full

summary_metrics_zoom

Figure Legend: The top plot above represents the complete summary within each plate. The bottom plot above zooms in to absolute difference < 10. Each point represents the mean/median/sum of absolute value differences comparing cytominer and pycytominer. I derived the metrics using every matrix entry per plate. There is one point per plate.

Interpretation

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data. This level data is only derived from aggregating single cell profiles. In level 4a and level 4b data, the mean and median per plate difference sits around 0.5. In level 4a data, the extreme outliers in mean and sum are driven largely by only a few very large outliers (we don't see this in medians). These features are removed in the level 4b data. Some extreme outliers in the level 4a outliers may be explained by the handful of plates that are oddly different in level 3 data.

Within Level Summaries

Interpretation

On the surface, it looks like there are insurmountable differences between the two tools. However, upon a closer inspection, the differences are not as large as they seem. Indeed, there are examples of extremely large differences in a subset of the plates and a subset of the features. This view highlights the strange differences observed in a small subset of level 3 plates, and these plates continue to be different in level 4a data. However these differences appear to dissolve in level 4b indicating that they are likely to be explained by a relatively small subset of specific features. Likewise, the handful of examples of extremely large level 4a differences also appear largely resolved in the level 4b data. This observation further emphasizes the need to perform an independent round of feature selection when merging level 4a plates. It also begs the question if we need to develop different normalization strategies per feature since the inherent differences in feature distributions lead to differential responses to a single normalization strategy.

Level 3 - Aggregated Profiles

Click to show figures

level_3_metrics_per_feature

level_3_metrics_per_plate

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4a - Aggregated and Normalized Profiles

Click to show figures

level_4a_metrics_per_feature

level_4a_metrics_per_plate

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4b - Aggregated and Normalized Profiles

Click to show figures

level_4b_metrics_per_feature

level_4b_metrics_per_plate

level_4b_all_across_well_summary

Figure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the middle plot shows a per plate view. The bottom plot represents the distribution of feature differences in total. Most differences are very near zero.

feature_select_summary

Figure Legend: Feature selection summary. This figure shows which features were selected, by plate, by each of the two tools.

@gwaybio
Copy link
Member Author

gwaybio commented May 10, 2020

Recommendation

We knew at the onset that aligning the profiles derived from each tool was going to be an arduous task. We knew that some cytominer-derived profiles were normalized differently (#3 (comment)), which also implies that different plates could have been processed at different times with different methods. We do not know the extent to which different plates have been processed differently with cytominer. We do know that all plates have been uniformly processed with the same pycytominer pipeline.

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

@gwaybio
Copy link
Member Author

gwaybio commented May 10, 2020

The code to generate these results is provided in #29

cc'ing @niranjchandrasekaran @shntnu and @AnneCarpenter - any feedback/suggestions/comments are welcome!

@shntnu
Copy link
Collaborator

shntnu commented May 10, 2020

This represents an incredible amount of work on your part! 💯

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data.

These very likely correspond to the 10 plates listed here #3 (comment).

When creating the notebook, do leave those plates out.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

R https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mad.html
Python https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html

@shntnu
Copy link
Collaborator

shntnu commented May 10, 2020

We knew that some cytominer-derived profiles were normalized differently (#3 (comment))

Ah – they were aggregated differently, not normalized differently

@shntnu shntnu changed the title Comparing Cytominer and Pyctominer Profiles Comparing Cytominer and Pycytominer Profiles May 11, 2020
@shntnu
Copy link
Collaborator

shntnu commented May 11, 2020

Recommendation

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

I'm on board with this plan! 💯

@gwaybio
Copy link
Member Author

gwaybio commented May 11, 2020

These very likely correspond to the 10 plates listed here #3 (comment).
When creating the notebook, do leave those plates out.

level_3_metrics_per_plate_uniformplates

level_3_metrics_per_feature_uniformplates

Confirmed that those are the 10 plates. Some features (not all) still do have a bit of a wobble.

Ah – they were aggregated differently, not normalized differently

Ah! I got it - makes sense.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

Digging into this now. One thing I did quickly check is that both python and R implementations use the same scaling factor (constant = 1.4826).

Three other things to note in the analysis log linked in #3 (comment).

  1. The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!
  2. Also, the default cytominer_scripts/normalize is "robustize" but the default cytominer/normalize is "standardize".
    • Sidenote: The version of cytominer/normalize used in processing the cytominer profiles was likely the August 15, 2016 version, which also uses "standardize". The February 23rd, 2017 cytominer_scripts/normalize also uses robustize.
  3. I think the current implementation of cytominer_scripts/normalize samples profiles before normalization. (Maybe I am reading the code wrong though). Without knowing the seed used in the processing, there is no way to simulate this

@shntnu
Copy link
Collaborator

shntnu commented May 11, 2020

  • The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!

My apologies for the confusion (in case I introduced it), but we did do a whole-plate normalization
when creating profiles for the "pseudo" batch 2016_04_01_a549_48hr_batch1_cmap_style.

This was the script. There are two different sets of files on S3.

  • One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control
  • The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate, but everything else upstream of normalization was identical to 2016_04_01_a549_48hr_batch1.

I think you are using 2016_04_01_a549_48hr_batch1. An easy way to check is to compute the median DMSO profile per plate; it should be all zeros (within some epsilon). Will do so.

For our notes: here are the two sets of files for a sample plate.

~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
2018-06-15 20:18:12 9778892027 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz
2018-06-19 16:05:52   31196865 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz

@shntnu
Copy link
Collaborator

shntnu commented May 11, 2020

I can confirm this

See these two plots in the attached notebook

  • x2016_04_01_a549_48hr_batch1_cmap_style_SQ00015233_normalized_median_full_plate
  • x2016_04_01_a549_48hr_batch1_SQ00015233_normalized_median_dmso

Notebook resolving_normalization_issue.nb.html.zip

So if you use the Level 4a profiles from 2016_04_01_a549_48hr_batch1_cmap_style, pretty sure it will nearly identical to your output.

@gwaybio
Copy link
Member Author

gwaybio commented May 11, 2020

This could be a reason for the discrepancy!!

I can confirm this

Looks like we are on to something here

Comparing DMSO normalized plates 👇 (after removing those pesky 10 nonuniform plates)

summary_metrics_full_uniformplates

summary_metrics_zoom_uniformplates

level_4a_metrics_per_feature

level_4b_metrics_per_feature

Check out those axes! 👀

@gwaybio
Copy link
Member Author

gwaybio commented May 11, 2020

I am much more comfortable with these comparisons 👍 a couple of takeaways:

  • The feature selection comparisons are within floating point estimates, but there is some wobble. There is more wobble in the 4a profiles (before feature selection). I think this is why feature selection tends to work well in our cases.
    • Our normalization procedures do not play nicely with all CellProfiler feature outputs.
    • This will change experiment-to-experiment, and might even change between CellProfiler versions.
    • CellProfiler features that are introduced with each analysis module could come alongside a metadata spreadsheet for data scientists that describes expected feature behavior and normalization strategies.
    • I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.
  • I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

@shntnu
Copy link
Collaborator

shntnu commented May 11, 2020

  • I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

Agree

@shntnu
Copy link
Collaborator

shntnu commented May 11, 2020

  • I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.

distributions of features at the aggregate level or single cell level?

@gwaybio
Copy link
Member Author

gwaybio commented May 11, 2020

distributions of features at the aggregate level or single cell level?

Aggregate level, but probably would work at single cell right? I haven't played with single cell yet so I don't know these distributions

@gwaybio
Copy link
Member Author

gwaybio commented May 11, 2020

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

Comparing Cytominer 4a to Pyctyominer 4b

Results below. There are a couple discrepancies but not much more than we see in 4b intersection.

Click to show figures

pycytominer_select_metrics_per_feature

pycytominer_select_metrics_per_plate

@gwaybio
Copy link
Member Author

gwaybio commented May 15, 2020

closed by merging #29

@gwaybio gwaybio closed this as completed May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants