Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stated aggregation method (mean) is inconsistent with that used in other projects #53

Closed
shntnu opened this issue Apr 29, 2020 · 9 comments

Comments

@shntnu
Copy link
Member

shntnu commented Apr 29, 2020

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to use median. This is a major issue and should be resolved!

Some notes

  1. In the LINCS dataset, we decided Adding Level 3-5 Cell Painting Data Questions broadinstitute/lincs-cell-painting#3 (comment) to use medians: we were following the lead of what was done in the previously-processed version of the dataset in which we used median. When this project was first executed in 2017, cytominer_scripts used median as default; this was later changed here Change default aggregation to be mean instead of median broadinstitute/cytominer_scripts#18 (more on this below)
  2. pycytominer uses median by default for aggregation.
  3. In our 2018 experiments, we found that median performed better (that plot doesn't show mean).
  4. This PR Change default aggregation to be mean instead of median broadinstitute/cytominer_scripts#18 – Change default aggregation to be mean instead of median was to make cytominer_scripts/aggregate.R consistent with cytotools/aggregate.R. But it is unclear why cytotools (the new version of cytominer_scripts) used mean! I think this was because cytominer::aggregate used mean by default.
  5. Carmen Verdugo reported in June 2022: The profiling recipe does aggregation using median by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-coded mean by default (here).
@shntnu shntnu changed the title Clearly state aggregation method Stated aggregation method (mean) is inconsistent with that used in other projects Apr 29, 2020
@shntnu
Copy link
Member Author

shntnu commented Jun 9, 2022

Moving an email conversation with @bethac07 to this thread

IIUC the profiling recipe should be updated to reflect our current practice of using mean instead of median

I agree with Beth that mean makes more sense at this level

The reason for that is - a lot of things may have partial penetrance (when using genetic reagents) or cause adjustments in things like cell cycle phase, which unless it affects 50% +1 cells, you'll never see that by a median aggregation. For any strong phenotype, it shouldn't matter much which you use, but one that's subtle and/or not universal we feel mean captures it better.

However we did notice this:

  • In our 2018 experiments, we found that median performed better (that plot doesn't show mean).

Nevertheless, I think our default should just be whatever we are now doing in http://github.com/jump-cellpainting/ and whatever the Cimini lab has been using as their default. Are we using mean or median in JUMP, @bethac07. Is that consistent with whatever C-lab uses for profiling datasets? If so, let's update the recipe to reflect this

@bethac07
Copy link
Member

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5).

The current handbook does not cover collapsing at all, so it does not take a position on the latter, but you're right that it does in the former.

In our 2018 experiments, we found that median performed better (that plot doesn't show mean)

I'm curious whether that held for both the compound experiments and TA-ORF, do you recall? Because naively, due to penetrance, I would suspect that median might fare better for compounds but mean better for genetic perturbations.

Nevertheless, I think our default should just be whatever we are now doing in http://github.com/jump-cellpainting/ and whatever the Cimini lab has been using as their default. Are we using mean or median in JUMP, @bethac07. Is that consistent with whatever C-lab uses for profiling datasets?

@ErinWeisbart can confirm, but I'm pretty sure at least for all of JUMP production she was using collate.py, which means she's been doing mean aggregation. Based on the config files for CPJUMP1/the pilots, the aggregation method is not set and so would be the default, so I guess it's median. We've been using mean as our default, but if that is based on me remembering the decision to move to mean incorrectly and thinking that was what should have been our default, we can certainly pivot to using median as the aggregation.

@bethac07
Copy link
Member

bethac07 commented Jun 14, 2022

image

It's annoying that just mean alone isn't in this graph but to my eye-

Mean + PCA vs Median + PCA: Mean outperforms median in the bioactives + TA-ORF; they are equivalent in CDRP

Mean + Factor vs Median + Factor: The colors are maddeningly similar for this pair, and additionally it looks like the legend has 11 conditions and each plot has no more than 10, so they may never have been run head to head (or one has inadvertantly not been plotted).

So what specific data are we basing median outperforming mean on? It def isn't published, at least not in that paper.

@shntnu
Copy link
Member Author

shntnu commented Jun 14, 2022

So what specific data are we basing median outperforming mean on? It def isn't published, at least not in that paper.

Darn, I tried digging through https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/506658917/Experiments+-+Moment+based+profiling to find it, but no luck

I've asked Greg in case we did it in LINCS broadinstitute/lincs-cell-painting#22 (comment)

The current handbook does not cover collapsing at all, so it does not take a position on the latter, but you're right that it does in the former.

Noted

I'm curious whether that held for both the compound experiments and TA-ORF, do you recall? Because naively, due to penetrance, I would suspect that median might fare better for compounds but mean better for genetic perturbations.

No luck with my digging, so I don't have an answer here unfortunately

@ErinWeisbart can confirm, but I'm pretty sure at least for all of JUMP production she was using collate.py, which means she's been doing mean aggregation. Based on the config files for CPJUMP1/the pilots, the aggregation method is not set and so would be the default, so I guess it's median.

Thanks for this background. So it sounds like there is a discrepancy between jump/source_4 (mean) and jump-pilot/source_4 (median) at the least?

We've been using mean as our default, but if that is based on me remembering the decision to move to mean incorrectly and thinking that was what should have been our default, we can certainly pivot to using median as the aggregation.

  1. the majority of the recent datasets have been processed using mean
  2. mean is the current default in the Cimini lab
  3. Even though Stated aggregation method (mean) is inconsistent with that used in other projects #53 (comment) does not show mean vs median head-to-head, the PCA+mean and PCA+median should be indicative. PCA+mean > PCA+median

All this together makes me recommend mean as the default.

We might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57

@ErinWeisbart
Copy link
Member

It is correct that for JUMP Production I used collate.py

@shntnu
Copy link
Member Author

shntnu commented Aug 15, 2022

I'll resolve this thread now because we've concluded that mean is the aggregation method of choice.

As noted above, we might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57 and related efforts, and we can revisit these decisions then.

@niranjchandrasekaran note that there is a discrepancy between the CPJUMP1 and the JUMP production datasets; see #53 (comment) for details.

@shntnu shntnu closed this as completed Aug 15, 2022
@shntnu
Copy link
Member Author

shntnu commented Aug 15, 2022

Oh, and @bethac07 @ErinWeisbartL: please keep an eye out for this discrepancy; it's possible there are other places (other than the handbook+recipe) where this will need to be fixed.

I see that @carmendv has already updated this in https://github.com/broadinstitute/cellprofiler-on-Terra (/pull/40) so that's great 👍

@AnneCarpenter
Copy link
Member

AnneCarpenter commented Aug 15, 2022

So do I understand correctly that we are choosing mean as our future default because that is what is used in JUMP production data? If so, that makes sense to me.

@bethac07
Copy link
Member

@AnneCarpenter Mean was used in jump-production, and we think mean is better. But all the JUMP pilots were done with median.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants