-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stated aggregation method (mean) is inconsistent with that used in other projects #53
Comments
Moving an email conversation with @bethac07 to this thread IIUC the profiling recipe should be updated to reflect our current practice of using mean instead of median
I agree with Beth that
However we did notice this:
Nevertheless, I think our default should just be whatever we are now doing in http://github.com/jump-cellpainting/ and whatever the Cimini lab has been using as their default. Are we using |
The current handbook does not cover collapsing at all, so it does not take a position on the latter, but you're right that it does in the former.
I'm curious whether that held for both the compound experiments and TA-ORF, do you recall? Because naively, due to penetrance, I would suspect that median might fare better for compounds but mean better for genetic perturbations.
@ErinWeisbart can confirm, but I'm pretty sure at least for all of JUMP production she was using |
It's annoying that just mean alone isn't in this graph but to my eye-
Mean + PCA vs Median + PCA: Mean outperforms median in the bioactives + TA-ORF; they are equivalent in CDRP Mean + Factor vs Median + Factor: The colors are maddeningly similar for this pair, and additionally it looks like the legend has 11 conditions and each plot has no more than 10, so they may never have been run head to head (or one has inadvertantly not been plotted). So what specific data are we basing median outperforming mean on? It def isn't published, at least not in that paper. |
Darn, I tried digging through https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/506658917/Experiments+-+Moment+based+profiling to find it, but no luck I've asked Greg in case we did it in LINCS broadinstitute/lincs-cell-painting#22 (comment)
Noted
No luck with my digging, so I don't have an answer here unfortunately
Thanks for this background. So it sounds like there is a discrepancy between
All this together makes me recommend mean as the default. We might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57 |
It is correct that for JUMP Production I used |
I'll resolve this thread now because we've concluded that As noted above, we might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57 and related efforts, and we can revisit these decisions then. @niranjchandrasekaran note that there is a discrepancy between the CPJUMP1 and the JUMP production datasets; see #53 (comment) for details. |
Oh, and @bethac07 @ErinWeisbartL: please keep an eye out for this discrepancy; it's possible there are other places (other than the handbook+recipe) where this will need to be fixed. I see that @carmendv has already updated this in https://github.com/broadinstitute/cellprofiler-on-Terra (/pull/40) so that's great 👍 |
So do I understand correctly that we are choosing mean as our future default because that is what is used in JUMP production data? If so, that makes sense to me. |
@AnneCarpenter Mean was used in jump-production, and we think mean is better. But all the JUMP pilots were done with median. |
The handbook uses
mean
for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to usemedian
. This is a major issue and should be resolved!Some notes
median
. When this project was first executed in 2017,cytominer_scripts
usedmedian
as default; this was later changed here Change default aggregation to be mean instead of median broadinstitute/cytominer_scripts#18 (more on this below)pycytominer
usesmedian
by default for aggregation.median
performed better (that plot doesn't showmean
).mean
instead ofmedian
was to makecytominer_scripts/aggregate.R
consistent withcytotools/aggregate.R
. But it is unclear whycytotools
(the new version ofcytominer_scripts
) usedmean
! I think this was becausecytominer::aggregate
usedmean
by default.median
by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-codedmean
by default (here).The text was updated successfully, but these errors were encountered: