-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing Cytominer and Pycytominer Profiles #28
Comments
Important Note: The plates and features are visualized in the same order in all figures. EDIT: These comparisons were originally being made incorrectly. I was comparing cytominer DMSO normalization to pycytominer whole plate. See #28 (comment) for the correct comparisons. In the edit I also add a collapsable tag to these out-of-date figures.SummaryClick to show figuresFigure Legend: The top plot above represents the complete summary within each plate. The bottom plot above zooms in to absolute difference < 10. Each point represents the mean/median/sum of absolute value differences comparing cytominer and pycytominer. I derived the metrics using every matrix entry per plate. There is one point per plate. InterpretationIt is odd that a handful of plates sit around Within Level SummariesInterpretationOn the surface, it looks like there are insurmountable differences between the two tools. However, upon a closer inspection, the differences are not as large as they seem. Indeed, there are examples of extremely large differences in a subset of the plates and a subset of the features. This view highlights the strange differences observed in a small subset of level 3 plates, and these plates continue to be different in level 4a data. However these differences appear to dissolve in level 4b indicating that they are likely to be explained by a relatively small subset of specific features. Likewise, the handful of examples of extremely large level 4a differences also appear largely resolved in the level 4b data. This observation further emphasizes the need to perform an independent round of feature selection when merging level 4a plates. It also begs the question if we need to develop different normalization strategies per feature since the inherent differences in feature distributions lead to differential responses to a single normalization strategy. Level 3 - Aggregated ProfilesClick to show figuresFigure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view. Level 4a - Aggregated and Normalized ProfilesClick to show figuresFigure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view. Level 4b - Aggregated and Normalized ProfilesClick to show figuresFigure Legend: Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the middle plot shows a per plate view. The bottom plot represents the distribution of feature differences in total. Most differences are very near zero. Figure Legend: Feature selection summary. This figure shows which features were selected, by plate, by each of the two tools. |
RecommendationWe knew at the onset that aligning the profiles derived from each tool was going to be an arduous task. We knew that some cytominer-derived profiles were normalized differently (#3 (comment)), which also implies that different plates could have been processed at different times with different methods. We do not know the extent to which different plates have been processed differently with cytominer. We do know that all plates have been uniformly processed with the same pycytominer pipeline. Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions. |
The code to generate these results is provided in #29 cc'ing @niranjchandrasekaran @shntnu and @AnneCarpenter - any feedback/suggestions/comments are welcome! |
This represents an incredible amount of work on your part! 💯
These very likely correspond to the 10 plates listed here #3 (comment). When creating the notebook, do leave those plates out.
Can you test if it is related to m.a.d. R https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mad.html |
Ah – they were aggregated differently, not normalized differently |
I'm on board with this plan! 💯 |
Confirmed that those are the 10 plates. Some features (not all) still do have a bit of a wobble.
Ah! I got it - makes sense.
Digging into this now. One thing I did quickly check is that both python and R implementations use the same scaling factor (constant = 1.4826). Three other things to note in the analysis log linked in #3 (comment).
|
My apologies for the confusion (in case I introduced it), but we did do a whole-plate normalization This was the script. There are two different sets of files on S3.
I think you are using For our notes: here are the two sets of files for a sample plate.
|
I can confirm this See these two plots in the attached notebook
Notebook resolving_normalization_issue.nb.html.zip So if you use the Level 4a profiles from |
Looks like we are on to something here Comparing DMSO normalized plates 👇 (after removing those pesky 10 nonuniform plates) Check out those axes! 👀 |
I am much more comfortable with these comparisons 👍 a couple of takeaways:
One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step. |
Agree |
distributions of features at the aggregate level or single cell level? |
Aggregate level, but probably would work at single cell right? I haven't played with single cell yet so I don't know these distributions |
Comparing Cytominer 4a to Pyctyominer 4bResults below. There are a couple discrepancies but not much more than we see in 4b intersection. |
closed by merging #29 |
In this issue, I will discuss results of step 3 outlined in #22 (comment)
Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results
Comparing Pycytominer and Cytominer Processing
We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.
We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.
We output several metrics comparing the two approaches
Metrics
In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.
In addition, we confirm alignment of the following metadata columns:
Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.
Data Levels
Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.
The text was updated successfully, but these errors were encountered: