Comparing cytominer- and pycytominer-derived profiles #29

gwaybio · 2020-05-10T18:09:35Z

The motivation, results, figures, interpretation, and figures are provided in #28

removing the 10 plates that cytominer used a different aggregation strategy. I also add the comparison to pycytominer 4b to cytominer 4a

niranjchandrasekaran · 2020-05-13T23:02:27Z

comparison/util.py

+    return (pycyto_df, cyto_df)
+
+
+def generate_output_filenames(output_dir, level, metrics=["median", "mean", "sum"]):


Should the name of the function be changed given that it is also used to generate file names are read as input in 1.summarize-cytominer-tool-differences.py?

great catch - yes, it should be updated. Will do in the next commit!

fixed in 4abbbf5

niranjchandrasekaran · 2020-05-13T23:06:29Z

comparison/scripts/nbconverted/0.get-cytominer-tool-differences.py

+
+pycytominer_dir = pathlib.Path("../profiles/backend/")
+cytominer_dir = pathlib.Path(
+    "/Users/gway/work/projects/"


For the purpose of reproducibility would it be better to not hardcode the path?

absolutely better, I agree. However, there are limitations to what we can do to improve the current cytominer profile data reproducibility. I think the best thing that we can do now (to both minimize time and maximize reproducibility) is to add documentation to why we need to hardcode this path.

As it currently stands and to my knowledge, there is no way for anyone outside the lab (i.e. without access to the imaging-platform S3 bucket) to reproduce this script. There are many more things we need to do in order to enable this - all of which are outside the scope of this PR.

comment added in 4a2edbc

Agreed 👍. You may consider including instructions like this provided you think it will take <15 mins to do so (and use Path.home()).

(and use Path.home()).

Oh, this is great - I will update to use this.

You may consider including instructions like this

I think linking to a publicly available repo with these steps is a good idea. Otherwise, I think adding these instructions is too much detail. Also, there are many ways of updating this path for it to work. Lastly, there are checks inside this script that will throw errors if plates are not aligned - i.e. it would be tough to add an incorrect path here and have the script silently fail. It should fail loudly with an incorrect path!

Path.home() used in 76d2db3

wooo! reduced hardcoding 🎉

niranjchandrasekaran · 2020-05-13T23:10:23Z

comparison/scripts/nbconverted/0.get-cytominer-tool-differences.py

+    median_diff = abs_diff.median()
+    sum_diff = abs_diff.sum()
+
+    complete_mean_diff = mean_diff.replace([np.inf, -np.inf], np.nan).dropna().mean()


(This question comes from my lack of knowledge of what are the range of values of each feature) Why are there np.inf values in the mean/median/sum values? Do pycto_df or cyto_df contain np.inf values?

Thanks @niranjchandrasekaran - indeed, you've touched upon a pretty important difference between python and R-based processing. Solving this problem is beyond scope of this PR, but it does likely explain many of the small floating point differences between cytominer and pycytominer profiles.

I outlined the issue in cytomining/pycytominer#79 - although I was not sure where it belongs since it permeates so many different codebases.

Why are there np.inf values in the mean/median/sum values? Do pycto_df or cyto_df contain np.inf values?

Neither of them contain np.inf values. Some features are entirely NA in cytominer, but this is recoded as 0 in pycytominer. The np.inf happens in this specific column after subtraction and summarization.

so, pycyto_df.subtract(cyto_df).abs().mean() will result in np.inf for features with that case described above.

niranjchandrasekaran

@gwaygenomics I have made a couple of minor comments/suggestions. Everything else looks good.

also nesting results and figures in batch directory

gwaybio · 2020-05-14T19:29:02Z

Thanks for the review @niranjchandrasekaran ! I believe I have addressed all of your comments in the subsequent commits. PR ready for second round of review - should be good to merge after you give the ok.

niranjchandrasekaran

@gwaygenomics everything looks good! Merge away!

gwaybio added 7 commits May 10, 2020 12:36

add notebook to investigate tool differences

caee572

track .gz files

7516b90

add helper functions in separate script

449967b

add comparison results

1ba8f22

add plotnine to environment

40ea2ac

adding summary of cytominer tool differences

f08ccc4

add comparison figures

551079c

gwaybio requested a review from niranjchandrasekaran May 10, 2020 18:09

gwaybio mentioned this pull request May 10, 2020

Comparing Cytominer and Pycytominer Profiles #28

Closed

gwaybio added 7 commits May 11, 2020 15:51

point level 4a and 4b pycytominer profiles to dmso normalization

e890125

update comparison results for dmso normalization

624b1c0

add data collection comparing cytominer 4a to pycytominer 4b

3a68e50

rerun results pipeline after adding 4a to 4b comparison

0b691d1

only visualize uniform plates

42c8889

removing the 10 plates that cytominer used a different aggregation strategy. I also add the comparison to pycytominer 4b to cytominer 4a

add updated figures after DMSO normalization comparison

8f24bbd

add comparison of cytominer 4a to pycytominer 4b in util

4e6cb25

niranjchandrasekaran reviewed May 13, 2020

View reviewed changes

niranjchandrasekaran requested changes May 13, 2020

View reviewed changes

gwaybio added 6 commits May 14, 2020 09:44

improve function name

4abbbf5

also nesting results and figures in batch directory

add comment about hardcoded cytominer path

4a2edbc

move result files to nested batch folder

2254f91

use Path.home()

76d2db3

rerun figure generation with updated paths

840558c

move figures to batch nest

610dc59

niranjchandrasekaran approved these changes May 14, 2020

View reviewed changes

gwaybio merged commit ab7311e into broadinstitute:master May 14, 2020

gwaybio deleted the compare-tools branch May 14, 2020 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing cytominer- and pycytominer-derived profiles #29

Comparing cytominer- and pycytominer-derived profiles #29

gwaybio commented May 10, 2020

niranjchandrasekaran May 13, 2020

gwaybio May 14, 2020

gwaybio May 14, 2020

niranjchandrasekaran May 13, 2020

gwaybio May 14, 2020

gwaybio May 14, 2020

shntnu May 14, 2020 •

edited

Loading

gwaybio May 14, 2020 •

edited

Loading

gwaybio May 14, 2020

niranjchandrasekaran May 13, 2020

gwaybio May 14, 2020

gwaybio May 14, 2020

niranjchandrasekaran left a comment

gwaybio commented May 14, 2020

niranjchandrasekaran left a comment

		return (pycyto_df, cyto_df)


		def generate_output_filenames(output_dir, level, metrics=["median", "mean", "sum"]):

Comparing cytominer- and pycytominer-derived profiles #29

Comparing cytominer- and pycytominer-derived profiles #29

Conversation

gwaybio commented May 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shntnu May 14, 2020 • edited Loading

Choose a reason for hiding this comment

gwaybio May 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niranjchandrasekaran left a comment

Choose a reason for hiding this comment

gwaybio commented May 14, 2020

niranjchandrasekaran left a comment

Choose a reason for hiding this comment

shntnu May 14, 2020 •

edited

Loading

gwaybio May 14, 2020 •

edited

Loading