Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing cytominer- and pycytominer-derived profiles #29

Merged
merged 20 commits into from
May 14, 2020

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented May 10, 2020

The motivation, results, figures, interpretation, and figures are provided in #28

return (pycyto_df, cyto_df)


def generate_output_filenames(output_dir, level, metrics=["median", "mean", "sum"]):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name of the function be changed given that it is also used to generate file names are read as input in 1.summarize-cytominer-tool-differences.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch - yes, it should be updated. Will do in the next commit!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4abbbf5


pycytominer_dir = pathlib.Path("../profiles/backend/")
cytominer_dir = pathlib.Path(
"/Users/gway/work/projects/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of reproducibility would it be better to not hardcode the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely better, I agree. However, there are limitations to what we can do to improve the current cytominer profile data reproducibility. I think the best thing that we can do now (to both minimize time and maximize reproducibility) is to add documentation to why we need to hardcode this path.

As it currently stands and to my knowledge, there is no way for anyone outside the lab (i.e. without access to the imaging-platform S3 bucket) to reproduce this script. There are many more things we need to do in order to enable this - all of which are outside the scope of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added in 4a2edbc

Copy link
Collaborator

@shntnu shntnu May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍. You may consider including instructions like this provided you think it will take <15 mins to do so (and use Path.home()).

Copy link
Member Author

@gwaybio gwaybio May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and use Path.home()).

Oh, this is great - I will update to use this.

You may consider including instructions like this

I think linking to a publicly available repo with these steps is a good idea. Otherwise, I think adding these instructions is too much detail. Also, there are many ways of updating this path for it to work. Lastly, there are checks inside this script that will throw errors if plates are not aligned - i.e. it would be tough to add an incorrect path here and have the script silently fail. It should fail loudly with an incorrect path!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.home() used in 76d2db3

wooo! reduced hardcoding 🎉

median_diff = abs_diff.median()
sum_diff = abs_diff.sum()

complete_mean_diff = mean_diff.replace([np.inf, -np.inf], np.nan).dropna().mean()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This question comes from my lack of knowledge of what are the range of values of each feature) Why are there np.inf values in the mean/median/sum values? Do pycto_df or cyto_df contain np.inf values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @niranjchandrasekaran - indeed, you've touched upon a pretty important difference between python and R-based processing. Solving this problem is beyond scope of this PR, but it does likely explain many of the small floating point differences between cytominer and pycytominer profiles.

I outlined the issue in cytomining/pycytominer#79 - although I was not sure where it belongs since it permeates so many different codebases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there np.inf values in the mean/median/sum values? Do pycto_df or cyto_df contain np.inf values?

Neither of them contain np.inf values. Some features are entirely NA in cytominer, but this is recoded as 0 in pycytominer. The np.inf happens in this specific column after subtraction and summarization.

so, pycyto_df.subtract(cyto_df).abs().mean() will result in np.inf for features with that case described above.

Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaygenomics I have made a couple of minor comments/suggestions. Everything else looks good.

@gwaybio
Copy link
Member Author

gwaybio commented May 14, 2020

Thanks for the review @niranjchandrasekaran ! I believe I have addressed all of your comments in the subsequent commits. PR ready for second round of review - should be good to merge after you give the ok.

Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaygenomics everything looks good! Merge away!

@gwaybio gwaybio merged commit ab7311e into broadinstitute:master May 14, 2020
@gwaybio gwaybio deleted the compare-tools branch May 14, 2020 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants