-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing cytominer- and pycytominer-derived profiles #29
Conversation
removing the 10 plates that cytominer used a different aggregation strategy. I also add the comparison to pycytominer 4b to cytominer 4a
comparison/util.py
Outdated
return (pycyto_df, cyto_df) | ||
|
||
|
||
def generate_output_filenames(output_dir, level, metrics=["median", "mean", "sum"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the name of the function be changed given that it is also used to generate file names are read as input in 1.summarize-cytominer-tool-differences.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch - yes, it should be updated. Will do in the next commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 4abbbf5
|
||
pycytominer_dir = pathlib.Path("../profiles/backend/") | ||
cytominer_dir = pathlib.Path( | ||
"/Users/gway/work/projects/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of reproducibility would it be better to not hardcode the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely better, I agree. However, there are limitations to what we can do to improve the current cytominer profile data reproducibility. I think the best thing that we can do now (to both minimize time and maximize reproducibility) is to add documentation to why we need to hardcode this path.
As it currently stands and to my knowledge, there is no way for anyone outside the lab (i.e. without access to the imaging-platform S3 bucket) to reproduce this script. There are many more things we need to do in order to enable this - all of which are outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added in 4a2edbc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 👍. You may consider including instructions like this provided you think it will take <15 mins to do so (and use Path.home()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and use Path.home()).
Oh, this is great - I will update to use this.
You may consider including instructions like this
I think linking to a publicly available repo with these steps is a good idea. Otherwise, I think adding these instructions is too much detail. Also, there are many ways of updating this path for it to work. Lastly, there are checks inside this script that will throw errors if plates are not aligned - i.e. it would be tough to add an incorrect path here and have the script silently fail. It should fail loudly with an incorrect path!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.home()
used in 76d2db3
wooo! reduced hardcoding 🎉
median_diff = abs_diff.median() | ||
sum_diff = abs_diff.sum() | ||
|
||
complete_mean_diff = mean_diff.replace([np.inf, -np.inf], np.nan).dropna().mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This question comes from my lack of knowledge of what are the range of values of each feature) Why are there np.inf
values in the mean/median/sum values? Do pycto_df
or cyto_df
contain np.inf
values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @niranjchandrasekaran - indeed, you've touched upon a pretty important difference between python and R-based processing. Solving this problem is beyond scope of this PR, but it does likely explain many of the small floating point differences between cytominer and pycytominer profiles.
I outlined the issue in cytomining/pycytominer#79 - although I was not sure where it belongs since it permeates so many different codebases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there np.inf values in the mean/median/sum values? Do pycto_df or cyto_df contain np.inf values?
Neither of them contain np.inf
values. Some features are entirely NA in cytominer, but this is recoded as 0 in pycytominer. The np.inf
happens in this specific column after subtraction and summarization.
so, pycyto_df.subtract(cyto_df).abs().mean()
will result in np.inf
for features with that case described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gwaygenomics I have made a couple of minor comments/suggestions. Everything else looks good.
also nesting results and figures in batch directory
Thanks for the review @niranjchandrasekaran ! I believe I have addressed all of your comments in the subsequent commits. PR ready for second round of review - should be good to merge after you give the ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gwaygenomics everything looks good! Merge away!
The motivation, results, figures, interpretation, and figures are provided in #28