-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collate.py #160
Add collate.py #160
Conversation
Codecov Report
@@ Coverage Diff @@
## master #160 +/- ##
==========================================
- Coverage 98.04% 95.71% -2.33%
==========================================
Files 50 53 +3
Lines 2403 2593 +190
==========================================
+ Hits 2356 2482 +126
- Misses 47 111 +64
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Notes from Slack below Gregory Way Today at 6:14 AM 18 replies Gregory Way 20 minutes ago Gregory Way 18 minutes ago Gregory Way 17 minutes ago Gregory Way 16 minutes ago Gregory Way 15 minutes ago Shantanu Singh 10 minutes ago Shantanu Singh 9 minutes ago Beth Cimini Shantanu Singh 8 minutes ago Gregory Way 7 minutes ago Gregory Way 7 minutes ago Shantanu Singh 7 minutes ago cytomining.github.io Shantanu Singh 7 minutes ago Shantanu Singh 3 minutes ago Gregory Way 3 minutes ago Shantanu Singh 3 minutes ago Gregory Way 2 minutes ago Shantanu Singh < 1 minute ago pycytominer/pycytominer/cyto_utils/collate.py Lines 141 to 157 in a2fa463
collate.py google/python-fire Shantanu Singh < 1 minute ago |
Additional insight- cytomining/profiling-handbook#59 (comment) |
@bethac07 do you have any thoughts on whether this ( |
So all this needs is tests. I can imagine a couple of things
@niranjchandrasekaran can comment on the value of the last one. I can write some dumb tests quickly, but presumably we want non-dumb tests. |
After a quick skim, I think it’s fine for this to live here in this repo. Greg’s concern was that this command line functionality would be a break from the API but I think that alone can be yanked into a separate repo if that bothers us too much.
|
I can help filter
I think this is the major deciding factor and @niranjchandrasekaran is the best positioned to decide if this is practical. If it is, then this seems like the best solution to me. If not, then we can exclude the following two options below right away because I think it is fine for
The next q is, assuming it lives in HOWEVER, if that ends up being a blocker (no one has the capacity) we can bump the decision to the BDFL of this package to decide if we declare cyto_utils to be a wild-west :) The only concern I have is that it might set a bad precedent that will eventually drive down code coverage
PS – If we do write a test, it is perfectly ok to test only for |
I think merging |
Finally got around to writing some tests; I also tracked down why adding image features outside of the recipe wasn't working. I know the thought was that this will eventually move to the recipe, and I still think there's a reasonable argument that it should, but I also think that it's not unreasonable that someone using pycytominer outside of the recipe may want to concatenate data, plus it may be a while until we get around to adding parallelization to the recipe. Since the CLI aspect was the part with concerns, I've separated that out; once we have a nice way to run this in the recipe, that file can just be deleted if need be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - I am indeed less concerned about merging this code in with tests. Although I will note that the test coverage drops about 5% (~50% test coverage of collate) - do you feel comfortable moving forward with this coverage?
I've made several specific comments and suggestions in line, that we should also discuss before merging.
I'll also outline some broad strokes comments below:
- The
collate
function makes several SQLite commands. I wonder if it would be easier to read/maintain the script if you abstract these calls to some sort of a SQLite command builder (like you have already withrun_check_errors
) - We should consider removing
cytominer-database
from a required dependency. I am just too shakey when it comes to this code base. Then again, its been stable for several years and hasn't broken... but on the flip side, adding it to the requirements explicitly introduces a versioned bond that might prevent (or stall) improvements to pycytominer.- In the same vein, do we need to specifically mention the
sqlite
dependency somewhere?
- In the same vein, do we need to specifically mention the
Vision
Once we're happy with the PR, I'm happy to merge this contribution into pycytominer. It's a solid short-term solution that works, and is practical given our current limitations (funding, long-term software maintenance support, etc.)
As @niranjchandrasekaran has previously mentioned, this code actually belongs in some sort of profiling recipe. This remains our long-term goal.
Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.
I'm definitely interested in your thoughts on all aspects of this. Thanks for the PR!
I'm not sure which version you were looking at where you saw 5% code coverage drop, but the current one says it's about a 2.3% drop. I am personally fine with the amount of test coverage; I'd like to in a perfect world be able to test the aws download functionality but because we sync whole plates down at a time we would need to host the 4 test data sites somewhere public as a "pseudo-plate" which I think is probably overkill. I think all other comments here are addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bethac07 !
I made a couple more comments below. Let's resolve these before merge.
One last thing: What do you think about?
Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.
I'll also loop in @d33bs about this decision point. Dave, does this release schedule sound like a reasonable strategy? (Beth, note that @d33bs's #203 PR is to a develop
branch, but his #197 is to master
.)
If the release strategy looks kosher, then we'll likely want to:
- merge Improve Memory Performance Within merge_single_cells #197
- Release v0.1.5
- merge this PR
- Release v0.2
- Work on
develop
for future releases
@niranjchandrasekaran - would you like to take a peek at this prior to merge? It's not necessary since we only require one maintainer approval, but I thought since you work with the profiling recipe more often than I do these days, that you might have some unique insights. |
This PR really doesn't touch any of the rest of pycytominer, so I don't think you necessarily need to version-pull-version (and I'm not actually sure that just the addition of the functionality is a full-blow 0.X.0 version bump), but your repo, your rules :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a brief look at the PR and every looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some minor edits, which i will commit now
- one-sentence-per-line markdown convention
- fixing one or two spacing and punctuation typos
Let's follow #207 prior to merging (thanks Niranj for the quick scan!)
merging now! 🎉 |
Updated instructions for when collating is pulled
Description
Add collate.py, which will run cytominer-database, database file indexing, and aggregation per our standard workflow.
This replaces collate.R in the previous cytominer_scripts repository.
Collation is added as a callable function from within pycytominer, but to facilitate running multiple plates in parallel with GNU-parallel, a command line interface is also provided. This is necessary since this step can take 6-18 hours for an individual plate and we often want to run 20+ plates at a time.
See also cytomining/profiling-handbook#59 (comment)
Commits should be squashed before merging.
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.