Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add collate.py #160

Merged
merged 23 commits into from
Jun 17, 2022
Merged

Add collate.py #160

merged 23 commits into from
Jun 17, 2022

Conversation

shntnu
Copy link
Member

@shntnu shntnu commented Jul 29, 2021

Description

Add collate.py, which will run cytominer-database, database file indexing, and aggregation per our standard workflow.
This replaces collate.R in the previous cytominer_scripts repository.
Collation is added as a callable function from within pycytominer, but to facilitate running multiple plates in parallel with GNU-parallel, a command line interface is also provided. This is necessary since this step can take 6-18 hours for an individual plate and we often want to run 20+ plates at a time.

See also cytomining/profiling-handbook#59 (comment)

Commits should be squashed before merging.

What is the nature of your change?

  • Enhancement (adds functionality).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2021

Codecov Report

Merging #160 (ead2547) into master (02d0522) will decrease coverage by 2.32%.
The diff coverage is 66.31%.

@@            Coverage Diff             @@
##           master     #160      +/-   ##
==========================================
- Coverage   98.04%   95.71%   -2.33%     
==========================================
  Files          50       53       +3     
  Lines        2403     2593     +190     
==========================================
+ Hits         2356     2482     +126     
- Misses         47      111      +64     
Flag Coverage Δ
unittests 95.71% <66.31%> (-2.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/cyto_utils/collate_cmd.py 0.00% <0.00%> (ø)
setup.py 0.00% <ø> (ø)
pycytominer/cyto_utils/collate.py 54.73% <54.73%> (ø)
pycytominer/tests/test_cyto_utils/test_util.py 96.35% <90.00%> (-0.51%) ⬇️
pycytominer/cyto_utils/__init__.py 100.00% <100.00%> (ø)
pycytominer/cyto_utils/util.py 98.86% <100.00%> (+0.05%) ⬆️
pycytominer/tests/test_cyto_utils/test_collate.py 100.00% <100.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@shntnu shntnu changed the title Jump Add collate.py Jul 29, 2021
@shntnu
Copy link
Member Author

shntnu commented Jul 29, 2021

Notes from Slack below

Gregory Way Today at 6:14 AM
this is frightening to me 🙀
image.png
image.png

18 replies

Gregory Way 20 minutes ago
well, maybe starting to frighten

Gregory Way 18 minutes ago
I think JUMP is good to drive development (very good!), the fear is driving too fast without seatbelts and not considering other cars on the road that might benefit from various improvements

Gregory Way 17 minutes ago
but maybe the rules of the road aren't conducive to these kinds of equitable and safe improvements?

Gregory Way 16 minutes ago
if so, then we need to figure out a plan to make sure pycytominer has a sustainable future 🙂

Gregory Way 15 minutes ago
it's also possible JUMP has a merge plan - if so, then my fears can easily be calmed!

Shantanu Singh 10 minutes ago
It's a single file (mostly) https://github.com/cytomining/pycytominer/pull/160/files

Shantanu Singh 9 minutes ago
https://broadinstitute.slack.com/archives/C3QFQ3WQM/p1624420346066300

Beth Cimini
I'm having what I'm sure is probably a dumb pycytominer error- I'm trying to add the aggregation steps to collate.py so that we have a cleaner drop in replacement for the old collate.R . Issue in thread.
Thread in ip-profiling | Jun 22nd | View message

Shantanu Singh 8 minutes ago
^^ Copied that chat because that's some (but not all) the context for the contrib
👍
1

Gregory Way 7 minutes ago
cool, ok, less worried 🙂

Gregory Way 7 minutes ago
i trust that means there is also a merge plan

Shantanu Singh 7 minutes ago
Essentially, the idea is to replace collate.R and create these news instructions here https://cytomining.github.io/profiling-handbook/create-profiles.html#create-database-backend

cytomining.github.io
Chapter 5 Create Profiles | Image-based Profiling Handbook
This is a handbook for processing image-based profiling datasets using CellProfiler and pycytominer

Shantanu Singh 7 minutes ago
i.e.
python3 pycytominer/cyto_utils/collate.py ${BATCH_ID} pycytominer/cyto_utils/ingest_config.ini {1} \

Shantanu Singh 3 minutes ago
Regarding the merge plan – it's quite possible we will move it out of there and into the future cytominer-database replacement because it doesn't quite fit into the pycytominer framework... but maybe it does fit in, given that pycytominer is kinda monolithic. Hm

Gregory Way 3 minutes ago
gotcha! This seems to be a pretty big API change too. Pycytominer hasn't supported command line interaction in the past (which is fine to introduce, it just complicates things)

Shantanu Singh 3 minutes ago
Exactly

Gregory Way 2 minutes ago
cool, glad to know there's a plan!

Shantanu Singh < 1 minute ago
Although the command line part can be addressed easily – we needn't have

if __name__ =='__main__':
import argparse
parser = argparse.ArgumentParser(description='Collate CSVs')
parser.add_argument('batch', help='Batch name to process')
parser.add_argument('config', help='config file to pass to cytominer-database')
parser.add_argument('plate', help='Plate name to process')
parser.add_argument('--base','--base-directory', dest='base_directory',default='../..',help='Base directory where the CSV files will be located')
parser.add_argument('--column', default=None,help='An existing column to be explicitly copied to a Metadata_Plate column if Metadata_Plate was not set')
parser.add_argument('--munge', action='store_true', default=False,help='Whether munge should be passed to cytominer-database, if True will break a single object CSV down by objects')
parser.add_argument('--pipeline', default='analysis',help='A string used in path creation')
parser.add_argument('--remote', default=None,help='A remote AWS directory, if set CSV files will be synced down from at the beginning and to which SQLite files will be synced up at the end of the run')
parser.add_argument('--temp', default='/tmp',help='The temporary directory to be used by cytominer-databases for output')
parser.add_argument('--overwrite', action='store_true', default=False,help='Whether or not to overwrite an sqlite that exists in the temporary directory if it already exists')
args = parser.parse_args()
collate(args.batch, args.config, args.plate, base_directory=args.base_directory, column=args.column, munge=args.munge, pipeline=args.pipeline, remote=args.remote, temp=args.temp, overwrite=args.overwrite)
in the code, and instead move that out into a standalone using https://github.com/google/python-fire

collate.py
https://github.com/cytomining/pycytominer|cytomining/pycytominercytomining/pycytominer | Added by GitHub

google/python-fire
Stars
19850
Language
Python
Added by GitHub

Shantanu Singh < 1 minute ago
I'll copy these comments to the PR so we have notes there

@bethac07
Copy link
Member

Additional insight- cytomining/profiling-handbook#59 (comment)

@shntnu
Copy link
Member Author

shntnu commented Mar 31, 2022

@bethac07 do you have any thoughts on whether this (collate) should exist as a separate tool, or is it worth wrapping up this PR? I know you are booked out but wondering what crumbs should be left on this PR for anyone who has the capacity to work on this

@bethac07
Copy link
Member

bethac07 commented Mar 31, 2022

So all this needs is tests. I can imagine a couple of things

  1. We decide we can live without tests, since cyto_utils is a bit more wild-west, and then we rebase and pull
  2. I or somebody else writes some tests, and then we rebase and pull
  3. It moves to its own repo, which seems a bit over the top but fine, whatever, I don't actually care
  4. It moves somewhere else - cytominer-database repo? I don't love c-d specifically because it's got all that not-quite-working parquet stuff in it (this version uses the non-parquet version) but I can't think of anywhere else appropriate
  5. We work on adding parallelization to the recipe, which is essentially the only problem that this solves (that backends take 8-12 hours apiece and the recipe doesn't currently support parallelization so for ie 20 plates we'd much rather have this in parallel than in series) and then we move anything useful (like auto-file-downloads) to the recipe repo and then close this without merging.

@niranjchandrasekaran can comment on the value of the last one. I can write some dumb tests quickly, but presumably we want non-dumb tests.

@shntnu
Copy link
Member Author

shntnu commented Mar 31, 2022 via email

@shntnu
Copy link
Member Author

shntnu commented Apr 1, 2022

I can help filter

  1. We work on adding parallelization to the recipe, which is essentially the only problem that this solves (that backends take 8-12 hours apiece and the recipe doesn't currently support parallelization so for ie 20 plates we'd much rather have this in parallel than in series) and then we move anything useful (like auto-file-downloads) to the recipe repo and then close this without merging.

I think this is the major deciding factor and @niranjchandrasekaran is the best positioned to decide if this is practical. If it is, then this seems like the best solution to me.

If not, then we can exclude the following two options below right away because I think it is fine for collate to live in pycytominer; Greg’s concern was that this command line functionality would be a break from the API but I think that alone can be yanked into a separate repo if that bothers us too much. Or, we can delete the command-line functionality and use https://github.com/google/python-fire to create a command-line (and do this bit in the recipe) if @bethac07 thinks that is a fine way to go in general in such cases (we have a function for which we want to create a command-line tool)

  1. It moves to its own repo
  2. It moves somewhere else - cytominer-database repo?

The next q is, assuming it lives in pycytominer, does it really need to have a test? I think yes, so I'd go with 2. I say this because SQLite is often the point where things fail; testing that the index exists in the SQLite file would be a very valuable test (admittedly run_check_errors in the code already goes some distance to flag failure)

HOWEVER, if that ends up being a blocker (no one has the capacity) we can bump the decision to the BDFL of this package to decide if we declare cyto_utils to be a wild-west :) The only concern I have is that it might set a bad precedent that will eventually drive down code coverage

  1. We decide we can live without tests, since cyto_utils is a bit more wild-west, and then we rebase and pull
  2. I or somebody else writes some tests, and then we rebase and pull

PS – If we do write a test, it is perfectly ok to test only for remote=None

@niranjchandrasekaran
Copy link
Member

I think merging collate.py and the recipe is what we should do. The recipe will benefit from collate.py's parallelization and it think it fits better in the recipe repo. Some more context in cytomining/profiling-recipe#30 (look for "Combining collate.py and the recipe"). We can continue the discussion there.

@bethac07 bethac07 marked this pull request as ready for review May 23, 2022 02:32
@bethac07
Copy link
Member

bethac07 commented May 23, 2022

Finally got around to writing some tests; I also tracked down why adding image features outside of the recipe wasn't working.

I know the thought was that this will eventually move to the recipe, and I still think there's a reasonable argument that it should, but I also think that it's not unreasonable that someone using pycytominer outside of the recipe may want to concatenate data, plus it may be a while until we get around to adding parallelization to the recipe. Since the CLI aspect was the part with concerns, I've separated that out; once we have a nice way to run this in the recipe, that file can just be deleted if need be.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - I am indeed less concerned about merging this code in with tests. Although I will note that the test coverage drops about 5% (~50% test coverage of collate) - do you feel comfortable moving forward with this coverage?

I've made several specific comments and suggestions in line, that we should also discuss before merging.

I'll also outline some broad strokes comments below:

  • The collate function makes several SQLite commands. I wonder if it would be easier to read/maintain the script if you abstract these calls to some sort of a SQLite command builder (like you have already with run_check_errors)
  • We should consider removing cytominer-database from a required dependency. I am just too shakey when it comes to this code base. Then again, its been stable for several years and hasn't broken... but on the flip side, adding it to the requirements explicitly introduces a versioned bond that might prevent (or stall) improvements to pycytominer.
    • In the same vein, do we need to specifically mention the sqlite dependency somewhere?

Vision

Once we're happy with the PR, I'm happy to merge this contribution into pycytominer. It's a solid short-term solution that works, and is practical given our current limitations (funding, long-term software maintenance support, etc.)

As @niranjchandrasekaran has previously mentioned, this code actually belongs in some sort of profiling recipe. This remains our long-term goal.

Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.

I'm definitely interested in your thoughts on all aspects of this. Thanks for the PR!

pycytominer/cyto_utils/collate.py Show resolved Hide resolved
pycytominer/cyto_utils/collate.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/collate.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/collate.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/collate.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/util.py Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_collate.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
@bethac07
Copy link
Member

bethac07 commented Jun 8, 2022

I'm not sure which version you were looking at where you saw 5% code coverage drop, but the current one says it's about a 2.3% drop. I am personally fine with the amount of test coverage; I'd like to in a perfect world be able to test the aws download functionality but because we sync whole plates down at a time we would need to host the 4 test data sites somewhere public as a "pseudo-plate" which I think is probably overkill.

I think all other comments here are addressed.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bethac07 !

I made a couple more comments below. Let's resolve these before merge.

One last thing: What do you think about?

Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.

I'll also loop in @d33bs about this decision point. Dave, does this release schedule sound like a reasonable strategy? (Beth, note that @d33bs's #203 PR is to a develop branch, but his #197 is to master.)

If the release strategy looks kosher, then we'll likely want to:

  1. merge Improve Memory Performance Within merge_single_cells #197
  2. Release v0.1.5
  3. merge this PR
  4. Release v0.2
  5. Work on develop for future releases

README.md Outdated Show resolved Hide resolved
pycytominer/cyto_utils/collate.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Jun 8, 2022

@niranjchandrasekaran - would you like to take a peek at this prior to merge? It's not necessary since we only require one maintainer approval, but I thought since you work with the profiling recipe more often than I do these days, that you might have some unique insights.

@bethac07
Copy link
Member

bethac07 commented Jun 8, 2022

merge #197
Release v0.1.5
merge this PR
Release v0.2
Work on develop for future releases

This PR really doesn't touch any of the rest of pycytominer, so I don't think you necessarily need to version-pull-version (and I'm not actually sure that just the addition of the functionality is a full-blow 0.X.0 version bump), but your repo, your rules :)

Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a brief look at the PR and every looks good to me!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor edits, which i will commit now

  1. one-sentence-per-line markdown convention
  2. fixing one or two spacing and punctuation typos

Let's follow #207 prior to merging (thanks Niranj for the quick scan!)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Jun 17, 2022

merging now! 🎉

@gwaybio gwaybio merged commit f8ce3b4 into master Jun 17, 2022
@gwaybio gwaybio deleted the jump branch June 17, 2022 22:16
@shntnu shntnu restored the jump branch October 18, 2022 20:07
@kenibrewer kenibrewer deleted the jump branch November 7, 2023 13:31
shntnu added a commit to cytomining/profiling-handbook that referenced this pull request Mar 21, 2024
Updated instructions for when collating is pulled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants