-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding profiles to dvc #66
Comments
The bucket reference here is the one I had in mind https://registry.opendata.aws/cell-painting-image-collection/ I will go pull up stuff now |
Hm – the only trouble is that the bucket is called To unblock you, I'd suggest we go ahead with depositing it at There's some chance we may need to change that, but at least it will keep you moving. IIUC the change is not too hard and you'd only need to modify the URL. The file pointer will remain the same as long as we keep the relative paths the same, and don't modify the file (otherwise md5 will change) |
I am planning on adding all level 3-4 data to dvc, but keep level 5 and spherized profiles as git lfs files. We use the level 3-4 data less frequently, and we often read the level 5 and spherized profiles directly from their github urls. While dvc also has a nifty way of directly interacting with dvc files from github urls in python, it is not a direct drop-in solution for reading directly from url. We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs. |
Nice plan! For my notes: You can't directly access the a DVC-versioned file via URL because the pointer looks like this https://github.com/gwaygenomics/grit-benchmark/blob/6b826a03456b5e0d6437aff99e17a407653c2568/1.calculate-metrics/cell-health/results/cell_health_grit_compartments.tsv.gz.dvc Whereas you can directly access the files via URL for GitLFS https://github.com/gwaygenomics/grit-benchmark/blob/main/0.download-data/data/ceres.csv (click on "View raw") |
BTW for level 5 + spherized – I am guessing it isn't practical to have them live in both, DVC and GitLFS? I ask because it will be convenient to be able to get all the data from the bucket alone if one would like to do so. We needn't do that for this dataset, but I was just wondering if there's any path that will allow us to do so in the future. |
My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url |
Oh, interesting – curious you see what you mean by "adding the repo to the S3 bucket"; I can wait for the PR, no need to explain right now |
I am working on this now.
Asks
@shntnu
I also remember you wanting me to document adding DVC to this repo somewhere else, but I cannot find the link to where you want me to document steps. Can you also provide me this pointer again?Thanks! (see cross references below)Cross references
A couple cross-references to track history of DVC discussions:
The text was updated successfully, but these errors were encountered: