Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding profiles to dvc #66

Closed
2 tasks done
gwaybio opened this issue May 31, 2021 · 7 comments
Closed
2 tasks done

Adding profiles to dvc #66

gwaybio opened this issue May 31, 2021 · 7 comments
Labels
Data All things data

Comments

@gwaybio
Copy link
Member

gwaybio commented May 31, 2021

I am working on this now.

Asks

@shntnu

  • Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?
  • I also remember you wanting me to document adding DVC to this repo somewhere else, but I cannot find the link to where you want me to document steps. Can you also provide me this pointer again? Thanks! (see cross references below)

Cross references

A couple cross-references to track history of DVC discussions:

@gwaybio gwaybio added the Data All things data label May 31, 2021
@shntnu
Copy link
Collaborator

shntnu commented Jun 6, 2021

  • Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?

The bucket reference here is the one I had in mind

https://registry.opendata.aws/cell-painting-image-collection/

I will go pull up stuff now

@shntnu
Copy link
Collaborator

shntnu commented Jun 6, 2021

Hm – the only trouble is that the bucket is called cytodata, which will be a bit odd. Crud. It's on my plate to create a new AWS Open Data Resource, but I don't have an ETA.

To unblock you, I'd suggest we go ahead with depositing it at s3://cellpainting-datasets instead. You already have credentials for that (same as our primary AWS account)

There's some chance we may need to change that, but at least it will keep you moving.

IIUC the change is not too hard

https://github.com/gwaygenomics/grit-benchmark/blob/a04d010b2f579d5dd0cfdc2c9222c2d7f02b9a84/.dvc/config#L4

and you'd only need to modify the URL.

The file pointer will remain the same as long as we keep the relative paths the same, and don't modify the file (otherwise md5 will change)

cytomining/profiling-template#13 (comment)

@gwaybio
Copy link
Member Author

gwaybio commented Jun 16, 2021

I am planning on adding all level 3-4 data to dvc, but keep level 5 and spherized profiles as git lfs files. We use the level 3-4 data less frequently, and we often read the level 5 and spherized profiles directly from their github urls.

While dvc also has a nifty way of directly interacting with dvc files from github urls in python, it is not a direct drop-in solution for reading directly from url.

We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.

@shntnu
Copy link
Collaborator

shntnu commented Jun 16, 2021

We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.

Nice plan!

For my notes:

You can't directly access the a DVC-versioned file via URL because the pointer looks like this https://github.com/gwaygenomics/grit-benchmark/blob/6b826a03456b5e0d6437aff99e17a407653c2568/1.calculate-metrics/cell-health/results/cell_health_grit_compartments.tsv.gz.dvc

Whereas you can directly access the files via URL for GitLFS https://github.com/gwaygenomics/grit-benchmark/blob/main/0.download-data/data/ceres.csv (click on "View raw")

@shntnu
Copy link
Collaborator

shntnu commented Jun 16, 2021

BTW for level 5 + spherized – I am guessing it isn't practical to have them live in both, DVC and GitLFS? I ask because it will be convenient to be able to get all the data from the bucket alone if one would like to do so.

We needn't do that for this dataset, but I was just wondering if there's any path that will allow us to do so in the future.

@gwaybio
Copy link
Member Author

gwaybio commented Jun 16, 2021

My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url

@shntnu
Copy link
Collaborator

shntnu commented Jun 16, 2021

My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url

Oh, interesting – curious you see what you mean by "adding the repo to the S3 bucket"; I can wait for the PR, no need to explain right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data All things data
Projects
None yet
Development

No branches or pull requests

2 participants