-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List public databases in our internal git-annex server #351
Comments
@mguaypaq I am wondering this sine I want to add new GT for spinal cord segmentation, and it would be more simple to just push on one of them! |
I think we can at least de-duplicate the image storage, using git-annex special remotes, like we currently do for spine-generic. For spine-generic:
In this case, we would have:
This would also be useful in the case where we want to take a public dataset, but change the file and folder names. (If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible.) |
Ah, interesting. This is what we intended. Isn't it possible because of the nature of git-annex to only use pointer for big files? What if we specify in |
This is more a limitation of (Neuro)gitea than git-annex. We could specify in I think what you want is called a repository mirror, and it looks like Gitea does support it, which I didn't know about. Essentially, Gitea clones the existing repository (which must be publicly accessible), and it periodically pulls changes from the public source. I'm not sure yet how this would interact with git-annex, but I think it would only pull the un-annexed stuff (file names, folder names, commit messages, etc.) and the content availability information (that is, info about which servers have the large file contents). |
ok, thank you for the clarification. And what is our need to keep gitea instead of just using git-annex? Is it for the visual display of the images on a web browser, and the possibility to open issues/PRs? If so, does the lab use these features? (I personally don't, except for the Praxis databset, but that's separate from data.neuro.polymtl.ca) |
Gitea handles a lot for us:
In particular, it's the reason we have a listing of datasets at all. If we didn't want gitea, we might as well list external datasets as just an HTML link on the lab manual. |
I am preparing to send one of our datasets to https://dandiarchive.org/. It's currently sitting in the internal data server. We should de-duplicate by deleting the data from the git-annex server. Has this been done before? I don't know what the procedure should be in this case. |
That depends on which parts you want to de-duplicate, and how much you trust dandiarchive to still be working in a few years.
|
Some of our datasets (eg: spine-generic, whole-spine) are hosted on public cloud servers (eg: AWS, OpenNeuro). To facilitate their reusability it would be convenient to have them listed in our internal git-annex database. Is this something possible?
Moreover, this approach would avoid having duplicated dataset. E.g., currently, whole-spine is physically hosted on both our internal git-annex server, and on OpenNeuro.
The text was updated successfully, but these errors were encountered: