Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List public databases in our internal git-annex server #351

Open
sandrinebedard opened this issue Feb 11, 2025 · 8 comments
Open

List public databases in our internal git-annex server #351

sandrinebedard opened this issue Feb 11, 2025 · 8 comments

Comments

@sandrinebedard
Copy link
Member

Some of our datasets (eg: spine-generic, whole-spine) are hosted on public cloud servers (eg: AWS, OpenNeuro). To facilitate their reusability it would be convenient to have them listed in our internal git-annex database. Is this something possible?

Moreover, this approach would avoid having duplicated dataset. E.g., currently, whole-spine is physically hosted on both our internal git-annex server, and on OpenNeuro.

@sandrinebedard
Copy link
Member Author

@mguaypaq I am wondering this sine I want to add new GT for spinal cord segmentation, and it would be more simple to just push on one of them!

@mguaypaq
Copy link
Member

I think we can at least de-duplicate the image storage, using git-annex special remotes, like we currently do for spine-generic. For spine-generic:

  • The small files and the folder structure and pull requests are hosted on github.
  • The contents of the large files are hosted on AWS.

In this case, we would have:

  • The small files and folder structure and pull requests are hosted on data.neuro.polymtl.ca.
  • The contents of the large public files are hosted in the cloud, and are available from both the public repo, and our internal repo.
  • If we have large private files (for example, derivatives) to add, those ones can still be stored on data.neuro.polymtl.ca. Git-annex is able to track multiple sources for different files.

This would also be useful in the case where we want to take a public dataset, but change the file and folder names.

(If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible.)

@jcohenadad
Copy link
Member

If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible

Ah, interesting. This is what we intended. Isn't it possible because of the nature of git-annex to only use pointer for big files? What if we specify in .gitattributes that all files should be annexed?

@mguaypaq
Copy link
Member

If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible

Ah, interesting. This is what we intended. Isn't it possible because of the nature of git-annex to only use pointer for big files? What if we specify in .gitattributes that all files should be annexed?

This is more a limitation of (Neuro)gitea than git-annex.

We could specify in .gitattributes that git-annex should handle the contents of all files, not just big files, that's true. But that's just the file contents, not the file and folder names (and not the history, branches, commit messages, etc.). Regardless of git-annex, there still needs to be a git repository.

I think what you want is called a repository mirror, and it looks like Gitea does support it, which I didn't know about. Essentially, Gitea clones the existing repository (which must be publicly accessible), and it periodically pulls changes from the public source. I'm not sure yet how this would interact with git-annex, but I think it would only pull the un-annexed stuff (file names, folder names, commit messages, etc.) and the content availability information (that is, info about which servers have the large file contents).

@jcohenadad
Copy link
Member

This is more a limitation of (Neuro)gitea than git-annex.

ok, thank you for the clarification. And what is our need to keep gitea instead of just using git-annex? Is it for the visual display of the images on a web browser, and the possibility to open issues/PRs? If so, does the lab use these features? (I personally don't, except for the Praxis databset, but that's separate from data.neuro.polymtl.ca)

@mguaypaq
Copy link
Member

Gitea handles a lot for us:

  • It's the entire web interface (so, listing datasets, browsing the contents and the README).
  • It does pull requests, which students/postdocs and me have been using a lot.
  • It handles user logins and authentication.
  • It handles read/write/admin permissions for the repositories.
  • It provides the endpoint that git clone/push/pull etc. talks to over ssh.
  • Eventually, it should handle automated checks and validation for pull requests.

In particular, it's the reason we have a listing of datasets at all. If we didn't want gitea, we might as well list external datasets as just an HTML link on the lab manual.

@hermancollin
Copy link
Member

I am preparing to send one of our datasets to https://dandiarchive.org/. It's currently sitting in the internal data server. We should de-duplicate by deleting the data from the git-annex server. Has this been done before?

I don't know what the procedure should be in this case.

@mguaypaq
Copy link
Member

mguaypaq commented Mar 3, 2025

I am preparing to send one of our datasets to https://dandiarchive.org/. It's currently sitting in the internal data server. We should de-duplicate by deleting the data from the git-annex server. Has this been done before?

That depends on which parts you want to de-duplicate, and how much you trust dandiarchive to still be working in a few years.

  • The most radical option would be to delete your dataset completely from data.neuro.polymtl.ca, and then to replace it with a mirror of the version on https://github.com/dandisets. This would probably lose some of the git history of the dataset during development, and it would certainly lose the information on pull requests, etc.
  • Since dandiarchive supports datalad, another option would be to keep the git repository on data.neuro.polymtl.ca, and simply add dandiarchive as a special remote in git-annex. Then, we could de-duplicate only the files which are on dandiarchive, while keeping everything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants