List public databases in our internal git-annex server #351

sandrinebedard · 2025-02-11T16:16:59Z

Some of our datasets (eg: spine-generic, whole-spine) are hosted on public cloud servers (eg: AWS, OpenNeuro). To facilitate their reusability it would be convenient to have them listed in our internal git-annex database. Is this something possible?

Moreover, this approach would avoid having duplicated dataset. E.g., currently, whole-spine is physically hosted on both our internal git-annex server, and on OpenNeuro.

sandrinebedard · 2025-02-17T17:29:41Z

@mguaypaq I am wondering this sine I want to add new GT for spinal cord segmentation, and it would be more simple to just push on one of them!

mguaypaq · 2025-02-17T18:56:48Z

I think we can at least de-duplicate the image storage, using git-annex special remotes, like we currently do for spine-generic. For spine-generic:

The small files and the folder structure and pull requests are hosted on github.
The contents of the large files are hosted on AWS.

In this case, we would have:

The small files and folder structure and pull requests are hosted on data.neuro.polymtl.ca.
The contents of the large public files are hosted in the cloud, and are available from both the public repo, and our internal repo.
If we have large private files (for example, derivatives) to add, those ones can still be stored on data.neuro.polymtl.ca. Git-annex is able to track multiple sources for different files.

This would also be useful in the case where we want to take a public dataset, but change the file and folder names.

(If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible.)

jcohenadad · 2025-02-18T22:38:03Z

If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible

Ah, interesting. This is what we intended. Isn't it possible because of the nature of git-annex to only use pointer for big files? What if we specify in .gitattributes that all files should be annexed?

mguaypaq · 2025-02-19T15:49:04Z

If you want to purely have a link, so that we don't even have to duplicate the small files and folder structure, but still have the repository show up in searches, I don't think that's possible

Ah, interesting. This is what we intended. Isn't it possible because of the nature of git-annex to only use pointer for big files? What if we specify in .gitattributes that all files should be annexed?

This is more a limitation of (Neuro)gitea than git-annex.

We could specify in .gitattributes that git-annex should handle the contents of all files, not just big files, that's true. But that's just the file contents, not the file and folder names (and not the history, branches, commit messages, etc.). Regardless of git-annex, there still needs to be a git repository.

I think what you want is called a repository mirror, and it looks like Gitea does support it, which I didn't know about. Essentially, Gitea clones the existing repository (which must be publicly accessible), and it periodically pulls changes from the public source. I'm not sure yet how this would interact with git-annex, but I think it would only pull the un-annexed stuff (file names, folder names, commit messages, etc.) and the content availability information (that is, info about which servers have the large file contents).

jcohenadad · 2025-02-19T16:28:07Z

This is more a limitation of (Neuro)gitea than git-annex.

ok, thank you for the clarification. And what is our need to keep gitea instead of just using git-annex? Is it for the visual display of the images on a web browser, and the possibility to open issues/PRs? If so, does the lab use these features? (I personally don't, except for the Praxis databset, but that's separate from data.neuro.polymtl.ca)

mguaypaq · 2025-02-19T17:26:55Z

Gitea handles a lot for us:

It's the entire web interface (so, listing datasets, browsing the contents and the README).
It does pull requests, which students/postdocs and me have been using a lot.
It handles user logins and authentication.
It handles read/write/admin permissions for the repositories.
It provides the endpoint that git clone/push/pull etc. talks to over ssh.
Eventually, it should handle automated checks and validation for pull requests.

In particular, it's the reason we have a listing of datasets at all. If we didn't want gitea, we might as well list external datasets as just an HTML link on the lab manual.

hermancollin · 2025-02-27T22:12:28Z

I am preparing to send one of our datasets to https://dandiarchive.org/. It's currently sitting in the internal data server. We should de-duplicate by deleting the data from the git-annex server. Has this been done before?

I don't know what the procedure should be in this case.

mguaypaq · 2025-03-03T16:25:32Z

I am preparing to send one of our datasets to https://dandiarchive.org/. It's currently sitting in the internal data server. We should de-duplicate by deleting the data from the git-annex server. Has this been done before?

That depends on which parts you want to de-duplicate, and how much you trust dandiarchive to still be working in a few years.

The most radical option would be to delete your dataset completely from data.neuro.polymtl.ca, and then to replace it with a mirror of the version on https://github.com/dandisets. This would probably lose some of the git history of the dataset during development, and it would certainly lose the information on pull requests, etc.
Since dandiarchive supports datalad, another option would be to keep the git repository on data.neuro.polymtl.ca, and simply add dandiarchive as a special remote in git-annex. Then, we could de-duplicate only the files which are on dandiarchive, while keeping everything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List public databases in our internal git-annex server #351

List public databases in our internal git-annex server #351

sandrinebedard commented Feb 11, 2025

sandrinebedard commented Feb 17, 2025

mguaypaq commented Feb 17, 2025

jcohenadad commented Feb 18, 2025

mguaypaq commented Feb 19, 2025

jcohenadad commented Feb 19, 2025

mguaypaq commented Feb 19, 2025

hermancollin commented Feb 27, 2025

mguaypaq commented Mar 3, 2025

List public databases in our internal git-annex server #351

List public databases in our internal git-annex server #351

Comments

sandrinebedard commented Feb 11, 2025

sandrinebedard commented Feb 17, 2025

mguaypaq commented Feb 17, 2025

jcohenadad commented Feb 18, 2025

mguaypaq commented Feb 19, 2025

jcohenadad commented Feb 19, 2025

mguaypaq commented Feb 19, 2025

hermancollin commented Feb 27, 2025

mguaypaq commented Mar 3, 2025