Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify location to store metadata for registry clients #322

Open
thomashoneyman opened this issue Feb 2, 2022 · 5 comments
Open

Specify location to store metadata for registry clients #322

thomashoneyman opened this issue Feb 2, 2022 · 5 comments

Comments

@thomashoneyman
Copy link
Member

We have a Dhall specification for the metadata about a particular package as well as particular package versions:

https://github.com/purescript/registry/blob/master/v1/Metadata.dhall

This metadata is information we collect about a package or package version on top of the information provided by users via their package manifests. For example, we collect the package size, compute a hash, and so on. The spec asserts that the metadata for packages is stored in a directory named packages here in the registry repo:

https://github.com/purescript/registry#package-metadata

However, during our registry call this morning in the #registry channel of the PureScript chat we discussed mirroring information to other repositories from the registry, like how we write the package manifests to the registry-index repository. The metadata about packages seems like it would be equally useful to package managers as the contents of the manifests themselves, and so we discussed whether the metadata should be mirrored to the registry-index as well.

The question is: should we mirror the metadata to the registry index or elsewhere, and if so, where should it be stored in the registry index?

@f-f
Copy link
Member

f-f commented Feb 3, 2022

The question is: should we mirror the metadata to the registry index or elsewhere, and if so, where should it be stored in the registry index?

I'll note that really this is about having a slight convenience: right now package managers would have to fetch both the registry repo (for reading the metadata) and the registry-index repo (for reading the manifests).
Mirroring the metadata to the registry-index repo would allow us to have access to all the information by only fetching the registry-index repo.

As a package-manager author I don't really mind fetching two repos.
It might be necessary to do it in any case: the registry-index is just a cache, while the registry is the source of truth.
Since caches can get out of date, the only way to have an up to date answer to the question "has version X of package Y been published?" is by looking at the registry repo.

@thomashoneyman
Copy link
Member Author

I'll note that really this is about having a slight convenience: right now package managers would have to fetch both the registry repo (for reading the metadata) and the registry-index repo (for reading the manifests). Mirroring the metadata to the registry-index repo would allow us to have access to all the information by only fetching the registry-index repo.

What if we write the registry-index and metadata caches to the storage backend instead of a repository? Clients already have to fetch from the storage backend to get tarballs, and they already have to fetch from the registry repository to know what the storage backend URLs are. If we make them read the registry index and metadata caches from the registry-index repository then that makes three locations they have to look at.

In general, it seems like we could use the registry repository as the 'source of truth' (it's essentially our database), and then use the storage backend to store packages and the replica of registry repository information (the registry index, the package metadata). Package managers don't need to look at the registry repository unless there is reason to believe the cache is out of date.

As an aside: package managers also look at the registry repository to get the location of the storage backend. I'm struggling to remember why this is necessary. It appears that using the getPackageUrls will tell you the location of the package for different backends, which presumably will be stored in different locations, but I'm not aware of any plan to host different versions of packages for different backends. Why can't we just assert that packages are hosted at packages.purescript.org/name/version.tar.gz?

@f-f
Copy link
Member

f-f commented Feb 7, 2022

In general, it seems like we could use the registry repository as the 'source of truth' (it's essentially our database), and then use the storage backend to store packages and the replica of registry repository information (the registry index, the package metadata)

The reason why the registry-index is a git repository is actually for package-managers' convenience: in this way they can clone it once, and then get incremental updates. Having a static tarball on a CDN would be messier because HTTP content might be cached without us realizing, etc.
Instead git works very well for this, and I believe it would be more pleasant for users as well, since it should use less bandwidth.

Package managers don't need to look at the registry repository unless there is reason to believe the cache is out of date.

Note that package managers must always look at the most up-to-date metadata (i.e. the registry repo), because stale data is just incorrect (e.g. think about how it could go "oh sure you can publish version Y of package X, I don't see it in my cache!"), and I can't think of a mechanism to figure out that the cache is outdated other than, well, trying to fetch the source.

As an aside: package managers also look at the registry repository to get the location of the storage backend. I'm struggling to remember why this is necessary. It appears that using the getPackageUrls will tell you the location of the package for different backends, which presumably will be stored in different locations, but I'm not aware of any plan to host different versions of packages for different backends. Why can't we just assert that packages are hosted at packages.purescript.org/name/version.tar.gz?

There's a section of the spec about mirroring the Registry.
The basic idea is that chunks of the internet that we depend on (GitHub and DigitalOcean in this case) go down pretty often, and generally speaking we need to have a story to mirror all the content if we need to, also because we don't want to tie ourselves to a specific service, e.g. GitHub.
So the idea is that the main entrypoint is this repo (or one of its mirrors, e.g. I'd like to mirror this to gitlab.com at least), and from its content you would be able to figure out the possible locations where you'd fetch packages.
Note that the data on a storage backend is critical data - we fetch it once and then it's supposed to stay there forever. So we'll need at least backups, and why not expose some more copies as well? I feel like the question should not be "why do we want to mirror this" but rather "why wouldn't we want to mirror this?" 😄

So that chunk of code you link is so that package managers (or anyone else really) can figure out the URLs of where to fetch packages from all the different mirrors.
The registry-index is in single copy at the moment, and we could probably have a similar mechanism to link to its multiple mirrors here.

So to get back to your initial question:

If we make them read the registry index and metadata caches from the registry-index repository then that makes three locations they have to look at.

...maybe instead of trying to reduce the amount of locations that package managers need to deal with (that I feel it's pretty minimal right now, since all the different pieces do different things) we could focus on making sure that there is one entrypoint (this repo) from which all the information on how to get the rest of the things (and their mirrors!) can be easily figured out?

@thomashoneyman
Copy link
Member Author

The reason why the registry-index is a git repository is actually for package-managers' convenience: in this way they can clone it once, and then get incremental updates. Having a static tarball on a CDN would be messier because HTTP content might be cached without us realizing, etc.

That makes sense to me.

Note that package managers must always look at the most up-to-date metadata (i.e. the registry repo), because stale data is just incorrect (e.g. think about how it could go "oh sure you can publish version Y of package X, I don't see it in my cache!"), and I can't think of a mechanism to figure out that the cache is outdated other than, well, trying to fetch the source.

In this case, it sounds like we don't bother publishing the metadata anywhere other than this repository? Since ultimately it has to be looked up here anyway?

...maybe instead of trying to reduce the amount of locations that package managers need to deal with (that I feel it's pretty minimal right now, since all the different pieces do different things) we could focus on making sure that there is one entrypoint (this repo) from which all the information on how to get the rest of the things (and their mirrors!) can be easily figured out?

That's fine with me: ensuring this entrypoint gives you what you need to find the other locations, and it's up to package managers to follow that.

The basic idea is that chunks of the internet that we depend on (GitHub and DigitalOcean in this case) go down pretty often, and generally speaking we need to have a story to mirror all the content if we need to, also because we don't want to tie ourselves to a specific service, e.g. GitHub.

I'm curious how you see mirrors fitting in with things like metadata. If the metadata must be looked up from the registry repo, because caches can be out of date, then what is different about a mirror? What if a particular version of the registry fails to be mirrored over to GitLab, for example?

Do we assert that you can look up essentially everything except for mirrors from the registry repo, but you must look up metadata from this repository?

@f-f
Copy link
Member

f-f commented Feb 12, 2022

In this case, it sounds like we don't bother publishing the metadata anywhere other than this repository? Since ultimately it has to be looked up here anyway?

Yeah, I'm for this

I'm curious how you see mirrors fitting in with things like metadata. If the metadata must be looked up from the registry repo, because caches can be out of date, then what is different about a mirror? What if a particular version of the registry fails to be mirrored over to GitLab, for example?
Do we assert that you can look up essentially everything except for mirrors from the registry repo, but you must look up metadata from this repository?

This is now a distributed system and we can't support all CAP properties, so we have to choose between

  • CP (the data is always consistent, but the system will not always be available)
  • and AP (the system is always available, but the data is eventually consistent).

I think in the case of this repo we should go with CP: if this repo is unreachable, then it's pretty likely that everything else at GitHub is down, and you wouldn't be able to run the pipeline anyways.
Even if we'd want to do some kind of multi-master setup then we'd still have to go with CP, since AP would put us in the potential situation of two packages with the same name being published at the same time, which I'd not like to deal with 🙂

I'll note that while mirroring storage backends is about ensuring availability (i.e. we do AP there, since if you don't find a package on a mirror you can just try the next one, but you're sure it's there somewhere because the Metadata says so, and that is always consistent), mirroring this repo is more of an insurance policy: if we don't want to deal with GitHub and its CI anymore we can move somewhere else and the data will be already there.
We won't be doing that necessarily now, but if that need comes and we didn't plan for this all along we risk for that transition to be painful. By planning for this explicitly now we are minimizing that risk of breakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants