Skip to content

Commit

Permalink
Finish filling out initial outline.
Browse files Browse the repository at this point in the history
  • Loading branch information
TallJimbo committed Jan 8, 2025
1 parent 7c6ee3b commit 01379c0
Showing 1 changed file with 60 additions and 6 deletions.
66 changes: 60 additions & 6 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,13 @@ Another would be using personal PostgreSQL database space to hold butler databas

Each of the pieces of missing functionality introduced above is described in greater detail later in this technical note.

This technote is explicitly not about so-called "user batch" services, for which the architecture is still too uncertain to design against.
It does reference user batch in a few places where commonalities are clear.

Minimum Viable Satellite Butlers
================================

QuantumGraph generation on RemoteButler
QuantumGraph Generation on RemoteButler
---------------------------------------

The current ``QuantumGraph`` generation algorithm relies extensively on temporary tables, which are emulated in ``RemoteButler`` by re-running the original query each time the temporary table is used.
Expand Down Expand Up @@ -75,31 +78,82 @@ It could also be implemented directly within ``DirectButler``, but since we want
User Interfaces
---------------

TODO
We need high-level command-line and notebook-friendly tools for at least the following tasks:

- Create a new, empty satellite data repository from the official data repository.
This needs to set up a datastore that can reference official datasets (with URL signing), and it may be convenient for it to always register a few instruments and/or skymaps, and maybe even write curated calibrations.

- Create and run a ``QuantumGraph`` with inputs from the official data repository or the personal data repository and outputs going to the personal data repository, using ``QuantumBackedButler``.
In the minimum-viable system the input source would strictly be one repository OR the other, but in the future we may be able to support combinations, and should plan for that in the UI.
I would recommend splitting this up into multiple subcommands, centered around a user-visible directory that holds the ``QuantumGraph`` file and some transfer metadata, similar to how BPS interacts with its "submit" directory.
I think BPS is overall a better UI starting point than ``pipetasks``, but this is something we should put some real design thought and vetting into.
The working directory for run state could even be the directory of the new RUN collection within the datastore root, since deleting that manually prior to the transfer job would actually be fine, but it may be a little close to the edge what would not be fine for the user to do manually.

- Explicitly transfer (by referencing) datasets from the official data repository to a personal one, using user-provided queries to identify the datasets.
This may be best implemented by augmenting ``butler transfer-from`` with a notebook-friendly interface and a way to default the source data repository, since the personal repository has to know something about the official repository anyway to sign URLs.


Further Extensions
==================

Multi-Butler QuantumGraph Generation
------------------------------------

TODO
Generating a ``QuantumGraph`` whose overall inputs come from more than one data repository is a hard problem because the algorithm does not know in advance which of several constraints (the user-provided query string, the existence of various dimension records, and the existence of the input datasets in the collection search path) best constrains the set of data IDs that will go into the graph.
Inferring which constraint to start with is entirely analogous to planning the execution of a SQL query, so our ``QuantumGraph`` generation algorithm delegates to the butler database by forming a single query with all of those constraints.
When there are multiple databases, this is impossible.

This approach already fails for some hard ``QuantumGraph`` generation problems, leading to the creation of the ``--dataset-query-constraint`` argument to ``pipetask``, which allows the user to indicate which overall-input datasets (if any) are the best ones to use to constrain the graph.
Using this option successfully typicall requires expert help, however, so it's more of a workaround than a real solution.

In order to get multi-Butler ``QuantumGraph`` generation working well for end users, we need to come up with and test heuristics for splitting up the "best constraint" problem into multiple queries, often one for each data repository, that must be executed in a particular order.
It may also involve requiring the user to provide better hints about which constraints are likely to be relevant, but if so, they need to be more intuitive than ``--dataset-query-constraint``.
This work should be driven by the concrete ``QuantumGraph`` problems we expect to encounter in practice, and we expect it to proceed incrementally, with support for *some* kinds of multi-Butler ``QuantumGraph`` problems being available well before others, and fully-general support may never happen.
Many common problems should be solvable with a single constraint query against one of the two repositories (most often the personal data repository, because it is smaller), and the challenge is recognizing these cases and identify which of the two repositories to use.
This needs to be done with care, because a bad initial constraint query can be extremely expensive, and too many of these could overload the ``RemoteButler`` server.


Publishing Collections
----------------------

TODO
To facilitate sharing between science users, we would ideally provide a way for users to "publish" datasets and collections back to the official data repository as a federated data product.
This does not need to be fully automated, in the sense that we may want project staff to sign off on any publishing request, and it does not need to be immediate.
This greatly mitigates the problem of writing to the (in general) highly-replicated official data repository database; while the ideal scenario is a fully frozen database, a system comprised of a single read-write database servers and many read-only replicas should be possible as well.

Aside from the work involved in setting up the appropriate kind of database replication, we need to make sure publishing does not break any caching in the ``RemoteButler`` server, and then write a new API endpoints and client-side UIs for initiating, reviewing, and completing publish requests.

A major open question here is how the actual transfers of file artifacts and metadata would actually work.
The simplest scenario involves the client packaging up all of the content to be transferred into a ``zip`` or ``tar`` file and pushing that to the ``RemoteButler`` server, where it would have to be staged somewhere until the transfer is approved and completed.
More efficient options in which the server pulls the content from the personal data repository are much more likely to be viable if we provide personal PostgreSQL databases and/or object storage, as discussed in the next section.

The publishing mechanism is probably something we should try to share with any future user-batch implementation in which user-generated data products land in a butler data repository, regardless of whether that data repository is the same RSP personal data repository this technote describes.

We ultimately want the publishing system to support multiple levels of access control, in which users or groups retain ownership of the datasets and collections they have published, and can grant access to other users and groups instead of making them world-readable.
These access controls need to be implemented in the ``RemoteButler`` server (queries should not return datasets a user does not have access to) and the URL signing server.
Where to store permission state like access control lists - in the ``RemoteButler`` database vs. a separate one - is an open question.

External Storage
================

Personal PostgreSQL Databases
-----------------------------

TODO
Providing personal PostgreSQL database space to science users (with a direct SQL driver, not some HTTP intermediary) is something the project is considering for reasons other than just Butler support, and if that functionality is available we should strongly consider using this database storage to back personal data repositories, instead of relying on SQLite.
Having a separate namespace for Butler and other personal tables (i.e. no "one namespace per user" rule) is the only requirement we believe that Butler usage would impose on a general personal-PostgreSQL system.
PostgreSQL-backed data repositories are much more scalable and have received most of the focus in Butler optimization work.
They can also be centrally managed, which may help us provide user support, and unlike RSP SQLite databases it is plausible that they could also be used for user batch.

Personal PostgreSQL-backed butler databases do have some disadvantages:

- They are much harder to completely delete and reset (something inexpert users will want to do quite often).
- They are harder to share with other users (sharing a full SQLite data repository via filesystem permissions or copying is not ideal, but it may be fine for simple, common cases where a full "publish" request seems like overkill).
- If file artifact storage is still on the RSP filesystem, it may be hard to maintain data repository consistency, since the database could be somewhat centrally-managed but the file artifacts will not be at all centrally-managed.

Personal Object Storage
-----------------------

TODO
Personal space in an object store could be much cheaper than RSP filesystem storage, but it requires more sophisticated URL signing and permissions to allow users and possibly groups to own files, with access mediated by the Butler client (note that this is still a ``DirectButler``, interacting with a server or URL signing only, as in the case of personal data repositories referencing official datasets).
This is at least similar to the functionality needed for user and group ownership and sharing of published datasets, but it may not be identical.

As noted earlier, personal object storage works best when paired with personal PostgreSQL database space rather than a SQLite database.
At the very least, the SQLite database itself cannot be accessed through object storage (POSIX filesystem access is required), and maintaining consistency between database and file storage will be easier if both are on the RSP filesystem or both are more centrally managed.

0 comments on commit 01379c0

Please sign in to comment.