Control the concurrent running of registry jobs #696

thomashoneyman · 2024-08-31T21:20:45Z

The registry supports an API for publishing, unpublishing, and transferring packages, as well as running a compiler matrix for one or all packages and creating new package sets. These operations touch shared resources stored on GitHub and can have relationships among themselves; these factors mean that in the case of concurrent API calls we must carefully control in what order the registry schedules these operations.

Previously we couldn't do anything about concurrent jobs because we rely on GitHub Actions, but now that there's a proper registry server we can control things. Here are some factors to keep in mind:

Jobs cannot be run concurrently because they may rely on shared Git resources.
Most data about packages is stored in Git; no two jobs can run concurrently if they read from or write to shared Git resources due to the risk of conflicts. For example, publishing a package affects its metadata, manifest entry, and potentially the new-packages.json file, and it reads the metadata and manifest entries for its dependencies at their solved versions. For simplicity, we've started with the rule 'no jobs may run concurrently', but we may choose to allow some limited concurrency (for example, allow package set jobs to run concurrently with publish jobs).
Jobs may have different priorities.
Some jobs are more urgent than others. Publishing is more important than a package set update. Since jobs cannot be run concurrently, when multiple jobs arrive at once we put them into a queue. This queue must take the job's priority into account, such that if 10 jobs are in the queue and only one is a 'Publish' job, then it should be taken next regardless of the time it arrived.
Jobs may depend on one another.
Some jobs depend on other jobs. If two jobs arrive to publish packages [email protected] and [email protected], and package B depends on package A at a range that includes 1.0.0, then package A should be published before package B, regardless of the order they originally arrived in.
Jobs may spawn other jobs.
After a package version is published, we go on to compute the various compiler versions it is compatible with.

These factors mean we must queue jobs instead of starting them as soon as they arrive, and the queue must be more sophisticated than a simple FIFO — we have to consider the priority of the job and its potential dependencies on other jobs also in the queue. If we choose to allow limited concurrency, then we will also have to consider what resources the job uses and only run it so long as no other job uses the same resources.

Job Types

There are three broad types of jobs that the registry runs:

Package jobs
A user may publish, unpublish, or transfer a package.
Matrix jobs
The registry tests all compiler versions against every package version that is published. These jobs are spawned in one of two ways. First, when a package version is published, the registry creates an independent compiler test job for every compiler version the registry supports. Second, when a new compiler version is released, registry maintainers can submit a matrix request to the API, which creates an independent test job for every package version in the registry at that compiler version.
Package set jobs
A user may submit a package set update that adds, upgrades, downgrades, or removes packages from the most recent package set (downgrade/remove can only be performed by registry maintainers).

Matrix jobs targeting many package versions are special because they cannot simply be added to the job queue in any order. Instead, matrix jobs must be topologically sorted and then inserted into the queue so that they can be popped from the queue in the correct order.

System Overview

Currently, the registry works like this: when a new job is submitted to the API, we fork a new process that adds the job to the jobs table, runs the job, and then marks it as completed. However, this is unsuitable given the restrictions described above.

Instead, I propose a different model. The registry API exposes endpoints where you can submit jobs, but instead of executing them, we immediately store each job in the database. The registry server, at startup, forks a long-running job executor process that is responsible for selecting and executing jobs. At each tick (perhaps every second), the executor follows these steps:

Find the highest-priority job available. If none are available, sleep, then return to (1).
Execute the job.
Mark the job completed and return to (1).

We broadly group jobs into three priority buckets; within any given bucket, jobs are picked up in first-come-first-served order.

Package Jobs (Publish/Unpublish/Transfer)
Matrix Jobs
Package Set Jobs

We include two mechanisms to prevent the job executor loop from either dropping jobs that were interrupted (such as by a server restart) or freezing if a job takes too long.

At startup, the executor will scan the database for jobs that have been started but not completed. If there is only one job, it will execute the job. If there are multiple jobs (somehow), then it will choose one to execute and will reset the ‘started’ time for the others.
When a job is started, the job executor holds on to the fiber that’s running the job and starts a timer. If the job exceeds the timeout, then the job executor will kill and retry the job. After N retries, the job is marked as failed.

We also give the job executor an extra responsibility, as it owns the queue: when the core repositories have too many commits (more than 100k), it should squash-commit the repositories before picking up the next job so as to make sure consumers of the repositories don’t have to fetch too much data.

Matrix Jobs

A few extra notes about matrix jobs — these are a pairing of a compiler version and a package version where we compile the package and see what compilers it works with. This happens either post-publishing (we test all compilers against that package version) or on a new compiler release (we test the new compiler against all package versions).

These jobs have a couple extra quirks to them:

Matrix jobs are run individually but are conceptually a bulk action — either "all compilers against package version X" or "compiler version X against all package versions". The reason we don't actually run them in bulk is because...
Matrix jobs are low priority, so package jobs like publishing, unpublishing, and transferring should take priority over matrix jobs.
Matrix jobs can depend on each other because they look at the compilers metadata field for dependencies. Thus, for example, when testing a new compiler against all packages we need to topologically sort the jobs.

The issue with splitting a conceptual bulk action into individual jobs targeting a package version / compiler version is that we should be sure that the registry doesn't change in a way that would cause two matrix jobs against package [email protected] to have different results for any other reason than compilation failure. For example, if we run a few jobs against [email protected], and then a dependency has a new version published, then the remaining jobs against [email protected] may solve to different build plans and could start failing. Or, if we choose to solve a matrix job at creation time and store the result in the queue, and then a dependency is unpublished, then the existing plan can fail.

I'm not entirely sure what the best approach is, here. I think we can get away with:

On a new compiler release, solve and topologically sort all the package versions, and then insert them into the matrix queue in order. We can then run the queue based on creation time alone.
On a new package publish, solve the matrix jobs right then and store it in the database with the job so if the registry state changes in the meantime the build plans are not affected.
On an unpublish job, verify whether this unpublish will affect any queued matrix jobs. If so, then update the solutions for those matrix jobs.

But I'm open to ideas to make this more robust.

Database Design

Here is the current jobs table:

CREATE TABLE jobs (
  jobId text primary key not null,
  jobType text not null,
  packageName text not null,
  ref text not null,
  createdAt text not null,
  finishedAt text,
  success integer not null default 0
);

This design is unsuitable to capture the various job types and priority ordering we need. Instead, we should separate jobs into multiple tables by their coarse type (package job, matrix job, package set job). Priority selection is simply looking first at the package jobs table, then matrix jobs table, and so on.

-- Common job information table
CREATE TABLE job_info (
  jobId TEXT PRIMARY KEY NOT NULL,
  createdAt TEXT NOT NULL
  startedAt TEXT,
  finishedAt TEXT,
  success INTEGER NOT NULL DEFAULT 0
);

-- Package-oriented jobs (publish/unpublish/transfer)
CREATE TABLE package_jobs (
  jobId TEXT PRIMARY KEY NOT NULL,
  jobType TEXT NOT NULL,
  packageName TEXT NOT NULL,
  packageVersion TEXT NOT NULL,
  payload JSON NOT NULL,
  FOREIGN KEY (jobId) REFERENCES job_info (jobId)
);

-- Compiler matrix jobs (one compiler, all packages)
CREATE TABLE matrix_jobs (
  jobId TEXT PRIMARY KEY NOT NULL,
  packageName TEXT NOT NULL,
  packageVersion TEXT NOT NULL,
  compilerVersion TEXT NOT NULL,
  -- the build plan, which should be computed before the job is stored in the queue
  -- so that if multiple jobs targeting one package get interrupted by a higher-
  -- priority job then the build plan is not affected.
  payload JSON NOT NULL,
  FOREIGN KEY (jobId) REFERENCES job_info (jobId)
);

-- Package set jobs
CREATE TABLE package_set_jobs (
  jobId TEXT PRIMARY KEY NOT NULL,
  payload JSON NOT NULL,
  FOREIGN KEY (jobId) REFERENCES job_info (jobId)
);

Open Questions

Did we ever figure out what to do if a package version is unpublished and other packages depend on that version specifically? Do we allow that? Do we cascade the breakage?
GitHub Actions are a loophole that allows for jobs to be run concurrently regardless of what the server is doing. Any solution to this issue should also include Proxy GitHub issue API calls to the registry server #649.

The text was updated successfully, but these errors were encountered:

thomashoneyman self-assigned this Aug 31, 2024

thomashoneyman mentioned this issue Aug 31, 2024

Require all packages to solve / compile and include all valid compilers in their metadata #669

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control the concurrent running of registry jobs #696

Control the concurrent running of registry jobs #696

thomashoneyman commented Aug 31, 2024

Control the concurrent running of registry jobs #696

Control the concurrent running of registry jobs #696

Comments

thomashoneyman commented Aug 31, 2024

Job Types

System Overview

Matrix Jobs

Database Design

Open Questions