it: zero-g git

it aims to augment git with primitives to build integrated, cryptographically verifiable collaboration workflows around source code. It maintains the distributed property of git, not requiring a central server. it is transport agnostic, and permits data dissemination in client-server, federated, as well as peer-to-peer network topologies.

Table of Contents

1. Introduction
- 1.1. Motivation
- 1.2. Overview
2. Conventions and Terminology
3. Formats
- 3.1. Signed Values
- 3.2. Common Types
4. Identities
- 4.1. Metadata
- 4.2. Verification
5. Patches
6. Drops
7. Future work
References

1. Introduction

1.1. Motivation

The checks and balances of Free and Open Source Software (FOSS) is the ability for anyone to contribute to or diverge from (“fork”) a line of development freely and cheaply. As FOSS is defined by the community developing it, this extends to all artefacts of communication and collaboration, not just the source code itself. In other words, an open development model is a transparent process.

It is easy to see that this model necessitates data sovereignty: control over the data implies controlling participation.

Traditionally, this property has been approximated by using internet email for collaboration. While its simplicity as a medium has its merits, email is clearly declining in popularity for our purpose. We attribute this to mainly two weaknesses: intended primarily as a free-form medium, email is lacking the programmability of the web, impeding innovation in both tooling and services. Secondly, the protocol is inherently prone to abuse by permitting unsolicited messages, and the response measures implemented over the years have amplified monopolization: today, it takes significant effort and expertise to maintain a mail exchanger independent of large providers (let alone one which hosts a mailing list fanning out messages to a potentially large number of subscribers).

It is not obvious, however, how an alternative could look like on a protocol level. Among the tradeoffs to consider is the tension between openness, addressability and availability — and it highly depends on the situation which one has higher priority. It thus seems unlikely that it can be resolved once and for all. Instead, we recognise it as desirable to provide the user with choices of transport methods. Or, put differently, that “the network is optional”, as Kleppmann et al. have called for in their essay on "Local-first software".

Git is prototypical of the local-first idea, providing data sovereignty — for as long as we do not consider bidirectional collaboration: git commits do not commute, and so concurrent modifications do not converge, but must be explicitly linearised. This is not satisfying if we want to eliminate both intermediaries and online rendezvous. It is tempting to design a source code management and collaboration system from the ground up with commutativity in mind, yet git is so ubiquitous that we feel that we cannot forgo to present a solution which preserves the ability to use its existing toolchain and ecosystem. It turns out that, while it would be difficult to retrofit git into a proper, idealised local-first application, it is perfectly suitable for hosting such an application which models the collaboration process itself.

1.2. Overview

it is essentially a collection of datatypes.

We start by establishing identities (Section 4), which for our purposes only need to certify ownership of public keys. By using an extensible, human-readable metadata format, we leave it to the user to bind the identity to external identifiers or extend it with “profile” information in order to convey a persona. As the metadata can be conveniently managed using git, it can be published easily.

it inherits the paradigm of most distributed version control systems, where changes are exchanged as small increments (“patches”, Section 5), but generalises the concept to include both source code changes and associated data such as commentary. An it patch is thus similar to an email message, but mandates the associated data to be structured (as opposed to free-form). Ordering with respect to related patches is determined via git’s commit graph, optionally allowing for sophisticated shared state objects to be constructed if a [CRDT]-based payload is used.

Patches are recorded onto a log structure (“drop”, Section 6), for which we define a representation as a git commit history. The patch contents are, however, not stored directly in this structure, but redistributed verbatim. This is done so as to reduce data dissemination to mostly (static) file transmission, which opens up more choices for alternative transport protocols and minimises resource consumption imposed by dynamic repacking.

The drop is responsible for ensuring that the dependencies (or: prerequisites) of a patch are satisfied before recording it, enforcing that the partial ordering of related patches can be recovered. Apart from that, a drop does not provide any ordering guarantees, which means that independent drops may converge even though their (commit) hashes differ.

Finally, a drop is secured by a trust delegation scheme which authorises operations modifying its state. It also serves as a PKI, allowing verification of all signed objects it refers to.

Networking is exemplified by a simple HTTP API (Section 6.7), hinting at alternative protocols where appropriate. We envisage patch submission to give rise to gateway services, which may be elaborated on in future revisions of this document.

2. Conventions and Terminology

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119] and [RFC8174] when, and only when, they appear in all capitals, as shown here.

Familiarity with git concepts and datastructures is assumed, and terminology used without further explanation. Refer to the [gitglossary] instead.

3. Formats

3.1. Signed Values

Signed data items in it are encoded as a subset of JSON which disallows floating point numbers, and requires string values and object keys to be UTF-8 encoded. Signatures are obtained over the SHA-512 hash of the canonical form of the JSON object (hashing is used to minimise the payload size, which may be sent to an agent process for signing).

JSON values SHOULD be stored in pretty-printed form, with object keys sorted lexicographically.

Empty optional fields SHOULD NOT be omitted from the output, but be set to null if the value is a custom type represented by a JSON string, or the neutral element of the JSON type otherwise.

Unless otherwise noted, JSON arrays SHALL be treated as sets.

Where JSON data is signed inline, it is wrapped in an object:

{
    "signed": OBJECT,
    "signatures": {
        KEYID: SIGNATURE,
        ...
    }
}

OBJECT: A JSON object. Its canonical form is obtained as per [Canonical-JSON].
KEYID: The identifier of the key signing the OBJECT, which is the SHA-256 hash of the canonical form of the key, in hexadecimal.
SIGNATURE: The hex-encoded signature of the SHA-512 hash of the canonical form of OBJECT.

3.2. Common Types

BLOB_HASH

Hash of the payload p, as if created by [git-hash-object]. That is, for a hash algorithm H:

H('blob ' || LEN(p) || NUL || p)

CONTENT_HASH

Dictionary of both the SHA-1 and SHA-256 BLOB_HASH of the referenced object^[1]:

{
    "sha1": BLOB_HASH,
    "sha2": BLOB_HASH
}

DATETIME

Date-time string in [RFC3339] format, e.g. “2022-08-23T14:48:00Z”.

OBJECT_ID

Hexadecimal git object id.

FMT_VERSION

Version of a datatype, in “dotted triple” format. The semantics loosely follows the "Semantic Versioning" convention, but gives no significance to leading zeroes. That is, a major version of 1.x does not indicate that it is more stable than 0.x, but that it is not forward compatible with 0.x.

URL

A URL as per the WHATWG specification.

VARCHAR(N)

A UTF-8 encoded string of at most length N (in bytes).

4. Identities

Like most decentralised systems, it relies on public key cryptography to ensure authenticity of data. In order to manage and distribute public keys, it defines a simple, JSON-based format which can conveniently be stored in git.

The subject of an it identity is not inherently a human, it could just as well be a machine user such as a CI- or merge bot, or a group of users extending ultimate trust to each other. Consequently, it should not be assumed that ownership of the keys constituting the identity lies with a single actor in the system. It is, however, illegal to reuse keys for multiple identities within the same context.

The context of an identity is generally a drop. Thus, a subject may create as many identities as they see fit (provided keys are not reused). Conversely, the custom attribute of an id.json document permits to associate an it identity with external methods certifying the subject’s persona, such as custodial identity providers or [DID] controllers (for example by embedding a DID document in the custom section).

In general, it does not specify how trust is initially established in an identity.

Identities in it are self-certifying, in that introduction or revocation of keys are signed by a threshold of the specified keys themselves. A threshold greater than one reduces the probability of identity compromise, even if a subset of its keys is compromised. For usability reasons, owners of personal identities may want to set the threshold to 2 and carry a certification key on a portable device.

For practical reasons, it is RECOMMENDED for implementations to use the widely deployed [OpenSSH] suite for signing purposes, including for git commits. Verification of SSH-signed git commits (available since git version 2.34) MUST be supported. Via the [ssh-agent] protocol, alternative tooling is not precluded. All key algorithms and signature schemes supported by OpenSSH MUST be supported by it implementations. To make it easy for users to visually match output from OpenSSH with id.json documents, keys are encoded in the format used by OpenSSH.

Additional key algorithms, signature schemes or public key encodings may be introduced in the future.

4.1. Metadata

Identity information is stored in a JSON file, conventionally named id.json. The file’s contents can be amended using a threshold signature scheme, and revisions are hash-linked to their predecessors.

The signed portion of the id.json file is defined as follows:

{
    "_type": "eagain.io/it/identity",
    "fmt_version": FMT_VERSION,
    "prev": CONTENT_HASH | null,
    "keys": [
        KEY,
        ...
    ],
    "roles": {
        "root": {
            "keys": [KEYID],
            "threshold": THRESHOLD
        }
    },
    "mirrors": [
        URL,
        ...
    ],
    "expires": DATETIME | null,
    "custom": CUSTOM
}

KEY

Public key in SSH encoding, specified in [RFC4253], [RFC5656] and [RFC8709]. The comment or label part after the base64-encoded key SHOULD be omitted in the document.

Example:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDtt6XEdNVInhiKkX+ccN++Bk8kccdP6SeBPg0Aq8XFo

THRESHOLD

An integer number of keys whose signatures are required in order to consider the identity metadata to be properly signed. Must be between 1 and the number of keys in the metadata file.

The current FMT_VERSION of id.json is: 1.0.0.

4.2. Verification

Verification of an identity history proceeds as follows:

Load the latest known id.json metadata
If the expires attribute is not null, check that the specified DATETIME does not lie in the past. Otherwise, abort and report an error.
Let k be the subset of keys which have a corresponding entry in the roles.root.keys set. Verify that at least roles.root.threshold of k have provided valid signatures
If prev is not null, load the corresponding previous revision of the metadata
Let k' be the subset of keys of the previous revision which have a corresponding entry in the roles.root.keys set (also of the previous revision). Verify that at least threshold of k' have provided valid signatures over the current revision
Repeat steps 4. and 5. until prev is null
Compute the SHA-256 hash over the canonical form of the initial revision. This is the identity id.
If a particular identity id was expected, check that it matches the computed one

5. Patches

A source code patch is traditionally a differential between source code files. In practice, however, such diffs are seldomly exchanged without additional context, usually prose describing and motivating the change.

During the process of accepting a patch into the mainline history of a project, collaborators may leave comments on the original submission, reference points may be annotated (“tagged”), and revised versions of the patch may be submitted. The degree to which this process is formalised varies between projects, as does the preference for capturing it in formal datastructures such as ticketing systems. A common property of all these different contributions to a code base is that they can be seen as state transitions, where the git commit chain helpfully provides a way to establish a partial ordering.

it seeks to unify all kinds of contributions into a single exchange format, a bundle, which is already native to git. The semantics of a bundle, apart from causal ordering, is defined by its contents, which makes the format amenable for future extensions.

In that sense, it aspirationally uses the term “patch” in the generalised way described by theoretical work such as [Darcs], [Pijul], [CaPT], and [HoPT]. When describing the more mundane processing procedures, the term “patch bundle” is also used, alluding to the container format.

5.1. Bundles

A patch bundle is a git bundle of either version supported by git (v2 or v3). If v3 is used, only the object-format capability is recognised, specifying an object filter is illegal.

For compatibility with git, prerequisite object ids MUST refer to commit objects, even though the format specification permits any object type.

The pack data of the bundle MUST NOT contain objects unreachable from the tips specified in the bundle header.

Note

Enforcing this rule on the receiving end of a patch bundle may not be practical in some circumstances. Unreachable objects will automatically be purged if and when snapshots are taken (which imply repacking), but it is worth noting that there might be security implications of redistributing patch bundles which have not been verified to adhere to this rule, as it is possible to “hide” arbitrary objects in the bundle.

The bundle references may contain zero or more branches, tags or notes. A topic ref MUST be present. If identities need to be added or updated, zero or more ids refs may be present whose target either resolves directly to an updated id.json, or is peelable^[2] to a tree containing the updated document in a blob named id.json at the root.

Where more than one occurrence is permissible, the receiver MAY limit the total number of occurrences (see also Section 6.4.1).

More formally, the permissible references are (in ABNF notation):

refname  = topic / *identity / *branch / *tag / *note

topic    = "refs/it/topics/" TOPIC_ID
identity = "refs/it/ids/" [IDENTITY_ID]
branch   = "refs/heads/" name
tag      = "refs/tags/" name
note     = "refs/notes/" name

TOPIC_ID: SHA-256 hash, in hexadecimal. The preimage is opaque to it, but should be chosen by the initiator of a topic such that the probability of collisions with independently initiated topics is very low (for example the contents of the initial message combined with a random nonce).

The pack data section of a bundle MAY be encrypted using either [age] or GPG.

5.2. Topics

A topic is conceptually similar to a mailing list thread or structured data such as a “Pull Request”, in that it groups together related information. The stable identifier of a topic is a SHA-256 hash, the preimage of which is opaque to it.

A patch bundle MUST contain a topic commit history (refs/it/topics/…) containing objects which represent interactions such as free-form comments, code review annotations, attestations (“signoffs”) or results from CI services. The set of all histories referring to the same topic identifier forms a directed acyclic graph (DAG), usually a tree, yielding a partial order of topic entries.

If topic entries form a [CRDT], sophisticated “mutable” state objects can be constructed, resembling concepts commonly managed in a centralised fashion such as “Issues”, “Task trackers” or automated merge queues. However, not all workflows require this level of sophistication (namely the ability to change state collaboratively), and traversing a DAG of semi-structured, easily parseable data in topological order is sufficient. Examples of this include mailing-list style conversations or archives of external communication systems.

Hence, it mandates that topic histories can have one of two types: message based or CRDT based.

Message based topics consist of a single JSON object per commit, found in a file named m at the root of the commit’s tree. A message based topic is represented by its commit graph.
CRDT based topics consist of a single change object per commit, found in a file named c at the root of the commit’s tree. CRDT based topics are represented by a single object, to which changes are applied in the topological order of the commit graph.

Note

The [Automerge] CRDT is chosen for its generality. Future versions of this document may allow for other CRDTs to be used.

The exact encoding of Automerge changes for use with it is still under consideration. Since binary operation payloads are likely to be undesirable for the intended use, it may be preferable to define a textual encoding (such as JSON), which would make the stored data easier to inspect without specialised tooling.

Changing the type of a topic is illegal, and should result in the offending patch to be rejected, or be omitted during topic traversal.

In both paradigms, authenticity of authorship is inferred from the cryptographic signature of the individual commits. Dependencies, respectively reply-to relationships, are expressed as commit parents.

Note that no type or schema information is imposed. It is up to the client to interpret the data payload, and potentially omit unknown entries from the output.

5.3. Equivalence

Depending on context, two patch bundles are considered equivalent if:

The set of bundle reference targets is equal

This means that the bundles logically carry the same information, which is preserved even if repacked (e.g. when snapshots are used). This equivalence is captured in the BUNDLE_HEADS value, which is the value a patch submitter signs and which determines whether a patch has been received before by a drop.
The union of the reference targets and prerequisite objects is equal

When applied to an existing object database, the packfiles require the same objects to be present, and result in the same (reachable) state afterwards, and so are for practical purposes “the same”.

However, packfile generation is not formally guaranteed to be deterministic across git implementations or -versions, namely the order of entries in the file. For long-term storage, patch bundles are thus referred to by their BUNDLE_HASH.
Or, the exact file contents are equal

When downloading bundles from untrusted sources, or from content-addressable storage, the checksum of the exact contents should be verified. This information is preserved as the BUNDLE_CHECKSUM.

6. Drops

A drop (as in -box or deadletter-) is a hash-linked log which timestamps the reception of patches. In git terms, it is a history of (single-parent) commits, where integrity is ensured through git itself. To add authenticity, drops carry additional metadata which is secured using a scheme based on The Update Framework Specification (TUF).

A drop also carries all identities needed to verify cryptographic signatures on metadata, patches, and optionally git commits^[3], thus forming a PKI. Identities are themselves updated through patches.

Importantly, the drop history does not carry the patch payload itself. Patch bundles are kept and redistributed as received, and so can make heavy use of content distribution networks. At the same time, the drop history itself remains fairly small even if not delta-encoded. Together, this allows to operate even public drops on relatively constrained hardware.

A drop is a strictly local-first concept — the drop history may never leave a single machine. In order to be able to accept patch proposals, however, a drop may make itself externally addressable, for example by exposing an HTTP API (see Section 6.7).

It is important to note that drop histories, even if they logically describe the same project, are not in principle required to converge. In git terms, this means that two drop histories may refer to the same set of patch bundles, but differ in the ordering of the commits (or other parameters which change the commit identity). Conversely, the respective sets of patch bundles may also be distinct, to the extent permitted by the connectivity requirement (see Section 6.4).

An exception to this rule are mirrors, whose network addresses are published as part of the drop metadata: the addresses listed therein are interchangeable, i.e. obtaining the drop history from any of them MUST result in the exact same state.

Instead of or in addition to exposing a public means of patch submission, drops may aggregate patches from other drops. That is, they may follow other drops just like a normal git remote, and apply patch records to their own history. By specifying alternates in the metadata, a drop promises to aggregate submissions from those locations. Aggregation is, however, not limited to published alternates: for example, a contributor may maintain their own private drop recording only the patches created by that contributor. Another drop for the same project may be made aware of a mirror URL for that private drop, and update itself from there periodically.

6.1. Metadata

The authenticity of drops is ensured by a trust delegation scheme derived from [TUF]. There, a role-based threshold signature scheme is used to prove authenticity of updates to certain parts of an abstract “repository”, including the metadata containing the trust delegations itself.

For our purposes, some of the properties of a “repository” are upheld by git itself, while other roles are specific to it. There are four roles defined for it drops:

Root role
Snapshot role
Mirrors role
Branch roles

Like in TUF, the mirrors role is optional. Also like TUF, we note that it is possible to instantiate a drop with a single identity (and even with a single key) — which is not considered to be secure, but may be convenient in some scenarios.

Root role: The root role delegates trust to specific identities trusted for all other roles, by means of being eligible to sign updates to the drop.json metadata file.

Delegating to identities instead of directly to keys permits to rotate the respective keys independently, thus weakening the requirement for air-gapped storage of all root keys.
Snapshot role: The snapshot role permits signing commits onto the drop history.

This applies mainly to new records, but note that it may also include updates to the metadata files, yet does not render those updates valid as their signatures are verified independently.

The snapshot role is typically granted to machine users on public drop servers.

Snapshot signatures are regular git commit signatures. Pending a practical method to obtain multiple signatures on a git commit, threshold values other than 1 are not currently supported.
Mirrors role: The mirrors role permits signing the mirrors.json and alternates.json metadata files.

This role is optional, as not all drop operators may find it practical or useful to publish signed mirrors/alternates lists.
Branch roles: Branch roles are keyed by concrete reference names, which the listed identities are trusted to update (see Section 6.6).

The metadata files establishing the scheme are described in the following sections.

6.1.1. `drop.json`

The drop.json metadata file is signed by the root role and indicates which identities are authorised for all roles, including the root role itself.

The signed portion the drop.json metadata file is defined as follows:

{
    "_type": "eagain.io/it/drop",
    "fmt_version": FMT_VERSION,
    "description": DESCRIPTION,
    "prev": CONTENT_HASH | null,
    "roles": {
        "root": ROLE,
        "snapshot": ROLE,
        "mirrors": ROLE,
        "branches": {
            REFNAME: ANNOTATED_ROLE,
            ...
        }
    },
    "custom": CUSTOM
}

ANNOTATED_ROLE

Like a ROLE, but with an additional field description of type DESCRIPTION.

{
    "ids": [
        [IDENTITY_ID],
        ...
    ],
    "threshold": THRESHOLD,
    "description": DESCRIPTION
}

CUSTOM

An arbitrary JSON object carrying user-defined data. To avoid conflicts, it is RECOMMENDED to key custom objects by a URL-like identifier. For example:

{
    "custom": {
        "eagain.io/it/emojicoin": {
            "insert-here": "lol1u2vgx76adff"
        }
    }
}

DESCRIPTION

A UTF-8 string with a maximum length of 128 bytes, i.e. a VARCHAR(128).

REFNAME

A full git refname (i.e. starting with “refs/”), well-formed as per [git-check-ref-format].

ROLE

Dictionary of a set of identity ids assigned to that role, paired with a threshold. I.e.:

{
    "ids": [
        [IDENTITY_ID],
        ...
    ],
    "threshold": THRESHOLD
}

Example:

{
    "ids": [
        "671e27d4cce92f747106c7da90bcc2be7072909afa304d008eb8ecbfdebfbfe2"
    ],
    "threshold": 1
}

The current FMT_VERSION of drop.json is: 0.2.0.

6.1.2. `mirrors.json`

The mirrors.json file is signed by the mirrors role. It describes known network addresses of read-only copies of the drop, believed to be kept in-sync with the drop within a reasonable time window by its operators.

The signed portion of the mirrors.json file is defined as follows:

{
    "_type": "eagain.io/it/mirrors",
    "fmt_version": FMT_VERSION,
    "mirrors": [
        MIRROR,
        ...
    ],
    "expires": DATETIME | null
}

MIRROR

A dictionary describing a mirror.

{
    "url": URL,
    "kind": MIRROR_KIND,
    "custom": CUSTOM
}

MIRROR_KIND

Hint at what retrieval method is offered by the mirror. Unknown values MUST be accepted during parsing and signature verification. Defined values are:

bundled: the mirror is expected to serve patch bundles at the well-known HTTP endpoint relative to url, if url denotes a HTTP URL
packed: the mirror is a plain git server, but the client may reify bundles by requesting the appropriate objects over the regular git network protocol
sparse: the mirror does not host bundle data at all, only the drop history. This can be useful in constrained environments such as peer-to-peer storage if (and only if) the record.json entries specify stable bundle URIs.

The current FMT_VERSION of mirrors.json is: 0.2.0.

6.1.3. `alternates.json`

The alternates.json file is signed by the mirrors role. It describes known network addresses of writeable (e.g. via HTTP) drops where patches pertaining the same project may be submitted. The method of submission is described by the alternate’s URL. A drop publishing an alternates.json file implicitly promises to aggregate patches from the alternates listed, although it is free to do so only selectively.

The signed portion of the alternates.json file is defined as follows:

{
    "_type": "eagain.io/it/alternates",
    "fmt_version": FMT_VERSION,
    "alternates": [
        URL,
        ...
    ],
    "custom": CUSTOM,
    "expires": DATETIME | null
}

The current FMT_VERSION of alternates.json is: 0.2.0.

6.2. Verification

To verify a drop, the drop.json metadata file must be verified first:

From the latest known commit of the drop history, load the drop.json file
For each identity id in the root role of the file, resolve the corresponding identity and verify it^[4]
Verify that no key is being referenced by more than one identity
Verify that the drop.json file is signed by a threshold of identities as specified in the threshold attribute of the root role. Signatures by multiple keys from the same identity are allowed, but don’t count toward the threshold.
If prev is not null, load the corresponding previous revision of the metadata
Verify that the threshold specified in the previous revision is met on the current revision, loading and verifying additional identities as needed
Repeat steps 5. and 6. until prev is null

Having obtained a verified drop.json metadata file, it can now be verified that the head commit of the drop history is signed by a key belonging to an identity which is assigned the snapshot role.

If a mirrors.json and/or alternates.json is present in the head commit’s tree, it should be verified as follows:

Load the metadata file
If the expires attribute is not null, check that the specified DATETIME does not lie in the past
For each identity id in the mirrors role of the drop.json file, resolve the corresponding identity and verify it^[4]
Verify that the metadata file is signed by a threshold of identities as specified in the threshold attribute of the mirrors role. Signatures by multiple keys from the same identity are allowed, but don’t count toward the threshold.

Verification of mirror- and alternates-lists MAY be deferred until they are actually used. Failure to verify mirrors.json or alternates.json does not render the drop metadata invalid.

6.3. History representation

A drop history is stored as a git commit history. Initially, it contains only the metadata, organised in a tree with the following layout:

Figure 1. Drop metadata tree

.
|-- drop.json
|-- mirrors.json
|-- alternates.json
`-- ids
    |-- identity-id
    |   `-- id.json
    `-- ...

Note	In this document, tree entries are ordered for legibility, which is not necessarily how they are ordered by git.

In Figure 1, the mirrors.json and alternates.json files are optional. The ids hierarchy contains at least all identities needed to verify the metadata files, where the id.json file represents the most recent revision of the identity. It is up to the implementation how to make previous revisions available, although most are expected to opt for a “folded” representation where previous revisions are stored as files in a subdirectory.

A commit which updates metadata files may carry a free-form commit message. Data created by a previous patch commit SHOULD be removed from the tree.

To record a patch, the record.json is written to the tree adjacent to the other metadata files. If the patch contains identity updates, the ids subtree is updated accordingly.

The patch topic is written as a trailer keyed “Re:”, as shown in Figure 2. This allows to collect patches for a particular topic from the drop history without having to access objects deeper than the commit.

Figure 2. Simplified topic commit

commit ccd1fd5736bed6fb6342e34c9d8cbc2b9db7f326
Author: Kim Altintop <kim@eagain.io>
Date:   Mon Dec 12 10:47:32 2022 +0100

    Re: 1fdc53e27b01b440839ff1b6c14ef81c3d63d0f2b39aae8fb4abd0b565ea0b10

Lastly, the BUNDLE_HEADS (cf. Section 5.3) are written to a file heads adjacent to the record.json file in the tree. Provided appropriate atomicity measures, this provides a reasonably efficient way to determine if a patch has been received before by simply probing the object database for existence of the corresponding BLOB_HASH.

6.3.1. Location-independent storage

Since the drop history only stores metadata, it should be suitable for location-independent storage inheriting some of git’s data model, e.g. [IPFS], [Hypercore], or [SSB]. Those systems come with their own limitations, perhaps the most severe one in our context being the lack of a reliable and efficient way to propagate contributions from unknown identities back to the root drop. Thus, exact mappings are deferred to a future revision of this document.

We note, however, that distributing git bundle snapshots of the drop history itself over protocols which support some form of name resolution (such as [IPNS]) may present an attractive bandwidth-sharing mechanism.

6.4. Recording patches

Once a patch has passed validation, its reception is recorded in the drop history as a file containing metadata about the patch. The file’s schema may be extended over time, where the currently defined properties are:

Figure 3. record.json

{
    "bundle": {
        "len": BUNDLE_SIZE,
        "hash": BUNDLE_HASH,
        "checksum": BUNDLE_CHECKSUM,
        "prerequisites": [
            OBJECT_ID,
            ...
        ],
        "references": {
            REFNAME: OBJECT_ID,
            ...
        },
        "encryption": "age" | "gpg",
        "uris": [
            URL,
            ...
        ]
    },
    "signature": {
        "signer": CONTENT_HASH,
        "signature": SIGNATURE,
    }
}

BUNDLE_SIZE: Size in bytes of the bundle file as received.
BUNDLE_HASH: SHA-256 hash over the sorted set of object ids (in bytes) referenced by the bundles, i.e. both the prerequisites and reference heads.
BUNDLE_CHECKSUM: BLAKE3 hash over the bundle file as received^[5].
BUNDLE_SIGNATURE: Signature over the BUNDLE_HEADS, in hexadecimal.
BUNDLE_HEADS: SHA-256 hash over the sorted set of object ids (in bytes) or the reference heads of the bundle (i.e. without the prerequisites).

The signature field captures the signature made by the submitter of the patch. Multiple signatures may be supported in a future revision of this document.

The uris field enumerates alternate network addresses from which the bundle file may be downloaded. Since the recorded information is immutable, this is mainly intended for content-based addresses, such as IPFS CIDs.

Additionally, the drop will want to record the hashed reference heads in an efficiently retrievable form, such that it can be quickly determined if a patch has been received before (see Section 5.3, Section 6.3). Similarly for the patch topic.

6.4.1. Validation

Accepting a patch for inclusion in the drop history is subject to validation rules, some of which depend on preferences or policies. A public drop server will want to apply stricter rules before accepting a patch than a user who is applying a patch to a local (unpublished) drop history.

The mandatory validations are:

The bundle file MUST be available locally before creating a log entry
The bundle MUST be connected, i.e. its prerequisite objects must be present in bundles received prior to the one under consideration
The bundle MUST NOT have been received before (cf. Section 5.3)
The bundle MUST conform to the conventions specified in Section 5
The bundle MUST be signed and the signer’s (i.e. submitter’s) identity resolvable, either from the drop state or the bundle contents (or both)
If the bundle contains identity updates, they MUST pass verification and MUST NOT diverge from their previously recorded history (if any)

Note	Validation 5. entails that a patch submission message must carry the CONTENT_HASH of the submitter’s identity head revision.

Additional RECOMMENDED validations include:

restricting the size in bytes of the patch bundle
restricting the number of references a bundle can convey
restricting the number of commits, or total number of objects a bundle can contain
rejecting patches whose topic is not properly signed by the submitter, does not cleanly apply to a merged history of previously received patches on the same topic, or contains otherwise invalid data

Beyond that, a drop may also decide to reject a patch if it is encrypted, or if its contents do not pass content analysis proper (e.g. Bayesian filtering).

6.5. Snapshots

Over time, a drop will accumulate many small patch bundles. Repacking them into larger bundles is likely to reclaim storage space by means of offering more opportunities for delta compression. It can also be beneficial for data synchronisation (especially non-incremental) to avoid too many network roundtrips.

In principle, a drop could employ a dynamic repacking scheme, and either serve larger than requested data when individual bundles are requested, or offer a way to dynamically discover snapshotted alternatives via the bundle-uri negotiation mechanism (see Section 6.7.1). This would, however, preclude drops which delegate bundle storage entirely (such as packed or sparse mirrors) from benefiting from this optimisation. Therefore, we define a convention for publishing snapshots as patches on the drop itself.

A snapshot is a patch posted to the well-known topic SHA256("snapshots"), i.e.:

2b36a6e663158ffd942c174de74dbe163bfdb1b18f6d0ffc647e00647abca9bb

A snapshot bundle may either capture the entire history of the drop, or depend on an earlier snapshot. The bundle references capture all references of the patch bundles received prior to the snapshot, up until the previous snapshot if the snapshot is incremental. In order to be unique within the snapshot bundle, the patch bundle references are rewritten as follows:

Strip the refs/ prefix
Prepend refs/it/bundles/BUNDLE_HEADS/

For example:

refs/it/bundles/107e80b2287bc763d7a64bee9bc4401e12778c55925265255d4f2a38296262b8/heads/main 77ce512aa813988bdca54fa2ba5754f3a46c71f3
refs/it/bundles/107e80b2287bc763d7a64bee9bc4401e12778c55925265255d4f2a38296262b8/it/topics/c44c20434bfdaa0384b67d48d6c3bb36d755b87576027671f606c404b09d9774 65cdd5234e310efc1cb0afbc7de0a2786e6dd582

The payload of the topic entry associated with a snapshot is not defined normatively. It is RECOMMENDED to use a message based topic, where a payload schema could be:

{
    "_type": "eagain.io/it/notes/checkpoint",
    "kind": "snapshot",
    "refs": {
        REFNAME: OBJECT_ID,
        ...
    }
}

Taking a snapshot implies privileged access to the drop repository, and can only be submitted by the snapshot role.

After publishing a snapshot, a drop MAY prune patch bundles recorded prior to the snapshot, possibly after a grace period (for example, by only pruning bundles older than the N-1st snapshot). When synchronising with a drop, clients which encounter a snapshot record should thus prefer fetching only snapshots from this point on in the drop history.

6.6. Mergepoints

It is often useful for a drop to convey cryptographically verifiable reference points for contributors to base source code changes on, i.e. long-running branches.

While the process of agreeing on what changes are to be finalised into such branches can vary widely between projects, and could even involve the evaluation of CRDT state, the final statement can be reduced to restricting the set of allowed signers of a patch bundle (which updates a certain set of branches). This is what the branch roles in the drop.json metadata file are for: they make certain identities eligible for submitting mergepoints affecting named long-running branches.

A mergepoint is a patch posted to the well-known topic SHA256("merges"), i.e.:

c44c20434bfdaa0384b67d48d6c3bb36d755b87576027671f606c404b09d9774

A mergepoint bundle may contain one or more references matching exactly the names specified in the drop’s branch roles, and MUST only be accepted if the submitter(s) identities are allowed as per the role definition.

As with snapshots, the topic payload is not defined normatively. It is RECOMMENDED to use message based topic, where a payload schema could be:

{
    "_type": "eagain.io/it/notes/checkpoint",
    "kind": "merge",
    "refs: {
        REFNAME: OBJECT_ID,
        ...
    }
}

Upon encountering a mergepoint properly signed by the applicable branch roles, a client may update the targets of a local representation of the mergepoint references iff the local targets are in the ancestry path of the mergepoint targets.

6.7. HTTP API

Drops MAY expose an HTTP API for accepting and serving patch bundles. Drops listed as alternates in the drop metadata MUST conform to this API (endpoint paths are interpreted as relative to the alternate URL). The defined endpoints of the API are as follows:

6.7.1. Fetching patch bundles

GET /bundles/bundle-hash[.bundle|.uris]

Without a file extension suffix, this endpoint conforms to the git [bundle-uri] specification: the server may either respond by sending the bundle file identified by bundle-hash, or a bundle list.

When responding with a bundle list:

mode MUST be any
<id> segments MUST be treated as opaque by the client
entries specifying a filter MUST be ignored by the client

In addition to regular uri values (relative, http://, https://), ipfs:// URLs are accepted. If encountered, a client MAY rewrite them to gateway URLs to fetch the bundle from.

By specifying the .bundle suffix, a client instructs the server to either respond with the bundle file, or a 404 status, but never with a bundle list. Correspondingly, by specifying .uris, the server MUST respond with a bundle list, or a 404 status, but never with a bundle file.

Figure 4. Example bundle list

[bundle]
    version = 1
    mode = any
    heuristic = creationToken

[bundle "8aea1a1c20b09ed9ad4737adc6319203d65a0026ac86873f84f7961bd42f132c"]
    uri = /bundles/6c4d3d4e4db8e37c698f891e94780c63e1b94f87c67925cd30163915c7d7923e.bundle

[bundle "816dc1231cb1b82a91144ebb9e325c3655f4b4da30f806a84fa86fdb06ca9c04"]
    uri = https://it.example.com/bundles/6c4d3d4e4db8e37c698f891e94780c63e1b94f87c67925cd30163915c7d7923e.bundle
    creationToken = 1670838467

[bundle "f4ecc80c9339ecdbc2a8f4c0d19e990f8ee9631e6b7f3e044b86c35fe69505d3"]
    uri = ipfs://QmVTw4vVFWkPBN6ZT7To4BHoNBfaBNjVJ17wK15tci6bn1
    creationToken = 1670839391

6.7.2. Submitting patches

POST /patches
HEADER_SIGNATURE

HEADER_SIGNATURE

A BUNDLE_SIGNATURE and corresponding identity CONTENT_HASH, encoded suitable for use as a HTTP header value:

X-it-signature: s1={BLOB_HASH}; s2={BLOB_HASH}; sd={BUNDLE_SIGNATURE}

The body of this request is a bundle file. The bundle signature is transmitted as a HTTP header, allowing for the bundle file to be streamed directly from disk.

Once the drop server has received the request body, it attempts to record the patch, and responds with the corresponding record.json document, or an error.

Optionally, the server MAY accept a request of the form:

POST /patches/request-pull
Content-Type: application/x-www-form-urlencoded
HEADER_SIGNATURE

url=URL

If accepted, the server attempts to fetch the bundle file from the URL given in the form field before continuing as if the bundle was submitted directly in the request body. Otherwise, the server responds with an error code in the 4xx range to indicate that this method of submission is not supported.

7. Future work

We found that git bundles are a simple yet effective container format. They are, however, not extensible: git, being the reference implementation, rejects bundles whose header does not exactly conform to the specified format. While compatibility with upstream git was a design goal for the current iteration of it, we may want to evolve the format independently in the future, e.g. by embedding cryptographic signatures directly in the file.

We have deliberately not mandated strict schema checking for topic payloads respectively CRDT objects, although we acknowledge that interoperability will eventually demand for some method to be devised. Since the design space is quite large — ranging from static schema definitions to runtime evaluation of a dynamic interpreter — this would have been well beyond the scope of the current specification.

Acknowledgements

The author would like to thank Alex Good for a perpetual supply of ideas worth considering.

Copyright notice

Copyright © 2022-2023 Kim Altintop. This work is made available under the Creative Commons Attribution 4.0 International License. To the extent portions of it are incorporated into source code, such portions in the source code are licensed under either the Apache License 2.0 or the MIT license at your option.

References

[RFC2119]: https://datatracker.ietf.org/doc/html/rfc2119
[RFC3339]: https://datatracker.ietf.org/doc/html/rfc3339#section-5.6
[RFC4253]: https://datatracker.ietf.org/doc/html/rfc4253
[RFC5656]: https://datatracker.ietf.org/doc/html/rfc5656
[RFC8174]: https://datatracker.ietf.org/doc/html/rfc8174
[RFC8709]: https://datatracker.ietf.org/doc/html/rfc8709
[ssh-agent]: https://datatracker.ietf.org/doc/html/draft-miller-ssh-agent

[automerge-change]: https://alexjg.github.io/automerge-storage-docs/#change-reference
[Canonical-JSON]: http://wiki.laptop.org/go/Canonical_JSON
[DID]: https://www.w3.org/TR/did-core
[semver]: https://semver.org
[TUF]: https://theupdateframework.github.io/specification/latest
[WHATWG-URL]: https://url.spec.whatwg.org

[Apache-2]: https://www.apache.org/licenses/LICENSE-2.0
[CC-BY-SA-4]: https://creativecommons.org/licenses/by/4.0
[MIT]: https://spdx.org/licenses/MIT.html

[bundle-uri]: https://git-scm.com/docs/bundle-uri
[git-check-ref-format]: https://git-scm.com/docs/git-check-ref-format
[git-format-patch]: https://git-scm.com/docs/git-format-patch
[git-hash-object]: https://git-scm.com/docs/git-hash-object
[git-interpret-trailers]: https://git-scm.com/docs/git-interpret-trailers
[git]: https://git-scm.com
[gitformat-bundle]: https://git-scm.com/docs/gitformat-bundle
[gitglossary]: https://git-scm.com/docs/gitglossary

[Darcs]: https://en.wikibooks.org/wiki/Understanding_Darcs/Patch_theory
[Pijul]: https://pijul.org/manual/theory.html
[CaPT]: https://arxiv.org/abs/1311.3903
[HoPT]: https://www.cambridge.org/core/journals/journal-of-functional-programming/article/homotopical-patch-theory/42AD8BB8A91688BCAC16FD4D6A2C3FE7

[age]: https://age-encryption.org/v1
[Automerge]: https://automerge.org
[CRDT]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
[Hypercore]: https://hypercore-protocol.org
[IPFS-GATEWAY]: https://docs.ipfs.tech/concepts/ipfs-gateway
[IPFS]: https://ipfs.tech
[IPNS]: https://docs.ipfs.tech/concepts/ipns
[local-first]: https://www.inkandswitch.com/local-first/
[OpenSSH]: https://www.openssh.com
[SSB]: https://scuttlebutt.nz

1. Hashing with both the SHA-1 and SHA-256 algorithms allows internally-linked data to roam between git repositories with different object formats. We hope that when and if git introduces support for a new hash algorithm post SHA-256, it will also have interoperability implemented. Otherwise, the burden will fall on it implementations.

2. "Peeling" is git jargon for dereferencing the natural target of a git object until an object of the desired type is found.

3. it does not prescribe whether commits or tags pertaining source code histories must be cryptographically signed. Due to the non-commutativity of git commits (their identity changes when reordered), it is highly dependent on the development model whether author signatures are preserved in published histories. Thus, we leave it to users to decide if signatures should be applied at the git level, or other forms of attestation (e.g. via topic entries) are employed.

4. Normally, identities must be resolvable within the same tree as the drop metadata. However, resolution may be substituted if e.g. the client believes to have more up-to-date identity data elsewhere.

5. BLAKE3 is a tree hash which remains stable in verified streaming mode (Bao), even with variable chunk sizes. This means that it is a good choice for long-term content addressability, in particular in location-independent storage. We don’t use BLAKE3 elsewhere, however, in order to maximise compatibility with git.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec.adoc

spec.adoc

it: zero-g git

1. Introduction

1.1. Motivation

1.2. Overview

2. Conventions and Terminology

3. Formats

3.1. Signed Values

3.2. Common Types

4. Identities

4.1. Metadata

4.2. Verification

5. Patches

5.1. Bundles

5.2. Topics

5.3. Equivalence

6. Drops

6.1. Metadata

6.1.1. `drop.json`

6.1.2. `mirrors.json`

6.1.3. `alternates.json`

6.2. Verification

6.3. History representation

6.3.1. Location-independent storage

6.4. Recording patches

6.4.1. Validation

6.5. Snapshots

6.6. Mergepoints

6.7. HTTP API

6.7.1. Fetching patch bundles

6.7.2. Submitting patches

7. Future work

Acknowledgements

Copyright notice

References

Files

spec.adoc

Latest commit

History

spec.adoc

File metadata and controls

it: zero-g git

1. Introduction

1.1. Motivation

1.2. Overview

2. Conventions and Terminology

3. Formats

3.1. Signed Values

3.2. Common Types

4. Identities

4.1. Metadata

4.2. Verification

5. Patches

5.1. Bundles

5.2. Topics

5.3. Equivalence

6. Drops

6.1. Metadata

6.1.1. drop.json

6.1.2. mirrors.json

6.1.3. alternates.json

6.2. Verification

6.3. History representation

6.3.1. Location-independent storage

6.4. Recording patches

6.4.1. Validation

6.5. Snapshots

6.6. Mergepoints

6.7. HTTP API

6.7.1. Fetching patch bundles

6.7.2. Submitting patches

7. Future work

Acknowledgements

Copyright notice

References

6.1.1. `drop.json`

6.1.2. `mirrors.json`

6.1.3. `alternates.json`