Skip to content

Commit

Permalink
Merge branch 'tb/cruft-packs'
Browse files Browse the repository at this point in the history
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.

* tb/cruft-packs:
  sha1-file.c: don't freshen cruft packs
  builtin/gc.c: conditionally avoid pruning objects via loose
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: support generating a cruft pack
  builtin/pack-objects.c: --cruft with expiration
  reachable: report precise timestamps from objects in cruft packs
  reachable: add options to add_unseen_recent_objects_to_traversal
  builtin/pack-objects.c: --cruft without expiration
  builtin/pack-objects.c: return from create_object_entry()
  t/helper: add 'pack-mtimes' test-tool
  pack-mtimes: support writing pack .mtimes files
  chunk-format.h: extract oid_version()
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  pack-mtimes: support reading .mtimes files
  Documentation/technical: add cruft-packs.txt
  • Loading branch information
gitster committed Jun 3, 2022
2 parents 37d4ae5 + a613164 commit a50036d
Show file tree
Hide file tree
Showing 32 changed files with 1,853 additions and 102 deletions.
1 change: 1 addition & 0 deletions Documentation/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ TECH_DOCS += MyFirstObjectWalk
TECH_DOCS += SubmittingPatches
TECH_DOCS += ToolsForGit
TECH_DOCS += technical/bundle-format
TECH_DOCS += technical/cruft-packs
TECH_DOCS += technical/hash-function-transition
TECH_DOCS += technical/http-protocol
TECH_DOCS += technical/index-format
Expand Down
21 changes: 14 additions & 7 deletions Documentation/config/gc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,14 +81,21 @@ gc.packRefs::
to enable it within all non-bare repos or it can be set to a
boolean value. The default is `true`.

gc.cruftPacks::
Store unreachable objects in a cruft pack (see
linkgit:git-repack[1]) instead of as loose objects. The default
is `false`.

gc.pruneExpire::
When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
Override the grace period with this config variable. The value
"now" may be used to disable this grace period and always prune
unreachable objects immediately, or "never" may be used to
suppress pruning. This feature helps prevent corruption when
'git gc' runs concurrently with another process writing to the
repository; see the "NOTES" section of linkgit:git-gc[1].
When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
cruft packs via `gc.cruftPacks` or `--cruft`). Override the
grace period with this config variable. The value "now" may be
used to disable this grace period and always prune unreachable
objects immediately, or "never" may be used to suppress pruning.
This feature helps prevent corruption when 'git gc' runs
concurrently with another process writing to the repository; see
the "NOTES" section of linkgit:git-gc[1].

gc.worktreePruneExpire::
When 'git gc' is run, it calls
Expand Down
9 changes: 9 additions & 0 deletions Documentation/config/repack.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,12 @@ repack.updateServerInfo::
If set to false, linkgit:git-repack[1] will not run
linkgit:git-update-server-info[1]. Defaults to true. Can be overridden
when true by the `-n` option of linkgit:git-repack[1].

repack.cruftWindow::
repack.cruftWindowMemory::
repack.cruftDepth::
repack.cruftThreads::
Parameters used by linkgit:git-pack-objects[1] when generating
a cruft pack and the respective parameters are not given over
the command line. See similarly named `pack.*` configuration
variables for defaults and meaning.
5 changes: 5 additions & 0 deletions Documentation/git-gc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
be performed as well.


--cruft::
When expiring unreachable objects, pack them separately into a
cruft pack instead of storing the loose objects as loose
objects.

--prune=<date>::
Prune loose objects older than date (default is 2 weeks ago,
overridable by the config variable `gc.pruneExpire`).
Expand Down
30 changes: 30 additions & 0 deletions Documentation/git-pack-objects.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ SYNOPSIS
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=<n>] [--depth=<n>]
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
[--cruft] [--cruft-expiration=<time>]
[--stdout [--filter=<filter-spec>] | <base-name>]
[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>

Expand Down Expand Up @@ -95,6 +96,35 @@ base-name::
Incompatible with `--revs`, or options that imply `--revs` (such as
`--all`), with the exception of `--unpacked`, which is compatible.

--cruft::
Packs unreachable objects into a separate "cruft" pack, denoted
by the existence of a `.mtimes` file. Typically used by `git
repack --cruft`. Callers provide a list of pack names and
indicate which packs will remain in the repository, along with
which packs will be deleted (indicated by the `-` prefix). The
contents of the cruft pack are all objects not contained in the
surviving packs which have not exceeded the grace period (see
`--cruft-expiration` below), or which have exceeded the grace
period, but are reachable from an other object which hasn't.
+
When the input lists a pack containing all reachable objects (and lists
all other packs as pending deletion), the corresponding cruft pack will
contain all unreachable objects (with mtime newer than the
`--cruft-expiration`) along with any unreachable objects whose mtime is
older than the `--cruft-expiration`, but are reachable from an
unreachable object whose mtime is newer than the `--cruft-expiration`).
+
Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
options which imply `--revs`. Also incompatible with `--max-pack-size`;
when this option is set, the maximum pack size is not inferred from
`pack.packSizeLimit`.

--cruft-expiration=<approxidate>::
If specified, objects are eliminated from the cruft pack if they
have an mtime older than `<approxidate>`. If unspecified (and
given `--cruft`), then no objects are eliminated.

--window=<n>::
--depth=<n>::
These two options affect how the objects contained in
Expand Down
11 changes: 11 additions & 0 deletions Documentation/git-repack.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,17 @@ to the new separate pack will be written.
Also run 'git prune-packed' to remove redundant
loose object files.

--cruft::
Same as `-a`, unless `-d` is used. Then any unreachable objects
are packed into a separate cruft pack. Unreachable objects can
be pruned using the normal expiry rules with the next `git gc`
invocation (see linkgit:git-gc[1]). Incompatible with `-k`.

--cruft-expiration=<approxidate>::
Expire unreachable objects older than `<approxidate>`
immediately instead of waiting for the next `git gc` invocation.
Only useful with `--cruft -d`.

-l::
Pass the `--local` option to 'git pack-objects'. See
linkgit:git-pack-objects[1].
Expand Down
123 changes: 123 additions & 0 deletions Documentation/technical/cruft-packs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
= Cruft packs

The cruft packs feature offer an alternative to Git's traditional mechanism of
removing unreachable objects. This document provides an overview of Git's
pruning mechanism, and how a cruft pack can be used instead to accomplish the
same.

== Background

To remove unreachable objects from your repository, Git offers `git repack -Ad`
(see linkgit:git-repack[1]). Quoting from the documentation:

[quote]
[...] unreachable objects in a previous pack become loose, unpacked objects,
instead of being left in the old pack. [...] loose unreachable objects will be
pruned according to normal expiry rules with the next 'git gc' invocation.

Unreachable objects aren't removed immediately, since doing so could race with
an incoming push which may reference an object which is about to be deleted.
Instead, those unreachable objects are stored as loose objects and stay that way
until they are older than the expiration window, at which point they are removed
by linkgit:git-prune[1].

Git must store these unreachable objects loose in order to keep track of their
per-object mtimes. If these unreachable objects were written into one big pack,
then either freshening that pack (because an object contained within it was
re-written) or creating a new pack of unreachable objects would cause the pack's
mtime to get updated, and the objects within it would never leave the expiration
window. Instead, objects are stored loose in order to keep track of the
individual object mtimes and avoid a situation where all cruft objects are
freshened at once.

This can lead to undesirable situations when a repository contains many
unreachable objects which have not yet left the grace period. Having large
directories in the shards of `.git/objects` can lead to decreased performance in
the repository. But given enough unreachable objects, this can lead to inode
starvation and degrade the performance of the whole system. Since we
can never pack those objects, these repositories often take up a large amount of
disk space, since we can only zlib compress them, but not store them in delta
chains.

== Cruft packs

A cruft pack eliminates the need for storing unreachable objects in a loose
state by including the per-object mtimes in a separate file alongside a single
pack containing all loose objects.

A cruft pack is written by `git repack --cruft` when generating a new pack.
linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
is a classic all-into-one repack, meaning that everything in the resulting pack is
reachable, and everything else is unreachable. Once written, the `--cruft`
option instructs `git repack` to generate another pack containing only objects
not packed in the previous step (which equates to packing all unreachable
objects together). This progresses as follows:

1. Enumerate every object, marking any object which is (a) not contained in a
kept-pack, and (b) whose mtime is within the grace period as a traversal
tip.

2. Perform a reachability traversal based on the tips gathered in the previous
step, adding every object along the way to the pack.

3. Write the pack out, along with a `.mtimes` file that records the per-object
timestamps.

This mode is invoked internally by linkgit:git-repack[1] when instructed to
write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
of packs which will not be deleted by the repack; in other words, they contain
all of the repository's reachable objects.

When a repository already has a cruft pack, `git repack --cruft` typically only
adds objects to it. An exception to this is when `git repack` is given the
`--cruft-expiration` option, which allows the generated cruft pack to omit
expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
later on.

It is linkgit:git-gc[1] that is typically responsible for removing expired
unreachable objects.

== Caution for mixed-version environments

Repositories that have cruft packs in them will continue to work with any older
version of Git. Note, however, that previous versions of Git which do not
understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
all of the objects in it. In other words, do not expect older (pre-cruft pack)
versions of Git to interpret or even read the contents of the `.mtimes` file.

Note that having mixed versions of Git GC-ing the same repository can lead to
unreachable objects never being completely pruned. This can happen under the
following circumstances:

- An older version of Git running GC explodes the contents of an existing
cruft pack loose, using the cruft pack's mtime.
- A newer version running GC collects those loose objects into a cruft pack,
where the .mtime file reflects the loose object's actual mtimes, but the
cruft pack mtime is "now".

Repeating this process will lead to unreachable objects not getting pruned as a
result of repeatedly resetting the objects' mtimes to the present time.

If you are GC-ing repositories in a mixed version environment, consider omitting
the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
leaving the `gc.cruftPacks` configuration unset until all writers understand
cruft packs.

== Alternatives

Notable alternatives to this design include:

- The location of the per-object mtime data, and
- Storing unreachable objects in multiple cruft packs.

On the location of mtime data, a new auxiliary file tied to the pack was chosen
to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
support for optional chunks of data, it may make sense to consolidate the
`.mtimes` format into the `.idx` itself.

Storing unreachable objects among multiple cruft packs (e.g., creating a new
cruft pack during each repacking operation including only unreachable objects
which aren't already stored in an earlier cruft pack) is significantly more
complicated to construct, and so aren't pursued here. The obvious drawback to
the current implementation is that the entire cruft pack must be re-written from
scratch.
19 changes: 19 additions & 0 deletions Documentation/technical/pack-format.txt
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,25 @@ Pack file entry: <+

All 4-byte numbers are in network order.

== pack-*.mtimes files have the format:

All 4-byte numbers are in network byte order.

- A 4-byte magic number '0x4d544d45' ('MTME').

- A 4-byte version identifier (= 1).

- A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

- A table of 4-byte unsigned integers. The ith value is the
modification time (mtime) of the ith object in the corresponding
pack by lexicographic (index) order. The mtimes count standard
epoch seconds.

- A trailer, containing a checksum of the corresponding packfile,
and a checksum of all of the above (each having length according
to the specified hash function).

== multi-pack-index (MIDX) files have the following format:

The multi-pack-index files refer to multiple pack-files and loose objects.
Expand Down
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -740,6 +740,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
TEST_BUILTINS_OBJS += test-oidmap.o
TEST_BUILTINS_OBJS += test-oidtree.o
TEST_BUILTINS_OBJS += test-online-cpus.o
TEST_BUILTINS_OBJS += test-pack-mtimes.o
TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
Expand Down Expand Up @@ -996,6 +997,7 @@ LIB_OBJS += oidtree.o
LIB_OBJS += pack-bitmap-write.o
LIB_OBJS += pack-bitmap.o
LIB_OBJS += pack-check.o
LIB_OBJS += pack-mtimes.o
LIB_OBJS += pack-objects.o
LIB_OBJS += pack-revindex.o
LIB_OBJS += pack-write.o
Expand Down
10 changes: 9 additions & 1 deletion builtin/gc.c
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ static const char * const builtin_gc_usage[] = {

static int pack_refs = 1;
static int prune_reflogs = 1;
static int cruft_packs = 0;
static int aggressive_depth = 50;
static int aggressive_window = 250;
static int gc_auto_threshold = 6700;
Expand Down Expand Up @@ -152,6 +153,7 @@ static void gc_config(void)
git_config_get_int("gc.auto", &gc_auto_threshold);
git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
git_config_get_bool("gc.autodetach", &detach_auto);
git_config_get_bool("gc.cruftpacks", &cruft_packs);
git_config_get_expiry("gc.pruneexpire", &prune_expire);
git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
git_config_get_expiry("gc.logexpiry", &gc_log_expire);
Expand Down Expand Up @@ -331,7 +333,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
{
if (prune_expire && !strcmp(prune_expire, "now"))
strvec_push(&repack, "-a");
else {
else if (cruft_packs) {
strvec_push(&repack, "--cruft");
if (prune_expire)
strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
} else {
strvec_push(&repack, "-A");
if (prune_expire)
strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
Expand Down Expand Up @@ -551,6 +557,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
N_("prune unreferenced objects"),
PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
PARSE_OPT_NOCOMPLETE),
Expand Down Expand Up @@ -670,6 +677,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
die(FAILED_RUN, repack.v[0]);

if (prune_expire) {
/* run `git prune` even if using cruft packs */
strvec_push(&prune, prune_expire);
if (quiet)
strvec_push(&prune, "--no-progress");
Expand Down
Loading

0 comments on commit a50036d

Please sign in to comment.