From 799e1e739f70d603d511f399aba802133dd0e37f Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Fri, 27 Sep 2024 00:26:53 -0700 Subject: [PATCH 01/13] perf: report unused inputs for the tar rule The `mtree` spec passed to the `tar` rule very often selects a subset of the inputs made available through the `srcs` attribute. In many cases, these subsets do not break down cleanly along dependency-tree lines and there is no simple way just pass less content to the `tar` rule. One prominent example where this occurs is when constructing the tars for OCI image layers. For instance when [building a Python-based container image](https://github.com/bazel-contrib/rules_oci/blob/main/docs/python.md), we might want to split the Python interpreter, third-party dependencies, and application code into their own layers. This is done by [filtering the `mtree_spec`](https://github.com/aspect-build/bazel-examples/blob/85cb2aaf8c6e51d5e9e086cc94b94ab896903fb0/oci_python_image/py_layer.bzl#L39). However, in the operation to construct a `tar` from a subsetted mtree, it is usually still an unsubsetted tree of `srcs` that gets passed. As a result, the subset tarball is considered dependent upon a larger set of sources than is strictly necessary. This over-scoping runs counter to a very common objective associated with breaking up an image into layers - isolating churn to a smaller slice of the application. Because of the spurious relationships established in Bazel's dependency graph, all tars get rebuilt anytime any content in the application gets changed. Tar rebuilds can even be triggered by changes to files that are completely filtered-out from all layers of the container. Redundent creation of archive content is usually not too computationally intensive, but the archives can be quite large in some cases, and avoiding a rebuild might free up gigabytes of disk and/or network bandwidth for better use. In addition, eliminating the spurious dependency edges removes erroneous constraints applied to the build action schedule; these tend to push all Tar-building operations towards the end of a build, even when some archive construction could be scheduled much earlier. ## Risk assessment and mitigation The `unused_inputs_list` mechanism used to report spurious dependency relationships is a bit difficult to use. Reporting an actually-used input as unused can create difficult to diagnose problems down the line. However, the behaviour of the `mtree`-based `tar` rule is sufficiently simple and self-contained that I am fairly confident that this rule's used/unused set can be determined accurately in a maintainable fashion. Out of an abundance of caution I have gated this feature behind a default-off flag. The `tar` rule will continue to operate as it had before - typically over-reporting dependencies - unless the `--@aspect_bazel_lib//lib:tar_compute_unused_inputs` flag is passed. ### Filter accuracy The `vis` encoding used by the `mtree` format to resiliently handle path names has a small amount of "play" to it - it is reversable but the encoded representation of a string is not unique. Two unequal encoded strings might decode to the same value; this can happen when at least one of the encoded strings contains unnecessary escapes that are nevertheless honoured by the decoder. The unused-inputs set is determined using a filter that compares `vis`-encoded strings. In the presence of non-canonically-encoded paths, false-mismatches can lead to falsely reporting that an input is unused. The only `vis`-encoded path content that is under the control of callers is the `mtree` content itself; all other `vis`-encoded strings are constructed internally to this package, not exposed publicly, and are all derived using the `lib/private/tar.bzl%_vis_encode` function; all of these paths are expected to compare exactly. Additionally, it is expected that many/most users will use this package's helpers (e.g. `mtree_spec`) when crafting their mtree content; such content is also safe. It is only when the user crafts their own mtree, or modifies an mtree spec's `content=` fields' encoding in some way, that a risk of inaccurate reporting arises. The chances for this are expected to be minor since this seems like an inconvenient and not-particularly-useful thing for a user to go out of their way to do. --- lib/BUILD.bazel | 5 ++ lib/private/tar.bzl | 112 +++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 111 insertions(+), 6 deletions(-) diff --git a/lib/BUILD.bazel b/lib/BUILD.bazel index c1d808754..dce61bd2b 100644 --- a/lib/BUILD.bazel +++ b/lib/BUILD.bazel @@ -31,6 +31,11 @@ bool_flag( build_setting_default = True if is_bzlmod_enabled() else False, ) +bool_flag( + name = "tar_compute_unused_inputs", + build_setting_default = False, +) + config_setting( name = "enable_runfiles", values = {"enable_runfiles": "true"}, diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index 7e4866aaf..5c52911ec 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -1,5 +1,6 @@ "Implementation of tar rule" +load("@bazel_skylib//rules:common_settings.bzl", "BuildSettingInfo") load("//lib:paths.bzl", "to_repository_relative_path") TAR_TOOLCHAIN_TYPE = "@aspect_bazel_lib//lib:tar_toolchain_type" @@ -84,6 +85,7 @@ _tar_attrs = { doc = "Compress the archive file with a supported algorithm.", values = _ACCEPTED_COMPRESSION_TYPES, ), + "_compute_unused_inputs": attr.label(default = Label("//lib:tar_compute_unused_inputs")), } _mtree_attrs = { @@ -125,6 +127,89 @@ def _calculate_runfiles_dir(default_info): return manifest.short_path[:-9] fail("manifest path {} seems malformed".format(manifest.short_path)) +def _fmt_all_inputs_line(file): + # The tar.all_inputs.txt file has a two columns: + # 1. vis-encoded paths of the files, used in comparison + # 2. un-vis-encoded paths of the files, used for reporting back to Bazel after filtering + path = file.path + return _vis_encode(path) + " " + path + +def _fmt_keep_inputs_line(file): + # The tar.keep_inputs.txt file has a single column of vis-encoded paths of the files to keep. + return _vis_encode(file.path) + +def _configured_unused_inputs_file(ctx, srcs, keep): + """ + Compute the unused_inputs_list, if configured. + + Args: + ctx: `tar` rule context. Must provide `mtree` and `_compute_unused_inputs` attrs , and a `coreutils_toolchain_type` toolchain. + srcs: sequence or depset. The set of all input sources being provided to the `tar` rule. + keep: sequence or depset. A hardcoded set of sources to consider "used" regardless of whether or not they appear in the mtree. + + Returns: file or None. List of inputs unused by the `Tar` action. + """ + if not ctx.attr._compute_unused_inputs[BuildSettingInfo].value: + return None + + coreutils = ctx.toolchains["@aspect_bazel_lib//lib:coreutils_toolchain_type"].coreutils_info.bin + + all_inputs = ctx.actions.declare_file(ctx.attr.name + ".all_inputs.txt") + keep_inputs = ctx.actions.declare_file(ctx.attr.name + ".keep_inputs.txt") + unused_inputs = ctx.actions.declare_file(ctx.attr.name + ".unused_inputs.txt") + + ctx.actions.write( + output = all_inputs, + content = ctx.actions.args() + .set_param_file_format("multiline") + .add_all( + srcs, + map_each = _fmt_all_inputs_line, + ), + ) + ctx.actions.write( + output = keep_inputs, + content = ctx.actions.args() + .set_param_file_format("multiline") + .add_all( + keep, + map_each = _fmt_keep_inputs_line, + ), + ) + + # Unused inputs are inputs that: + # * are in the set of ALL_INPUTS + # * are not found in any content= keyword in the MTREE + # * are not in the hardcoded KEEP_INPUTS set + # + # Comparison and filtering of ALL_INPUTS is performed in the vis-encoded representation, stored in field 1, + # before being written out in the un-vis-encoded form Bazel understands, from field 2. + ctx.actions.run_shell( + outputs = [unused_inputs], + inputs = [all_inputs, keep_inputs, ctx.file.mtree], + tools = [coreutils], + command = ''' + "$COREUTILS" join -v 1 \\ + <("$COREUTILS" sort -u "$ALL_INPUTS") \\ + <("$COREUTILS" sort -u \\ + <(grep -o '\\bcontent=\\S*' "$MTREE" | "$COREUTILS" cut -c 9-) \\ + "$KEEP_INPUTS" \\ + ) \\ + | "$COREUTILS" cut -d' ' -f 2- \\ + > "$UNUSED_INPUTS" + ''', + env = { + "COREUTILS": coreutils.path, + "ALL_INPUTS": all_inputs.path, + "KEEP_INPUTS": keep_inputs.path, + "MTREE": ctx.file.mtree.path, + "UNUSED_INPUTS": unused_inputs.path, + }, + mnemonic = "UnusedTarInputs", + ) + + return unused_inputs + def _tar_impl(ctx): bsdtar = ctx.toolchains[TAR_TOOLCHAIN_TYPE] inputs = ctx.files.srcs[:] @@ -151,17 +236,29 @@ def _tar_impl(ctx): src[DefaultInfo].files_to_run.repo_mapping_manifest for src in ctx.attr.srcs ] - inputs.extend([m for m in repo_mappings if m != None]) + repo_mappings = [m for m in repo_mappings if m != None] + inputs.extend(repo_mappings) + + srcs_runfiles = [ + src[DefaultInfo].default_runfiles.files + for src in ctx.attr.srcs + ] + + unused_inputs_file = _configured_unused_inputs_file( + ctx, + srcs = depset(direct = ctx.files.srcs + repo_mappings, transitive = srcs_runfiles), + keep = [ctx.file.mtree, bsdtar.tarinfo.binary], + ) + if unused_inputs_file: + inputs.append(unused_inputs_file) ctx.actions.run( executable = bsdtar.tarinfo.binary, - inputs = depset(direct = inputs, transitive = [bsdtar.default.files] + [ - src[DefaultInfo].default_runfiles.files - for src in ctx.attr.srcs - ]), + inputs = depset(direct = inputs, transitive = [bsdtar.default.files] + srcs_runfiles), outputs = [out], arguments = [args], mnemonic = "Tar", + unused_inputs_list = unused_inputs_file, ) return DefaultInfo(files = depset([out]), runfiles = ctx.runfiles([out])) @@ -284,5 +381,8 @@ tar = rule( doc = "Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rather than load this directly.", implementation = tar_lib.implementation, attrs = tar_lib.attrs, - toolchains = [tar_lib.toolchain_type], + toolchains = [ + tar_lib.toolchain_type, + "@aspect_bazel_lib//lib:coreutils_toolchain_type", + ], ) From 359512a7952a84e265763df1b184e91fe574cf1a Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 00:05:27 -0700 Subject: [PATCH 02/13] Also include other bsdtar toolchain files in keep set --- lib/private/tar.bzl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index 5c52911ec..5836c9677 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -247,7 +247,7 @@ def _tar_impl(ctx): unused_inputs_file = _configured_unused_inputs_file( ctx, srcs = depset(direct = ctx.files.srcs + repo_mappings, transitive = srcs_runfiles), - keep = [ctx.file.mtree, bsdtar.tarinfo.binary], + keep = depset(direct = [ctx.file.mtree, bsdtar.tarinfo.binary], transitive = [bsdtar.default.files]), ) if unused_inputs_file: inputs.append(unused_inputs_file) From f1492cf75c1fc5dce2e7aa0a4643afeda5855d9c Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 09:26:09 -0700 Subject: [PATCH 03/13] Add tri-state attribute to control unused-inputs behaviour This control surface provides for granular control of the feature. The interface is selected to mirror the common behaviour of `stamp` attributes. --- lib/private/tar.bzl | 56 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 53 insertions(+), 3 deletions(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index 5836c9677..afbaca564 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -85,7 +85,40 @@ _tar_attrs = { doc = "Compress the archive file with a supported algorithm.", values = _ACCEPTED_COMPRESSION_TYPES, ), - "_compute_unused_inputs": attr.label(default = Label("//lib:tar_compute_unused_inputs")), + "compute_unused_inputs": attr.int( + doc = """ +Whether to discover and prune input files that will not contribute to the archive. + +Unused inputs are discovered by comparing the set of input files in `srcs` to the set +of files referenced by `mtree`. Files not used for content by the mtree specification +will not be read by the `tar` tool when creating the archive and can be pruned from the +input set using the `unused_inputs_list` +[mechanism](https://bazel.build/contribute/codebase#input-discovery). + +Benefits: pruning unused input files can reduce the amount of work the build system must +perform. Pruned files are not included in the action cache key; changes to them do not +invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need +to block on the availability of pruned inputs, which can increase the available +parallelism of builds. Pruned files do not need to be transfered to remote-execution +workers, which can reduce network costs. + +Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The +comparison performed between `srcs` and `mtree` is currently inexact and may fail to +handle handwritten or externally-derived mtree specifications. However, it is safe to use +this feature when the lines found in `mtree` are derived from one or more `mtree_spec` +rules, filtered and/or merged on whole-line basis only. + +Possible values: + + - `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. + - `compute_unused_inputs = 0`: Never discover or prune unused inputs. + - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the + --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. + """, + default = -1, + values = [-1, 0, 1], + ), + "_compute_unused_inputs_flag": attr.label(default = Label("//lib:tar_compute_unused_inputs")), } _mtree_attrs = { @@ -127,6 +160,23 @@ def _calculate_runfiles_dir(default_info): return manifest.short_path[:-9] fail("manifest path {} seems malformed".format(manifest.short_path)) +def _is_unused_inputs_enabled(attr): + """Determine whether or not to compute unused inputs. + + Args: + attr: `tar` rule ctx.attr struct. Must provide `_tar_attrs`. + + Returns: bool. Whether the unused_inputs_list file should be computed. + """ + if attr.compute_unused_inputs == 1: + return True + elif attr.compute_unused_inputs == 0: + return False + elif attr.compute_unused_inputs == -1: + return attr._compute_unused_inputs_flag[BuildSettingInfo].value + else: + fail("Unexpected `compute_unused_inputs` value: {}".format(attr.compute_unused_inputs)) + def _fmt_all_inputs_line(file): # The tar.all_inputs.txt file has a two columns: # 1. vis-encoded paths of the files, used in comparison @@ -143,13 +193,13 @@ def _configured_unused_inputs_file(ctx, srcs, keep): Compute the unused_inputs_list, if configured. Args: - ctx: `tar` rule context. Must provide `mtree` and `_compute_unused_inputs` attrs , and a `coreutils_toolchain_type` toolchain. + ctx: `tar` rule context. Must provide `_tar_attrs` and a `coreutils_toolchain_type` toolchain. srcs: sequence or depset. The set of all input sources being provided to the `tar` rule. keep: sequence or depset. A hardcoded set of sources to consider "used" regardless of whether or not they appear in the mtree. Returns: file or None. List of inputs unused by the `Tar` action. """ - if not ctx.attr._compute_unused_inputs[BuildSettingInfo].value: + if not _is_unused_inputs_enabled(ctx.attr): return None coreutils = ctx.toolchains["@aspect_bazel_lib//lib:coreutils_toolchain_type"].coreutils_info.bin From a2d0a85edabbef38d8a453e97734e35e4c09ed7a Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 09:56:50 -0700 Subject: [PATCH 04/13] Add bzl_library level dep --- lib/private/BUILD.bazel | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/lib/private/BUILD.bazel b/lib/private/BUILD.bazel index b4159239d..b4b4d3884 100644 --- a/lib/private/BUILD.bazel +++ b/lib/private/BUILD.bazel @@ -281,7 +281,10 @@ bzl_library( name = "tar", srcs = ["tar.bzl"], visibility = ["//lib:__subpackages__"], - deps = ["@aspect_bazel_lib//lib:paths"], + deps = [ + "@aspect_bazel_lib//lib:paths", + "@bazel_skylib//rules:common_settings", + ], ) bzl_library( From 59161455cea13b1a9fe71ff8bc486554895200c2 Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 09:57:15 -0700 Subject: [PATCH 05/13] Update docs --- docs/tar.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/tar.md b/docs/tar.md index 16c95960e..a4e818c2e 100644 --- a/docs/tar.md +++ b/docs/tar.md @@ -78,7 +78,7 @@ Create an mtree specification to map a directory hierarchy. See https://man.free ## tar_rule
-tar_rule(name, srcs, out, args, compress, mode, mtree)
+tar_rule(name, srcs, out, args, compress, compute_unused_inputs, mode, mtree)
 
Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rather than load this directly. @@ -93,6 +93,7 @@ Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rat | out | Resulting tar file to write. If absent, `[name].tar` is written. | Label | optional | `None` | | args | Additional flags permitted by BSD tar; see the man page. | List of strings | optional | `[]` | | compress | Compress the archive file with a supported algorithm. | String | optional | `""` | +| compute_unused_inputs | Whether to discover and prune input files that will not contribute to the archive.

Unused inputs are discovered by comparing the set of input files in `srcs` to the set of files referenced by `mtree`. Files not used for content by the mtree specification will not be read by the `tar` tool when creating the archive and can be pruned from the input set using the `unused_inputs_list` [mechanism](https://bazel.build/contribute/codebase#input-discovery).

Benefits: pruning unused input files can reduce the amount of work the build system must perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available parallelism of builds. Pruned files do not need to be transfered to remote-execution workers, which can reduce network costs.

Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The comparison performed between `srcs` and `mtree` is currently inexact and may fail to handle handwritten or externally-derived mtree specifications. However, it is safe to use this feature when the lines found in `mtree` are derived from one or more `mtree_spec` rules, filtered and/or merged on whole-line basis only.

Possible values:

- `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. | Integer | optional | `-1` | | mode | A mode indicator from the following list, copied from the tar manpage:

- create: Create a new archive containing the specified items. - append: Like `create`, but new entries are appended to the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - list: List archive contents to stdout. - update: Like `append`, but new entries are added only if they have a modification date newer than the corresponding entry in the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - extract: Extract to disk from the archive. If a file with the same name appears more than once in the archive, each copy will be extracted, with later copies overwriting (replacing) earlier copies. | String | optional | `"create"` | | mtree | An mtree specification file | Label | required | | From 916ee9fdada50be78d4dc8991e59c2d2ec9eeaef Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 10:03:10 -0700 Subject: [PATCH 06/13] pre-commit --- docs/tar.md | 2 +- lib/private/BUILD.bazel | 2 +- lib/private/tar.bzl | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/tar.md b/docs/tar.md index a4e818c2e..4661dab51 100644 --- a/docs/tar.md +++ b/docs/tar.md @@ -93,7 +93,7 @@ Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rat | out | Resulting tar file to write. If absent, `[name].tar` is written. | Label | optional | `None` | | args | Additional flags permitted by BSD tar; see the man page. | List of strings | optional | `[]` | | compress | Compress the archive file with a supported algorithm. | String | optional | `""` | -| compute_unused_inputs | Whether to discover and prune input files that will not contribute to the archive.

Unused inputs are discovered by comparing the set of input files in `srcs` to the set of files referenced by `mtree`. Files not used for content by the mtree specification will not be read by the `tar` tool when creating the archive and can be pruned from the input set using the `unused_inputs_list` [mechanism](https://bazel.build/contribute/codebase#input-discovery).

Benefits: pruning unused input files can reduce the amount of work the build system must perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available parallelism of builds. Pruned files do not need to be transfered to remote-execution workers, which can reduce network costs.

Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The comparison performed between `srcs` and `mtree` is currently inexact and may fail to handle handwritten or externally-derived mtree specifications. However, it is safe to use this feature when the lines found in `mtree` are derived from one or more `mtree_spec` rules, filtered and/or merged on whole-line basis only.

Possible values:

- `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. | Integer | optional | `-1` | +| compute_unused_inputs | Whether to discover and prune input files that will not contribute to the archive.

Unused inputs are discovered by comparing the set of input files in `srcs` to the set of files referenced by `mtree`. Files not used for content by the mtree specification will not be read by the `tar` tool when creating the archive and can be pruned from the input set using the `unused_inputs_list` [mechanism](https://bazel.build/contribute/codebase#input-discovery).

Benefits: pruning unused input files can reduce the amount of work the build system must perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available parallelism of builds. Pruned files do not need to be transferred to remote-execution workers, which can reduce network costs.

Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The comparison performed between `srcs` and `mtree` is currently inexact and may fail to handle handwritten or externally-derived mtree specifications. However, it is safe to use this feature when the lines found in `mtree` are derived from one or more `mtree_spec` rules, filtered and/or merged on whole-line basis only.

Possible values:

- `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. | Integer | optional | `-1` | | mode | A mode indicator from the following list, copied from the tar manpage:

- create: Create a new archive containing the specified items. - append: Like `create`, but new entries are appended to the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - list: List archive contents to stdout. - update: Like `append`, but new entries are added only if they have a modification date newer than the corresponding entry in the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - extract: Extract to disk from the archive. If a file with the same name appears more than once in the archive, each copy will be extracted, with later copies overwriting (replacing) earlier copies. | String | optional | `"create"` | | mtree | An mtree specification file | Label | required | | diff --git a/lib/private/BUILD.bazel b/lib/private/BUILD.bazel index b4b4d3884..2e9654044 100644 --- a/lib/private/BUILD.bazel +++ b/lib/private/BUILD.bazel @@ -283,7 +283,7 @@ bzl_library( visibility = ["//lib:__subpackages__"], deps = [ "@aspect_bazel_lib//lib:paths", - "@bazel_skylib//rules:common_settings", + "@bazel_skylib//rules:common_settings", ], ) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index afbaca564..1402fe7d4 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -99,7 +99,7 @@ Benefits: pruning unused input files can reduce the amount of work the build sys perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available -parallelism of builds. Pruned files do not need to be transfered to remote-execution +parallelism of builds. Pruned files do not need to be transferred to remote-execution workers, which can reduce network costs. Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The @@ -286,7 +286,7 @@ def _tar_impl(ctx): src[DefaultInfo].files_to_run.repo_mapping_manifest for src in ctx.attr.srcs ] - repo_mappings = [m for m in repo_mappings if m != None] + repo_mappings = [m for m in repo_mappings if m != None] inputs.extend(repo_mappings) srcs_runfiles = [ From e8f81bef51414eab67916c3981df5977e4099770 Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 10:04:41 -0700 Subject: [PATCH 07/13] Add reminder to change flag default on major-version bump --- lib/BUILD.bazel | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/BUILD.bazel b/lib/BUILD.bazel index dce61bd2b..44b312801 100644 --- a/lib/BUILD.bazel +++ b/lib/BUILD.bazel @@ -31,6 +31,7 @@ bool_flag( build_setting_default = True if is_bzlmod_enabled() else False, ) +# TODO(3.0): change default to True. bool_flag( name = "tar_compute_unused_inputs", build_setting_default = False, From d60b03c02c456b632be97ba35340e5e94c230503 Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sat, 5 Oct 2024 11:03:18 -0700 Subject: [PATCH 08/13] Add note about how to make unused input computation exactly correct --- lib/private/tar.bzl | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index 1402fe7d4..ee9d66564 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -234,6 +234,9 @@ def _configured_unused_inputs_file(ctx, srcs, keep): # # Comparison and filtering of ALL_INPUTS is performed in the vis-encoded representation, stored in field 1, # before being written out in the un-vis-encoded form Bazel understands, from field 2. + # + # TODO: Make comparison exact by converting all inputs to a canonical vis-encoded form before comparing. + # See also: https://github.com/bazel-contrib/bazel-lib/issues/794 ctx.actions.run_shell( outputs = [unused_inputs], inputs = [all_inputs, keep_inputs, ctx.file.mtree], From 6d72717eaab3863699f12a13984c65b3f26f7bdb Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sun, 6 Oct 2024 10:45:24 -0700 Subject: [PATCH 09/13] Add a test for unused_inputs listing --- lib/private/tar.bzl | 13 +++++++++- lib/tests/tar/BUILD.bazel | 54 ++++++++++++++++++++++++++++++++++++++- lib/tests/tar/asserts.bzl | 42 ++++++++++++++++++++++++++++++ 3 files changed, 107 insertions(+), 2 deletions(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index ee9d66564..c561d1dab 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -314,7 +314,18 @@ def _tar_impl(ctx): unused_inputs_list = unused_inputs_file, ) - return DefaultInfo(files = depset([out]), runfiles = ctx.runfiles([out])) + # TODO(3.0): Always return a list of providers. + default_info = DefaultInfo(files = depset([out]), runfiles = ctx.runfiles([out])) + if unused_inputs_file: + return [ + default_info, + OutputGroupInfo( + # exposed for testing + _unused_inputs_file = depset([unused_inputs_file]), + ), + ] + else: + return default_info def _mtree_line(file, type, content = None, uid = "0", gid = "0", time = "1672560000", mode = "0755"): spec = [ diff --git a/lib/tests/tar/BUILD.bazel b/lib/tests/tar/BUILD.bazel index 300f56eb1..d2b499668 100644 --- a/lib/tests/tar/BUILD.bazel +++ b/lib/tests/tar/BUILD.bazel @@ -3,7 +3,7 @@ load("@aspect_bazel_lib//lib:diff_test.bzl", "diff_test") load("@aspect_bazel_lib//lib:tar.bzl", "mtree_mutate", "mtree_spec", "tar") load("@aspect_bazel_lib//lib:testing.bzl", "assert_archive_contains") load("@bazel_skylib//rules:write_file.bzl", "write_file") -load(":asserts.bzl", "assert_tar_listing") +load(":asserts.bzl", "assert_tar_listing", "assert_unused_listing") # The examples below work with both source files and generated files. # Here we generate a file to use in the examples. @@ -413,3 +413,55 @@ diff_test( file1 = ":strip_prefix14", file2 = "expected14.mtree", ) + +############# +# Example 15: mtree subsetting and unused-input pruning. + +copy_directory( + name = "unused_srcdir", + src = "srcdir", + out = "unused", +) + +mtree_spec( + name = "mtree15", + srcs = [ + ":treeartifact", + ], +) + +tar( + name = "tar15", + srcs = [ + "treeartifact", + ":mtree15", # Not in output archive, but cannot be pruned. + ":unused_srcdir", + ], + out = "15.tar", + compute_unused_inputs = 1, + mtree = ":mtree15", +) + +assert_tar_listing( + name = "test_unused_inputs_ignored", + actual = ":tar15", + expected = [ + "drwxr-xr-x 0 0 0 0 Jan 1 2023 lib/", + "drwxr-xr-x 0 0 0 0 Jan 1 2023 lib/tests/", + "drwxr-xr-x 0 0 0 0 Jan 1 2023 lib/tests/tar/", + "drwxr-xr-x 0 0 0 0 Jan 1 2023 lib/tests/tar/treeartifact/", + "-rwxr-xr-x 0 0 0 0 Jan 1 2023 lib/tests/tar/treeartifact/info", + "-rwxr-xr-x 0 0 0 0 Jan 1 2023 lib/tests/tar/treeartifact/pkg", + "-rwxr-xr-x 0 0 0 1 Jan 1 2023 lib/tests/tar/treeartifact/space in name.txt", + ], +) + +assert_unused_listing( + name = "test_unused_inputs_listed", + actual = ":tar15", + expected = [ + "lib/tests/tar/unused/info", + "lib/tests/tar/unused/pkg", + "lib/tests/tar/unused/space in name.txt", + ], +) diff --git a/lib/tests/tar/asserts.bzl b/lib/tests/tar/asserts.bzl index a35e19ab1..2884e5fd0 100644 --- a/lib/tests/tar/asserts.bzl +++ b/lib/tests/tar/asserts.bzl @@ -31,3 +31,45 @@ def assert_tar_listing(name, actual, expected): file2 = expected_listing, timeout = "short", ) + +# buildifier: disable=function-docstring +def assert_unused_listing(name, actual, expected): + actual_listing = native.package_relative_label("_{}_actual_listing".format(name)) + actual_shortnames = native.package_relative_label("_{}_actual_shortnames".format(name)) + actual_shortnames_file = native.package_relative_label("_{}.actual_shortnames".format(name)) + expected_listing = native.package_relative_label("_{}_expected".format(name)) + expected_listing_file = native.package_relative_label("_{}.expected".format(name)) + + native.filegroup( + name = actual_listing.name, + output_group = "_unused_inputs_file", + srcs = [actual], + testonly = True, + ) + + # Trim platform-specific bindir prefix from unused inputs listing. E.g. + # bazel-out/darwin_arm64-fastbuild/bin/lib/tests/tar/unused/info + # -> + # lib/tests/tar/unused/info + native.genrule( + name = actual_shortnames.name, + srcs = [actual_listing], + cmd = "sed 's!^bazel-out/[^/]*/bin/!!' $< >$@", + testonly = True, + outs = [actual_shortnames_file], + ) + + write_file( + name = expected_listing.name, + testonly = True, + out = expected_listing_file, + content = expected + [""], + newline = "unix", + ) + + diff_test( + name = name, + file1 = actual_shortnames, + file2 = expected_listing, + timeout = "short", + ) From c859573195e83327071e460d5ea50b27a35df65b Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sun, 6 Oct 2024 10:56:42 -0700 Subject: [PATCH 10/13] Support alternate contents= form This is accepted by bsdtar/libarchive. In fact `contents=` is the only of the pair documented in `mtree(5)`; `content=` is an undocumented alternate form supported by libarchive. --- lib/private/tar.bzl | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index c561d1dab..ca82f982a 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -229,12 +229,15 @@ def _configured_unused_inputs_file(ctx, srcs, keep): # Unused inputs are inputs that: # * are in the set of ALL_INPUTS - # * are not found in any content= keyword in the MTREE + # * are not found in any content= or contents= keyword in the MTREE # * are not in the hardcoded KEEP_INPUTS set # # Comparison and filtering of ALL_INPUTS is performed in the vis-encoded representation, stored in field 1, # before being written out in the un-vis-encoded form Bazel understands, from field 2. # + # Note: bsdtar (libarchive) accepts both content= and contents= to identify source file: + # ref https://github.com/libarchive/libarchive/blob/a90e9d84ec147be2ef6a720955f3b315cb54bca3/libarchive/archive_read_support_format_mtree.c#L1640 + # # TODO: Make comparison exact by converting all inputs to a canonical vis-encoded form before comparing. # See also: https://github.com/bazel-contrib/bazel-lib/issues/794 ctx.actions.run_shell( @@ -242,13 +245,13 @@ def _configured_unused_inputs_file(ctx, srcs, keep): inputs = [all_inputs, keep_inputs, ctx.file.mtree], tools = [coreutils], command = ''' - "$COREUTILS" join -v 1 \\ - <("$COREUTILS" sort -u "$ALL_INPUTS") \\ - <("$COREUTILS" sort -u \\ - <(grep -o '\\bcontent=\\S*' "$MTREE" | "$COREUTILS" cut -c 9-) \\ - "$KEEP_INPUTS" \\ - ) \\ - | "$COREUTILS" cut -d' ' -f 2- \\ + "$COREUTILS" join -v 1 \\ + <("$COREUTILS" sort -u "$ALL_INPUTS") \\ + <("$COREUTILS" sort -u \\ + <(grep -o '\\bcontents\\?=\\S*' "$MTREE" | "$COREUTILS" cut -d'=' -f 2-) \\ + "$KEEP_INPUTS" \\ + ) \\ + | "$COREUTILS" cut -d' ' -f 2- \\ > "$UNUSED_INPUTS" ''', env = { From 515377c99a624acef13b3422555d2e9106db052b Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Sun, 6 Oct 2024 11:16:20 -0700 Subject: [PATCH 11/13] Don't try to prune the unprunable Bazel's interpretation of unused_inputs_list cannot accomodate certain things in filenames. These are also likely to mess up our own line-oriented protocol in the shellscript that produces this file. --- lib/private/tar.bzl | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index ca82f982a..812bdccbd 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -177,8 +177,19 @@ def _is_unused_inputs_enabled(attr): else: fail("Unexpected `compute_unused_inputs` value: {}".format(attr.compute_unused_inputs)) -def _fmt_all_inputs_line(file): - # The tar.all_inputs.txt file has a two columns: +def _is_unprunable(file): + # Some input files cannot be pruned because their name will be misinterpreted by Bazel when reading the unused_inputs_list. + # * Filenames containing newlines will be mangled in the line-oriented format of the file. + # * Filenames with leading or trailing whitespace will be mangled by the call to String.trim(). + # ref https://github.com/bazelbuild/bazel/blob/678b01a512c0b28c87067cdf5a4e0224a82716c0/src/main/java/com/google/devtools/build/lib/analysis/actions/StarlarkAction.java#L357 + p = file.path + return p[0].isspace() or p[-1].isspace() or "\n" in p or "\r" in p + +def _fmt_pruanble_inputs_line(file): + if _is_unprunable(file): + return None + + # The tar.prunable_inputs.txt file has a two columns: # 1. vis-encoded paths of the files, used in comparison # 2. un-vis-encoded paths of the files, used for reporting back to Bazel after filtering path = file.path @@ -204,17 +215,17 @@ def _configured_unused_inputs_file(ctx, srcs, keep): coreutils = ctx.toolchains["@aspect_bazel_lib//lib:coreutils_toolchain_type"].coreutils_info.bin - all_inputs = ctx.actions.declare_file(ctx.attr.name + ".all_inputs.txt") + prunable_inputs = ctx.actions.declare_file(ctx.attr.name + ".prunable_inputs.txt") keep_inputs = ctx.actions.declare_file(ctx.attr.name + ".keep_inputs.txt") unused_inputs = ctx.actions.declare_file(ctx.attr.name + ".unused_inputs.txt") ctx.actions.write( - output = all_inputs, + output = prunable_inputs, content = ctx.actions.args() .set_param_file_format("multiline") .add_all( srcs, - map_each = _fmt_all_inputs_line, + map_each = _fmt_pruanble_inputs_line, ), ) ctx.actions.write( @@ -228,11 +239,11 @@ def _configured_unused_inputs_file(ctx, srcs, keep): ) # Unused inputs are inputs that: - # * are in the set of ALL_INPUTS + # * are in the set of PRUNABLE_INPUTS # * are not found in any content= or contents= keyword in the MTREE # * are not in the hardcoded KEEP_INPUTS set # - # Comparison and filtering of ALL_INPUTS is performed in the vis-encoded representation, stored in field 1, + # Comparison and filtering of PRUNABLE_INPUTS is performed in the vis-encoded representation, stored in field 1, # before being written out in the un-vis-encoded form Bazel understands, from field 2. # # Note: bsdtar (libarchive) accepts both content= and contents= to identify source file: @@ -242,11 +253,11 @@ def _configured_unused_inputs_file(ctx, srcs, keep): # See also: https://github.com/bazel-contrib/bazel-lib/issues/794 ctx.actions.run_shell( outputs = [unused_inputs], - inputs = [all_inputs, keep_inputs, ctx.file.mtree], + inputs = [prunable_inputs, keep_inputs, ctx.file.mtree], tools = [coreutils], command = ''' "$COREUTILS" join -v 1 \\ - <("$COREUTILS" sort -u "$ALL_INPUTS") \\ + <("$COREUTILS" sort -u "$PRUNABLE_INPUTS") \\ <("$COREUTILS" sort -u \\ <(grep -o '\\bcontents\\?=\\S*' "$MTREE" | "$COREUTILS" cut -d'=' -f 2-) \\ "$KEEP_INPUTS" \\ @@ -256,7 +267,7 @@ def _configured_unused_inputs_file(ctx, srcs, keep): ''', env = { "COREUTILS": coreutils.path, - "ALL_INPUTS": all_inputs.path, + "PRUNABLE_INPUTS": prunable_inputs.path, "KEEP_INPUTS": keep_inputs.path, "MTREE": ctx.file.mtree.path, "UNUSED_INPUTS": unused_inputs.path, From 747eb6f39a19223705797f3259629a3248840285 Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Tue, 8 Oct 2024 15:05:18 -0700 Subject: [PATCH 12/13] Fix copy-paste error Whoops. Co-authored-by: Sahin Yort --- lib/private/tar.bzl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/private/tar.bzl b/lib/private/tar.bzl index 812bdccbd..3e96ef136 100644 --- a/lib/private/tar.bzl +++ b/lib/private/tar.bzl @@ -112,7 +112,7 @@ Possible values: - `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the + - `compute_unused_inputs = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. """, default = -1, From af6c856f1953a9bfaa3257f0959baadf94f76e67 Mon Sep 17 00:00:00 2001 From: Peter Lobsinger Date: Tue, 8 Oct 2024 15:06:10 -0700 Subject: [PATCH 13/13] Rerun docs update --- docs/tar.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tar.md b/docs/tar.md index 4661dab51..ccdda5288 100644 --- a/docs/tar.md +++ b/docs/tar.md @@ -93,7 +93,7 @@ Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rat | out | Resulting tar file to write. If absent, `[name].tar` is written. | Label | optional | `None` | | args | Additional flags permitted by BSD tar; see the man page. | List of strings | optional | `[]` | | compress | Compress the archive file with a supported algorithm. | String | optional | `""` | -| compute_unused_inputs | Whether to discover and prune input files that will not contribute to the archive.

Unused inputs are discovered by comparing the set of input files in `srcs` to the set of files referenced by `mtree`. Files not used for content by the mtree specification will not be read by the `tar` tool when creating the archive and can be pruned from the input set using the `unused_inputs_list` [mechanism](https://bazel.build/contribute/codebase#input-discovery).

Benefits: pruning unused input files can reduce the amount of work the build system must perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available parallelism of builds. Pruned files do not need to be transferred to remote-execution workers, which can reduce network costs.

Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The comparison performed between `srcs` and `mtree` is currently inexact and may fail to handle handwritten or externally-derived mtree specifications. However, it is safe to use this feature when the lines found in `mtree` are derived from one or more `mtree_spec` rules, filtered and/or merged on whole-line basis only.

Possible values:

- `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - `stamp = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. | Integer | optional | `-1` | +| compute_unused_inputs | Whether to discover and prune input files that will not contribute to the archive.

Unused inputs are discovered by comparing the set of input files in `srcs` to the set of files referenced by `mtree`. Files not used for content by the mtree specification will not be read by the `tar` tool when creating the archive and can be pruned from the input set using the `unused_inputs_list` [mechanism](https://bazel.build/contribute/codebase#input-discovery).

Benefits: pruning unused input files can reduce the amount of work the build system must perform. Pruned files are not included in the action cache key; changes to them do not invalidate the cache entry, which can lead to higher cache hit rates. Actions do not need to block on the availability of pruned inputs, which can increase the available parallelism of builds. Pruned files do not need to be transferred to remote-execution workers, which can reduce network costs.

Risks: pruning an actually-used input file can lead to unexpected, incorrect results. The comparison performed between `srcs` and `mtree` is currently inexact and may fail to handle handwritten or externally-derived mtree specifications. However, it is safe to use this feature when the lines found in `mtree` are derived from one or more `mtree_spec` rules, filtered and/or merged on whole-line basis only.

Possible values:

- `compute_unused_inputs = 1`: Always perform unused input discovery and pruning. - `compute_unused_inputs = 0`: Never discover or prune unused inputs. - `compute_unused_inputs = -1`: Discovery and pruning of unused inputs is controlled by the --[no]@aspect_bazel_lib//lib:tar_compute_unused_inputs flag. | Integer | optional | `-1` | | mode | A mode indicator from the following list, copied from the tar manpage:

- create: Create a new archive containing the specified items. - append: Like `create`, but new entries are appended to the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - list: List archive contents to stdout. - update: Like `append`, but new entries are added only if they have a modification date newer than the corresponding entry in the archive. Note that this only works on uncompressed archives stored in regular files. The -f option is required. - extract: Extract to disk from the archive. If a file with the same name appears more than once in the archive, each copy will be extracted, with later copies overwriting (replacing) earlier copies. | String | optional | `"create"` | | mtree | An mtree specification file | Label | required | |