Middle-ground vendoring option using a local registry #46

cormacrelf · 2024-05-22T11:25:54Z

The problem is a goldilocks one.

vendor = false is quick & easy to manage deps & buckify, but pretty bad day to day.
- Doesn't work offline as buck2 issues HEAD requests constantly
- Terrible DX annoyance with "too many open files" errors due to buck trying to download 1000 crates at once. The standard start to your day looks like "run buck2 build about 15 times until by random chance the scheduler manages to get past those errors"
- Those crates get downloaded again, and again, and again
- reindeer buckify takes 2 seconds or so. Pretty convenient.
[vendor] ... is slow to manage deps & buckify.
- Neat for small projects
- Also probably neat for Meta with y'all's funky EdenFS etc.
- But middle ground is bad
  - Middle = vendor directory of 1000 crates, 1.2 GB, 50k source files. Mostly from dupes of the windows crates which can't be pinned to one single version etc.
  - reindeer vendor takes 35 seconds
  - reindeer buckify takes 20 seconds
  - git status takes 150ms
  - The vendor folder wrecks git performance simply by its existence.
- Build experience is perfect, works offline, etc.

I think we need a solution for the middle ground:

vendor = "local-registry", using https://github.com/dhovart/cargo-local-registry
reindeer vendor ultimately just writes a bunch of .crate files into vendor/, which are just gzipped tarballs
.crate files stored in git, but using git-lfs if you like. Suddenly windows-0.48.0.crate is just a blob. Your diffs are much simpler when you modify deps. Etc.
Some buck2 rule to extract them. (There is no prelude rule that can do this with strip_prefix and sub_targets support, but prelude's extract_archive could probably have that added.)

Outcomes:

Offline works (~~although doesn't handle cargo package = { git = "..." } deps yet~~).
reindeer vendor and reindeer buckify both take 2 seconds
git status takes 20ms
Buck builds are a compromise, but a pretty great one. It still has to extract the tarballs when you want to build things. But at least buck won't ever actually extract windows-0.48.0.crate on linux, and you only pay for what you build.
The DX annoyance factor during builds is back to zero. No more too many open files errors.
DX annoyance when updating deps is acceptable.

Problems:

Relies on https://github.com/dhovart/cargo-local-registry being installed. Note however this is a single-file binary. I think if you rewrote it without the dependency on the cargo crate it would be maybe a 2-file crate. And we could use it as a library.
I think storing the local registry's index folder (nested ab/ ac/ ah/ ... folders) might be a little bit annoying if you're making concurrent edits on different branches. But you can always regenerate.

cormacrelf · 2024-05-22T11:29:44Z

And here is an `extract_archive` rule based on prelude's `http_archive` that makes all this work.

def _tar_strip_prefix_flags(strip_prefix: [str, None]) -> list[str]:
    if strip_prefix:
        # count nonempty path components in the prefix
        count = len(filter(lambda c: c != "", strip_prefix.split("/")))
        return ["--strip-components=" + str(count), strip_prefix]
    return []

def _unarchive_cmd(
        # ext_type: str,
        # exec_is_windows: bool,
        archive: Artifact,
        strip_prefix: [str, None]) -> (cmd_args, bool):
    unarchive_cmd = cmd_args(
        "tar",
        "-xzf",
        archive,
        _tar_strip_prefix_flags(strip_prefix),
    )
    return unarchive_cmd, False

def _extract_archive_impl(ctx: AnalysisContext) -> list[Provider]:
    archive = ctx.attrs.src

    # no need to prefer local; this is not a downloaded object. unlike http_archive
    prefer_local = False
    unarchive_cmd, needs_strip_prefix = _unarchive_cmd(archive, ctx.attrs.strip_prefix)
    exec_is_windows = False

    output_name = ctx.label.name
    output = ctx.actions.declare_output(output_name, dir = True)
    script_output = ctx.actions.declare_output(output_name + "_tmp", dir = True) if needs_strip_prefix else output
    if exec_is_windows:
        ext = "bat"
        mkdir = "md {}"
        interpreter = []
    else:
        ext = "sh"
        mkdir = "mkdir -p {}"
        interpreter = ["/bin/sh"]
    exclude_flags = []
    script, _ = ctx.actions.write(
        "unpack.{}".format(ext),
        [
            cmd_args(script_output, format = mkdir),
            cmd_args(script_output, format = "cd {}"),
            cmd_args([unarchive_cmd] + exclude_flags, delimiter = " ").relative_to(script_output),
        ],
        is_executable = True,
        allow_args = True,
    )
    exclude_hidden = []
    ctx.actions.run(
        cmd_args(interpreter + [script]).hidden(exclude_hidden + [archive, script_output.as_output()]),
        category = "extract_archive",
        prefer_local = prefer_local,
    )

    if needs_strip_prefix:
        ctx.actions.copy_dir(output.as_output(), script_output.project(ctx.attrs.strip_prefix))

    return [DefaultInfo(
        default_output = output,
        sub_targets = {
            path: [DefaultInfo(default_output = output.project(path))]
            for path in ctx.attrs.sub_targets
        },
    )]

extract_archive = rule(
    impl = _extract_archive_impl,
    attrs = {
        "src": attrs.source(),
        "strip_prefix": attrs.option(attrs.string(), default = None),
        "sub_targets": attrs.list(attrs.string(), default = [], doc = """
            A list of filepaths within the archive to be made accessible as sub-targets.
            For example if we have an http_archive with `name = "archive"` and
            `sub_targets = ["src/lib.rs"]`, then other targets would be able to refer
            to that file as `":archive[src/lib.rs]"`.
        """),
    },
)

dmezh · 2024-09-26T17:06:04Z

@cormacrelf I think your analysis is on point; there is a gap here that the tarballs fill nicely.

dmezh · 2024-09-26T17:09:05Z

Maybe @jsgf?

rbtcollins · 2025-01-07T21:46:14Z

Terrible DX annoyance with "too many open files" errors due to buck trying to download 1000 crates at once

This seems like something buck2 should be handling - wide graphs of any sort will easily have far more work available than the machine can perform; scheduling within those limits is a core feature. Is there some reason thats not realistic / possible here?

We use bazel (but I'm interested in buck2) with a substantial number of rust crates and 1000 third party crates. For bazel we use rules_rust and vendor the metadata for each crate - it works great, builds are fast, there is minimal data in git (one build file per crate). What stops reindeer + buck2 also being great with this style of approach?

cormacrelf · 2025-01-07T22:48:24Z

Yes, for non-vendored mode, it would be a big step in the right direction if the caching of the compressed crate download + decompress steps could be shared across buck configurations. At the moment if you build in two different configurations (e.g. release mode) it will download and decompress the entire crate graph twice, as the http_archive targets are all configured again.

It is supposedly possible to do this with Buck anonymous targets, according to the documentation. But it's more of a note in the margins IIRC, I don't know how to actually do it. Regular anonymous targets are only shared between uses in the same buck configuration, which is for things like "only compile this erlang file once".

rbtcollins · 2025-01-07T23:02:14Z

Thats quite suprising; I would have expected the same action with the same CAS inputs to get a cache hit and become a no-op, even if their incremental computation nodes are separate. :( Are the buck2 team aware? Should we file a bug?

cormacrelf · 2025-01-07T23:17:33Z

Everything is a bit better when you use RE, because everything does eventually get cached. I wrote this PR before I had RE configured, so I haven't tried going back to non-vendored with RE. There is no way around having to download every change to your cargo deps (which is often a good 50 crates at a time) in every configuration you use, but if it's cached it might be okay.

Most paths in buck-out/ include the configuration hash, including http_archive's output paths. Paths for anonymous targets, at least those I've managed to make, live in a namespace and include a hash of their inputs instead of e.g path/to/package/__target_name__, but they still have the configuration hash at the start.

Fully vendored = nothing to cache, your source files are not in buck-out/, everything just uses those paths directly
No vendoring = http_archive's output path includes configuration hash, and so do all the intermediates (downloaded .crate tarballs). Bad caching
This PR = the tarballs are vendored, so they don't have to be downloaded twice. But the outputs still include a configuration hash, so the decompress step has to be done once per configuration.

dtolnay

Thank you!

facebook-github-bot · 2025-01-08T01:39:04Z

@dtolnay has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-01-08T03:14:25Z

@dtolnay merged this pull request in 8f90c1f.

rbtcollins · 2025-01-08T07:35:16Z

Thanks @cormacrelf. When you say RE results in caching I presume the concept there is that the output of an action is cached by the RE CAS cache; this is perhaps more of a buck2 than reindeer question, but does buck2 not have a local CAS cache?

cormacrelf · 2025-01-08T08:50:22Z

It does not. Buck does have a local
database of materialized artifacts but it is not a CAS store and it is not persistent across daemon restarts.

Bazel has a few sandboxing modes, some of which amount to "local remote execution" that's somewhat automatic. With buck you can run your own RE locally (eg with docker-compose & nativelink), and configure it to do anything between CAS only and take over executing everything. I have been doing that and it's been great. But a lot of setup.

rbtcollins · 2025-01-08T09:01:42Z

@cormacrelf thanks! Has anyone written a guide?

cormacrelf · 2025-01-08T12:37:20Z

Not that I know of. You can just use hosted RE services. I only do LRE because the internet connection I have is not up to the task. The hard part is always making good toolchains, which is unrelated to where your RE happens to run. You would think Nix would make it easy but it is not.

I have a thing that dumps a Nix shell environment to a file that can be sourced as an entrypoint script in the RE image for RE actions, and also normalises & hashes it and dumps the hashes to a .buckconfig.d/ file. Buck reads that configuration and attaches the hashes as a hidden dependency of each compiler invocation. So when you e.g. upgrade clang or rustc or some shared library in the RE image, you automatically invalidate large swaths of the action cache. I think I should open source that.

Summary: This is to support facebookincubator/reindeer#46, attn dtolnay probably Two new features: - `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support. - Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`. X-link: facebook/buck2#836 Reviewed By: zertosh Differential Revision: D67976137 Pulled By: dtolnay fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5

Summary: This is to support facebookincubator/reindeer#46, attn dtolnay probably Two new features: - `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support. - Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`. Pull Request resolved: #836 Reviewed By: zertosh Differential Revision: D67976137 Pulled By: dtolnay fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5

cormacrelf added 7 commits May 22, 2024 18:11

VendorConfig::Compressed

91562c2

implement reindeer vendor for VendorConfig::Compressed

2e11b0f

Use a relative path in the .cargo/config.toml

7e814ef

Check is_vendored for compressed mode

570b67c

remove sha256 from extract archive rules

d9eb1c3

Rename VendorConfig::Compressed -> LocalRegistry

9e0af3a

rename CompressedCrate -> ExtractArchive

819b983

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2024

cormacrelf force-pushed the feature/local-registry branch from 2eb5ad4 to d8c14c7 Compare May 22, 2024 12:35

Vendor git dependencies as well in local-registry mode

e1d2c53

cormacrelf force-pushed the feature/local-registry branch from d8c14c7 to e1d2c53 Compare May 22, 2024 12:40

Pretty hacky way to make LocalRegistry git deps work

a7b0188

cormacrelf force-pushed the feature/local-registry branch from cb858a3 to a7b0188 Compare September 4, 2024 07:52

dtolnay approved these changes Jan 8, 2025

View reviewed changes

facebook-github-bot closed this in 8f90c1f Jan 8, 2025

facebook-github-bot added the Merged label Jan 8, 2025

cormacrelf mentioned this pull request Jan 9, 2025

Improved extract_archive and http_archive facebook/buck2#836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Middle-ground vendoring option using a local registry #46

Middle-ground vendoring option using a local registry #46

cormacrelf commented May 22, 2024 •

edited

Loading

cormacrelf commented May 22, 2024 •

edited

Loading

dmezh commented Sep 26, 2024

dmezh commented Sep 26, 2024

rbtcollins commented Jan 7, 2025

cormacrelf commented Jan 7, 2025

rbtcollins commented Jan 7, 2025

cormacrelf commented Jan 7, 2025

dtolnay left a comment

facebook-github-bot commented Jan 8, 2025

facebook-github-bot commented Jan 8, 2025

rbtcollins commented Jan 8, 2025

cormacrelf commented Jan 8, 2025 •

edited

Loading

rbtcollins commented Jan 8, 2025

cormacrelf commented Jan 8, 2025 •

edited

Loading

Middle-ground vendoring option using a local registry #46

Middle-ground vendoring option using a local registry #46

Conversation

cormacrelf commented May 22, 2024 • edited Loading

cormacrelf commented May 22, 2024 • edited Loading

dmezh commented Sep 26, 2024

dmezh commented Sep 26, 2024

rbtcollins commented Jan 7, 2025

cormacrelf commented Jan 7, 2025

rbtcollins commented Jan 7, 2025

cormacrelf commented Jan 7, 2025

dtolnay left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 8, 2025

facebook-github-bot commented Jan 8, 2025

rbtcollins commented Jan 8, 2025

cormacrelf commented Jan 8, 2025 • edited Loading

rbtcollins commented Jan 8, 2025

cormacrelf commented Jan 8, 2025 • edited Loading

cormacrelf commented May 22, 2024 •

edited

Loading

cormacrelf commented May 22, 2024 •

edited

Loading

cormacrelf commented Jan 8, 2025 •

edited

Loading

cormacrelf commented Jan 8, 2025 •

edited

Loading