Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Middle-ground vendoring option using a local registry #46

Closed

Conversation

cormacrelf
Copy link
Contributor

@cormacrelf cormacrelf commented May 22, 2024

The problem is a goldilocks one.

  1. vendor = false is quick & easy to manage deps & buckify, but pretty bad day to day.

    • Doesn't work offline as buck2 issues HEAD requests constantly
    • Terrible DX annoyance with "too many open files" errors due to buck trying to download 1000 crates at once. The standard start to your day looks like "run buck2 build about 15 times until by random chance the scheduler manages to get past those errors"
    • Those crates get downloaded again, and again, and again
    • reindeer buckify takes 2 seconds or so. Pretty convenient.
  2. [vendor] ... is slow to manage deps & buckify.

    • Neat for small projects
    • Also probably neat for Meta with y'all's funky EdenFS etc.
    • But middle ground is bad
      • Middle = vendor directory of 1000 crates, 1.2 GB, 50k source files. Mostly from dupes of the windows crates which can't be pinned to one single version etc.
      • reindeer vendor takes 35 seconds
      • reindeer buckify takes 20 seconds
      • git status takes 150ms
      • The vendor folder wrecks git performance simply by its existence.
    • Build experience is perfect, works offline, etc.

I think we need a solution for the middle ground:

  • vendor = "local-registry", using https://github.com/dhovart/cargo-local-registry
  • reindeer vendor ultimately just writes a bunch of .crate files into vendor/, which are just gzipped tarballs
  • .crate files stored in git, but using git-lfs if you like. Suddenly windows-0.48.0.crate is just a blob. Your diffs are much simpler when you modify deps. Etc.
  • Some buck2 rule to extract them. (There is no prelude rule that can do this with strip_prefix and sub_targets support, but prelude's extract_archive could probably have that added.)

Outcomes:

  • Offline works (although doesn't handle cargo package = { git = "..." } deps yet).
  • reindeer vendor and reindeer buckify both take 2 seconds
  • git status takes 20ms
  • Buck builds are a compromise, but a pretty great one. It still has to extract the tarballs when you want to build things. But at least buck won't ever actually extract windows-0.48.0.crate on linux, and you only pay for what you build.
  • The DX annoyance factor during builds is back to zero. No more too many open files errors.
  • DX annoyance when updating deps is acceptable.

Problems:

  • Relies on https://github.com/dhovart/cargo-local-registry being installed. Note however this is a single-file binary. I think if you rewrote it without the dependency on the cargo crate it would be maybe a 2-file crate. And we could use it as a library.
  • I think storing the local registry's index folder (nested ab/ ac/ ah/ ... folders) might be a little bit annoying if you're making concurrent edits on different branches. But you can always regenerate.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2024
@cormacrelf
Copy link
Contributor Author

cormacrelf commented May 22, 2024

And here is an `extract_archive` rule based on prelude's `http_archive` that makes all this work.
def _tar_strip_prefix_flags(strip_prefix: [str, None]) -> list[str]:
    if strip_prefix:
        # count nonempty path components in the prefix
        count = len(filter(lambda c: c != "", strip_prefix.split("/")))
        return ["--strip-components=" + str(count), strip_prefix]
    return []

def _unarchive_cmd(
        # ext_type: str,
        # exec_is_windows: bool,
        archive: Artifact,
        strip_prefix: [str, None]) -> (cmd_args, bool):
    unarchive_cmd = cmd_args(
        "tar",
        "-xzf",
        archive,
        _tar_strip_prefix_flags(strip_prefix),
    )
    return unarchive_cmd, False

def _extract_archive_impl(ctx: AnalysisContext) -> list[Provider]:
    archive = ctx.attrs.src

    # no need to prefer local; this is not a downloaded object. unlike http_archive
    prefer_local = False
    unarchive_cmd, needs_strip_prefix = _unarchive_cmd(archive, ctx.attrs.strip_prefix)
    exec_is_windows = False

    output_name = ctx.label.name
    output = ctx.actions.declare_output(output_name, dir = True)
    script_output = ctx.actions.declare_output(output_name + "_tmp", dir = True) if needs_strip_prefix else output
    if exec_is_windows:
        ext = "bat"
        mkdir = "md {}"
        interpreter = []
    else:
        ext = "sh"
        mkdir = "mkdir -p {}"
        interpreter = ["/bin/sh"]
    exclude_flags = []
    script, _ = ctx.actions.write(
        "unpack.{}".format(ext),
        [
            cmd_args(script_output, format = mkdir),
            cmd_args(script_output, format = "cd {}"),
            cmd_args([unarchive_cmd] + exclude_flags, delimiter = " ").relative_to(script_output),
        ],
        is_executable = True,
        allow_args = True,
    )
    exclude_hidden = []
    ctx.actions.run(
        cmd_args(interpreter + [script]).hidden(exclude_hidden + [archive, script_output.as_output()]),
        category = "extract_archive",
        prefer_local = prefer_local,
    )

    if needs_strip_prefix:
        ctx.actions.copy_dir(output.as_output(), script_output.project(ctx.attrs.strip_prefix))

    return [DefaultInfo(
        default_output = output,
        sub_targets = {
            path: [DefaultInfo(default_output = output.project(path))]
            for path in ctx.attrs.sub_targets
        },
    )]

extract_archive = rule(
    impl = _extract_archive_impl,
    attrs = {
        "src": attrs.source(),
        "strip_prefix": attrs.option(attrs.string(), default = None),
        "sub_targets": attrs.list(attrs.string(), default = [], doc = """
            A list of filepaths within the archive to be made accessible as sub-targets.
            For example if we have an http_archive with `name = "archive"` and
            `sub_targets = ["src/lib.rs"]`, then other targets would be able to refer
            to that file as `":archive[src/lib.rs]"`.
        """),
    },
)

@cormacrelf cormacrelf force-pushed the feature/local-registry branch from 2eb5ad4 to d8c14c7 Compare May 22, 2024 12:35
@cormacrelf cormacrelf force-pushed the feature/local-registry branch from d8c14c7 to e1d2c53 Compare May 22, 2024 12:40
@cormacrelf cormacrelf force-pushed the feature/local-registry branch from cb858a3 to a7b0188 Compare September 4, 2024 07:52
@dmezh
Copy link

dmezh commented Sep 26, 2024

@cormacrelf I think your analysis is on point; there is a gap here that the tarballs fill nicely.

@dmezh
Copy link

dmezh commented Sep 26, 2024

Maybe @jsgf?

@rbtcollins
Copy link

Terrible DX annoyance with "too many open files" errors due to buck trying to download 1000 crates at once

This seems like something buck2 should be handling - wide graphs of any sort will easily have far more work available than the machine can perform; scheduling within those limits is a core feature. Is there some reason thats not realistic / possible here?

We use bazel (but I'm interested in buck2) with a substantial number of rust crates and 1000 third party crates. For bazel we use rules_rust and vendor the metadata for each crate - it works great, builds are fast, there is minimal data in git (one build file per crate). What stops reindeer + buck2 also being great with this style of approach?

@cormacrelf
Copy link
Contributor Author

Yes, for non-vendored mode, it would be a big step in the right direction if the caching of the compressed crate download + decompress steps could be shared across buck configurations. At the moment if you build in two different configurations (e.g. release mode) it will download and decompress the entire crate graph twice, as the http_archive targets are all configured again.

It is supposedly possible to do this with Buck anonymous targets, according to the documentation. But it's more of a note in the margins IIRC, I don't know how to actually do it. Regular anonymous targets are only shared between uses in the same buck configuration, which is for things like "only compile this erlang file once".

@rbtcollins
Copy link

Thats quite suprising; I would have expected the same action with the same CAS inputs to get a cache hit and become a no-op, even if their incremental computation nodes are separate. :( Are the buck2 team aware? Should we file a bug?

@cormacrelf
Copy link
Contributor Author

Everything is a bit better when you use RE, because everything does eventually get cached. I wrote this PR before I had RE configured, so I haven't tried going back to non-vendored with RE. There is no way around having to download every change to your cargo deps (which is often a good 50 crates at a time) in every configuration you use, but if it's cached it might be okay.

Most paths in buck-out/ include the configuration hash, including http_archive's output paths. Paths for anonymous targets, at least those I've managed to make, live in a namespace and include a hash of their inputs instead of e.g path/to/package/__target_name__, but they still have the configuration hash at the start.

  • Fully vendored = nothing to cache, your source files are not in buck-out/, everything just uses those paths directly
  • No vendoring = http_archive's output path includes configuration hash, and so do all the intermediates (downloaded .crate tarballs). Bad caching
  • This PR = the tarballs are vendored, so they don't have to be downloaded twice. But the outputs still include a configuration hash, so the decompress step has to be done once per configuration.

Copy link

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@facebook-github-bot
Copy link
Contributor

@dtolnay has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@dtolnay merged this pull request in 8f90c1f.

@rbtcollins
Copy link

Thanks @cormacrelf. When you say RE results in caching I presume the concept there is that the output of an action is cached by the RE CAS cache; this is perhaps more of a buck2 than reindeer question, but does buck2 not have a local CAS cache?

@cormacrelf
Copy link
Contributor Author

cormacrelf commented Jan 8, 2025

It does not. Buck does have a local
database of materialized artifacts but it is not a CAS store and it is not persistent across daemon restarts.

Bazel has a few sandboxing modes, some of which amount to "local remote execution" that's somewhat automatic. With buck you can run your own RE locally (eg with docker-compose & nativelink), and configure it to do anything between CAS only and take over executing everything. I have been doing that and it's been great. But a lot of setup.

@rbtcollins
Copy link

@cormacrelf thanks! Has anyone written a guide?

@cormacrelf
Copy link
Contributor Author

cormacrelf commented Jan 8, 2025

Not that I know of. You can just use hosted RE services. I only do LRE because the internet connection I have is not up to the task. The hard part is always making good toolchains, which is unrelated to where your RE happens to run. You would think Nix would make it easy but it is not.

I have a thing that dumps a Nix shell environment to a file that can be sourced as an entrypoint script in the RE image for RE actions, and also normalises & hashes it and dumps the hashes to a .buckconfig.d/ file. Buck reads that configuration and attaches the hashes as a hidden dependency of each compiler invocation. So when you e.g. upgrade clang or rustc or some shared library in the RE image, you automatically invalidate large swaths of the action cache. I think I should open source that.

facebook-github-bot pushed a commit to facebook/buck2-prelude that referenced this pull request Jan 9, 2025
Summary:
This is to support facebookincubator/reindeer#46, attn dtolnay probably

Two new features:

- `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support.
- Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`.

X-link: facebook/buck2#836

Reviewed By: zertosh

Differential Revision: D67976137

Pulled By: dtolnay

fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5
facebook-github-bot pushed a commit to facebook/buck2 that referenced this pull request Jan 9, 2025
Summary:
This is to support facebookincubator/reindeer#46, attn dtolnay probably

Two new features:

- `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support.
- Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`.

Pull Request resolved: #836

Reviewed By: zertosh

Differential Revision: D67976137

Pulled By: dtolnay

fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants