-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Middle-ground vendoring option using a local registry #46
Middle-ground vendoring option using a local registry #46
Conversation
And here is an `extract_archive` rule based on prelude's `http_archive` that makes all this work.def _tar_strip_prefix_flags(strip_prefix: [str, None]) -> list[str]:
if strip_prefix:
# count nonempty path components in the prefix
count = len(filter(lambda c: c != "", strip_prefix.split("/")))
return ["--strip-components=" + str(count), strip_prefix]
return []
def _unarchive_cmd(
# ext_type: str,
# exec_is_windows: bool,
archive: Artifact,
strip_prefix: [str, None]) -> (cmd_args, bool):
unarchive_cmd = cmd_args(
"tar",
"-xzf",
archive,
_tar_strip_prefix_flags(strip_prefix),
)
return unarchive_cmd, False
def _extract_archive_impl(ctx: AnalysisContext) -> list[Provider]:
archive = ctx.attrs.src
# no need to prefer local; this is not a downloaded object. unlike http_archive
prefer_local = False
unarchive_cmd, needs_strip_prefix = _unarchive_cmd(archive, ctx.attrs.strip_prefix)
exec_is_windows = False
output_name = ctx.label.name
output = ctx.actions.declare_output(output_name, dir = True)
script_output = ctx.actions.declare_output(output_name + "_tmp", dir = True) if needs_strip_prefix else output
if exec_is_windows:
ext = "bat"
mkdir = "md {}"
interpreter = []
else:
ext = "sh"
mkdir = "mkdir -p {}"
interpreter = ["/bin/sh"]
exclude_flags = []
script, _ = ctx.actions.write(
"unpack.{}".format(ext),
[
cmd_args(script_output, format = mkdir),
cmd_args(script_output, format = "cd {}"),
cmd_args([unarchive_cmd] + exclude_flags, delimiter = " ").relative_to(script_output),
],
is_executable = True,
allow_args = True,
)
exclude_hidden = []
ctx.actions.run(
cmd_args(interpreter + [script]).hidden(exclude_hidden + [archive, script_output.as_output()]),
category = "extract_archive",
prefer_local = prefer_local,
)
if needs_strip_prefix:
ctx.actions.copy_dir(output.as_output(), script_output.project(ctx.attrs.strip_prefix))
return [DefaultInfo(
default_output = output,
sub_targets = {
path: [DefaultInfo(default_output = output.project(path))]
for path in ctx.attrs.sub_targets
},
)]
extract_archive = rule(
impl = _extract_archive_impl,
attrs = {
"src": attrs.source(),
"strip_prefix": attrs.option(attrs.string(), default = None),
"sub_targets": attrs.list(attrs.string(), default = [], doc = """
A list of filepaths within the archive to be made accessible as sub-targets.
For example if we have an http_archive with `name = "archive"` and
`sub_targets = ["src/lib.rs"]`, then other targets would be able to refer
to that file as `":archive[src/lib.rs]"`.
"""),
},
) |
2eb5ad4
to
d8c14c7
Compare
d8c14c7
to
e1d2c53
Compare
cb858a3
to
a7b0188
Compare
@cormacrelf I think your analysis is on point; there is a gap here that the tarballs fill nicely. |
Maybe @jsgf? |
This seems like something buck2 should be handling - wide graphs of any sort will easily have far more work available than the machine can perform; scheduling within those limits is a core feature. Is there some reason thats not realistic / possible here? We use bazel (but I'm interested in buck2) with a substantial number of rust crates and 1000 third party crates. For bazel we use rules_rust and vendor the metadata for each crate - it works great, builds are fast, there is minimal data in git (one build file per crate). What stops reindeer + buck2 also being great with this style of approach? |
Yes, for non-vendored mode, it would be a big step in the right direction if the caching of the compressed crate download + decompress steps could be shared across buck configurations. At the moment if you build in two different configurations (e.g. release mode) it will download and decompress the entire crate graph twice, as the http_archive targets are all configured again. It is supposedly possible to do this with Buck anonymous targets, according to the documentation. But it's more of a note in the margins IIRC, I don't know how to actually do it. Regular anonymous targets are only shared between uses in the same buck configuration, which is for things like "only compile this erlang file once". |
Thats quite suprising; I would have expected the same action with the same CAS inputs to get a cache hit and become a no-op, even if their incremental computation nodes are separate. :( Are the buck2 team aware? Should we file a bug? |
Everything is a bit better when you use RE, because everything does eventually get cached. I wrote this PR before I had RE configured, so I haven't tried going back to non-vendored with RE. There is no way around having to download every change to your cargo deps (which is often a good 50 crates at a time) in every configuration you use, but if it's cached it might be okay. Most paths in buck-out/ include the configuration hash, including
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@dtolnay has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks @cormacrelf. When you say RE results in caching I presume the concept there is that the output of an action is cached by the RE CAS cache; this is perhaps more of a buck2 than reindeer question, but does buck2 not have a local CAS cache? |
It does not. Buck does have a local Bazel has a few sandboxing modes, some of which amount to "local remote execution" that's somewhat automatic. With buck you can run your own RE locally (eg with docker-compose & nativelink), and configure it to do anything between CAS only and take over executing everything. I have been doing that and it's been great. But a lot of setup. |
@cormacrelf thanks! Has anyone written a guide? |
Not that I know of. You can just use hosted RE services. I only do LRE because the internet connection I have is not up to the task. The hard part is always making good toolchains, which is unrelated to where your RE happens to run. You would think Nix would make it easy but it is not. I have a thing that dumps a Nix shell environment to a file that can be sourced as an entrypoint script in the RE image for RE actions, and also normalises & hashes it and dumps the hashes to a .buckconfig.d/ file. Buck reads that configuration and attaches the hashes as a hidden dependency of each compiler invocation. So when you e.g. upgrade clang or rustc or some shared library in the RE image, you automatically invalidate large swaths of the action cache. I think I should open source that. |
Summary: This is to support facebookincubator/reindeer#46, attn dtolnay probably Two new features: - `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support. - Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`. X-link: facebook/buck2#836 Reviewed By: zertosh Differential Revision: D67976137 Pulled By: dtolnay fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5
Summary: This is to support facebookincubator/reindeer#46, attn dtolnay probably Two new features: - `extract_archive` now shares its unarchive implementation (and available attrs) with `http_archive`, which was a lot more mature. So it gets `strip_prefix` and `sub_targets` etc. And windows support. - Both of them now allow a dict for `sub_targets`. This wasn't needed for that reindeer PR, but I did use it for a golang version of that reindeer vendoring strategy. Think `{ "embed_files": [ "default.html", "default.css" ] }`. Pull Request resolved: #836 Reviewed By: zertosh Differential Revision: D67976137 Pulled By: dtolnay fbshipit-source-id: c2efb5851d6f05cfbe4fe061b456e548be5438a5
The problem is a goldilocks one.
vendor = false
is quick & easy to manage deps & buckify, but pretty bad day to day.reindeer buckify
takes 2 seconds or so. Pretty convenient.[vendor] ...
is slow to manage deps & buckify.vendor
directory of 1000 crates, 1.2 GB, 50k source files. Mostly from dupes of the windows crates which can't be pinned to one single version etc.reindeer vendor
takes 35 secondsreindeer buckify
takes 20 secondsgit status
takes 150msI think we need a solution for the middle ground:
vendor = "local-registry"
, using https://github.com/dhovart/cargo-local-registryreindeer vendor
ultimately just writes a bunch of .crate files into vendor/, which are just gzipped tarballswindows-0.48.0.crate
is just a blob. Your diffs are much simpler when you modify deps. Etc.strip_prefix
andsub_targets
support, but prelude'sextract_archive
could probably have that added.)Outcomes:
although doesn't handle cargo).package = { git = "..." }
deps yetreindeer vendor
andreindeer buckify
both take 2 secondsgit status
takes 20mswindows-0.48.0.crate
on linux, and you only pay for what you build.Problems:
cargo
crate it would be maybe a 2-file crate. And we could use it as a library.index
folder (nestedab/ ac/ ah/ ...
folders) might be a little bit annoying if you're making concurrent edits on different branches. But you can always regenerate.