Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Make CAS blobs tied to ActionKey to improve sharding locality #296

Open
alexofortune opened this issue May 26, 2024 · 12 comments

Comments

@alexofortune
Copy link

alexofortune commented May 26, 2024

Hi there! Opening a discussion issue to see what other folks think about this.

Right now CAS keyspace is basically hashes of contents. This means that if your CAS is a sharded system, actions can have their outputs split among multiple shards. This also means that if you lose just one of your shards, you will force * a lot * of actions to rebuild, as a lot of them lost maybe one or two output files that happened to be on that shard. It also means that generally speaking, latency of most FindMissingBlobs calls is going to depend on tail latency over multiple shards, which is undesirable.

My initial gut feeling is that if we could have some keyeing by a combination of an action key + output key instead of having raw output hashes, then we could use that to have better shard locality ( colocate all action outputs in single shard ). It comes with some rough edges ( for example, it might be that multiple actions have the same output - and we would see duplication ), but in practice I don't think it would be that much of a concern.

@EdSchouten
Copy link
Collaborator

EdSchouten commented May 26, 2024

How does this work with remote execution? If an output file gets placed in the input root of a subsequent action, how would the worker know which shard to query? It would need to know the “action key” of the action that yielded that file. And an input root can contain many files, each belonging to a different originating action.

@alexofortune
Copy link
Author

alexofortune commented May 26, 2024

That's correct - basically, the idea is to use a different digest to address files - instead of passing raw content hashes like we do right now, we would be passing a combination of action key + content hash to reference these files. The annoying part is that it would probably require something like a new FileDigest message, and use that in OutputFile and FileNode.

@mostynb
Copy link
Contributor

mostynb commented May 26, 2024

This feels like a server side detail, which could be mitigated entirely on the cache server side (if it is important).

@alexofortune
Copy link
Author

@mostynb How would you mitigate that with the way protocol works right now? I'm very much open to suggestions how to solve that problem without resorting to changes to how RE works.

@mostynb
Copy link
Contributor

mostynb commented May 26, 2024

There are a bunch of different ways you could try to mitigate this, it's hard to suggest a single best solution without knowing your requirements. I would probably start by trying to store blobs on multiple shards with a fixed mapping known to the cache frontend.

@sluongng
Copy link
Contributor

Right now CAS keyspace is basically hashes of contents. This means that if your CAS is a sharded system, actions can have their outputs split among multiple shards. This also means that if you lose just one of your shards, you will force * a lot * of actions to rebuild, as a lot of them lost maybe one or two output files that happened to be on that shard. It also means that generally speaking, latency of most FindMissingBlobs calls is going to depend on tail latency over multiple shards, which is undesirable.

Let's focus on the problem here.

My understanding is that your current server CAS implementation would shard the storage between multiple hosts. If one host were to go down, the CAS entry would be considered lost, and related AC would be invalidated, thus requiring the Action to get re-executed.

A typical way to solve 1 node going down is to replicate the data between multiple nodes. For example, databases typically support primary replicas where if the primary were to go down, the replica could be used instead. Some more advanced DB systems would replicate data with a quorum on write as well, creating an HA deployment.

Would any of these existing solutions address your problem? 🤔

@alexofortune
Copy link
Author

Hi @sluongng

Yes, we do maintain durability using replicas and yes, it does help. However I feel like it is still kinda unwieldly - especially, if you want to do AC validation and thus need to check for CAS blobs existence, you will fan-out requests to multiple ( possibly every ) shards, relying on tail latency of said requests.

@sluongng
Copy link
Contributor

I don't get why a request need to be route to all shards. Each shard should have it's own routing key right?

For example, in git and git-lfs, it's very common to split up the storage of blobs onto multiple shards, with the first 2-4 character of the sha256 hash being the routing key for each shard. Let's say for each of these shard, you deploy a primary and a replica, the existence check is still relatively cheap.

I think what might help from a REAPI perspective would be to support a canonical header that can be used as a routing key. The problem with grpc requests is that it could be expensive for load balancers to deserialize the requests to get the digest value for routing purposes. In such case, we could put the sha256 value into a header to help with routing the request to the correct shard.

@ulrfa
Copy link

ulrfa commented May 28, 2024

Maybe different use cases (e.g. regarding AC validation) leads into different directions?

My interest is more in the opposite direction, splitting huge individual files into smaller parts and spread them among different shards, in order to reduce spikes to specific shards. (I don't use AC validation and instead rely on bazel handling missing CAS blobs via --experimental_remote_cache_eviction_retries)

However, if adding some kind of separate routing key, then don't forget to consider also use cases such as CAS blobs referenced from BEP.

@sluongng
Copy link
Contributor

@ulrfa let's keep this discussion on the protocol level and avoid implementation specific as much as possible (i.e. BEP is specific to Bazel). The splitting of large blob conversation is on-going over in #282 so let's not side-track.

At the end of the day, most of these requests are ByteStream Read/Write. So here is an AI-generated diagram (with some modification by me) of how such a setup would work.

image

Let's say instead of Bytestream, you want to do FindMissing based on some content marker (i.e. AC, BEP, Tree, etc...). Then the CAS server implementation could split up the FindMissing digests into multiple requests according to the routing key calculation, right?

And because each shard could be a set of primary and replica, losing 1 primary should simply trigger the replica promotion. No Action is required to re-run in such a scenario.

However, extracting routing/sharding key is a bit hard to extract right now. ByteStream RPC (and other RPCs) does include it in the resource name of the request, which requires load balancer/gateway implementation to be aware of GRPC payload, deserialize it, and extract out the
key. Perhaps using a standardized HTTP header, such as reapi-routing-digest: 9577a8bfed904bd55390ba203f1a233be1981b36fa945537eb3be5b2446de031/4, could help different load-balancer/gateway to route traffic much easier to the correct shard.

@alexofortune
Copy link
Author

All of this is right @sluongng , but its exactly what I'm trying to avoid, since system will tail-latency based on the slowest shard, so all it takes is one shard having a bad time to have an outsized impact, which is why I was looking for alternative ways to partition.

Also, I'd be supportive for adding something even more general to the protocol to help with routing/sharding, such as reapi-routing-digest.

@sluongng
Copy link
Contributor

sluongng commented Jun 7, 2024

since system will tail-latency based on the slowest shard, so all it takes is one shard having a bad time to have an outsized impact, which is why I was looking for alternative ways to partition.

Discussed with Alex on Slack about this concern.
The concern is mainly applied to FindMissing RPC, which is valid.

Noted that I am mostly proposing this routing key for ByteStream Read/Write and not FindMissing. If an action digest is needed to support the routing of FindMissing, then perhaps we could also add it as another header.

So right now we are looking into 2 additional headers/metadatas (optional) to assist routing:

reapi-routing-blob-digest
reapi-routing-action-digest

These should be plaintext, not protobuf binary encoded like the current RequestMetadata, so that they could be regex matched by typical reverse proxy solutions like Nginx, Caddy, HAProxy etc...

Right? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants