-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Make CAS blobs tied to ActionKey to improve sharding locality #296
Comments
How does this work with remote execution? If an output file gets placed in the input root of a subsequent action, how would the worker know which shard to query? It would need to know the “action key” of the action that yielded that file. And an input root can contain many files, each belonging to a different originating action. |
That's correct - basically, the idea is to use a different digest to address files - instead of passing raw content hashes like we do right now, we would be passing a combination of action key + content hash to reference these files. The annoying part is that it would probably require something like a new FileDigest message, and use that in OutputFile and FileNode. |
This feels like a server side detail, which could be mitigated entirely on the cache server side (if it is important). |
@mostynb How would you mitigate that with the way protocol works right now? I'm very much open to suggestions how to solve that problem without resorting to changes to how RE works. |
There are a bunch of different ways you could try to mitigate this, it's hard to suggest a single best solution without knowing your requirements. I would probably start by trying to store blobs on multiple shards with a fixed mapping known to the cache frontend. |
Let's focus on the problem here. My understanding is that your current server CAS implementation would shard the storage between multiple hosts. If one host were to go down, the CAS entry would be considered lost, and related AC would be invalidated, thus requiring the Action to get re-executed. A typical way to solve 1 node going down is to replicate the data between multiple nodes. For example, databases typically support primary replicas where if the primary were to go down, the replica could be used instead. Some more advanced DB systems would replicate data with a quorum on write as well, creating an HA deployment. Would any of these existing solutions address your problem? 🤔 |
Hi @sluongng Yes, we do maintain durability using replicas and yes, it does help. However I feel like it is still kinda unwieldly - especially, if you want to do AC validation and thus need to check for CAS blobs existence, you will fan-out requests to multiple ( possibly every ) shards, relying on tail latency of said requests. |
I don't get why a request need to be route to all shards. Each shard should have it's own routing key right? For example, in git and git-lfs, it's very common to split up the storage of blobs onto multiple shards, with the first 2-4 character of the sha256 hash being the routing key for each shard. Let's say for each of these shard, you deploy a primary and a replica, the existence check is still relatively cheap. I think what might help from a REAPI perspective would be to support a canonical header that can be used as a routing key. The problem with grpc requests is that it could be expensive for load balancers to deserialize the requests to get the digest value for routing purposes. In such case, we could put the sha256 value into a header to help with routing the request to the correct shard. |
Maybe different use cases (e.g. regarding AC validation) leads into different directions? My interest is more in the opposite direction, splitting huge individual files into smaller parts and spread them among different shards, in order to reduce spikes to specific shards. (I don't use AC validation and instead rely on bazel handling missing CAS blobs via --experimental_remote_cache_eviction_retries) However, if adding some kind of separate routing key, then don't forget to consider also use cases such as CAS blobs referenced from BEP. |
@ulrfa let's keep this discussion on the protocol level and avoid implementation specific as much as possible (i.e. BEP is specific to Bazel). The splitting of large blob conversation is on-going over in #282 so let's not side-track. At the end of the day, most of these requests are ByteStream Read/Write. So here is an AI-generated diagram (with some modification by me) of how such a setup would work. Let's say instead of Bytestream, you want to do FindMissing based on some content marker (i.e. AC, BEP, Tree, etc...). Then the CAS server implementation could split up the FindMissing digests into multiple requests according to the routing key calculation, right? And because each shard could be a set of primary and replica, losing 1 primary should simply trigger the replica promotion. No Action is required to re-run in such a scenario. However, extracting |
All of this is right @sluongng , but its exactly what I'm trying to avoid, since system will tail-latency based on the slowest shard, so all it takes is one shard having a bad time to have an outsized impact, which is why I was looking for alternative ways to partition. Also, I'd be supportive for adding something even more general to the protocol to help with routing/sharding, such as |
Discussed with Alex on Slack about this concern. Noted that I am mostly proposing this routing key for ByteStream Read/Write and not FindMissing. If an action digest is needed to support the routing of FindMissing, then perhaps we could also add it as another header. So right now we are looking into 2 additional headers/metadatas (optional) to assist routing:
These should be plaintext, not protobuf binary encoded like the current RequestMetadata, so that they could be regex matched by typical reverse proxy solutions like Nginx, Caddy, HAProxy etc... Right? 🤔 |
Hi there! Opening a discussion issue to see what other folks think about this.
Right now CAS keyspace is basically hashes of contents. This means that if your CAS is a sharded system, actions can have their outputs split among multiple shards. This also means that if you lose just one of your shards, you will force * a lot * of actions to rebuild, as a lot of them lost maybe one or two output files that happened to be on that shard. It also means that generally speaking, latency of most FindMissingBlobs calls is going to depend on tail latency over multiple shards, which is undesirable.
My initial gut feeling is that if we could have some keyeing by a combination of an action key + output key instead of having raw output hashes, then we could use that to have better shard locality ( colocate all action outputs in single shard ). It comes with some rough edges ( for example, it might be that multiple actions have the same output - and we would see duplication ), but in practice I don't think it would be that much of a concern.
The text was updated successfully, but these errors were encountered: