-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local disk cache #529
base: master
Are you sure you want to change the base?
Local disk cache #529
Conversation
This is ready for review now. I'm not sure about the lint errors -- it complains about unwrapped errors from other packages, but from the code I can see, we do this literally everywhere, so I'm choosing to ignore unless asked otherwise. |
abb4792
to
8979a8d
Compare
go/pkg/diskcas/diskcas.go
Outdated
newSize := uint64(atomic.AddInt64(&d.sizeBytes, dg.Size)) | ||
if newSize > d.maxCapacityBytes { | ||
select { | ||
case d.gcReq <- atomic.AddUint64(&d.gcTick, 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason we count the GC cycles here is for testing. I couldn't find another way to wait for GC to be over in tests.
Update: Brave have tried this change and see no difference, because we still need to do an Action Cache RPC. I will now add optional Action Cache to this change to see how much improvement they get with it. |
b42ecc2
to
50846d6
Compare
Thanks for the PR Ola! I generally like and agree with having a local disk cache feature, but I want to ask some due diligence questions.
|
Hey Kousik and Anas! Good questions! 1 - @goodov from Brave has tried this out on a slow connection -- maybe Aleksey can comment on how it compares to racing, because I know he has tried that as well (although it's not yet the default setting we recommend to customers). I'm wondering if there's a way for me to artificially create a VM with a slow connection to try it out -- maybe get a VM in the farthest possible region from our backend?... 2 - I tried a Chromium build on Linux with 50gb cache size, and the actual cache size was 26GB. Which, btw, implies that RBE download metrics that it displays at the end of the build are waaaay off (it usually says 3 - Good point! I didn't think the output files would be locally modified (do they? Do you know which actions do that?). I'll add handling of this case and a unit test. But I'm struggling to imagine this being a race condition unless it's literally two concurrent actions that output the same file -- and if that's the case, isn't it already a race condition in the build itself?... 4 - Yes, agreed, I did it in one go because I wanted @goodov to be able to give it a try easily. I think this can naturally break into 3 PRs -- first, introduce the diskcache with CAS-only plus unit tests, then second PR adds Action Cache functionality plus unit tests, then 3rd PR introduces the client flags plus instrumentation to actually use the diskcache in the client. Does that sound good? Thank you! |
efa688d
to
a0206e2
Compare
Updates:
I'm leaving this PR here for reference to the whole thing. |
3590779
to
2677570
Compare
b01e309
to
c800a0e
Compare
Sorry I'm late. I'm trying to understand the context of such cache within past discussions and future requirements.
Ah, interesting. Any chance you've been able to spot the cause of this mismatch? |
3cc60db
to
7a31b23
Compare
2b812c8
to
9eb7483
Compare
} | ||
it := &qitem{ | ||
key: k, | ||
lat: fileInfoToAccessTime(info), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the last access time instead of last modified time?
I don't believe that using the file system to track this information is a good idea. My implementations saved this as a dumb json file and that was very simple, I don't care about the actual time stamp, just the ordering; thus the file is a flat list of files in LRU for eviction order. This also removes the need for the priorityQueue type.
go/pkg/diskcache/diskcache.go
Outdated
d.mu.Lock() | ||
it := heap.Pop(d.queue).(*qitem) | ||
d.mu.Unlock() | ||
size, err := d.getItemSize(it.key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this does a stat on the file when the size is already in the digest.
go/pkg/diskcache/diskcache.go
Outdated
diskOpsStart := time.Now() | ||
// We only delete the files, and not the prefix directories, because the prefixes are not worth worrying about. | ||
if err := os.Remove(d.getPath(it.key)); err != nil { | ||
log.Errorf("Error removing file: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good if the error would be accumulated, so that Shutdown() could be renamed to Close() and return the accumulated errors.
return key{digest: dg, isCas: !isAc}, nil | ||
} | ||
|
||
func (d *DiskCache) getPath(k key) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nit: I would have preferred files to have a uniform file name "%s-%d" % (hash, size), instead of having the size as an extension. Then you could use ".cas" as the marker for cas files. IMHO it would be a bit cleaner.
if q.n == cap(q.items) { | ||
// Resize the queue | ||
old := q.items | ||
q.items = make([]*qitem, 2*cap(old)) // Initial capacity needs to be > 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if old == 0 {
old = 256
}
then you can remove the comment
|
||
type key struct { | ||
digest digest.Digest | ||
isCas bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is isCas meant to do a performance optimization later? I don't see any behavior change based on this member.
Thank you, Mark-Antoine -- I'm still tinkering with the whole thing, I just added stats and benchmark tests (not pushed yet), and I'm trying out different variations to see what's fastest. I hope I can get it all done by Monday. |
9888072
to
8094227
Compare
8094227
to
e9e596f
Compare
@@ -129,6 +140,11 @@ func (c *Client) DownloadOutputs(ctx context.Context, outs map[string]*TreeOutpu | |||
if err := cache.Update(absPath, md); err != nil { | |||
return fullStats, err | |||
} | |||
if c.diskCache != nil { | |||
if err := c.diskCache.StoreCas(output.Digest, absPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it could be done as part of DownloadFiles() ? Otherwise this adds a serial latency.
I'm really sorry, but I have to admit to myself and others that I apparently have zero spare cycles to put into this in the foreseeable future :-( |
Implements an optional local disk cache (LRU, CAS + Action Cache) used for remote action outputs.
Goma has this feature, and many of our Reclient customers requested it since (usually because of slow internet connections).
In my benchmarks on Linux with a relatively good internet connection this was not able to reduce the runtime of a fully cached build by a statistically significant amount. But it didn't hurt either.
The current implementation does GC asynchronously, to save time. The priority queue implementation using the
container/heap
library is taken verbatim from the Golang documentation, plus I improved it a bit to double the array size when needed instead of extending it by one element at a time.