-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory usage of the "stat cache" #1878
base: main
Are you sure you want to change the base?
Conversation
There's no "smoking gun" of memory hoarding but fixes include: - When a server stops reporting to the director, delete the corresponding stat cache object. - Actually run the cache cleanup goroutine for the TTL cache. Previously, no cleanup occurred except when the cache hit capacity. - Ensure the cache always has a capacity - programmatically prevent the administrator from configuring an unlimited size. - Keep a reference to the cache instead of a copy of it in the `statUtil` map. - Decrease the cache sizes and the number of concurrent goroutines.
Slightly tweak the error handling logic so if one-or-more origin returns a 404 for an object (and none return a success or an indeterminate failure like 403 / 500) then declare the object as not existing. Previously, if all origins returned a 404, then the logic declared the query as having an insufficient number of responses and redirected the client to one of the origins that just provided the 404.
f579edd
to
0911e68
Compare
The `log.SetLevel` command should be left untouched as the logging is actually managed via hooks.
Various unit tests redirected the logrus logging to a byte buffer but didn't reset things back to normal afterwards. This resulted in unbounded memory growth in other unit tests (plus missing test output to stderr!).
The client IP cache did not run the eviction goroutine, causing an unbounded pileup of cache entries. This also causes the usage of the client IP to not extend the expiry time, allowing clients to gently "bounce around" between locations.
This additional stress test generates a large number of object requests against the director and measures the golang heap usage before/after the requests. The goal is to ensure the memory usage stays between known fenceposts, hopefully acting as an early warning that the stat cache code has gone wrong.
0911e68
to
d5796d1
Compare
The `TestCache` test incremented a counter in one goroutine during HTTP request processing and read it from the main test thread to see if a request was finished. Since this was not an atomic counter, the change in the counter could get propagated at any arbitrary time in the future, meaning the latest value need not be visible to the main thread. This PR switches to an atomic, fixing the race condition. Additionally, set a TTL in the mock cache to a large value to avoid potentially triggering a preemptive cache refresh. This test has been flakey; hopefully this fixes the issue.
b32f3f6
to
ef7a643
Compare
Appears that the prior run of `testSyncUploadNone` was leaving around state causing spurious test failures.
Since it touches the Note that this fixes three different test failures (one linter issue that snuck into main, a random failure, and a race failure). |
|
||
// The `directorResponse` variable indicates we think this response came from a director | ||
// process, not a proxy / ingress like traefik. | ||
directorResponse := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check whether this approach will mesh well with my open PR for detecting Director reboots in the client?
#1890
// we don't want to permit an unbounded number of queries due to potential | ||
// memory usage. | ||
if concLimit <= 0 { | ||
concLimit = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good spot to log a warning to tell the user that they a) provided a bad default and b) a different default is being set.
// "unbounded" (bad) and a negative value gets cast to uint64, | ||
// becoming an effectively unbounded number (also bad) | ||
if cap <= 0 { | ||
cap = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. Can we log that we're overriding a user-provided config?
@@ -331,7 +331,9 @@ func TestClient(t *testing.T) { | |||
_, err = client.DoGet(ctx, "pelican://"+param.Server_Hostname.GetString()+":"+strconv.Itoa(param.Server_WebPort.GetInt())+"/test/hello_world.txt.1", | |||
filepath.Join(tmpDir, "hello_world.txt.1"), false, client.WithToken(token), client.WithCaches(cacheUrl), client.WithAcquireToken(false)) | |||
require.Error(t, err) | |||
assert.Equal(t, "failed download from local-cache: server returned 404 Not Found", err.Error()) | |||
// TODO (bbockelm, 10-Jan-2025): It's surprising that the `client.DoGet` above is querying the director then the local cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to an issue?
This PR tidies up the "stat cache" code, attempting to get it to the point where it can be enabled in production again.
Fixes/changes include: