Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage of the "stat cache" #1878

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

bbockelm
Copy link
Collaborator

This PR tidies up the "stat cache" code, attempting to get it to the point where it can be enabled in production again.

Fixes/changes include:

  • Addition of a "memory stress test" that performs as many transfers as possible within a fixed time interval. The goal is to fill the contents of the stat cache and check for leaks.
  • Several small memory hoarding / leak issues fixed:
    • The admin can no longer accidentally make the cache unbounded by setting the size to 0 and the TTLcache goroutine eviction is now run.
    • The client IP cache similarly grew unbounded as the TTL cache goroutine is run. Various log lines have been tidied up.
    • The stat cache is cleared after a server disappears.
  • A prometheus metric was added to record the number of entries in the caches, not just the number of caches.
  • If all the potential origins respond saying they do not have the object, then the director responds with a 404. Previously, in this case it would claim to have insufficient responses and redirect to the origin that sent the 404 in the first place.
  • Unit tests are updated for where error messages have changed.

There's no "smoking gun" of memory hoarding but fixes include:
- When a server stops reporting to the director, delete the corresponding
  stat cache object.
- Actually run the cache cleanup goroutine for the TTL cache.  Previously,
  no cleanup occurred except when the cache hit capacity.
- Ensure the cache always has a capacity - programmatically prevent the
  administrator from configuring an unlimited size.
- Keep a reference to the cache instead of a copy of it in the `statUtil`
  map.
- Decrease the cache sizes and the number of concurrent goroutines.
@bbockelm bbockelm added the bug Something isn't working label Jan 10, 2025
Slightly tweak the error handling logic so if one-or-more origin
returns a 404 for an object (and none return a success or an
indeterminate failure like 403 / 500) then declare the object as
not existing.

Previously, if all origins returned a 404, then the logic declared
the query as having an insufficient number of responses and redirected
the client to one of the origins that just provided the 404.
@bbockelm bbockelm force-pushed the stat_cache_cleanup branch 2 times, most recently from f579edd to 0911e68 Compare January 10, 2025 21:51
@bbockelm bbockelm linked an issue Jan 10, 2025 that may be closed by this pull request
The `log.SetLevel` command should be left untouched as the logging
is actually managed via hooks.
Various unit tests redirected the logrus logging to a byte buffer
but didn't reset things back to normal afterwards.

This resulted in unbounded memory growth in other unit tests (plus
missing test output to stderr!).
The client IP cache did not run the eviction goroutine, causing
an unbounded pileup of cache entries.

This also causes the usage of the client IP to not extend the expiry
time, allowing clients to gently "bounce around" between locations.
This additional stress test generates a large number of object requests
against the director and measures the golang heap usage before/after
the requests.  The goal is to ensure the memory usage stays between
known fenceposts, hopefully acting as an early warning that the
stat cache code has gone wrong.
The `TestCache` test incremented a counter in one goroutine during
HTTP request processing and read it from the main test thread to
see if a request was finished.

Since this was not an atomic counter, the change in the counter could
get propagated at any arbitrary time in the future, meaning the latest
value need not be visible to the main thread.  This PR switches to
an atomic, fixing the race condition.

Additionally, set a TTL in the mock cache to a large value to avoid
potentially triggering a preemptive cache refresh.

This test has been flakey; hopefully this fixes the issue.
Appears that the prior run of `testSyncUploadNone` was leaving around
state causing spurious test failures.
@bbockelm
Copy link
Collaborator Author

Since it touches the clientIP code, I nominated @jhiemstrawisc as the reviewer...

Note that this fixes three different test failures (one linter issue that snuck into main, a random failure, and a race failure).


// The `directorResponse` variable indicates we think this response came from a director
// process, not a proxy / ingress like traefik.
directorResponse := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check whether this approach will mesh well with my open PR for detecting Director reboots in the client?
#1890

// we don't want to permit an unbounded number of queries due to potential
// memory usage.
if concLimit <= 0 {
concLimit = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good spot to log a warning to tell the user that they a) provided a bad default and b) a different default is being set.

// "unbounded" (bad) and a negative value gets cast to uint64,
// becoming an effectively unbounded number (also bad)
if cap <= 0 {
cap = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. Can we log that we're overriding a user-provided config?

@@ -331,7 +331,9 @@ func TestClient(t *testing.T) {
_, err = client.DoGet(ctx, "pelican://"+param.Server_Hostname.GetString()+":"+strconv.Itoa(param.Server_WebPort.GetInt())+"/test/hello_world.txt.1",
filepath.Join(tmpDir, "hello_world.txt.1"), false, client.WithToken(token), client.WithCaches(cacheUrl), client.WithAcquireToken(false))
require.Error(t, err)
assert.Equal(t, "failed download from local-cache: server returned 404 Not Found", err.Error())
// TODO (bbockelm, 10-Jan-2025): It's surprising that the `client.DoGet` above is querying the director then the local cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup memory usage in director
2 participants