-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
propfind invalidates filecache #496
base: dev
Are you sure you want to change the base?
Conversation
... so that the server can redirect us without needing to read the data.
... so that the server can redirect us without needing to read the data.
gcs uses md5 as the etag, and we store the etag from gcs, not the one from the 307 response from Valhalla.
0228170
to
ac4706f
Compare
src/filecache.c
Outdated
@@ -386,7 +386,7 @@ static void get_fresh_fd(filecache_t *cache, | |||
// If we're in saint mode, don't go to the server | |||
if (pdata != NULL && | |||
((flags & O_TRUNC) || use_local_copy || | |||
(pdata->last_server_update == 0) || (time(NULL) - pdata->last_server_update) <= STAT_CACHE_NEGATIVE_TTL)) { | |||
(pdata->last_server_update == 0) || pdata->last_server_update > FILECACHE_INVALIDATED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to make the implicit assumption that there's been a recent PROPFIND invalidation pass on the directory prior to reaching here. While seems reasonable given expected kernel behavior, I'm not sure how I feel about relying on that ordering for correctness. I think I need to give this some thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we have options to de-risk the change. The plan was always to use valhalla log data to find customers with a high number of 304s and less-than-elite plan/tier and use them for a canary release that would be closely monitored. We could additionally add a softer age out, like 60 seconds or so. That way any unexpected behavior would correct itself fairly quickly, while also letting the invalidation logic have a potentially significant effect on the number of 304s. Files that are actually being updated many times a minute would have problems, but I imagine sites with that behavior aren't performing very well anyway. As we gain confidence we could grow the age out interval.
We could also add another type of cache entry for "when did we last update the cache", which would let us log warnings and bump statsd counters when for some reason we did not update our caches recently before entering get_fresh_fd
. I would think that, even with races to update this "meta-cache" entry, our worst case result of keeping the oldest update time would give us an update time that's inaccurate by a few seconds, 2 or 3 at most. We could introduce telemetry for this before actually introducing the invalidation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about checking the directory's last freshening and using the later of the directory and file freshness timestamps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As clarified in slack, I feel great about this. Also, I learned that the stat cache actually has a special entry for this, which why it's even possible to set the query parameter for progressive propfinds. I've rebased on the change that allows me to build docker images and debug them, mostly so I could test this change (cursory testing show that it works).
I'm not explicitly trying for the latter of two timestamps, I'm just setting a grace period for the validity of a directory update that would have invalidated some file cache records, and falling back on the logic that was already there. I'm introducing separate defines for separate concepts in an effort to remove a trap where someone could intentionally change "the time after which a negative entry in the stat cache is no longer valid" and unwittingly change a bunch of other things. All these things are 2
(seconds) ... for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and the latest commit is my latest stab at doing this.
09d654d
to
45a65de
Compare
So we can run them in kube.
124b7ca
to
e9e6178
Compare
e9e6178
to
3b3ad15
Compare
Before this change, we only serve from the file cache if the content was updated by a call to the server in the last 2 seconds, or the thread is in saint mode. The reason is that we do a conditional get using the If-None-Match header and the Etag from the last response for that file path, hoping to get a relatively fast 304 most of the time. However, querying total duration of requests broken down by cluster, method, and status shows that our 304 GETs are using just as much time in process as normal GETs and PUTs due to sheer volume. This change invalidates the file cache for a path when a PROPFIND shows that the path has been updated in Valhalla. The mechanism for invalidation is to set the timestamp in `filecache_pdata->last_server_update` to 1 for the path in question, and to serve all paths from file cache if the entry isn't invalidated. This arrangement affords serving invalidated content while in saint mode, and aggressive removal of known invalid content by the file cache cleanup mechanism. An optimization we could consider is to use the etag to locate the file cache content, in order to provide deduplication and potentially better cache coherency for the case where the same file content moves between paths. We would need to add the etag to the propfind data, parse it out, pass it to the callback, and use that to evaluate the freshness of the file cache data, and also change the logic for naming a cache file to rename the file after its hash is known, but before publishing file cache data to leveldb. Since the etag and the filename are both properties of the file cache data, we would need to either change the format of the stat cache data, add another entry type, or add a concept of "unfilled" file cache data.
3b3ad15
to
7c24280
Compare
eae4795
to
add5fa8
Compare
934aa78
to
ea7ba94
Compare
Before this change, we only serve from the file cache if the content was updated
by a call to the server in the last 2 seconds, or the thread is in saint mode.
The reason is that we do a conditional get using the If-None-Match header and
the Etag from the last response for that file path, hoping to get a relatively
fast 304 most of the time. However, querying total duration of requests broken
down by cluster, method, and status shows that our 304 GETs are using just as
much time in process as normal GETs and PUTs due to sheer volume.
This change invalidates the file cache for a path when a PROPFIND shows that the
path has been updated in Valhalla. The mechanism for invalidation is to set the
timestamp in
filecache_pdata->last_server_update
to 1 for the path inquestion, and to serve all paths from file cache if the entry isn't invalidated.
This arrangement affords serving invalidated content while in saint mode, and
aggressive removal of known invalid content by the file cache cleanup mechanism.
An optimization we could consider is to use the etag to locate the file cache
content, in order to provide deduplication and potentially better cache
coherency for the case where the same file content moves between paths. We would
need to add the etag to the propfind data, parse it out, pass it to the
callback, and use that to evaluate the freshness of the file cache data, and
also change the logic for naming a cache file to rename the file after its hash
is known, but before publishing file cache data to leveldb. Since the etag and
the filename are both properties of the file cache data, we would need to either
change the format of the stat cache data, add another entry type, or add a
concept of "unfilled" file cache data.