-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom filters to avoid KV writes #27
Comments
My costs have come down quite a bit since July (where my Cloudflare Workers bill was a pricy $92.72) and recently hit a low of $13.50 for November. I'm cautiously optimistic that I won't have to take more drastic steps. I will monitor the bills for December 2021 through February 2022 before I make a decision. |
Edit: turns out my savings were caused by a bug! I guess my bills are going to return to like $100 per month again. |
Since your update earlier today, I have spent a bit of time looking into how Bloom filters work, and moreover, how we could implement them here. I found a good whitepaper on a key-value Bloom filter implementation that you might want to check out if you haven't seen it already. It is mostly focused on a "new" Bloom filter implementation, but the concepts discussed could probably, at least in part, be applied to any Bloom filter. Suffice it to say, the biggest hurdle I'm seeing is simply how this data structure is going to be stored. If we want it to persist between multiple datacenters, the only in-house provided way of doing so, as far as I can see, is using Workers KV itself. This would, at least in my mind, undermine the entire point of this endeavor, however, if this would reduce monthly bills it would probably still be worthwhile. |
I mean, we'd have to read the Bloom filter on every miss, but the number of writes gets smaller as more entries in the filter are populated (with the probability of an entry being in the filter approaching 1, assuming a fixed size and that it is not periodically forgotten). Still, implementing a Bloom filter is probably worth it, but it has to be balanced between two things:
|
I think implementing a Bloom filter to reduce "one-hit wonders" and KV writes is actually pretty straightforward. In my mind, we populate a section of the KV for the Bloom filter all set initially to 0s. When a request comes in (and makes it past the cache) it checks against the Bloom filter whether or not it has seen it before. If it hasn't, it retrieves the data from Mojang's API as usual and adds that request to the Bloom filter, writing 1s where required (it does not yet flush the data to the KV). Next time a request for this comes in, it will check against the Bloom filter whether or not it has been requested before, if it has it will check the KV for it. If the KV contains the data, perfect, otherwise it will go out for the data again but this time it will flush it to the KV. This is slightly different from the implementation described in the original issue as it doesn't check whether or not it hits the cache, just whether or not it has been requested before, however, checking against cache hits instead would be trivial; I don't know how this would affect cache effectiveness nor the number of writes. There are two main things to consider with this:
Additionally, I'm struggling to see how a Bloom filter would reduce the rate of Mojang's 429 errors, although I'm sure since you mentioned it, it would benefit from the inclusion of a Bloom filter. |
This is fine - for reference, writes are roughly half of the read count.
That's fine - we can just expire the Bloom filter every day. We lose little by essentially discarding it.
Actually, it would probably increase the rate. |
Okay, so still cheaper for sure.
While that makes sense, wouldn't it require re-allocating space on the KV for the Bloom filter requiring yet more writes? Although, I assume this would still probably be fewer than there are now.
Ah I see, I misinterpreted what you said before, that makes sense. |
Workers KV writes cost 10x as much as KV reads. This leads to rather large bills, something I do not desire. Therefore, I intend to implement a Bloom filter that works like so:
On every request that doesn't immediately hit a local cache, a cache request will hit the local cache first. If it does hit and the Bloom filter indicates that the request was hit at least once before and the key doesn't already exist in KV, we will flush the data to KV. Otherwise, we will do a KV get and if that succeeds, store it in the local cache. Otherwise, go out to the original source but cache locally only.
This will need some more optimization.
The text was updated successfully, but these errors were encountered: