Bloom filters to avoid KV writes #27

astei · 2021-07-03T20:06:38Z

Workers KV writes cost 10x as much as KV reads. This leads to rather large bills, something I do not desire. Therefore, I intend to implement a Bloom filter that works like so:

On every request that doesn't immediately hit a local cache, a cache request will hit the local cache first. If it does hit and the Bloom filter indicates that the request was hit at least once before and the key doesn't already exist in KV, we will flush the data to KV. Otherwise, we will do a KV get and if that succeeds, store it in the local cache. Otherwise, go out to the original source but cache locally only.

This will need some more optimization.

astei · 2021-12-20T06:27:45Z

My costs have come down quite a bit since July (where my Cloudflare Workers bill was a pricy $92.72) and recently hit a low of $13.50 for November. I'm cautiously optimistic that I won't have to take more drastic steps. I will monitor the bills for December 2021 through February 2022 before I make a decision.

astei · 2022-02-19T21:07:04Z

Edit: turns out my savings were caused by a bug! I guess my bills are going to return to like $100 per month again.

TibiNonEst · 2022-02-20T05:02:07Z

Since your update earlier today, I have spent a bit of time looking into how Bloom filters work, and moreover, how we could implement them here. I found a good whitepaper on a key-value Bloom filter implementation that you might want to check out if you haven't seen it already. It is mostly focused on a "new" Bloom filter implementation, but the concepts discussed could probably, at least in part, be applied to any Bloom filter. Suffice it to say, the biggest hurdle I'm seeing is simply how this data structure is going to be stored. If we want it to persist between multiple datacenters, the only in-house provided way of doing so, as far as I can see, is using Workers KV itself. This would, at least in my mind, undermine the entire point of this endeavor, however, if this would reduce monthly bills it would probably still be worthwhile.

astei · 2022-02-20T23:49:32Z

I mean, we'd have to read the Bloom filter on every miss, but the number of writes gets smaller as more entries in the filter are populated (with the probability of an entry being in the filter approaching 1, assuming a fixed size and that it is not periodically forgotten). Still, implementing a Bloom filter is probably worth it, but it has to be balanced between two things:

Mojang 429 rates: we still get 429 errors pretty often, although we're now caching entries for an entire 24 hours.
The actual rate of "one-hit wonders": I have gathered little data about this on our traffic specifically, but the theory behind implementing a Bloom filter for this purpose has a long history (Akamai's success is in part attributable to Bloom filters, and Cloudflare probably uses Bloom filters internally to prevent "one-hit wonders" from being saved to disk).

TibiNonEst · 2022-02-21T05:25:35Z

I think implementing a Bloom filter to reduce "one-hit wonders" and KV writes is actually pretty straightforward. In my mind, we populate a section of the KV for the Bloom filter all set initially to 0s. When a request comes in (and makes it past the cache) it checks against the Bloom filter whether or not it has seen it before. If it hasn't, it retrieves the data from Mojang's API as usual and adds that request to the Bloom filter, writing 1s where required (it does not yet flush the data to the KV). Next time a request for this comes in, it will check against the Bloom filter whether or not it has been requested before, if it has it will check the KV for it. If the KV contains the data, perfect, otherwise it will go out for the data again but this time it will flush it to the KV.

This is slightly different from the implementation described in the original issue as it doesn't check whether or not it hits the cache, just whether or not it has been requested before, however, checking against cache hits instead would be trivial; I don't know how this would affect cache effectiveness nor the number of writes.

There are two main things to consider with this:

While reducing KV writes, this effectively doubles the number of KV reads needing to read both the Bloom filter and KV for most values. Reads are much cheaper, I know, but I don't know what your reads vs writes numbers currently look like so I can't do the math on how this would affect billing.
More importantly, KV-cached data does need to expire, it's a cache after all. This means we'd either have to delete values from the filter or accept that it will eventually become redundant. Notably, normal Bloom filters do not allow for deletions from the set, we'd have to use something such as a Counting Bloom filter (although this allows for false negatives) or a cuckoo filter. If we are to delete values, we cannot do so by using an expirationTtl like with normal KV writes, but manually due to the limitations of such filters.

Additionally, I'm struggling to see how a Bloom filter would reduce the rate of Mojang's 429 errors, although I'm sure since you mentioned it, it would benefit from the inclusion of a Bloom filter.

astei · 2022-02-21T05:43:26Z

While reducing KV writes, this effectively doubles the number of KV reads needing to read both the Bloom filter and KV for most values. Reads are much cheaper, I know, but I don't know what your reads vs writes numbers currently look like so I can't do the math on how this would affect billing.

This is fine - for reference, writes are roughly half of the read count.

More importantly, KV-cached data does need to expire, it's a cache after all. This means we'd either have to delete values from the filter or accept that it will eventually become redundant.

That's fine - we can just expire the Bloom filter every day. We lose little by essentially discarding it.

Additionally, I'm struggling to see how a Bloom filter would reduce the rate of Mojang's 429 errors, although I'm sure since you mentioned it, it would benefit from the inclusion of a Bloom filter.

Actually, it would probably increase the rate.

TibiNonEst · 2022-02-21T05:51:16Z

This is fine - for reference, writes are roughly half of the read count.

Okay, so still cheaper for sure.

That's fine - we can just expire the Bloom filter every day. We lose little by essentially discarding it.

While that makes sense, wouldn't it require re-allocating space on the KV for the Bloom filter requiring yet more writes? Although, I assume this would still probably be fewer than there are now. ~~Also, KV delete requests are charged at the same rate as write requests although not out of the same pool.~~ I guess in this instance using built-in expiration features does work.

Actually, it would probably increase the rate.

Ah I see, I misinterpreted what you said before, that makes sense.

astei added the enhancement New feature or request label Jul 3, 2021

TibiNonEst mentioned this issue Feb 21, 2022

Add Bloom filter to KV reads/writes #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom filters to avoid KV writes #27

Bloom filters to avoid KV writes #27

astei commented Jul 3, 2021

astei commented Dec 20, 2021

astei commented Feb 19, 2022

TibiNonEst commented Feb 20, 2022 •

edited

Loading

astei commented Feb 20, 2022

TibiNonEst commented Feb 21, 2022 •

edited

Loading

astei commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022 •

edited

Loading

Bloom filters to avoid KV writes #27

Bloom filters to avoid KV writes #27

Comments

astei commented Jul 3, 2021

astei commented Dec 20, 2021

astei commented Feb 19, 2022

TibiNonEst commented Feb 20, 2022 • edited Loading

astei commented Feb 20, 2022

TibiNonEst commented Feb 21, 2022 • edited Loading

astei commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022 • edited Loading

TibiNonEst commented Feb 20, 2022 •

edited

Loading

TibiNonEst commented Feb 21, 2022 •

edited

Loading

TibiNonEst commented Feb 21, 2022 •

edited

Loading