Add Bloom filter to KV reads/writes #43

TibiNonEst · 2022-02-21T17:22:15Z

Closes #27

Currently, this throws a RangeError: Maximum call stack size exceeded but I'm heading out and don't have time to debug it until later. Also, expiration needs to be fixed because currently, it takes the current time when a write happens + 24 hours, but if we update the value in the Bloom filter it won't expire at the same time.

Having a package-lock.json and yarn-lock.json is redundant and can be confusing to deployments so I just regenerated the lock file for a "clean" start.

TibiNonEst · 2022-02-21T23:43:34Z

Doing bugfixes, currently, it's throwing CompileError: WebAssembly.instantiate(): Wasm code generation disallowed by embedder. I believe this may be because of the hashing library I'm using which is WASM-based for speed reasons. Switching to the JS version will make hash calls on the Bloom filter significantly slower but this error message seems a bit cryptic to me.

astei · 2022-02-21T23:49:42Z

We already have a WASM module in this project. Why not wire up a small function call for it? There's even a bunch of fast hashes implemented as a Rust crate, including xxHash.

TibiNonEst · 2022-02-21T23:51:19Z

Hmm that may make more sense. I am currently using xxHash but through JS bindings, moving it into the WASM module might fix this.

- Removed infinite loop - Bloom filter expires midnight UTC everyday

TibiNonEst · 2022-02-22T01:27:06Z

That fixed it! This implementation increases the KV read amounts by a lot so it will probably still require a little bit of optimization, but KV writes only go once a request made it past cache 2 times in the same day (UTC) by means of a Bloom filter. Honestly, the writes might require more optimization, however, the fewer KV writes we make the more likely Mojang will return 429s because we will be requesting directly from them more.

Currently, the Bloom filter only allocates 10 rows of the KV for its filter. THIS WAS FOR TESTING ONLY. The final production value will depend on how many requests we get a day and the desired error rate, I may be able to make this dynamic but it is not yet. This will need to be changed before merging. Aside from that, 3 hashes seems like a good amount, although I don't know what the tradeoffs are of raising or lowering that.

TibiNonEst · 2022-02-22T01:36:45Z

Also, "requests" currently are separate for username lookups and profile lookups just because of how the KV integration was initially written, I'll probably change this to reduce KV writes by a factor of 2 unless there's some reason I shouldn't.

Edit: this may not be realistic since profile and username lookups are stored separately in the KV and sometimes profile lookups will happen without username lookups

I wish I could have these inherit from the same interface or abstract class but TypeScript doesn't let you define static methods in those.

Reduces writes by removing writing to the same KV key twice

TibiNonEst · 2022-02-23T02:51:53Z

I don't know how much more optimized this can get. I think the only thing left to determine is how we are going to allocate rows and hashes for the Bloom filter. I did make it dynamic, so it takes a desired false positive tolerance and the expected number of items in the Bloom filter to calculate the number of KV pairs needed and how many hashes to apply to each. For testing, I set the false positive tolerance to 5% (0.05) and the expected number of items to 5, but this should be changed for production use. The expected number of items is probably just going to be the average amount of items added to the Bloom filter in a given day (because it expires every day at UTC midnight). What I don't know, however, is how we want to set this value, currently, it's hardcoded but maybe there's a way of querying it from Cloudflare's API.

The only other thing I can think of is maybe improving how it checks whether or not a Bloom filter exists in the KV. Currently, it checks await KVDirect.get('bloom:0') === null and reallocates the KV rows needed if it doesn't exist, but I don't know if this is ideal.

TibiNonEst · 2022-02-28T04:26:03Z

I don't really know what else can be added to improve the performance of this, it seems fairly efficient and gets the job done. I don't know how many more requests to Mojang this will make (and subsequent 429 error responses) but my assumption is it will be a lot. This shouldn't be merged until the values for epsilon and n are decided on, but the code should be good to go.

astei · 2022-03-03T06:50:27Z

I'm going to hold off on merging this for now. Apparently we're under the limit where Cloudflare charges for KV writes. I will have to keep an eye on this.

astei · 2023-09-13T05:45:19Z

I'm back to this, then. I received a very large bill from Cloudflare last month.

astei · 2023-09-13T05:48:46Z

Unfortunately I do not think this will help much with the cost issue. We're still issuing a KV write to update the Bloom filter, and this is by far the most expensive bill line item:

Workers requests: $45.50
Read operations: $25.50
Write operations: $50.00
Storage: $1.50
Workers Bundled monthly fee: $5.00

TibiNonEst · 2023-09-13T14:50:44Z

It's been quite a while since I wrote this but tbh I treated it more as a learning exercise than anything. Looking back at the code now, all this would do is increase the amount of KV writes overall. Where a bloom filter would be useful is if you could store it in less expensive storage that can act as a cache control layer before the more expensive KV. Of course, doing that here would pretty much entirely defeat the purpose of caching in the KV because the latency of accessing the Bloom filter would be slower than a KV read.

TibiNonEst · 2023-09-13T14:53:20Z

The Bloom filter I implemented was pretty much just acting as a "seen before" filter to exclude writes for any one-off requests, but doing so includes a KV write itself 🙃

Cherry · 2024-03-03T19:05:16Z

Thanks folks. I'm going to close this for now, as costs are less of a concern. If there's any general performance wins that can be done from implementing something like this, I'd be open to it.

TibiNonEst added 2 commits February 21, 2022 11:43

Update dependencies

9c17eec

Having a package-lock.json and yarn-lock.json is redundant and can be confusing to deployments so I just regenerated the lock file for a "clean" start.

Add Bloom filter to KV reads/writes

88e2dfc

TibiNonEst changed the title ~~Bloom filter~~ Add Bloom filter to KV reads/writes Feb 21, 2022

TibiNonEst added 2 commits February 21, 2022 18:54

Bug fixes + housekeeping

d3611bd

- Removed infinite loop - Bloom filter expires midnight UTC everyday

Move xxHash to WASM module + bugflixes

4d6af4a

TibiNonEst added 5 commits February 22, 2022 11:12

Move KV direct methods into their own class

1251cfc

I wish I could have these inherit from the same interface or abstract class but TypeScript doesn't let you define static methods in those.

Move Bloom filter into separate file

89c74fe

Dynamically calculate m and k

ed7d980

Return Set from getIndexes()

6f14c7b

Reduces writes by removing writing to the same KV key twice

Remove redundant KV get requests

056c33e

TibiNonEst marked this pull request as ready for review February 28, 2022 04:22

Cherry closed this Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bloom filter to KV reads/writes #43

Add Bloom filter to KV reads/writes #43

TibiNonEst commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022

astei commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022

TibiNonEst commented Feb 22, 2022

TibiNonEst commented Feb 22, 2022 •

edited

Loading

TibiNonEst commented Feb 23, 2022

TibiNonEst commented Feb 28, 2022

astei commented Mar 3, 2022

astei commented Sep 13, 2023

astei commented Sep 13, 2023

TibiNonEst commented Sep 13, 2023

TibiNonEst commented Sep 13, 2023

Cherry commented Mar 3, 2024

Add Bloom filter to KV reads/writes #43

Add Bloom filter to KV reads/writes #43

Conversation

TibiNonEst commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022

astei commented Feb 21, 2022

TibiNonEst commented Feb 21, 2022

TibiNonEst commented Feb 22, 2022

TibiNonEst commented Feb 22, 2022 • edited Loading

TibiNonEst commented Feb 23, 2022

TibiNonEst commented Feb 28, 2022

astei commented Mar 3, 2022

astei commented Sep 13, 2023

astei commented Sep 13, 2023

TibiNonEst commented Sep 13, 2023

TibiNonEst commented Sep 13, 2023

Cherry commented Mar 3, 2024

TibiNonEst commented Feb 22, 2022 •

edited

Loading