Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bloom filter to KV reads/writes #43

Closed
wants to merge 9 commits into from

Conversation

TibiNonEst
Copy link
Contributor

Closes #27

Currently, this throws a RangeError: Maximum call stack size exceeded but I'm heading out and don't have time to debug it until later. Also, expiration needs to be fixed because currently, it takes the current time when a write happens + 24 hours, but if we update the value in the Bloom filter it won't expire at the same time.

Having a package-lock.json and yarn-lock.json is redundant and can be confusing to deployments so I just regenerated the lock file for a "clean" start.
@TibiNonEst TibiNonEst changed the title Bloom filter Add Bloom filter to KV reads/writes Feb 21, 2022
@TibiNonEst
Copy link
Contributor Author

Doing bugfixes, currently, it's throwing CompileError: WebAssembly.instantiate(): Wasm code generation disallowed by embedder. I believe this may be because of the hashing library I'm using which is WASM-based for speed reasons. Switching to the JS version will make hash calls on the Bloom filter significantly slower but this error message seems a bit cryptic to me.

@astei
Copy link
Collaborator

astei commented Feb 21, 2022

We already have a WASM module in this project. Why not wire up a small function call for it? There's even a bunch of fast hashes implemented as a Rust crate, including xxHash.

@TibiNonEst
Copy link
Contributor Author

Hmm that may make more sense. I am currently using xxHash but through JS bindings, moving it into the WASM module might fix this.

- Removed infinite loop
- Bloom filter expires midnight UTC everyday
@TibiNonEst
Copy link
Contributor Author

That fixed it! This implementation increases the KV read amounts by a lot so it will probably still require a little bit of optimization, but KV writes only go once a request made it past cache 2 times in the same day (UTC) by means of a Bloom filter. Honestly, the writes might require more optimization, however, the fewer KV writes we make the more likely Mojang will return 429s because we will be requesting directly from them more.

Currently, the Bloom filter only allocates 10 rows of the KV for its filter. THIS WAS FOR TESTING ONLY. The final production value will depend on how many requests we get a day and the desired error rate, I may be able to make this dynamic but it is not yet. This will need to be changed before merging. Aside from that, 3 hashes seems like a good amount, although I don't know what the tradeoffs are of raising or lowering that.

@TibiNonEst
Copy link
Contributor Author

TibiNonEst commented Feb 22, 2022

Also, "requests" currently are separate for username lookups and profile lookups just because of how the KV integration was initially written, I'll probably change this to reduce KV writes by a factor of 2 unless there's some reason I shouldn't.

Edit: this may not be realistic since profile and username lookups are stored separately in the KV and sometimes profile lookups will happen without username lookups

I wish I could have these inherit from the same interface or abstract class but TypeScript doesn't let you define static methods in those.
Reduces writes by removing writing to the same KV key twice
@TibiNonEst
Copy link
Contributor Author

I don't know how much more optimized this can get. I think the only thing left to determine is how we are going to allocate rows and hashes for the Bloom filter. I did make it dynamic, so it takes a desired false positive tolerance and the expected number of items in the Bloom filter to calculate the number of KV pairs needed and how many hashes to apply to each. For testing, I set the false positive tolerance to 5% (0.05) and the expected number of items to 5, but this should be changed for production use. The expected number of items is probably just going to be the average amount of items added to the Bloom filter in a given day (because it expires every day at UTC midnight). What I don't know, however, is how we want to set this value, currently, it's hardcoded but maybe there's a way of querying it from Cloudflare's API.

The only other thing I can think of is maybe improving how it checks whether or not a Bloom filter exists in the KV. Currently, it checks await KVDirect.get('bloom:0') === null and reallocates the KV rows needed if it doesn't exist, but I don't know if this is ideal.

@TibiNonEst TibiNonEst marked this pull request as ready for review February 28, 2022 04:22
@TibiNonEst
Copy link
Contributor Author

I don't really know what else can be added to improve the performance of this, it seems fairly efficient and gets the job done. I don't know how many more requests to Mojang this will make (and subsequent 429 error responses) but my assumption is it will be a lot. This shouldn't be merged until the values for epsilon and n are decided on, but the code should be good to go.

@astei
Copy link
Collaborator

astei commented Mar 3, 2022

I'm going to hold off on merging this for now. Apparently we're under the limit where Cloudflare charges for KV writes. I will have to keep an eye on this.

@astei
Copy link
Collaborator

astei commented Sep 13, 2023

I'm back to this, then. I received a very large bill from Cloudflare last month.

@astei
Copy link
Collaborator

astei commented Sep 13, 2023

Unfortunately I do not think this will help much with the cost issue. We're still issuing a KV write to update the Bloom filter, and this is by far the most expensive bill line item:

  • Workers requests: $45.50
  • Read operations: $25.50
  • Write operations: $50.00
  • Storage: $1.50
  • Workers Bundled monthly fee: $5.00

@TibiNonEst
Copy link
Contributor Author

It's been quite a while since I wrote this but tbh I treated it more as a learning exercise than anything. Looking back at the code now, all this would do is increase the amount of KV writes overall. Where a bloom filter would be useful is if you could store it in less expensive storage that can act as a cache control layer before the more expensive KV. Of course, doing that here would pretty much entirely defeat the purpose of caching in the KV because the latency of accessing the Bloom filter would be slower than a KV read.

@TibiNonEst
Copy link
Contributor Author

The Bloom filter I implemented was pretty much just acting as a "seen before" filter to exclude writes for any one-off requests, but doing so includes a KV write itself 🙃

@Cherry
Copy link
Member

Cherry commented Mar 3, 2024

Thanks folks. I'm going to close this for now, as costs are less of a concern. If there's any general performance wins that can be done from implementing something like this, I'd be open to it.

@Cherry Cherry closed this Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bloom filters to avoid KV writes
3 participants