-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bloom filter to KV reads/writes #43
Conversation
Having a package-lock.json and yarn-lock.json is redundant and can be confusing to deployments so I just regenerated the lock file for a "clean" start.
Doing bugfixes, currently, it's throwing |
We already have a WASM module in this project. Why not wire up a small function call for it? There's even a bunch of fast hashes implemented as a Rust crate, including xxHash. |
Hmm that may make more sense. I am currently using xxHash but through JS bindings, moving it into the WASM module might fix this. |
- Removed infinite loop - Bloom filter expires midnight UTC everyday
That fixed it! This implementation increases the KV read amounts by a lot so it will probably still require a little bit of optimization, but KV writes only go once a request made it past cache 2 times in the same day (UTC) by means of a Bloom filter. Honestly, the writes might require more optimization, however, the fewer KV writes we make the more likely Mojang will return 429s because we will be requesting directly from them more. Currently, the Bloom filter only allocates 10 rows of the KV for its filter. THIS WAS FOR TESTING ONLY. The final production value will depend on how many requests we get a day and the desired error rate, I may be able to make this dynamic but it is not yet. This will need to be changed before merging. Aside from that, 3 hashes seems like a good amount, although I don't know what the tradeoffs are of raising or lowering that. |
Also, "requests" currently are separate for username lookups and profile lookups just because of how the KV integration was initially written, I'll probably change this to reduce KV writes by a factor of 2 unless there's some reason I shouldn't. Edit: this may not be realistic since profile and username lookups are stored separately in the KV and sometimes profile lookups will happen without username lookups |
I wish I could have these inherit from the same interface or abstract class but TypeScript doesn't let you define static methods in those.
Reduces writes by removing writing to the same KV key twice
I don't know how much more optimized this can get. I think the only thing left to determine is how we are going to allocate rows and hashes for the Bloom filter. I did make it dynamic, so it takes a desired false positive tolerance and the expected number of items in the Bloom filter to calculate the number of KV pairs needed and how many hashes to apply to each. For testing, I set the false positive tolerance to 5% (0.05) and the expected number of items to 5, but this should be changed for production use. The expected number of items is probably just going to be the average amount of items added to the Bloom filter in a given day (because it expires every day at UTC midnight). What I don't know, however, is how we want to set this value, currently, it's hardcoded but maybe there's a way of querying it from Cloudflare's API. The only other thing I can think of is maybe improving how it checks whether or not a Bloom filter exists in the KV. Currently, it checks |
I don't really know what else can be added to improve the performance of this, it seems fairly efficient and gets the job done. I don't know how many more requests to Mojang this will make (and subsequent 429 error responses) but my assumption is it will be a lot. This shouldn't be merged until the values for |
I'm going to hold off on merging this for now. Apparently we're under the limit where Cloudflare charges for KV writes. I will have to keep an eye on this. |
I'm back to this, then. I received a very large bill from Cloudflare last month. |
Unfortunately I do not think this will help much with the cost issue. We're still issuing a KV write to update the Bloom filter, and this is by far the most expensive bill line item:
|
It's been quite a while since I wrote this but tbh I treated it more as a learning exercise than anything. Looking back at the code now, all this would do is increase the amount of KV writes overall. Where a bloom filter would be useful is if you could store it in less expensive storage that can act as a cache control layer before the more expensive KV. Of course, doing that here would pretty much entirely defeat the purpose of caching in the KV because the latency of accessing the Bloom filter would be slower than a KV read. |
The Bloom filter I implemented was pretty much just acting as a "seen before" filter to exclude writes for any one-off requests, but doing so includes a KV write itself 🙃 |
Thanks folks. I'm going to close this for now, as costs are less of a concern. If there's any general performance wins that can be done from implementing something like this, I'd be open to it. |
Closes #27
Currently, this throws a
RangeError: Maximum call stack size exceeded
but I'm heading out and don't have time to debug it until later. Also, expiration needs to be fixed because currently, it takes the current time when a write happens + 24 hours, but if we update the value in the Bloom filter it won't expire at the same time.