-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Worker based expiration strategy. #447
base: main
Are you sure you want to change the base?
Conversation
One of the things that make IdentityCache (IDC) hard to run at scale is how much it can reflect load onthe database during high load events, most of the load occurs due to over-fetching during misses while the application using IDC is enduring a high thruput event. A possible solition for this is to change our cache population model. Currently IDC will try to find an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try and find the objects in the database. This simple mental model seems good enough for most things, however it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also be seen in memcached itself, IDC uses `cas` to avoid writting stale data during high thruput events the `cas` calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyond a WIP patch, not meant to be merged]. Changin our cache population model is a big refactor, so I consider we can do this in two steps. 1. Change the cache expiration model 2. Change the cache population model This PR is/may be an prototype to see what a worker based expiration implementation looks like. The idea is the following. 1. Make the expiration strategy configurable 2. On the worker based expiration model simply do _nothig_ to expire blobs inline 3. Implement a worker process that is supposed to be long lived and recevi binlog type events 4. Test to see how fast ruby can process incoming events, if ut turns out to be too slow to keep up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?), implement the worker in native code? implement the worker in a 100% diff language and export the logic to go from blob -> key from ruby to the other language
# This is meant to run a long lived process along these lines | ||
# Parse args from command line / env then call | ||
# | ||
# IdentityCache::BinlogExpirationWorker.new(*args).run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous hack day project @pushrax and I explored identity cache binlog based expiration. It consisted of two parts, one part to create expiration rules from the cached associations (https://github.com/Shopify/shopify/compare/identity_cache_expiry_rules) and a binlog reader that used those rules to invalidate the cache (https://github.com/Shopify/binlog-cache-invalidator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh cool thanks @dylanahsmith ! @ignacio-chiazzo and I might take a stab at a second expirment using CDC instead of raw logs but I think the code/principles can be adapted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did someone say CDC? 😁 anyway we have an SLO [REDACTED] -- TL;DR is that the Kafka+Sieve parts add around 500ms on top of the MySQL replication lag. The MySQL replication lag is by far the single biggest contributor to the delay. We intend to make CDC read from writers to remove the MySQL replication lag, but we're not there yet.
This is the steady state, if there were an incident or a shop move then (for a given shop's data) you would see delayed records. That said, we're solving a bunch of hard problems for you, so doing things "raw" means having to keep state on the MySQL schema and (depending on your service) shop moves, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?
Why is that a problem? I think the fundamental problem with IDC is the assumption of freshness.
One of the things that make IdentityCache (IDC) hard to run at scale is how much it can
reflect load on the database during high load events, most of the load occurs due to over-fetching
during misses while the application using IDC is enduring a high thruput event.
A possible solution for this is to change our cache population model. Currently IDC will try to find
an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try
and find the objects in the database. This simple mental model seems good enough for most things, however
it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also
be seen in memcached itself, IDC uses
cas
to avoid writting stale data during high thruput events thecas
calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyonda WIP patch, not meant to be merged].
Changing our cache population model is a big refactor, so I consider we can do this in two steps.
This PR is/may be an prototype to see what a worker based expiration implementation looks like.
The idea is the following.
up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?),
implement the worker in native code? implement the worker in a 100% diff language and export the
logic to go from blob -> key from ruby to the other language