Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Worker based expiration strategy. #447

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

camilo
Copy link
Contributor

@camilo camilo commented Mar 30, 2020

One of the things that make IdentityCache (IDC) hard to run at scale is how much it can
reflect load on the database during high load events, most of the load occurs due to over-fetching
during misses while the application using IDC is enduring a high thruput event.

A possible solution for this is to change our cache population model. Currently IDC will try to find
an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try
and find the objects in the database. This simple mental model seems good enough for most things, however
it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also
be seen in memcached itself, IDC uses cas to avoid writting stale data during high thruput events the
cas calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyond
a WIP patch, not meant to be merged].

Changing our cache population model is a big refactor, so I consider we can do this in two steps.

  1. Change the cache expiration model
  2. Change the cache population model

This PR is/may be an prototype to see what a worker based expiration implementation looks like.

The idea is the following.

  1. Make the expiration strategy configurable
  2. On the worker based expiration model simply do nothing to expire blobs inline
  3. Implement a worker process that is supposed to be long lived and receive binlog type events
  4. Test to see how fast ruby can process incoming events, if ut turns out to be too slow to keep
    up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?),
    implement the worker in native code? implement the worker in a 100% diff language and export the
    logic to go from blob -> key from ruby to the other language

One of the things that make IdentityCache (IDC) hard to run at scale is how much it can
reflect load onthe database during high load events, most of the load occurs due to over-fetching
during misses while the application using IDC is enduring a high thruput event.

A possible solition for this is to change our cache population model. Currently IDC will try to find
an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try
and find the objects in the database. This simple mental model seems good enough for most things, however
it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also
be seen in memcached itself, IDC uses `cas` to avoid writting stale data during high thruput events the
`cas` calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyond
a WIP patch, not meant to be merged].

Changin our cache population model is a big refactor, so I consider we can do this in two steps.

1. Change the cache expiration model
2. Change the cache population model

This PR is/may be an prototype to see what a worker based expiration implementation looks like.

The idea is the following.

1. Make the expiration strategy configurable
2. On the worker based expiration model simply do _nothig_ to expire blobs inline
3. Implement a worker process that is supposed to be long lived and recevi binlog type events
4. Test to see how fast ruby can process incoming events, if ut turns out to be too slow to keep
   up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?),
   implement the worker in native code? implement the worker in a 100% diff language and export the
   logic to go from blob -> key from ruby to the other language
@camilo
Copy link
Contributor Author

camilo commented Mar 30, 2020

cc @ignacio-chiazzo

# This is meant to run a long lived process along these lines
# Parse args from command line / env then call
#
# IdentityCache::BinlogExpirationWorker.new(*args).run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous hack day project @pushrax and I explored identity cache binlog based expiration. It consisted of two parts, one part to create expiration rules from the cached associations (https://github.com/Shopify/shopify/compare/identity_cache_expiry_rules) and a binlog reader that used those rules to invalidate the cache (https://github.com/Shopify/binlog-cache-invalidator).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh cool thanks @dylanahsmith ! @ignacio-chiazzo and I might take a stab at a second expirment using CDC instead of raw logs but I think the code/principles can be adapted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?

Copy link

@insom insom Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did someone say CDC? 😁 anyway we have an SLO [REDACTED] -- TL;DR is that the Kafka+Sieve parts add around 500ms on top of the MySQL replication lag. The MySQL replication lag is by far the single biggest contributor to the delay. We intend to make CDC read from writers to remove the MySQL replication lag, but we're not there yet.

This is the steady state, if there were an incident or a shop move then (for a given shop's data) you would see delayed records. That said, we're solving a bunch of hard problems for you, so doing things "raw" means having to keep state on the MySQL schema and (depending on your service) shop moves, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?

Why is that a problem? I think the fundamental problem with IDC is the assumption of freshness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants