okcomputer: Sidekiq worker count check #2216

jmartin-sul · 2023-03-31T06:56:19Z

Why was this change made? 🤔

fail the health check if a worker box has an unexpected number of sidekiq worker processes or threads

How was this change tested? 🤨

after rebasing on the latest main (to pick up the upgrade to sidekiq 7), i deployed this branch to QA, and tested the following scenarios:

web host (doesn't have workers, so doesn't run this new check)

worker VM

normal operation (exactly the number of expected management processes and worker threads)

some but not all worker management processes running

no worker management processes running

more than expected number of worker management processes running

this might happen when a long running job spans a deployment, though in this case i faked it by starting an extra process manually on the VM. happy to kick off a long running CV job and re-deploy for a more realistic test. i'd just wanted to avoid messing with cocina-models release testing that was in progress at the time, since infra integration tests are very sensitive to delays.

⚡ ⚠ If this change has cross service impact, or if it changes code used internally for cloud replication, run integration test preassembly_image_accessioning_spec.rb against stage as it tests preservation, and/or test in stage environment, in addition to specs. The main classes relevant to replication are ZipmakerJob, DeliveryDispatcherJob, *DeliveryJob, ResultsRecorderJob, and DruidVersionZip; see here for overview diagram of replication pipeline.⚡

jmartin-sul · 2023-06-20T23:45:07Z

rebased on latest main, and deployed the rebased branch to stage for some quick testing to make sure all was still working (happy path works, check fails if i manually kill some sidekiq processes).

justinlittman

Curious why you're checking thread counts (instead of just process counts)?

jmartin-sul · 2023-06-21T17:45:13Z

Curious why you're checking thread counts (instead of just process counts)?

some combination of... it was pretty easy to get at that info (both the actual and expected thread counts), it seemed potentially useful at a glance for infra FR and ops folks, and we've seen occasional sidekiq worker issues but not enough to have a great idea of what the most common misbehavior modes are.

so i figured (a little more) info == better for triaging alerts. i also was thinking that the sidekiq API is pretty stable and pretty well documented, and this is unlikely to need changes in the near future, so even though thread counts adds a bit of complexity over just doing process counts, it didn't seem like complexity we'd have to pay much for going forward? and then if e.g. sidekiq 8 or 9 breaks this, we can simplify if fixing doesn't seem worth the effort?

justinlittman · 2023-06-22T11:17:26Z

My concern is that this is just making work for our future selves to troubleshoot. Do we have any indication that there is any value in monitoring this or that Sidekiq does not repair any damaged threads? Do we know this won't set off false alarms? How will our future selves know why we were monitoring threads in the first place if it breaks?

Might be worth getting some additional opinions on this.

jmartin-sul · 2024-08-20T00:30:26Z

backburnered this since it was reviewed. haven't seen any issues with sidekiq processes running the wrong number of worker threads compared to what's configured in the year or so this change has hung out. we did run into an issue in hydra_etd that was along the lines of what justin worried about above. after discussing the hydra_etd issue in slack with aaron and laura, i was convinced to simplify the hydra_etd check to just look for expected process count (and to give a bit of helpful guidance about when to worry if there's an alert for too many workers).

next unscheduled week, i'll circle back to this PR and simplify it in the same way as https://github.com/sul-dlss/hydra_etd/pull/1689

jmartin-sul force-pushed the sidekiq-worker-count-check branch from 8b55361 to b9779bf Compare March 31, 2023 18:51

jmartin-sul mentioned this pull request Mar 31, 2023

add an okcomputer check for sidekiq process and worker counts sul-dlss/sdr-api#396

Merged

jmartin-sul changed the title ~~[HOLD] okcomputer: Sidekiq worker count check~~ okcomputer: Sidekiq worker count check Mar 31, 2023

jmartin-sul marked this pull request as ready for review March 31, 2023 23:32

jmartin-sul added 3 commits June 20, 2023 15:23

okcomputer: check that sidekiq process and thread counts are as expected

87e2561

config cruft cleanup

85568ee

docker-compose.yml: use same version of redis as deployed envs

c49af28

jmartin-sul force-pushed the sidekiq-worker-count-check branch from b9779bf to c49af28 Compare June 20, 2023 23:21

justinlittman reviewed Jun 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

okcomputer: Sidekiq worker count check #2216

okcomputer: Sidekiq worker count check #2216

jmartin-sul commented Mar 31, 2023 •

edited

Loading

jmartin-sul commented Jun 20, 2023

justinlittman left a comment

jmartin-sul commented Jun 21, 2023

justinlittman commented Jun 22, 2023

jmartin-sul commented Aug 20, 2024

okcomputer: Sidekiq worker count check #2216

Are you sure you want to change the base?

okcomputer: Sidekiq worker count check #2216

Conversation

jmartin-sul commented Mar 31, 2023 • edited Loading

Why was this change made? 🤔

How was this change tested? 🤨

web host (doesn't have workers, so doesn't run this new check)

worker VM

normal operation (exactly the number of expected management processes and worker threads)

some but not all worker management processes running

no worker management processes running

more than expected number of worker management processes running

jmartin-sul commented Jun 20, 2023

justinlittman left a comment

Choose a reason for hiding this comment

jmartin-sul commented Jun 21, 2023

justinlittman commented Jun 22, 2023

jmartin-sul commented Aug 20, 2024

jmartin-sul commented Mar 31, 2023 •

edited

Loading