-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
okcomputer: Sidekiq worker count check #2216
base: main
Are you sure you want to change the base?
Conversation
8b55361
to
b9779bf
Compare
b9779bf
to
c49af28
Compare
rebased on latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you're checking thread counts (instead of just process counts)?
some combination of... it was pretty easy to get at that info (both the actual and expected thread counts), it seemed potentially useful at a glance for infra FR and ops folks, and we've seen occasional sidekiq worker issues but not enough to have a great idea of what the most common misbehavior modes are. so i figured (a little more) info == better for triaging alerts. i also was thinking that the sidekiq API is pretty stable and pretty well documented, and this is unlikely to need changes in the near future, so even though thread counts adds a bit of complexity over just doing process counts, it didn't seem like complexity we'd have to pay much for going forward? and then if e.g. sidekiq 8 or 9 breaks this, we can simplify if fixing doesn't seem worth the effort? |
My concern is that this is just making work for our future selves to troubleshoot. Do we have any indication that there is any value in monitoring this or that Sidekiq does not repair any damaged threads? Do we know this won't set off false alarms? How will our future selves know why we were monitoring threads in the first place if it breaks? Might be worth getting some additional opinions on this. |
backburnered this since it was reviewed. haven't seen any issues with sidekiq processes running the wrong number of worker threads compared to what's configured in the year or so this change has hung out. we did run into an issue in hydra_etd that was along the lines of what justin worried about above. after discussing the hydra_etd issue in slack with aaron and laura, i was convinced to simplify the hydra_etd check to just look for expected process count (and to give a bit of helpful guidance about when to worry if there's an alert for too many workers). next unscheduled week, i'll circle back to this PR and simplify it in the same way as https://github.com/sul-dlss/hydra_etd/pull/1689 |
Why was this change made? 🤔
fail the health check if a worker box has an unexpected number of sidekiq worker processes or threads
How was this change tested? 🤨
after rebasing on the latest
main
(to pick up the upgrade to sidekiq 7), i deployed this branch to QA, and tested the following scenarios:web host (doesn't have workers, so doesn't run this new check)
worker VM
normal operation (exactly the number of expected management processes and worker threads)
some but not all worker management processes running
no worker management processes running
more than expected number of worker management processes running
this might happen when a long running job spans a deployment, though in this case i faked it by starting an extra process manually on the VM. happy to kick off a long running CV job and re-deploy for a more realistic test. i'd just wanted to avoid messing with cocina-models release testing that was in progress at the time, since infra integration tests are very sensitive to delays.
⚡ ⚠ If this change has cross service impact, or if it changes code used internally for cloud replication, run integration test preassembly_image_accessioning_spec.rb against stage as it tests preservation, and/or test in stage environment, in addition to specs. The main classes relevant to replication are
ZipmakerJob
,DeliveryDispatcherJob
,*DeliveryJob
,ResultsRecorderJob
, andDruidVersionZip
; see here for overview diagram of replication pipeline.⚡