-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add debugging and telemetry stats #532
Comments
The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like |
Thanks for opening up #911. That is really helpful! Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓 Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:
|
Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you. A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does. From the rabbitmq docs:
I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics) Once you've got the metrics figured out, there are a couple of ways to expose them
Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems. As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch. The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"? |
I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design. I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:
|
Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting. From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick. |
@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation. Also, some idle musings on this topic:
To give an example: Here's one of my apps that has:
So that means I have total:
|
A great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold). |
Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄
|
Its a fifo queue, so the first 3 would work :)
|
Not sure if this is much help, but I just glued this together from the scopes on WITH cte_statuses AS (
SELECT
CASE
WHEN finished_at IS NOT NULL
THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
WHEN COALESCE(scheduled_at, created_at) <= NOW()
THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
WHEN (serialized_params ->> 'executions')::integer > 1
THEN 'retried'
ELSE
'scheduled'
END AS status,
queue_name,
priority
FROM good_jobs
LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
AND pg_locks.objsubid = 1
AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM
(VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
status_list.status,
queues.queue_name; The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc... |
@jtannas fyi, I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅 |
@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. #984 added overall counts. What do you think about incorporating this into metrics? |
I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via https://guides.rubyonrails.org/active_support_instrumentation.html#active-job. By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics. |
This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * bensheldon#750 * bensheldon#532
Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine. for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense. |
💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics. |
Just to note, this isn't in the latest release version of |
* Make health probe server more general purpose This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * #750 * #532 * Use new API * Revert server name change We decided to leave the original ProbeServer name better sets expectations. See: #1079 (review) This also splits out middleware testing into separate specs. * Restore original naming This also helps ensure that the existing behavior and API remain intact. * Appease linters * Add required message for mock * Make test description relevant * Allow for handler to be injected into ProbeServer * Add WEBrick WEBrick handler * Add WEBrick as a development dependency * Add WEBrick tests and configuration * Add idle_timeout method to mock * Namespace server handlers * Warn and fallback when WEBrick isn't loadable Since the probe server has the option to use WEBrick as a server handler, but this library doesn't have WEBrick as a dependency, we want to throw a warning when WEBrick is configured, but not in the load path. This will also gracefully fallback to the built in HTTP server. * inspect load path * Account for multiple webrick entries in $LOAD_PATH * try removing load path test * For error on require to initiate test As opposed to manipulating the load path. * Handle explicit nils in intialization * Allow probe_handler to be set in configuration * Add documentation for probe server customization * Appease linter * retrigger CI * Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity * Update readme to have relevant app example * Fix readme grammar --------- Co-authored-by: Ben Sheldon [he/him] <[email protected]>
* Make health probe server more general purpose This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * bensheldon/good_job#750 * bensheldon/good_job#532 * Use new API * Revert server name change We decided to leave the original ProbeServer name better sets expectations. See: bensheldon/good_job#1079 (review) This also splits out middleware testing into separate specs. * Restore original naming This also helps ensure that the existing behavior and API remain intact. * Appease linters * Add required message for mock * Make test description relevant * Allow for handler to be injected into ProbeServer * Add WEBrick WEBrick handler * Add WEBrick as a development dependency * Add WEBrick tests and configuration * Add idle_timeout method to mock * Namespace server handlers * Warn and fallback when WEBrick isn't loadable Since the probe server has the option to use WEBrick as a server handler, but this library doesn't have WEBrick as a dependency, we want to throw a warning when WEBrick is configured, but not in the load path. This will also gracefully fallback to the built in HTTP server. * inspect load path * Account for multiple webrick entries in $LOAD_PATH * try removing load path test * For error on require to initiate test As opposed to manipulating the load path. * Handle explicit nils in intialization * Allow probe_handler to be set in configuration * Add documentation for probe server customization * Appease linter * retrigger CI * Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity * Update readme to have relevant app example * Fix readme grammar --------- Co-authored-by: Ben Sheldon [he/him] <[email protected]>
Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.
Imagining a method like
GoodJob.debug
to be run from a console, imagining the following fields to start:Seems like it would also be useful to have some counters on a
Scheduler.stats
too:The text was updated successfully, but these errors were encountered: