Add debugging and telemetry stats #532

bensheldon · 2022-02-27T20:03:45Z

Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.

Imagining a method like GoodJob.debug to be run from a console, imagining the following fields to start:

GoodJob version
Ruby Version
Rails Version
Rails Env
Execution mode
Queues
Polling Interval
Notifier listening?
Processes table (consider: maybe this is all data that should be going into the Processes table)

Seems like it would also be useful to have some counters on a Scheduler.stats too:

Successfully executed jobs
Failed executed jobs
Job query misses (e.g. queried for a job but there were no executable jobs)

The text was updated successfully, but these errors were encountered:

ckdake · 2023-04-03T18:04:54Z

The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either successful_jobs_counter=1234 or successful_jobs_since_last_logline=123 would let us use CloudWatch to build some of the metrics that we'd find handy.

Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like name=Job1, exit_code=1, queue_latency=1234ms,execution_duration=123ms would be great too.

bensheldon · 2023-04-03T22:59:51Z

Thanks for opening up #911. That is really helpful!

Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓

Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:

Should these be incrementers or gauges? If they're gauges then there needs to be a mechanism to "flush/reset" the values every interval.
What should be triggering the stats dump? Should this be invoked from the command line, or should it be logged by GoodJob itself. If it's by GoodJob itself, should that log come from somewhere that already exists (for example, it gets logged every time the Scheduler tries to pick up a job) or would a de veloper overload an existing mechanism e.g. an ActiveJob extension, or via cron (which would need to take a Proc; currently it only allows jobs) or some new self-contained, threaded TimerSet object that would wake up to log stats. I don't feel very strongly either way, but each has its tradeoffs.
Then there is the content of the info of being dumped:
- Would this only be intended for processes that have GoodJob and are executing jobs (e.g. the CLI or async web processes), or any process that has the GoodJob constant?
- Schedulers are the smallest unit to track. A process that is executing jobs could have multiple schedulers. Though maybe the recently created Capsule is the correct unit to track. Though a capsule can have multiple schedulers, and a process could have multiple capsules if someone was initializing them manually. Maybe the Capsule is the thing that has stats on it, and multiple schedulers increment a shared Stats object owned by the Capsule. (I'm imagining in this case the stats are like "number of jobs finished/errored/discarded/retried since creation"
- Scheduler#stats has some good stuff. I wonder if that would be a thing to extend, and also inserted into the Process.current_state. That would definitely surface a lot more details... though the Process state doesn't get flushed to the database currently except at startup.

ckdake · 2023-04-04T12:53:58Z

Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you.

A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does.

From the rabbitmq docs:

Most of the metrics represent point-in-time absolute values. Some, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are most useful when compared to their previous values and historical mean/percentile values.

I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics)

Once you've got the metrics figured out, there are a couple of ways to expose them

Exporting to a prometheus compatible endpoint that prometheus can poll would probably make a lot of people happy. When something polls that endpoint, it would calculate what it needs to (e.g. the 'flush' as needed) and return the results.
Supporting writing the metrics on a scheule, via a cron (maybe as a job itself?), to stdout would make an overlapping set of people happy. This would need to be configurable, and because time is hard, including a timestamp with the metrics would be important and the "Exact" moment the log line written, less so. It wouldn't be great for these to get bogged down in situations with high load/throughput/queue-pressure so maybe not "as a job".

Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems.

As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch.

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

bensheldon · 2023-04-12T14:13:00Z

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design.

I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:

Identify the metrics
Figure out the implementation for metrics.

If they require storage or are heavy to fetch (e.g. making database calls), I might ask whether there is an alternative.
I'm imagining they would need to be process-local, and the metrics system would need to sum/aggregate them to generate the cluster view.

Then identify how to do the transport (e.g. logging, emitting, fetching)

ckdake · 2023-04-12T17:43:22Z

Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting.

From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick.

bensheldon · 2023-04-21T15:21:03Z

@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation.

Also, some idle musings on this topic:

A GoodJob scheduler (a pool of execution threads) can pull from multiple queues, either *, or queue1,queue2 or negative matching *,-queue2. Trying to count "capacity" at the individual queue-level could be hard. (I'm somewhat haunted by trying to explain this in In what orders are multiple participating queues in a single process handled? #624)
I think counting up jobs that are assigned to individual queues should be simple (though maybe slightly more complicated by Create "discrete" good_job_executions table to separate Job records from Execution records and have a 1-to-1 correspondence between good_jobs records and Active Job jobs #928). I think the core ops question here is "are jobs being assigned to a queue that has zero executors"

To give an example:

Here's one of my apps that has:

2 processes.
Each process runs 2 threadpools, each with:
- 2 threads to pull from the whenever queue (high latency)
- 2 threads to pull from any queue except whenever (low latency)

So that means I have total:

whenever 4 threads
4 everything else threads? I think that's the complicated-ness I'm wondering about.

ckdake · 2023-04-21T16:13:38Z

A great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold).

bensheldon · 2023-04-21T17:53:24Z

Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄

"how many items are in this queue?" This seems simplest to break down. I assume "in this queue" is specifically "waiting to execute", and would exclude future-scheduled jobs. Is that right?
"how many items have been added to this queue?" What would be the (default) time-period for this query? e.g. "how many items have been added to this in [in the last N minutes; since T]?"
- Would future-scheduled jobs be considered "added" at their scheduled time, or when created?
- In terms of "items" there is a small difference between jobs and retries/executions, because a job might be initially enqueued and execute on one queue, then retry and execute multiple more times on another queue. From the operational/scheduler perspective, it doesn't matter (they're all items to execute) but that might not be the mental model of the person enqueueing them. e.g. items in queues >= jobs enqueued. Would you need to distinguish between "Jobs enqueued" and "jobs+retries enqueued?"
"how many items have been processed in this queue?" Similar to the previous query.
"how long does it take to process each item in this queue?"
- total_latency = queuing_latency + runtime_latency. Do you mean total_latency? or breakdown (queuing latency is useful for identifying a lack of capacity; total latency is useful for calculating capacity for meeting SLOs)
- I assume you mean this as a calculation across multiple items. So an average, or p95, or max?
- Same question of previous query, of "over what time period?"
- Something that has bedeviled me on latency is how to handle null values. e.g. if a job is in queue but has not started executing should I exclude the record from the calculation, or should I calculate it as COALESCE(performed_at, NOW()) - scheduled_at.
"are there any stuck jobs?" I think this is interesting/complicated.
- I wonder if the previous question would be equivalently alertable e.g. "What is the MAX runtime latency?" and then you could set an alert threshold in your tool.

ckdake · 2023-04-25T14:21:02Z

Its a fifo queue, so the first 3 would work :)

for "how many in the queue". We are not (currently) using good_job to schedule jobs to happen at a particular time. It is the "waiting to execute" that will be most interesting there for starters for me. (gauge metric)
I'm expecting "how many added" to be a total counter metric, increasing as each job is added. Not coupled to a timeperiod. Given that we're processing 100s of jobs and not 1000000000s of jobs, this is fine for our use case.
- Future scheduled jobs would be entirely separate, and only counted here when they are queued for execution.
- "retries" is a whole additional level of metrics. I would at some point like to know error/retry rates, but thats probably round 2
I'm expecting "how many processes" to be a total counter too
"latency"-> good question, I'm interested in all of the pieces, a good first version would be "time from when job is enqueued for execution until job is completed". Most of our jobs are super quick and are instrumented already with logs for job started -> job finished, so the piece good_job knows is how long they were queued before starting. mean,p95,max would all be helpful, and this would need a time range. if that time range approximately matched the polling interval of the monitoring tool, that might be ideal. Many tools use a very short polling interval (prometheus can do continual scraping, and requires special consideration over 2 minutes to avoid gaps in graphs), but I'd be good with a 5min rolling window. I'd expect the lantency to be for "completed jobs", so for you null example, that wouldn't be included in this metric. If the queue is piling up and not executing, I'd expect that to land in the first two metrics as a visible trend.
Stuck. Yep, complicated to get right, but alerting on the max individual latency would do it. I'm not sure a great design to surface ALL latencies for ALL jobs from good_job to a monitoring tool, but surfacing a max_latency_in_past_X_timeframe from good_job could work

jtannas · 2023-05-18T18:10:05Z

Not sure if this is much help, but I just glued this together from the scopes on GoodJob::Job to give a summary count for the jobs on each queue.

WITH cte_statuses AS (
  SELECT
    CASE
      WHEN finished_at IS NOT NULL
        THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
      WHEN COALESCE(scheduled_at, created_at) <= NOW()
        THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
      WHEN (serialized_params ->> 'executions')::integer > 1
        THEN 'retried'
      ELSE
        'scheduled'
    END AS status,
    queue_name,
    priority
  FROM good_jobs
    LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
    AND pg_locks.objsubid = 1
    AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
    AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
  WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM 
  (VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
  CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
  LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
  status_list.status,
  queues.queue_name;

The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc...
That plus the duration information we can parse out the ActiveJob logs gets us what we need.

bensheldon · 2023-05-18T19:13:57Z

@jtannas fyi, rubygems.org does an interesting thing using the GoodJob::JobsFilter's queries: https://github.com/rubygems/rubygems.org/blob/63f6ba818aaa21138a82c7d57da8380d261c5193/app/jobs/good_job_statsd_job.rb#L4-L10

I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅

AndersGM · 2023-07-07T16:38:22Z

@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. #984 added overall counts. What do you think about incorporating this into metrics?

shouichi · 2023-09-15T04:35:30Z

I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via perform.active_job. Because GoodJob is designed for ActiveJob, maybe they can be utilized?

https://guides.rubyonrails.org/active_support_instrumentation.html#active-job.

By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics.

This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * bensheldon#750 * bensheldon#532

bagedevimo · 2023-09-21T10:24:15Z

Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine.

for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense.

simi · 2023-11-25T01:02:08Z

💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

thatsmydoing · 2023-12-27T07:54:53Z

💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

Just to note, this isn't in the latest release version of prometheus_exporter (2.0.8).

* Make health probe server more general purpose This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * #750 * #532 * Use new API * Revert server name change We decided to leave the original ProbeServer name better sets expectations. See: #1079 (review) This also splits out middleware testing into separate specs. * Restore original naming This also helps ensure that the existing behavior and API remain intact. * Appease linters * Add required message for mock * Make test description relevant * Allow for handler to be injected into ProbeServer * Add WEBrick WEBrick handler * Add WEBrick as a development dependency * Add WEBrick tests and configuration * Add idle_timeout method to mock * Namespace server handlers * Warn and fallback when WEBrick isn't loadable Since the probe server has the option to use WEBrick as a server handler, but this library doesn't have WEBrick as a dependency, we want to throw a warning when WEBrick is configured, but not in the load path. This will also gracefully fallback to the built in HTTP server. * inspect load path * Account for multiple webrick entries in $LOAD_PATH * try removing load path test * For error on require to initiate test As opposed to manipulating the load path. * Handle explicit nils in intialization * Allow probe_handler to be set in configuration * Add documentation for probe server customization * Appease linter * retrigger CI * Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity * Update readme to have relevant app example * Fix readme grammar --------- Co-authored-by: Ben Sheldon [he/him] <[email protected]>

* Make health probe server more general purpose This removes the health check logic from the ProbeServer and renames the ProbeServer to UtilityServer that accepts any Rack based app. The health check and catchall logic are moved into simple Rack middleware that can be composed by users however they like and be used to preserve existing health check behavior while transitioning to a more general purpose utility server. All and all this pattern will allow users to add whatever functionality they like to GoodJob's web server by composing Rack apps and using GoodJob's configuration to pass in users' Rack apps. IE: ``` config.good_job.middleware = Rack::Builder.app do use GoodJob::Middleware::MyCustomMiddleware use GoodJob::Middleware::PrometheusExporter use GoodJob::Middleware::Healthcheck run GoodJob::Middleware::CatchAll end config.good_job.middleware_port = 7001 ``` This could help resolve: * bensheldon/good_job#750 * bensheldon/good_job#532 * Use new API * Revert server name change We decided to leave the original ProbeServer name better sets expectations. See: bensheldon/good_job#1079 (review) This also splits out middleware testing into separate specs. * Restore original naming This also helps ensure that the existing behavior and API remain intact. * Appease linters * Add required message for mock * Make test description relevant * Allow for handler to be injected into ProbeServer * Add WEBrick WEBrick handler * Add WEBrick as a development dependency * Add WEBrick tests and configuration * Add idle_timeout method to mock * Namespace server handlers * Warn and fallback when WEBrick isn't loadable Since the probe server has the option to use WEBrick as a server handler, but this library doesn't have WEBrick as a dependency, we want to throw a warning when WEBrick is configured, but not in the load path. This will also gracefully fallback to the built in HTTP server. * inspect load path * Account for multiple webrick entries in $LOAD_PATH * try removing load path test * For error on require to initiate test As opposed to manipulating the load path. * Handle explicit nils in intialization * Allow probe_handler to be set in configuration * Add documentation for probe server customization * Appease linter * retrigger CI * Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity * Update readme to have relevant app example * Fix readme grammar --------- Co-authored-by: Ben Sheldon [he/him] <[email protected]>

bensheldon mentioned this issue Jul 11, 2022

Improve Dashboard display of parameters (CronEntry kwargs; Process configuration; Job and Execution database values) #662

Merged

bensheldon moved this to Inbox in GoodJob Backlog v2 Aug 9, 2022

bensheldon added this to GoodJob Backlog v2 Aug 9, 2022

bensheldon moved this from Inbox to Prioritized Backlog in GoodJob Backlog v2 Aug 14, 2022

ckdake mentioned this issue Apr 3, 2023

Adds GoodJob.system_info #911

Closed

gagalago mentioned this issue Apr 24, 2023

Options to monitor good_job in production? #750

Open

AndersGM mentioned this issue Jun 19, 2023

Added metrics to Scheduler and track in Process state #984

Merged

jklina mentioned this issue Sep 17, 2023

Make health probe server more general purpose #1079

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debugging and telemetry stats #532

Add debugging and telemetry stats #532

bensheldon commented Feb 27, 2022

ckdake commented Apr 3, 2023

bensheldon commented Apr 3, 2023

ckdake commented Apr 4, 2023

bensheldon commented Apr 12, 2023

ckdake commented Apr 12, 2023

bensheldon commented Apr 21, 2023 •

edited

Loading

ckdake commented Apr 21, 2023

bensheldon commented Apr 21, 2023

ckdake commented Apr 25, 2023

jtannas commented May 18, 2023

bensheldon commented May 18, 2023

AndersGM commented Jul 7, 2023

shouichi commented Sep 15, 2023

bagedevimo commented Sep 21, 2023

simi commented Nov 25, 2023

thatsmydoing commented Dec 27, 2023

Add debugging and telemetry stats #532

Add debugging and telemetry stats #532

Comments

bensheldon commented Feb 27, 2022

ckdake commented Apr 3, 2023

bensheldon commented Apr 3, 2023

ckdake commented Apr 4, 2023

bensheldon commented Apr 12, 2023

ckdake commented Apr 12, 2023

bensheldon commented Apr 21, 2023 • edited Loading

ckdake commented Apr 21, 2023

bensheldon commented Apr 21, 2023

ckdake commented Apr 25, 2023

jtannas commented May 18, 2023

bensheldon commented May 18, 2023

AndersGM commented Jul 7, 2023

shouichi commented Sep 15, 2023

bagedevimo commented Sep 21, 2023

simi commented Nov 25, 2023

thatsmydoing commented Dec 27, 2023

bensheldon commented Apr 21, 2023 •

edited

Loading