Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging and telemetry stats #532

Open
bensheldon opened this issue Feb 27, 2022 · 16 comments
Open

Add debugging and telemetry stats #532

bensheldon opened this issue Feb 27, 2022 · 16 comments

Comments

@bensheldon
Copy link
Owner

Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.

Imagining a method like GoodJob.debug to be run from a console, imagining the following fields to start:

  • GoodJob version
  • Ruby Version
  • Rails Version
  • Rails Env
  • Execution mode
  • Queues
  • Polling Interval
  • Notifier listening?
  • Processes table (consider: maybe this is all data that should be going into the Processes table)

Seems like it would also be useful to have some counters on a Scheduler.stats too:

  • Successfully executed jobs
  • Failed executed jobs
  • Job query misses (e.g. queried for a job but there were no executable jobs)
@ckdake
Copy link
Contributor

ckdake commented Apr 3, 2023

The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either successful_jobs_counter=1234 or successful_jobs_since_last_logline=123 would let us use CloudWatch to build some of the metrics that we'd find handy.

Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like name=Job1, exit_code=1, queue_latency=1234ms,execution_duration=123ms would be great too.

@bensheldon
Copy link
Owner Author

Thanks for opening up #911. That is really helpful!

Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓

Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:

  • Should these be incrementers or gauges? If they're gauges then there needs to be a mechanism to "flush/reset" the values every interval.
  • What should be triggering the stats dump? Should this be invoked from the command line, or should it be logged by GoodJob itself. If it's by GoodJob itself, should that log come from somewhere that already exists (for example, it gets logged every time the Scheduler tries to pick up a job) or would a de veloper overload an existing mechanism e.g. an ActiveJob extension, or via cron (which would need to take a Proc; currently it only allows jobs) or some new self-contained, threaded TimerSet object that would wake up to log stats. I don't feel very strongly either way, but each has its tradeoffs.
  • Then there is the content of the info of being dumped:
    • Would this only be intended for processes that have GoodJob and are executing jobs (e.g. the CLI or async web processes), or any process that has the GoodJob constant?
    • Schedulers are the smallest unit to track. A process that is executing jobs could have multiple schedulers. Though maybe the recently created Capsule is the correct unit to track. Though a capsule can have multiple schedulers, and a process could have multiple capsules if someone was initializing them manually. Maybe the Capsule is the thing that has stats on it, and multiple schedulers increment a shared Stats object owned by the Capsule. (I'm imagining in this case the stats are like "number of jobs finished/errored/discarded/retried since creation"
    • Scheduler#stats has some good stuff. I wonder if that would be a thing to extend, and also inserted into the Process.current_state. That would definitely surface a lot more details... though the Process state doesn't get flushed to the database currently except at startup.

@ckdake
Copy link
Contributor

ckdake commented Apr 4, 2023

Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you.

A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does.

From the rabbitmq docs:

Most of the metrics represent point-in-time absolute values. Some, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are most useful when compared to their previous values and historical mean/percentile values.

I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics)

Once you've got the metrics figured out, there are a couple of ways to expose them

  1. Exporting to a prometheus compatible endpoint that prometheus can poll would probably make a lot of people happy. When something polls that endpoint, it would calculate what it needs to (e.g. the 'flush' as needed) and return the results.
  2. Supporting writing the metrics on a scheule, via a cron (maybe as a job itself?), to stdout would make an overlapping set of people happy. This would need to be configurable, and because time is hard, including a timestamp with the metrics would be important and the "Exact" moment the log line written, less so. It wouldn't be great for these to get bogged down in situations with high load/throughput/queue-pressure so maybe not "as a job".

Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems.

As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch.

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

@bensheldon
Copy link
Owner Author

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design.

I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:

  1. Identify the metrics
  2. Figure out the implementation for metrics.
  • If they require storage or are heavy to fetch (e.g. making database calls), I might ask whether there is an alternative.
  • I'm imagining they would need to be process-local, and the metrics system would need to sum/aggregate them to generate the cluster view.
  1. Then identify how to do the transport (e.g. logging, emitting, fetching)

@ckdake
Copy link
Contributor

ckdake commented Apr 12, 2023

Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting.

From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick.

@bensheldon
Copy link
Owner Author

bensheldon commented Apr 21, 2023

@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation.

Also, some idle musings on this topic:

To give an example:

Screenshot 2023-04-21 at 8 29 34 AM

Here's one of my apps that has:

  • 2 processes.
  • Each process runs 2 threadpools, each with:
    • 2 threads to pull from the whenever queue (high latency)
    • 2 threads to pull from any queue except whenever (low latency)

So that means I have total:

  • whenever 4 threads
  • 4 everything else threads? I think that's the complicated-ness I'm wondering about.

@ckdake
Copy link
Contributor

ckdake commented Apr 21, 2023

A great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold).

@bensheldon
Copy link
Owner Author

Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄

  • "how many items are in this queue?" This seems simplest to break down. I assume "in this queue" is specifically "waiting to execute", and would exclude future-scheduled jobs. Is that right?
  • "how many items have been added to this queue?" What would be the (default) time-period for this query? e.g. "how many items have been added to this in [in the last N minutes; since T]?"
    • Would future-scheduled jobs be considered "added" at their scheduled time, or when created?
    • In terms of "items" there is a small difference between jobs and retries/executions, because a job might be initially enqueued and execute on one queue, then retry and execute multiple more times on another queue. From the operational/scheduler perspective, it doesn't matter (they're all items to execute) but that might not be the mental model of the person enqueueing them. e.g. items in queues >= jobs enqueued. Would you need to distinguish between "Jobs enqueued" and "jobs+retries enqueued?"
  • "how many items have been processed in this queue?" Similar to the previous query.
  • "how long does it take to process each item in this queue?"
    • total_latency = queuing_latency + runtime_latency. Do you mean total_latency? or breakdown (queuing latency is useful for identifying a lack of capacity; total latency is useful for calculating capacity for meeting SLOs)
    • I assume you mean this as a calculation across multiple items. So an average, or p95, or max?
    • Same question of previous query, of "over what time period?"
    • Something that has bedeviled me on latency is how to handle null values. e.g. if a job is in queue but has not started executing should I exclude the record from the calculation, or should I calculate it as COALESCE(performed_at, NOW()) - scheduled_at.
  • "are there any stuck jobs?" I think this is interesting/complicated.
    • I wonder if the previous question would be equivalently alertable e.g. "What is the MAX runtime latency?" and then you could set an alert threshold in your tool.

@ckdake
Copy link
Contributor

ckdake commented Apr 25, 2023

Its a fifo queue, so the first 3 would work :)

  1. for "how many in the queue". We are not (currently) using good_job to schedule jobs to happen at a particular time. It is the "waiting to execute" that will be most interesting there for starters for me. (gauge metric)
  2. I'm expecting "how many added" to be a total counter metric, increasing as each job is added. Not coupled to a timeperiod. Given that we're processing 100s of jobs and not 1000000000s of jobs, this is fine for our use case.
    • Future scheduled jobs would be entirely separate, and only counted here when they are queued for execution.
    • "retries" is a whole additional level of metrics. I would at some point like to know error/retry rates, but thats probably round 2
  3. I'm expecting "how many processes" to be a total counter too
  4. "latency"-> good question, I'm interested in all of the pieces, a good first version would be "time from when job is enqueued for execution until job is completed". Most of our jobs are super quick and are instrumented already with logs for job started -> job finished, so the piece good_job knows is how long they were queued before starting. mean,p95,max would all be helpful, and this would need a time range. if that time range approximately matched the polling interval of the monitoring tool, that might be ideal. Many tools use a very short polling interval (prometheus can do continual scraping, and requires special consideration over 2 minutes to avoid gaps in graphs), but I'd be good with a 5min rolling window. I'd expect the lantency to be for "completed jobs", so for you null example, that wouldn't be included in this metric. If the queue is piling up and not executing, I'd expect that to land in the first two metrics as a visible trend.
  5. Stuck. Yep, complicated to get right, but alerting on the max individual latency would do it. I'm not sure a great design to surface ALL latencies for ALL jobs from good_job to a monitoring tool, but surfacing a max_latency_in_past_X_timeframe from good_job could work

@jtannas
Copy link

jtannas commented May 18, 2023

Not sure if this is much help, but I just glued this together from the scopes on GoodJob::Job to give a summary count for the jobs on each queue.

WITH cte_statuses AS (
  SELECT
    CASE
      WHEN finished_at IS NOT NULL
        THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
      WHEN COALESCE(scheduled_at, created_at) <= NOW()
        THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
      WHEN (serialized_params ->> 'executions')::integer > 1
        THEN 'retried'
      ELSE
        'scheduled'
    END AS status,
    queue_name,
    priority
  FROM good_jobs
    LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
    AND pg_locks.objsubid = 1
    AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
    AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
  WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM 
  (VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
  CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
  LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
  status_list.status,
  queues.queue_name;

The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc...
That plus the duration information we can parse out the ActiveJob logs gets us what we need.

@bensheldon
Copy link
Owner Author

@jtannas fyi, rubygems.org does an interesting thing using the GoodJob::JobsFilter's queries: https://github.com/rubygems/rubygems.org/blob/63f6ba818aaa21138a82c7d57da8380d261c5193/app/jobs/good_job_statsd_job.rb#L4-L10

I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅

@AndersGM
Copy link
Contributor

AndersGM commented Jul 7, 2023

@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. #984 added overall counts. What do you think about incorporating this into metrics?

@shouichi
Copy link
Contributor

I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via perform.active_job. Because GoodJob is designed for ActiveJob, maybe they can be utilized?

https://guides.rubyonrails.org/active_support_instrumentation.html#active-job.

By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics.

jklina added a commit to jklina/good_job that referenced this issue Sep 17, 2023
This removes the health check logic from the ProbeServer and renames the
ProbeServer to UtilityServer that accepts any Rack based app.

The health check and catchall logic are moved into simple Rack middleware
that can be composed by users however they like and be used to preserve
existing health check behavior while transitioning to a more general
purpose utility server.

All and all this pattern will allow users to add whatever functionality
they like to GoodJob's web server by composing Rack apps and using
GoodJob's configuration to pass in users' Rack apps. IE:

```
config.good_job.middleware = Rack::Builder.app do
  use GoodJob::Middleware::MyCustomMiddleware
  use GoodJob::Middleware::PrometheusExporter
  use GoodJob::Middleware::Healthcheck
  run GoodJob::Middleware::CatchAll
end
config.good_job.middleware_port = 7001
```

This could help resolve:

* bensheldon#750
* bensheldon#532
@bagedevimo
Copy link

Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine.

for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense.

IMG_6844

@simi
Copy link
Contributor

simi commented Nov 25, 2023

@thatsmydoing
Copy link

💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

Just to note, this isn't in the latest release version of prometheus_exporter (2.0.8).

bensheldon added a commit that referenced this issue Jan 23, 2024
* Make health probe server more general purpose

This removes the health check logic from the ProbeServer and renames the
ProbeServer to UtilityServer that accepts any Rack based app.

The health check and catchall logic are moved into simple Rack middleware
that can be composed by users however they like and be used to preserve
existing health check behavior while transitioning to a more general
purpose utility server.

All and all this pattern will allow users to add whatever functionality
they like to GoodJob's web server by composing Rack apps and using
GoodJob's configuration to pass in users' Rack apps. IE:

```
config.good_job.middleware = Rack::Builder.app do
  use GoodJob::Middleware::MyCustomMiddleware
  use GoodJob::Middleware::PrometheusExporter
  use GoodJob::Middleware::Healthcheck
  run GoodJob::Middleware::CatchAll
end
config.good_job.middleware_port = 7001
```

This could help resolve:

* #750
* #532

* Use new API

* Revert server name change

We decided to leave the original ProbeServer name better sets
expectations. See:

#1079 (review)

This also splits out middleware testing into separate specs.

* Restore original naming

This also helps ensure that the existing behavior and API remain intact.

* Appease linters

* Add required message for mock

* Make test description relevant

* Allow for handler to be injected into ProbeServer

* Add WEBrick WEBrick handler

* Add WEBrick as a development dependency

* Add WEBrick tests and configuration

* Add idle_timeout method to mock

* Namespace server handlers

* Warn and fallback when WEBrick isn't loadable

Since the probe server has the option to use WEBrick as a server
handler, but this library doesn't have WEBrick as a dependency,
we want to throw a warning when WEBrick is configured, but not in the
load path. This will also gracefully fallback to the built in HTTP
server.

* inspect load path

* Account for multiple webrick entries in $LOAD_PATH

* try removing load path test

* For error on require to initiate test

As opposed to manipulating the load path.

* Handle explicit nils in intialization

* Allow probe_handler to be set in configuration

* Add documentation for probe server customization

* Appease linter

* retrigger CI

* Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity

* Update readme to have relevant app example

* Fix readme grammar

---------

Co-authored-by: Ben Sheldon [he/him] <[email protected]>
legendarydeveloper919 added a commit to legendarydeveloper919/good_job that referenced this issue Mar 15, 2024
* Make health probe server more general purpose

This removes the health check logic from the ProbeServer and renames the
ProbeServer to UtilityServer that accepts any Rack based app.

The health check and catchall logic are moved into simple Rack middleware
that can be composed by users however they like and be used to preserve
existing health check behavior while transitioning to a more general
purpose utility server.

All and all this pattern will allow users to add whatever functionality
they like to GoodJob's web server by composing Rack apps and using
GoodJob's configuration to pass in users' Rack apps. IE:

```
config.good_job.middleware = Rack::Builder.app do
  use GoodJob::Middleware::MyCustomMiddleware
  use GoodJob::Middleware::PrometheusExporter
  use GoodJob::Middleware::Healthcheck
  run GoodJob::Middleware::CatchAll
end
config.good_job.middleware_port = 7001
```

This could help resolve:

* bensheldon/good_job#750
* bensheldon/good_job#532

* Use new API

* Revert server name change

We decided to leave the original ProbeServer name better sets
expectations. See:

bensheldon/good_job#1079 (review)

This also splits out middleware testing into separate specs.

* Restore original naming

This also helps ensure that the existing behavior and API remain intact.

* Appease linters

* Add required message for mock

* Make test description relevant

* Allow for handler to be injected into ProbeServer

* Add WEBrick WEBrick handler

* Add WEBrick as a development dependency

* Add WEBrick tests and configuration

* Add idle_timeout method to mock

* Namespace server handlers

* Warn and fallback when WEBrick isn't loadable

Since the probe server has the option to use WEBrick as a server
handler, but this library doesn't have WEBrick as a dependency,
we want to throw a warning when WEBrick is configured, but not in the
load path. This will also gracefully fallback to the built in HTTP
server.

* inspect load path

* Account for multiple webrick entries in $LOAD_PATH

* try removing load path test

* For error on require to initiate test

As opposed to manipulating the load path.

* Handle explicit nils in intialization

* Allow probe_handler to be set in configuration

* Add documentation for probe server customization

* Appease linter

* retrigger CI

* Rename `probe_server_app` to `probe_app`; make handler name a symbol; rename Rack middleware/app for clarity

* Update readme to have relevant app example

* Fix readme grammar

---------

Co-authored-by: Ben Sheldon [he/him] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Prioritized Backlog
Development

No branches or pull requests

8 participants