Batches don't work in production #1555

jiri1337 · 2024-11-28T09:25:26Z

Rails version 7.1.4
Ruby version 3.3.5
GoodJob version 4.3.0

Hello,
first of all, we are using multitenancy with RLS, so it is possible that we did mess something up. The weird thing is that batches work perfectly in development mode with both async and inline adapters.

Here is how we monkey-patched the JobPerformer

module GoodJob
  # https://github.com/bensheldon/good_job/blob/v3.29.3/lib/good_job/job_performer.rb#L32
  class JobPerformer
    alias original_next next

    def next
      ApplicationRecord.with_restore_previous_tenant do |tenant_ops|
        original_next do |execution|
          tenant = Tenant.find(execution.tenant_id)
          tenant_ops.setup_request_store_with_tenant(tenant)
          tenant_ops.setup_database_with_tenant(tenant.id)

          yield(execution) if block_given?
        end
      end
    end
  end

This has so far worked flawlessly with regular jobs.

Now, we have introduced a complex batch (similar to https://github.com/bensheldon/good_job?tab=readme-ov-file#complex-batches )

Then we have a batch job like this

class FinalizeBillingRunJob < ApplicationJob
  queue_as :default

  def perform(batch, _context)
    billing_run = BillingRun.find(batch.properties[:billing_run_id])

    if batch.properties[:stage].nil?
      billing_run.generate(...)

      batch.enqueue(stage: 1) do
        billing_run.invoices.each do |invoice|
          ::GenerateDocumentPDFJob.perform_later(document: invoice, ...)
        end
      end

    elsif batch.properties[:stage] == 1
      # attempt to deliver via email (if possible)
      billing_run.invoices.each do |invoice|
        ...
      end

      # generate PDFs for all invoices
      billing_run.generate_pdf
    end
  end
end

Which we invoke from our controller by calling

GoodJob::Batch.enqueue(on_finish: FinalizeBillingRunJob, billing_run_id: ...)

This works perfectly for me locally. Once we deploy this to an instance, the batch only runs once (no error is raised, it completes successfully). The second time you run a batch, the first job (FinalizeBillingRun with nil stage) will be queued but never picked up by the Scheduler. It will hang as pending/queued forever - or, funnily enough, until we restart the instance. Then it gets picked up immediately and completes without an error.

We are using puma and have implemented the suggested changes from https://github.com/bensheldon/good_job?tab=readme-ov-file#execute-jobs-async--in-process and async adapter in production

I would very much appreciate if you could point out what could have gone wrong.

The text was updated successfully, but these errors were encountered:

bensheldon · 2024-12-02T17:48:58Z

hmm, that's really strange!

How are the job and batch records being tenant'ed? I could imagine that maybe the jobs and batch records are being placed on a different database, and thus aren't able to be queried from the current context.

jiri1337 · 2024-12-03T09:58:29Z

Every table has a tenant_id column. We use a single database and rely on row-level security. The name setup_database_with_tenant(tenant.id) might be misleading — it essentially only executes SET app.tenant_id = #{tenant_id}; at the database level (PostgreSQL).

The strange thing is that regular jobs always work and are processed immediately. This remains true even when there is a stuck or queued job that was created via a batch earlier. This is strange because I would expect the scheduler to be implemented as a queue data structure, with no skipping.

I tried searching the project for differences in how the jobs work in development and production but couldn't find any significant differences.

Edit:
It seems enabling poll_interval solves this problem completely. Is LISTEN/NOTIFY not fully supported on batches?

github-project-automation bot added this to GoodJob Backlog v2 Nov 28, 2024

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batches don't work in production #1555

Batches don't work in production #1555

jiri1337 commented Nov 28, 2024 •

edited

Loading

bensheldon commented Dec 2, 2024

jiri1337 commented Dec 3, 2024 •

edited

Loading

Batches don't work in production #1555

Batches don't work in production #1555

Comments

jiri1337 commented Nov 28, 2024 • edited Loading

bensheldon commented Dec 2, 2024

jiri1337 commented Dec 3, 2024 • edited Loading

jiri1337 commented Nov 28, 2024 •

edited

Loading

jiri1337 commented Dec 3, 2024 •

edited

Loading