Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batches don't work in production #1555

Open
jiri1337 opened this issue Nov 28, 2024 · 2 comments
Open

Batches don't work in production #1555

jiri1337 opened this issue Nov 28, 2024 · 2 comments

Comments

@jiri1337
Copy link

jiri1337 commented Nov 28, 2024

Rails version 7.1.4
Ruby version 3.3.5
GoodJob version 4.3.0

Hello,
first of all, we are using multitenancy with RLS, so it is possible that we did mess something up. The weird thing is that batches work perfectly in development mode with both async and inline adapters.

Here is how we monkey-patched the JobPerformer

module GoodJob
  # https://github.com/bensheldon/good_job/blob/v3.29.3/lib/good_job/job_performer.rb#L32
  class JobPerformer
    alias original_next next

    def next
      ApplicationRecord.with_restore_previous_tenant do |tenant_ops|
        original_next do |execution|
          tenant = Tenant.find(execution.tenant_id)
          tenant_ops.setup_request_store_with_tenant(tenant)
          tenant_ops.setup_database_with_tenant(tenant.id)

          yield(execution) if block_given?
        end
      end
    end
  end

This has so far worked flawlessly with regular jobs.

Now, we have introduced a complex batch (similar to https://github.com/bensheldon/good_job?tab=readme-ov-file#complex-batches )

Then we have a batch job like this

class FinalizeBillingRunJob < ApplicationJob
  queue_as :default

  def perform(batch, _context)
    billing_run = BillingRun.find(batch.properties[:billing_run_id])

    if batch.properties[:stage].nil?
      billing_run.generate(...)

      batch.enqueue(stage: 1) do
        billing_run.invoices.each do |invoice|
          ::GenerateDocumentPDFJob.perform_later(document: invoice, ...)
        end
      end

    elsif batch.properties[:stage] == 1
      # attempt to deliver via email (if possible)
      billing_run.invoices.each do |invoice|
        ...
      end

      # generate PDFs for all invoices
      billing_run.generate_pdf
    end
  end
end

Which we invoke from our controller by calling

GoodJob::Batch.enqueue(on_finish: FinalizeBillingRunJob, billing_run_id: ...)

This works perfectly for me locally. Once we deploy this to an instance, the batch only runs once (no error is raised, it completes successfully). The second time you run a batch, the first job (FinalizeBillingRun with nil stage) will be queued but never picked up by the Scheduler. It will hang as pending/queued forever - or, funnily enough, until we restart the instance. Then it gets picked up immediately and completes without an error.

We are using puma and have implemented the suggested changes from https://github.com/bensheldon/good_job?tab=readme-ov-file#execute-jobs-async--in-process and async adapter in production

I would very much appreciate if you could point out what could have gone wrong.

@bensheldon
Copy link
Owner

hmm, that's really strange!

How are the job and batch records being tenant'ed? I could imagine that maybe the jobs and batch records are being placed on a different database, and thus aren't able to be queried from the current context.

@jiri1337
Copy link
Author

jiri1337 commented Dec 3, 2024

Every table has a tenant_id column. We use a single database and rely on row-level security. The name setup_database_with_tenant(tenant.id) might be misleading — it essentially only executes SET app.tenant_id = #{tenant_id}; at the database level (PostgreSQL).

The strange thing is that regular jobs always work and are processed immediately. This remains true even when there is a stuck or queued job that was created via a batch earlier. This is strange because I would expect the scheduler to be implemented as a queue data structure, with no skipping.

I tried searching the project for differences in how the jobs work in development and production but couldn't find any significant differences.

Edit:
It seems enabling poll_interval solves this problem completely. Is LISTEN/NOTIFY not fully supported on batches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants