Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the default JOB_EVENT_BUFFER_SECONDS 1 seconds #14335

Merged

Conversation

jainnikhil30
Copy link
Contributor

@jainnikhil30 jainnikhil30 commented Aug 11, 2023

This is in conjunction with experiment done for increase the JOB_EVENT_BUFFER_SECONDS from 0.1 second to 1 second. In a high load scenario with loads of event getting generated, with the default 0.1 seconds we end up with loads of small bins, which is not desirable. With 1 second we get bins with bigger sizes. It is clearly visible with following graphs from Grafana:

  • With 0.1 seconds:

0 1sec

  • with 1 seconds:

1sec

Thus making the default JOB_EVENT_BUFFER_SECONDS 1 seconds makes sense.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change

@AlanCoding
Copy link
Member

Let me go ahead and make some comments public - specifically why I'm not worried about this change for responsiveness.

When I think of responsiveness, I have manual behavior in mind. If you are trying to do something by clicking, you presumably go to a job template and click the "launch" button. In that case, it is unlikely there are other background jobs running which are actively making progress as most long-running jobs are doing work, but spend their time inside heavy tasks.

If your playbook runs fast (or just runs in bursts, which will always be true) then you'll see delay due to the redis timeout.

res = self.redis.blpop(self.queue_name, timeout=1)

What's probably normal and predicable (as user is watching the standard out) is that a handful of events come in, go in the buffer, and then there's nothing else to read. As such, it'll stay in the read until the timeout is hit. After that, it will flush. So that 1 second read timeout is the main thing that will delay the time until the user sees the first events. This is particularly true with a significant number of callback workers (as is factually the case, at least 4) that split the events between themselves.

I think it's worth considering how we can increase responsiveness. For example, if the last read was known to be a timeout, we could flush on seeing the first new event (only if stdout is not empty). That would alert the user to a batch of events coming in. However, to back this up with measurements, we should have some first-event timing benchmarks.

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are some big 🪣 s

@jainnikhil30 jainnikhil30 merged commit 4cd9016 into ansible:devel Aug 12, 2023
@jainnikhil30 jainnikhil30 deleted the increase_the_job_event_buffer_seconds branch August 12, 2023 02:19
djyasin pushed a commit to djyasin/awx that referenced this pull request Sep 16, 2024
djyasin pushed a commit to djyasin/awx that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants