-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When a test times-out, it sometimes jumps from runner to runner multiple times #365
Comments
Now displaying as "Taken by small-runner-20 on 2020-05-19 14:17 UTC". |
Runners are considering jobs that are still being executed by another runner as stalled. Some of those jobs are about to time-out; however, a new runner marks it as stalled and re-runs it, while the first runner is still running the job. Github doesn't receive an error/failure from the first runner, only the notification that a new runner is picking the job. Issue: #365 Signed-off-by: Armando Neto <[email protected]>
issue is fixed, I close this ticket. |
HI @netoarmando. What if we put a message broker in between runner and github instead of polling mechanism? |
I'm drafting something around these lines, that can be an optional feature for PR-CI runners and use the prci-automation host for that, similar to Vagrant catalog. In the meanwhile, I've opened #403 to improve the data we collect to troubleshoot this issue. |
By removing the sleep when locking a task the time between polling the status and setting a new one is reduced, this is important to lower the chances of a race condition. "Taken" is being divided into "Locked" and "Taken", that can also help us to troubleshoot future issues. "pending for rerun" status now includes the runner that set it and when it was set. Issue: #365 Signed-off-by: Armando Neto <[email protected]>
I am not sure if it is already a known fact, but today I have observed the timeouted job was picked and executed by 3 runners simultaneously:
Tests on runners 4, 7, 9 are still running The reason the task was set to rerun is
PRCI suspects something is wrong:
|
@wladich That's the issue. I've collected similar data where a job was picked my multiple runners even when the status is clearly locking that to another runner. It's hard to reproduce since we are using Github's API as database. |
[my previous comment updated with more logs] Strange things:
Ideas:
|
Here are log lines containing "#5662" on all runners starting at moment when commit was pushed:
|
Definitely a problem:
It took small-runner-3 5 minutes to realize it can not process locked task. And small-runner-12 marked task as stale after 50 seconds after it was locked. |
What's the next message after |
Sorry, I did not notice that these were different tasks - "/build" vs "/test_trust", so it is another problem.
So it started building right after "successfully locked". |
Aha, I think I found one issue: the system time on small-runner-12 and small-runner-3 differs by 2 minutes! |
Some statistics before correcting the time:
|
I have updated chrony config on all runners and now time is synced. Lets wait few days and check the statistics again |
Updated failures statistics
Fixing time synchronization seems to have little or no effect for the issue. |
Here are some parts of log files illustrating one task being locked and executed on multiple runners:
|
An example is freeipa/freeipa#4701 which was created May 18, 9:46 GMT+2 but was taken by multiple runners and is still running now on a different runner (Taken by small-runner-15 on 2020-05-19 13:14 UTC).
This is annoying as more than 1 day later the test is not finished. I would rather see the result as "Time-out".
The text was updated successfully, but these errors were encountered: