-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobStatusLite crashing while tracking jobs - unable to locate daemon #9703
Comments
@mapsacosta Maria, would you know whether we had any HTCondor-level intervention with the FNAL production schedds yesterday? From the traceback above, it was around 15:47:55 FNAL time. |
Hi @amaltaro HTCondor 8.9.6 kicked in with the production puppet push yesterday around that time. That might explain why the agent went mad. |
This vocms0283 (JobSubmitter)Failure 1 (on 12 June 2024)
Failure 2 (On 8 October 2024)
vocms0282 (JobStatusLite)Failure 1 (On * October 2024)
FYI @amaltaro |
Thank you for updating this ticket, Ahmed. For JobSubmitter, I feel like we could wrap this call Then in the JobStatusLite is likely something similar as well. @hassan11196 if you feel like trying to provide a PR for this, please let us know and we can help you out as needed. |
Hi @amaltaro The submitJobs is already wrapped in a try-except clause where the transaction is rolled back. The
BossAirException raised again.
You suggest to wrap the Thanks |
That is correct. Another alternative could be to simply have a string match in the generic exception block here (in the algorithm method): and if the error was about the condor daemon, we then do not re-raise that exception upstream. |
Impact of the bug
WMAgents
Describe the bug
I've seen this problem twice over the last 24h, so here is an issue to get it fixed.
JobStatusLite crashes when it's getting an instance of the htcondor schedd daemon object:
Maybe it's just a coincidence, but it first happened with submit5, now with submit4 (it might be worth it checking with Maria whether there was any condor_schedd outage or so)
How to reproduce it
Not sure
Expected behavior
The component should try to recreate the schedd object, if it fails again, then we should gracefully skip the cycle and try again in the next component execution.
Additional context and error message
Traceback from the logs:
The text was updated successfully, but these errors were encountered: