Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobStatusLite crashing while tracking jobs - unable to locate daemon #9703

Open
amaltaro opened this issue May 20, 2020 · 6 comments · May be fixed by #12172
Open

JobStatusLite crashing while tracking jobs - unable to locate daemon #9703

amaltaro opened this issue May 20, 2020 · 6 comments · May be fixed by #12172

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
WMAgents

Describe the bug
I've seen this problem twice over the last 24h, so here is an issue to get it fixed.
JobStatusLite crashes when it's getting an instance of the htcondor schedd daemon object:

Unable to locate local daemon

Maybe it's just a coincidence, but it first happened with submit5, now with submit4 (it might be worth it checking with Maria whether there was any condor_schedd outage or so)

How to reproduce it
Not sure

Expected behavior
The component should try to recreate the schedd object, if it fails again, then we should gracefully skip the cycle and try again in the next component execution.

Additional context and error message
Traceback from the logs:

2020-05-19 15:47:55,290:140664759113472:INFO:BossAirAPI:About to look for 1219 loadedJobs.
2020-05-19 15:47:55,315:140664759113472:ERROR:BossAirAPI:Unhandled exception while tracking jobs for plugin SimpleCondorPlugin!
Unable to locate local daemon
2020-05-19 15:47:55,434:140664759113472:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMCore.BossAir.StatusPoller.StatusPoller object at 0x7fef1131cf90> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while tracking jobs for plugin SimpleCondorPlugin!
Unable to locate local daemon
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : track
        ClassInstance : None
        FileName : /data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/BossAirAPI.py
        ClassName : None
        LineNumber : 486
        ErrorNr : 0

Traceback: 
  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/BossAirAPI.py", line 473, in track
    localRunning, localChanges, localCompletes = pluginInst.track(jobs=jobsToTrack[plugin])

  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 216, in track
    schedd = htcondor.Schedd()

<@---------- WMException End ----------@>  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 182, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)

  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/Database/DBExceptionHandler.py", line 39, in wrapper
    return f(*args, **kwargs)
  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/Utils/Timers.py", line 24, in wrapper
    res = func(*arg, **kw)
  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/StatusPoller.py", line 68, in algorithm
    self.checkStatus()
  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/StatusPoller.py", line 92, in checkStatus
    runningJobs = self.bossAir.track()
  File "/data/srv/wmagent/v1.3.0/sw/slc7_amd64_gcc630/cms/wmagent/1.3.0/lib/python2.7/site-packages/WMCore/BossAir/BossAirAPI.py", line 486, in track
    raise BossAirException(msg)

2020-05-19 15:47:55,435:140664759113472:INFO:Harness:>>>Terminating worker threads
@amaltaro
Copy link
Contributor Author

@mapsacosta Maria, would you know whether we had any HTCondor-level intervention with the FNAL production schedds yesterday? From the traceback above, it was around 15:47:55 FNAL time.

@mapsacosta
Copy link

Hi @amaltaro

HTCondor 8.9.6 kicked in with the production puppet push yesterday around that time. That might explain why the agent went mad.

@hassan11196
Copy link
Member

This Unable to locate local daemon error has again resurfaced, recently in two Cern agents. Agent version 2.3.4

vocms0283 (JobSubmitter)

Failure 1 (on 12 June 2024)

2024-06-12 16:05:04,202:140414616356608:INFO:JobSubmitterPoller:Have 1 packages to submit.
2024-06-12 16:05:04,202:140414616356608:INFO:JobSubmitterPoller:Have 75 jobs to submit.
2024-06-12 16:05:04,202:140414616356608:INFO:JobSubmitterPoller:Done assigning site locations.
2024-06-12 16:05:04,293:140414616356608:ERROR:BossAirAPI:Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
2024-06-12 16:05:04,339:140414616356608:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7fb4e1174520> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 157, in submit
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2024-06-12 16:05:04,340:140414616356608:INFO:Harness:>>>Terminating worker threads
2024-06-12 16:05:04,340:140414616356608:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7fb4e1174520> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 157, in submit
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>
Backtrace:
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 209, in __call__
    raise ex
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2024-06-12 16:05:04,361:140414616356608:INFO:BaseWorkerThread:Worker thread <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7fb4e1174520> terminated
2024-06-13 12:31:32,938:140559179552576:INFO:Harness:>>>Starting: JobSubmitter<<<

Failure 2 (On 8 October 2024)

2024-10-08 16:35:18,780:139761333335808:INFO:JobSubmitterPoller:Have 5 packages to submit.
2024-10-08 16:35:18,780:139761333335808:INFO:JobSubmitterPoller:Have 913 jobs to submit.
2024-10-08 16:35:18,780:139761333335808:INFO:JobSubmitterPoller:Done assigning site locations.
2024-10-08 16:35:19,109:139761333335808:ERROR:BossAirAPI:Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
2024-10-08 16:35:19,165:139761333335808:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f1ccc4165e0> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 157, in submit
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2024-10-08 16:35:19,165:139761333335808:INFO:Harness:>>>Terminating worker threads
2024-10-08 16:35:19,165:139761333335808:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f1ccc4165e0> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 157, in submit
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>
Backtrace:
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 209, in __call__
    raise ex
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.8/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2024-10-08 16:35:19,407:139761333335808:INFO:BaseWorkerThread:Worker thread <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f1ccc4165e0> terminated
2024-10-09 22:24:16,210:140040839120704:INFO:Harness:>>>Starting: JobSubmitter<<<

vocms0282 (JobStatusLite)

Failure 1 (On * October 2024)

2024-10-08 16:21:41,029:140714201671424:INFO:StatusPoller:Running job status poller algorithm...
2024-10-08 16:21:44,311:140714201671424:INFO:BossAirAPI:About to start building running jobs
2024-10-08 16:32:20,290:140714201671424:INFO:BossAirAPI:About to look for 74052 loadedJobs.
2024-10-08 16:32:20,638:140714201671424:ERROR:BossAirAPI:Unhandled exception while tracking jobs for plugin SimpleCondorPlugin!
Unable to locate local daemon
2024-10-08 16:32:20,721:140714201671424:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMCore.BossAir.StatusPoller.StatusPoller object at 0x7ffaa7a168e0> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while tracking jobs for plugin SimpleCondorPlugin!
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : track
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 486
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 473, in track
    localRunning, localChanges, localCompletes = pluginInst.track(jobs=jobsToTrack[plugin])

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 208, in track
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/StatusPoller.py", line 70, in algorithm
    self.checkStatus()
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/StatusPoller.py", line 94, in checkStatus
    runningJobs = self.bossAir.track()
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 486, in track
    raise BossAirException(msg)

2024-10-08 16:32:20,721:140714201671424:INFO:Harness:>>>Terminating worker threads
2024-10-08 16:32:20,721:140714201671424:ERROR:BaseWorkerThread:Error in event loop (2): <WMCore.BossAir.StatusPoller.StatusPoller object at 0x7ffaa7a168e0> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while tracking jobs for plugin SimpleCondorPlugin!
Unable to locate local daemon
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : track
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 486
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 473, in track
    localRunning, localChanges, localCompletes = pluginInst.track(jobs=jobsToTrack[plugin])

  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 208, in track
    schedd = htcondor.Schedd()

  File "/usr/local/lib/python3.8/site-packages/htcondor/_lock.py", line 70, in wrapper
    rv = func(*args, **kwargs)

<@---------- WMException End ----------@>
Backtrace:
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 209, in __call__
    raise ex
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/StatusPoller.py", line 70, in algorithm
    self.checkStatus()
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/StatusPoller.py", line 94, in checkStatus
    runningJobs = self.bossAir.track()
  File "/usr/local/lib/python3.8/site-packages/WMCore/BossAir/BossAirAPI.py", line 486, in track
    raise BossAirException(msg)

2024-10-08 16:32:20,912:140714201671424:INFO:BaseWorkerThread:Worker thread <WMCore.BossAir.StatusPoller.StatusPoller object at 0x7ffaa7a168e0> terminated
2024-10-09 22:19:26,079:140094430082880:INFO:Harness:>>>Starting: JobStatusLite<<<

FYI @amaltaro

@amaltaro
Copy link
Contributor Author

Thank you for updating this ticket, Ahmed.

For JobSubmitter, I feel like we could wrap this call
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L757
with a try/except and raise a WMException (or another exception) in case we have the error reported above.

Then in the algorithm method, we need to catch it and rollback any transaction that might be ongoing. Log the relevant error message and quit the current component cycle without crashing it.

JobStatusLite is likely something similar as well.

@hassan11196 if you feel like trying to provide a PR for this, please let us know and we can help you out as needed.

@hassan11196
Copy link
Member

Hi @amaltaro
sure, I can provide a PR. To summarize my understanding of the issue, please correct me if I am wrong.

The submitJobs is already wrapped in a try-except clause where the transaction is rolled back.
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L830-L843

The BossAirException raised by self.bossAir.submit(jobs=jobList) should by caught by [except WMException:] (

) the transaction is then rollbacked and the BossAirException raised again.

You suggest to wrap the self.bossAir.submit(jobs=jobList), in try-except claues, raise a descriptive WMException that is then caught by the already present try-except clause in algorithm?

Thanks

@amaltaro
Copy link
Contributor Author

That is correct.

Another alternative could be to simply have a string match in the generic exception block here (in the algorithm method):
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobSubmitter/JobSubmitterPoller.py#L835

and if the error was about the condor daemon, we then do not re-raise that exception upstream.
I am slightly more inclined to the former implementation and more meaningful exception handling though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants