SystemError, Xenon, ssh adaptor, session down #78

marcvdijk · 2018-09-27T12:29:54Z

When running a MDStudio Workflow that includes lie_md component endpoints that are using Cerise, a SystemError is sometimes raised.

System specs: OSX 10.11.6, lie_md running 'standalone' using python3.6, cerise client is Binac cluster GPU queue.

History leading up to error: call made from lie_workflow, running solvent-ligand MD simulation. At least the first run of the workflow starting from a clean cerise specialization docker (created the first time by lie_md) finishes without the SystemError being raised. Only in the second run or later using the same cerise specialization docker still running is the following System error raise (from cerise_backend.log):

[2018-09-27 12:18:12.978] [DEBUG] State is now SystemError [cerise.back_end.execution_manager]
[2018-09-27 12:18:12.978] [DEBUG] Deleting job 272923c41d334e93a1efa95360583772 [cerise.back_end.execution_manager]
[2018-09-27 12:18:12.982] [CRITICAL] An internal error occurred when processing job 272923c41d334e93a1efa95360583772 [cerise.back_end.execution_manager]
[2018-09-27 12:18:12.982] [CRITICAL] Traceback (most recent call last):
  File "cerise/../cerise/back_end/execution_manager.py", line 152, in _process_jobs
    self._delete_job(job_id, job)
  File "cerise/../cerise/back_end/execution_manager.py", line 74, in _delete_job
    self._remote_files.delete_job(job_id)
  File "cerise/../cerise/back_end/xenon_remote_files.py", line 219, in delete_job
    self._rm_remote_dir(job_id, '')
  File "cerise/../cerise/back_end/xenon_remote_files.py", line 372, in _rm_remote_dir
    self._x_recursive_delete(x_remote_path)
  File "cerise/../cerise/back_end/xenon_remote_files.py", line 418, in _x_recursive_delete
    if self._x.files().exists(x_remote_path):
jpype._jexception.nl.esciencecenter.xenon.XenonExceptionPyRaisable: nl.esciencecenter.xenon.XenonException: ssh adaptor: session is down
 [cerise.back_end.execution_manager]

The lie_md output leading up to this point:

2018-09-27T14:12:18+0200 Crossbar host is: localhost
2018-09-27T14:12:18+0200 Collecting logs on session "MDWampApi"
2018-09-27T14:12:18+0200 Uploaded schemas for MDWampApi
2018-09-27T14:12:18+0200 MDWampApi: 2 procedures successfully registered
2018-09-27T14:12:37+0200 starting liemd task_id: 4602185954418892
2018-09-27T14:12:37+0200 store output in: /tmp/mdstudio/lie_md/4602185954418892
2018-09-27T14:12:37+0200 Searching for pending jobs in DB
2018-09-27T14:12:37+0200 There are no pending jobs!
2018-09-27T14:12:37+0200 Created a new Cerise-client service
2018-09-27T14:12:37+0200 Creating Cerise-client job
2018-09-27T14:12:37+0200 Only ligand_file defined, perform SOLVENT-LIGAND MD
2018-09-27T14:12:37+0200 CWL worflow is: /Users/mvdijk/Documents/WorkProjects/liestudio-master/lie_md/lie_md/data/solvent_ligand.cwl
2018-09-27T14:12:37+0200 Running the job in a remote machine using docker: mdstudio/cerise-mdstudio-binac:gpu
2018-09-27T14:12:39+0200 Added service to mongoDB
2018-09-27T14:12:39+0200 There was an error: SystemError
2018-09-27T14:12:39+0200 Cerise log stored at: /tmp/mdstudio/lie_md/4602185954418892/cerise.log
2018-09-27T14:12:39+0200 removing job: 272923c41d334e93a1efa95360583772 from Cerise-client
2018-09-27T14:12:39+0200 Extracting output from: /tmp/mdstudio/lie_md/4602185954418892

The text was updated successfully, but these errors were encountered:

LourensVeen · 2018-09-27T12:52:10Z

Ah, looks like the SSH connection went down, and Cerise doesn't automatically reconnect. That's a known issue (see #25), it should of course try to reconnect and continue going. IIRC, I actually put that functionality into Cerulean, so it should come for free with the switch from Xenon to Cerulean. I'll get to that ASAP.

marcvdijk added the bug label Sep 27, 2018

marcvdijk assigned LourensVeen Sep 27, 2018

LourensVeen mentioned this issue Nov 8, 2018

Job hangs if connection to HPC resource failed MD-Studio/MDStudio_gromacs#12

Closed

LourensVeen closed this as completed in d4725b4 Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemError, Xenon, ssh adaptor, session down #78

SystemError, Xenon, ssh adaptor, session down #78

marcvdijk commented Sep 27, 2018

LourensVeen commented Sep 27, 2018

SystemError, Xenon, ssh adaptor, session down #78

SystemError, Xenon, ssh adaptor, session down #78

Comments

marcvdijk commented Sep 27, 2018

LourensVeen commented Sep 27, 2018