Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sshj-ssh sudo timeout issue #15

Open
eagle-rr opened this issue May 18, 2023 · 15 comments
Open

sshj-ssh sudo timeout issue #15

eagle-rr opened this issue May 18, 2023 · 15 comments

Comments

@eagle-rr
Copy link

After trying to upgrade from 4.7 to 4.10; the linux nodes are hitting the following error after running for 5 minutes:

net.sf.expectit.ExpectIOException: Expect operation fails (timeout: 30000000 ms) for matcher: regexp('~.*\$')
at net.sf.expectit.ExpectImpl.expectIn(ExpectImpl.java:106)
at net.sf.expectit.AbstractExpectImpl.expectIn(AbstractExpectImpl.java:57)
at net.sf.expectit.AbstractExpectImpl.expect(AbstractExpectImpl.java:61)
at com.plugin.sshjplugin.sudo.SudoCommand.runSudoCommand(SudoCommand.java:93)
at com.plugin.sshjplugin.model.SSHJExec.execute(SSHJExec.java:96)
at com.plugin.sshjplugin.SSHJNodeExecutorPlugin.executeCommand(SSHJNodeExecutorPlugin.java:250)
at com.dtolabs.rundeck.core.execution.ExecutionServiceImpl.executeCommand(ExecutionServiceImpl.java:427)
at com.dtolabs.rundeck.core.execution.ExecutionServiceImpl.executeCommand(ExecutionServiceImpl.java:390)
at com.dtolabs.rundeck.core.execution.workflow.steps.node.impl.ExecNodeStepExecutor.executeNodeStep(ExecNodeStepExecutor.java:56)

My related project settings are:

project.keep-alive-interval=5
...
project.retry-counter=3
project.retry-enable=true
project.ssh-authentication=password
project.ssh-command-timeout=0
project.ssh-connect-timeout=0
project.ssh-keypath=/var/lib/rundeck/.ssh/id_rsa
project.ssh.user=${option.username}
project.sudo-command-enabled=true
...
service.FileCopier.default.provider=sshj-scp
service.NodeExecutor.default.provider=sshj-ssh

The commands being ran are via sudo.

I can reliably re-produce this error by setting up a job to run the following inline script on remote servers:

#!/bin/bash
my_start_date=`date`
my_max=10000
my_curr_try=0
my_sleep=480

while [ $my_curr_try -lt $my_max ]
do
  my_curr_date=`date`
  (( my_curr_try = my_curr_try + 1 ))
  echo "Start: ${my_start_date} Curr: ${my_curr_date} Try: ${my_curr_try} of ${my_max}"
  echo "Sleeping ${my_sleep}"
  sleep ${my_sleep}
done

Using invocation string: sudo su - root
File Extension: .sh

Rundeck release 4.10.2 (sshj plugin version 0.1.4)

@rpo-fr
Copy link

rpo-fr commented May 23, 2023

Hi,
I think i have the same problem as you describe

@alekunlp
Copy link

This is still being an issue even after upgrading from 4.70 to 4.15

@MegaDrive68k
Copy link

Hi folks,

I tried to reproduce this issue unsuccessfully. That is the last attempt config/steps to compare to your environment:

  1. Rundeck 4.16 war-based installation (SSHJ 0.1.8), out of the box, without any code modification.

  2. A new fresh project with the following config:

#Wed Sep 13 17:32:09 CLST 2023
#edit below
project.disable.executions=false
project.disable.schedule=false
project.execution.history.cleanup.batch=500
project.execution.history.cleanup.enabled=false
project.execution.history.cleanup.retention.days=60
project.execution.history.cleanup.retention.minimum=50
project.execution.history.cleanup.schedule=0 0 0 1/1 * ? *
project.jobs.gui.groupExpandLevel=1
project.later.executions.disable.value=0
project.later.executions.disable=false
project.later.executions.enable.value=
project.later.executions.enable=false
project.later.schedule.disable.value=
project.later.schedule.disable=false
project.later.schedule.enable.value=
project.later.schedule.enable=false
project.name=ProjectEXAMPLE
project.nodeCache.enabled=true
project.nodeCache.firstLoadSynch=true
project.output.allowUnsanitized=false
project.retry-counter=3
project.ssh-authentication=privateKey
resources.source.1.type=local
resources.source.2.config.file=/home/user/Programs/rundeck/model_sources/centos_8_default.xml
resources.source.2.config.format=resourcexml
resources.source.2.config.requireFileExists=true
resources.source.2.config.writeable=true
resources.source.2.type=file
service.FileCopier.default.provider=sshj-scp
service.NodeExecutor.default.provider=sshj-ssh

As you see, I'm using the private key method to access the remote node. Also tried the filesystem private key like the original ticket config:

project.ssh-keypath=/var/lib/rundeck/.ssh/id_rsa

But this doesn't work on the SSHJ 0.1.8 (this is a known bug fixed on the 0.1.9 version).

  1. A remote node (Rocky Linux 8), This is the model source:
<?xml version="1.0" encoding="UTF-8"?>

<project>
  <node name="node00" description="Node 00" tags="" hostname="192.168.56.10" osArch="amd64" osFamily="unix" osName="Linux" osVersion="4.18.0-372.26.1.el8_6.x86_64" username="vagrant" ssh-key-storage-path="keys/rundeck"/>
</project>

As you see the private key is defined on the sh-key-storage-path parameter.

  1. This is the job (pointing to the remote node), it contains the @eagle-rr sudo su - root as interpreter script:
- defaultTab: nodes
  description: Just an example.
  executionEnabled: true
  id: 5e2a904e-1048-4c21-9595-c9a1b03a6999
  loglevel: INFO
  name: HelloWorld
  nodeFilterEditable: false
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: false
      rankOrder: ascending
      successOnEmptyNodeFilter: false
      threadcount: '1'
    filter: 'name: node00 '
  nodesSelectedByDefault: true
  plugins:
    ExecutionLifecycle: null
  scheduleEnabled: true
  sequence:
    commands:
    - fileExtension: .sh
      interpreterArgsQuoted: false
      script: |-
        #!/bin/bash
        my_start_date=`date`
        my_max=10000
        my_curr_try=0
        my_sleep=480

        while [ $my_curr_try -lt $my_max ]
        do
          my_curr_date=`date`
          (( my_curr_try = my_curr_try + 1 ))
          echo "Start: ${my_start_date} Curr: ${my_curr_date} Try: ${my_curr_try} of ${my_max}"
          echo "Sleeping ${my_sleep}"
          sleep ${my_sleep}
        done
      scriptInterpreter: sudo su - root
    keepgoing: false
    strategy: node-first
  uuid: 5e2a904e-1048-4c21-9595-c9a1b03a6999

Screenshot_2023-09-13_17-36-52

Also, I tried a simple job to make sure that sudo is working:

- defaultTab: nodes
  description: ''
  executionEnabled: true
  id: 8384ac4b-1df3-43e2-9e0c-fb68f53b8763
  loglevel: INFO
  name: TestSUDO
  nodeFilterEditable: false
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: false
      rankOrder: ascending
      successOnEmptyNodeFilter: false
      threadcount: '1'
    filter: 'name: node00 '
  nodesSelectedByDefault: true
  plugins:
    ExecutionLifecycle: null
  scheduleEnabled: true
  sequence:
    commands:
    - fileExtension: .sh
      interpreterArgsQuoted: false
      script: whoami
      scriptInterpreter: sudo su - root
    keepgoing: false
    strategy: node-first
  uuid: 8384ac4b-1df3-43e2-9e0c-fb68f53b8763

Screenshot_2023-09-13_17-47-17

Ok, this works on my env but I'm curious about what things are missing here.

  1. Could you try using a fresh project/job like my example?
  2. Could you clarify what elements are different in your SSH server and project configuration?

I'm pretty sure that I'm missing something.

Thanks!

@eagle-rr
Copy link
Author

For our scenario; we are not allowed to use ssh keys nor password-less sudo. As I don't see in your node definition for:

ssh-password-option: option.sshPassword
sudo-command-enabled: true
sudo-password-option: option.sshPassword

Are you using passwordless sudo access as well?

I don't see any options in your job prompting for username/password (unless I am overlooking it).

I can try again setting up a new project and job; but we cannot use keys or password-less sudo.

@eagle-rr
Copy link
Author

Our /etc/sudoers contains the following:

## Command Aliases
## These are groups of related commands...


Defaults    requiretty
Defaults   !visiblepw
Defaults    env_reset

which is why the tty is so important.

@eagle-rr
Copy link
Author

Also, our /etc/ssh/sshd_config has the following settings:

ClientAliveInterval 300
ClientAliveCountMax 0

which is why the keepalive is important.

@eagle-rr
Copy link
Author

I created a "sshtest" project with:

project.description=
project.disable.executions=false
project.disable.schedule=false
project.execution.history.cleanup.batch=500
project.execution.history.cleanup.enabled=false
project.execution.history.cleanup.retention.days=60
project.execution.history.cleanup.retention.minimum=50
project.execution.history.cleanup.schedule=0 0 0 1/1 * ? *
project.jobs.gui.groupExpandLevel=1
project.keep-alive-interval=5
project.label=
project.later.executions.disable=false
project.later.executions.enable=false
project.later.schedule.disable=false
project.later.schedule.enable=false
project.name=sshtest
project.nodeCache.enabled=true
project.nodeCache.firstLoadSynch=true
project.output.allowUnsanitized=false
project.retry-counter=3
project.ssh-authentication=password
project.ssh-keypath=/var/lib/rundeck/.ssh/id_rsa
resources.source.1.type=local
resources.source.2.config.file=/u01/rundeck_nodes.json
resources.source.2.config.format=resourcejson
resources.source.2.type=file
service.FileCopier.default.provider=sshj-scp
service.NodeExecutor.default.provider=sshj-ssh

My test node is basically:

[
  {
    "nodename": "testserveer1.mytest.com",
    "type": "Node",
    "tags": [

    ],
    "os_version": "Oracle Linux Server 7.9",
    "os": "Linux",
    "os_name": "Linux",
    "hostname": "testserveer1.mytest.com",
    "osFamily": "unix",
    "sudo-command-enabled": "true",
    "sudo-password-option": "option.sshPassword",
    "username": "${option.username}",
    "password-option": "option.sshPassword",
    "ssh-password-option": "option.sshPassword",
    "ssh-command-timeout": "0",
    "ssh-connect-timeout": "0",
  }
]

@eagle-rr
Copy link
Author

My test job is:

<joblist>
  <job>
    <context>
      <options preserveOrder='true'>
        <option name='username'>
          <label>username</label>
        </option>
        <option name='sshPassword' secure='true'>
          <label>sshPassword</label>
        </option>
      </options>
    </context>
    <defaultTab>nodes</defaultTab>
    <description></description>
    <dispatch>
      <excludePrecedence>true</excludePrecedence>
      <keepgoing>false</keepgoing>
      <rankOrder>ascending</rankOrder>
      <successOnEmptyNodeFilter>false</successOnEmptyNodeFilter>
      <threadcount>100</threadcount>
    </dispatch>
    <executionEnabled>true</executionEnabled>
    <id>5d2c406c-1cca-4c2a-9e51-f2e6f4e6ea35</id>
    <loglevel>INFO</loglevel>
    <name>test job</name>
    <nodeFilterEditable>false</nodeFilterEditable>
    <nodefilters>
      <filter>os_name: Linux</filter>
    </nodefilters>
    <nodesSelectedByDefault>true</nodesSelectedByDefault>
    <plugins />
    <scheduleEnabled>true</scheduleEnabled>
    <sequence keepgoing='false' strategy='node-first'>
      <command>
        <description>date</description>
        <exec>date</exec>
      </command>
      <command>
        <description>Sleep Test</description>
        <fileExtension>.sh</fileExtension>
        <script><![CDATA[#!/bin/bash
my_start_date=`date`
my_max=10000
my_curr_try=0
my_sleep=480

echo "Starting the script...."

while [ $my_curr_try -lt $my_max ]
do
  my_curr_date=`date`
  (( my_curr_try = my_curr_try + 1 ))
  echo "Start: ${my_start_date} Curr: ${my_curr_date} Try: ${my_curr_try} of ${my_max}"
  echo "Sleeping ${my_sleep}"
  sleep ${my_sleep}
done
]]></script>
        <scriptargs />
        <scriptinterpreter>sudo su - root</scriptinterpreter>
      </command>
    </sequence>
    <uuid>5d2c406c-1cca-4c2a-9e51-f2e6f4e6ea35</uuid>
  </job>
</joblist>

@eagle-rr
Copy link
Author

Sadly even after new project and job setup; sudo fails until I add

session.allocateDefaultPTY();

into SSHJExec.java and build a new plugin file.

I've yet to figure out how to get KeepAliveRunners to show up after the "sleep" command with the new version of the plugin.

@MegaDrive68k
Copy link

Thanks for the complete information @eagle-rr!

I reproduced the issue partially:

  1. Now, I have an "only password" auth method SSH server, and without passwordless sudo config. I have tested manually first.

  2. I'm using this model source file:

node00:
  ssh-password-option: option.sshPassword
  nodename: node00
  hostname: 192.168.56.10
  osFamily: unix
  sudo-password-option: option.sudoPassword
  description: Rocky8
  ssh-authentication: password
  sudo-command-enabled: 'true'
  tags: centos
  username: vagrant

And this simple job definition "just for testing":

- defaultTab: nodes
  description: ''
  executionEnabled: true
  id: 9534837e-353a-4280-b5b9-22992a705933
  loglevel: INFO
  name: SingleCommand
  nodeFilterEditable: false
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: false
      rankOrder: ascending
      successOnEmptyNodeFilter: false
      threadcount: '1'
    filter: 'name: node00 '
  nodesSelectedByDefault: true
  options:
  - name: sshPassword
    secure: true
    storagePath: keys/passwd
  - name: sudoPassword
    secure: true
    storagePath: keys/passwd
  plugins:
    ExecutionLifecycle: null
  scheduleEnabled: true
  sequence:
    commands:
    - fileExtension: .sh
      interpreterArgsQuoted: false
      script: whoami
      scriptInterpreter: sudo su - root
    keepgoing: false
    strategy: node-first
  uuid: 9534837e-353a-4280-b5b9-22992a705933

I ran the job by using the JCSH node executor ("SSH") successfully:

Screenshot_2023-09-13_19-47-28

But on the SSHJ, the job keeps the execution "forever".

Screenshot_2023-09-13_19-56-20

I'm still looking into it.

@eagle-rr
Copy link
Author

eagle-rr commented Sep 14, 2023

Yes - the running job should eventually time out. Basically, the sudo prompt's expect is waiting forever due to no tty present.

@MegaDrive68k
Copy link

Exactly, thanks for providing the context, I've reproduced it.

@eagle-rr
Copy link
Author

Glad you could reproduce it. My only work-around for this tty part was to add into SSHJExec.java the following "allocateDefaultPTY()" entry:

            session = ssh.startSession();
            session.allocateDefaultPTY();

Granted, it should probably check if "EnablePTY" is enabled in project settings and such before allocating pty.

However after allocateDefaultPTY() runs successfully and the sudo now works, the sleep in the command will eventually timeout. When comparing my same job in 4.7 versus the newer 4.16; I see in debug mode the following entries after the sleep command in v0.1.2 of sshj plugin:

Sleeping 3600
[net.schmizz.keepalive.KeepAliveRunner] Sending keep-alive since 5 seconds elapsed
[net.schmizz.keepalive.KeepAliveRunner] Received response from server to our keep-alive.
[net.schmizz.sshj.connection.ConnectionImpl] Making global request for `[[email protected]](https://groups.google.com/)`
[net.schmizz.keepalive.KeepAliveRunner] Sending keep-alive since 5 seconds elapsed
[net.schmizz.keepalive.KeepAliveRunner] Received response from server to our keep-alive.
[net.schmizz.sshj.connection.ConnectionImpl] Making global request for `[[email protected]](https://groups.google.com/)`
[net.schmizz.keepalive.KeepAliveRunner] Sending keep-alive since 5 seconds elapsed
[net.schmizz.keepalive.KeepAliveRunner] Received response from server to our keep-alive.
[net.schmizz.sshj.connection.ConnectionImpl] Making global request for `[[email protected]](https://groups.google.com/)`

In comparing version 0.1.2 and 0.1.9; it looks like SSHJBase.java's connect had the following lines changed from:

        SSHJAuthentication authentication = new SSHJAuthentication(sshjConnection, pluginLogger);
        final DefaultConfig config = SSHJDefaultConfig.init().getConfig();
        config.setLoggerFactory(new SSHJPluginLoggerFactory(pluginLogger));
        config.setKeepAliveProvider(KeepAliveProvider.KEEP_ALIVE);

        pluginLogger.log(3, "["+getPluginName()+"] init SSHClient" );
        pluginLogger.log(3, "["+getPluginName()+"] setting timeouts" );

to just:

        SSHJAuthentication authentication = new SSHJAuthentication(sshjConnection, pluginLogger);

        pluginLogger.log(3, "["+getPluginName()+"] setting timeouts" );

Wouldn't it be the following line that is preventing the KeepAliveRunner from kicking off?

config.setKeepAliveProvider(KeepAliveProvider.KEEP_ALIVE);

@eagle-rr
Copy link
Author

Any further questions or thoughts on this?

@eagle-rr
Copy link
Author

eagle-rr commented Nov 1, 2023

As we are at that time again (quarterly patching), are there any further questions or thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants