Skip to content
This repository has been archived by the owner on Mar 19, 2022. It is now read-only.

fix: try to rescue Net::SSH::Disconnect #536

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

azrle
Copy link

@azrle azrle commented Dec 23, 2018

Hi @matschaffer,

I sometimes encounter Net::SSH::Disconnect when trying to run chef-solo on slow machines.

/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/packet_stream.rb:89:in `next_packet': connection closed by remote host (Net::SSH::Disconnect)
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:193:in `block in poll_message'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `poll_message'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:474:in `dispatch_incoming_packets'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:225:in `preprocess'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:206:in `process'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `block in loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/channel.rb:269:in `wait'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_connection.rb:81:in `run_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_command.rb:300:in `run_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/knife-solo/ssh_command.rb:277:in `stream_command'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:338:in `cook'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:108:in `block in run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:233:in `time'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/knife-solo-0.7.0.pre3/lib/chef/knife/solo_cook.rb:88:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:443:in `block in run_with_pretty_exceptions'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/local_mode.rb:44:in `with_server_connectivity'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:442:in `run_with_pretty_exceptions'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/knife.rb:219:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/lib/chef/application/knife.rb:156:in `run'
        from /opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.18.31/bin/knife:25:in `<top (required)>'
        from /opt/chef/embedded/bin/knife:22:in `load'
        from /opt/chef/embedded/bin/knife:22:in `<main>'

After some investigation, I found that the server indeed closed the connection since it has been idle for a while. The reason why the connection becomes idle is simply that the machine (chef-solo runs on) is slow and sometimes it takes time to do works. Hence, I try to set the ssh_keepalive and ssh_keepalive_interval to a reasonable value in order to send keepalive message when the connection is idle. However it does not work as expected.

As the investigation goes, I found that keepalive and keepalive_interval only work for the case that you are running a slow command on the remote or you do Net::SSH::Session::loop because net-ssh is based on event and so is the keepalive of net-ssh. I am not sure if there is any way to keep the connection alive and reuse the connection efficiently.

However, it may be good to rescue from Net::SSH::Disconnect and retry the command. The caveat is that a command might be executed on the remote more than once.

Not sure are these issues are related to this one. #272 #332

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 89.995% when pulling 9f1af2a on azrle:fix/net_ssh_disconnect into ce8c223 on matschaffer:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 89.995% when pulling 9f1af2a on azrle:fix/net_ssh_disconnect into ce8c223 on matschaffer:master.

@matschaffer
Copy link
Owner

Have you checked if chef exits on disconnect? If not, seems like retry could be problematic

@azrle
Copy link
Author

azrle commented Dec 23, 2018

In my case, it does not exist.
The run_command tried to run chef on a closed SSH session and generated Net::SSH::Disconnect.
Hence, the chef did not even start.

Checking Chef version...
Resolving cookbook dependencies...
Uploading the kitchen...
Generating solo config...
ERROR: Net::SSH::Disconnect: connection closed by remote hostRunning Chef: sudo chef-solo -c /var/tmp/chef-solo_xxx/solo.rb -j /var/tmp/chef-solo_xxx/dna.jso
n -W

However, risk exists as Net::SSH::Disconnect may happen under other conditions.

@matschaffer
Copy link
Owner

That's surprising to see that part of the run take so long.

Anyway if it helps your situation I'd be fine with a configurable retry. But seems risky as a default.

@azrle
Copy link
Author

azrle commented Dec 23, 2018

It also surprises me too. I am going to conduct another investigation why it takes some much time.
Maybe resolving dependencies part and rsync part.

Considering this error should not be a common case, and this fix is not a proper fix,
I will try to add an option to support this workaround. Does it sound good to you?

@matschaffer
Copy link
Owner

matschaffer commented Dec 24, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants