-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Syndic 3006 uses synchronous ReqChannel #64552
Comments
Actually sending results over these persistent req channels sucks even after changing it from ReqChannel to AsyncReqChannel - it causes delays and duplicate results sent to MoMs under moderate stress (we tested with big result payloads but not very frequent). Reverting the code back to how 3003 worked - separate connection established for every return_pub_multi - made our syndics fly again. We tried that on a ~100k host installation and had to revert to 3003 multiple times. I'm talking Salt 3005.1 with #64032 (plus our attempts to fix it as mentioned above) but I don't think 3006 differs much in that respect. |
|
this should affect latest version of the minion since it is using the same approach. @dwoz any idea if there was a bug about this and was fixed ? |
The minion code is not exactly the same as Minion and Syndic classes now have their own _send_req_async() and _send_req_sync() methods but behaves in a dodgy way too. Despite keeping a connection to 4506 open I can see it opens new ones during each job return too, which can even be seen with a simple netstat -an displaying connections in TIME_WAIT state after processing a job. I'm currently trying to make sense of it. |
Hmm, it actually seems that while returning to the master 3005.1 minion closes the existing connection to 4506 and opens a new one. I took the trouble to watch the traffic and it seems that the existing one gets closed and the result is returned through a newly open connection. With that in mind it seems that 3005.1 just keeps reopening connections to 4506, it just doesn't close them until it opens a new one (!?) which makes it keep these connection hanging. That's might be what syndic does too although I had an impression that the syndic kept the same connection open but I could get confused. Unfortunately my 3005.1 syndics now use 3003.5-like _send_req_async() (just with the new salt.channel.client.AsyncReqChannel.factory() instead of salt.transport.client.AsyncReqChannel) and I don't feel like testing the wonky upstream version again. When I first upgraded to 3005.1 I thought that it kept persistent connections to 4506 to speed up sending job results but now it seems more that it's just some hanging connections that don't get closed and likely eat up resources both at the minion and the master. |
The reason of the above behaviour lies in https://github.com/saltstack/salt/blob/master/salt/transport/zeromq.py#L909 connect() and send() methods of RequestClient and connect() and _init_socket() methods of AsyncReqMessageClient https://github.com/saltstack/salt/blob/master/salt/transport/zeromq.py#L551 which closes that previosly hanging connection in _init_socket(). That connect() upon send() was added by another PR of @dwoz's 62d26a3 . Honestly I can't see a point of reusing these channel objects since they need to establish new connections every time anyway. Probably the least the current code could do calling self.close() in salt.transport.zeromq.RequestClient.send() but what's the benefit of reusing these channel objects anyway? |
Actually salt.transport.zeromq.RequestClient was not even meant for closing manually and then reusing - since close() sets self.stream to None and then _init_socket() tries to call close again if hasattr(self, "stream"):, it obviously ends up in an unhandled exception. That makes me wonder more about the purpose of all these changes... It worked good enough in 3003 creating that AsyncReqChannel every time it returned and with statement taking care of closing the connection. I tried to make that send() close the connection but gave up (it ended up with zmq "Operation not supported"). The deeper I get the clearer it gets this code has been bastardised too much recently. It used to be quite reliable in 3003. You can't even close that connection properly because self.message_client.close() destroys too much (calls self.context.term()), so it only works if you mimic what that salt.transport.zeromq.AsyncReqMessageClient._init_socket() function does in that if hasattr() at the very beginning. |
@szjur This should have been resolve in 3006.4. Can you please confirm it is no longer an issue? |
Description
5249eac made a change that syndic uses ReqChannel instead of AsyncReqChannel for returning to MoMs. From the code it seems that AsyncReqChannel is the right one for async operations.
Setup
Salt 3005.1 with #64032 backport applied. Multiple MoMs with multiple syndics.
Steps to Reproduce the behavior
Debugging and testing the code seems to show that the change makes returning events block on sending to masters more than when it uses AsyncReqChannel.
Versions Report
3005.1 with #64032
The text was updated successfully, but these errors were encountered: