-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix pipeline congestion shortcut synchronization #1740
fix: fix pipeline congestion shortcut synchronization #1740
Conversation
eclipse-zenoh#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead). The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue. The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure. The proposed fix adds a additional synchronization step after setting the congested flag: we retry to push the message, this time with an already expired deadline. If the batches were refilled, the message should be able to be sent and the congestion flag will be reset. If batches were in fact refilled, but the message was fragmented and still ends by being dropped, it still means batches have been pushed and will be refilled later, still unsetting the congested flag.
Supersede #1737 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug reported in #1738 is solved by the edge-case explained in the comments of this PR.
LGTM.
@@ -806,9 +806,31 @@ impl TransmissionPipelineProducer { | |||
let mut deadline = Deadline::new(wait_time, max_wait_time); | |||
// Lock the channel. We are the only one that will be writing on it. | |||
let mut queue = zlock!(self.stage_in[idx]); | |||
let sent = queue.push_network_message(&mut msg, priority, &mut deadline)?; | |||
// Check again for congestion in case it happens when blocking on the mutex. | |||
if self.status.is_congested(priority) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mallets are you ok with this addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead).
The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue.
The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure.
The proposed fix adds a additional synchronization step after setting the congested flag: we retry to push the message, this time with an already expired deadline. If the batches were refilled, the message should be able to be sent and the congestion flag will be reset.
If batches were in fact refilled, but the message was fragmented and still ends by being dropped, it still means batches have been pushed and will be refilled later, still unsetting the congested flag.