fix: fix pipeline congestion shortcut synchronization #1740

wyfo · 2025-01-24T08:20:22Z

#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead).

The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue.

The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure.

The proposed fix adds a additional synchronization step after setting the congested flag: we retry to push the message, this time with an already expired deadline. If the batches were refilled, the message should be able to be sent and the congestion flag will be reset.

If batches were in fact refilled, but the message was fragmented and still ends by being dropped, it still means batches have been pushed and will be refilled later, still unsetting the congested flag.

eclipse-zenoh#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead). The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue. The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure. The proposed fix adds a additional synchronization step after setting the congested flag: we retry to push the message, this time with an already expired deadline. If the batches were refilled, the message should be able to be sent and the congestion flag will be reset. If batches were in fact refilled, but the message was fragmented and still ends by being dropped, it still means batches have been pushed and will be refilled later, still unsetting the congested flag.

wyfo · 2025-01-24T08:21:05Z

Supersede #1737

io/zenoh-transport/src/common/pipeline.rs

oteffahi

The bug reported in #1738 is solved by the edge-case explained in the comments of this PR.
LGTM.

wyfo · 2025-01-24T11:58:59Z

io/zenoh-transport/src/common/pipeline.rs

@@ -806,9 +806,31 @@ impl TransmissionPipelineProducer {
        let mut deadline = Deadline::new(wait_time, max_wait_time);
        // Lock the channel. We are the only one that will be writing on it.
        let mut queue = zlock!(self.stage_in[idx]);
-        let sent = queue.push_network_message(&mut msg, priority, &mut deadline)?;
+        // Check again for congestion in case it happens when blocking on the mutex.
+        if self.status.is_congested(priority) {


@Mallets are you ok with this addition?

wyfo added the bug Something isn't working label Jan 24, 2025

wyfo requested review from Mallets, oteffahi and yellowhatter January 24, 2025 08:20

oteffahi reviewed Jan 24, 2025

View reviewed changes

io/zenoh-transport/src/common/pipeline.rs Outdated Show resolved Hide resolved

Fix typo

d5d45ec

oteffahi approved these changes Jan 24, 2025

View reviewed changes

fix: typo

47955b4

Mallets approved these changes Jan 24, 2025

View reviewed changes

Mallets enabled auto-merge (squash) January 24, 2025 11:31

wyfo commented Jan 24, 2025

View reviewed changes

Mallets disabled auto-merge January 24, 2025 13:06

Mallets merged commit eb3a7a4 into eclipse-zenoh:main Jan 24, 2025
12 of 14 checks passed

wyfo mentioned this pull request Jan 24, 2025

Publisher can get stuck dropping all messages with CongestionControl::Drop #1738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix pipeline congestion shortcut synchronization #1740

fix: fix pipeline congestion shortcut synchronization #1740

wyfo commented Jan 24, 2025

wyfo commented Jan 24, 2025

oteffahi left a comment

wyfo Jan 24, 2025

Mallets Jan 24, 2025

fix: fix pipeline congestion shortcut synchronization #1740

fix: fix pipeline congestion shortcut synchronization #1740

Conversation

wyfo commented Jan 24, 2025

wyfo commented Jan 24, 2025

oteffahi left a comment

Choose a reason for hiding this comment

wyfo Jan 24, 2025

Choose a reason for hiding this comment

Mallets Jan 24, 2025

Choose a reason for hiding this comment