Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix pipeline congestion shortcut synchronization #1740

Conversation

wyfo
Copy link
Contributor

@wyfo wyfo commented Jan 24, 2025

#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead).

The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue.

The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure.

The proposed fix adds a additional synchronization step after setting the congested flag: we retry to push the message, this time with an already expired deadline. If the batches were refilled, the message should be able to be sent and the congestion flag will be reset.

If batches were in fact refilled, but the message was fragmented and still ends by being dropped, it still means batches have been pushed and will be refilled later, still unsetting the congested flag.

eclipse-zenoh#1627 introduced a congestion shortcut mechanism, but this one was not
correctly synchronized. There was indeed situations experienced by
users in which congested flag was set and never reset, which implies a
drop of all successive messages (the publisher becomes kind of dead).

The congestion flag is in fact set after the deadline of a message is
reached, while it is unset when batches were refilled, only with
relaxed atomic operations. Also, after the deadline is reached, there
is no further check of the queue.

The most obvious synchronization flow here is that between the instant
where the thread is waken up because the deadline has been reached, and
the one where the congested flag is set, it is possible that the tx
task is unblocked and all the batches are sent and refilled. The
congested flags would been set after, so there would be no batch to
refill to unset it back. This flow seems hard to achieve when there are
many batches in the queue, but it is still theoretically possible. And
when fragmentation chain is dropped in the middle, pushing the
ephemeral batch takes more time before setting the congested flag; this
is precisely the case where the bug was observed by the way, so it may
indicate the described flow is the reason behind, but it's not sure.

The proposed fix adds a additional synchronization step after setting
the congested flag: we retry to push the message, this time with an
already expired deadline. If the batches were refilled, the message
should be able to be sent and the congestion flag will be reset.

If batches were in fact refilled, but the message was fragmented and
still ends by being dropped, it still means batches have been pushed
and will be refilled later, still unsetting the congested flag.
@wyfo wyfo added the bug Something isn't working label Jan 24, 2025
@wyfo
Copy link
Contributor Author

wyfo commented Jan 24, 2025

Supersede #1737

Copy link
Contributor

@oteffahi oteffahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug reported in #1738 is solved by the edge-case explained in the comments of this PR.
LGTM.

@Mallets Mallets enabled auto-merge (squash) January 24, 2025 11:31
@@ -806,9 +806,31 @@ impl TransmissionPipelineProducer {
let mut deadline = Deadline::new(wait_time, max_wait_time);
// Lock the channel. We are the only one that will be writing on it.
let mut queue = zlock!(self.stage_in[idx]);
let sent = queue.push_network_message(&mut msg, priority, &mut deadline)?;
// Check again for congestion in case it happens when blocking on the mutex.
if self.status.is_congested(priority) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mallets are you ok with this addition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@Mallets Mallets disabled auto-merge January 24, 2025 13:06
@Mallets Mallets merged commit eb3a7a4 into eclipse-zenoh:main Jan 24, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants