Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redid handling of link down to ensure tear down of old path #583

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Ktmi
Copy link

@Ktmi Ktmi commented Dec 4, 2024

Closes #517

Summary

Replaces the procedure for handling of link down events to ensure that the old_path is torn down before fully committing changes to the database.

Local Tests

This still needs some tweaking, but the general shape of the solution is here. Will have to develop a test harness which can replicate the old issue reliably.

E2E Tests

As mentioned with the local tests, this issue wasn't caught by the E2E tests, so a test harness needs to be developed.

@Ktmi Ktmi requested a review from a team as a code owner December 4, 2024 16:48
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ktmi I read in the description your still working on it and with TODOs in the code. Overall, it's headed in the right direction. Yes, first and foremost we definitely need a local way of reproducing this either deterministically or with a high likelihood to assess with concrete data how it's performing in terms of correctness and failover convergence control plane and data plane switchover flows performance, and e2e test maybe we don't need fully exact one but one that simulates a closely double failure, and other cases like failing both current_path and failover_path when they're not fully disjoint.

Other than that, when you need a final review let me know. I did a partial review with some things that were a bit obvious, other parts I won't assume their status or what happened since I don't know if you'll still address based on what we discussed.

# tradditional way
if (
with ExitStack() as exit_stack:
exit_stack.enter_context(self._lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_link_down will always be executed single threaded, self._lock was meant only for consistency check related execute loop, I wouldn't recommend mixing it here, so until we get to consistency part I wouldn't even touch this.

Each evc.lock though yes, those ones we need to make sure that when acquired by this single threaded exec no other concurrent threads would be performing other side effects. The ExitStack() usage is great, if it'll be used, needs to also ensure that deletions are also executed atomically. So, if you can review this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary: all affected EVCs need to build/map the flows and perform the flows removals and installations with the bulk request that we discussed in the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll change it to a separate lock. A lock is necessary here if we where to have multiple threads that wanted to hold multiple EVC locks, so I'll add one that is named explicitly for such purposes.

):
evcs_normal.append(evc)
)):
redeploy.append(evc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if evc.is_affected_by_link(link) is True
and evc.is_failover_path_affected_by_link(link) is True

failover_path needs to be cleaned too. Otherwise this can end up keeping a failover_path that isn't UP. Can you review this conditional and test it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's what the old logic was saying to do here. But I do see your point. Looking at the original code, it wouldn't ever tear down the failover path. Just tear down, then setup the current path and try to deploy a failover path if it doesn't already exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants