-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redid handling of link down to ensure tear down of old path #583
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ktmi I read in the description your still working on it and with TODOs in the code. Overall, it's headed in the right direction. Yes, first and foremost we definitely need a local way of reproducing this either deterministically or with a high likelihood to assess with concrete data how it's performing in terms of correctness and failover convergence control plane and data plane switchover flows performance, and e2e test maybe we don't need fully exact one but one that simulates a closely double failure, and other cases like failing both current_path and failover_path when they're not fully disjoint.
Other than that, when you need a final review let me know. I did a partial review with some things that were a bit obvious, other parts I won't assume their status or what happened since I don't know if you'll still address based on what we discussed.
# tradditional way | ||
if ( | ||
with ExitStack() as exit_stack: | ||
exit_stack.enter_context(self._lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle_link_down
will always be executed single threaded, self._lock
was meant only for consistency check related execute loop, I wouldn't recommend mixing it here, so until we get to consistency part I wouldn't even touch this.
Each evc.lock
though yes, those ones we need to make sure that when acquired by this single threaded exec no other concurrent threads would be performing other side effects. The ExitStack()
usage is great, if it'll be used, needs to also ensure that deletions are also executed atomically. So, if you can review this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary: all affected EVCs need to build/map the flows and perform the flows removals and installations with the bulk request that we discussed in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll change it to a separate lock. A lock is necessary here if we where to have multiple threads that wanted to hold multiple EVC locks, so I'll add one that is named explicitly for such purposes.
): | ||
evcs_normal.append(evc) | ||
)): | ||
redeploy.append(evc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if evc.is_affected_by_link(link)
is True
and evc.is_failover_path_affected_by_link(link)
is True
failover_path needs to be cleaned too. Otherwise this can end up keeping a failover_path that isn't UP. Can you review this conditional and test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's what the old logic was saying to do here. But I do see your point. Looking at the original code, it wouldn't ever tear down the failover path. Just tear down, then setup the current path and try to deploy a failover path if it doesn't already exist.
Closes #517
Summary
Replaces the procedure for handling of link down events to ensure that the old_path is torn down before fully committing changes to the database.
Local Tests
This still needs some tweaking, but the general shape of the solution is here. Will have to develop a test harness which can replicate the old issue reliably.
E2E Tests
As mentioned with the local tests, this issue wasn't caught by the E2E tests, so a test harness needs to be developed.