Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

Closed
wants to merge 1 commit into from

Conversation

tadamski
Copy link
Contributor

@tadamski tadamski commented Feb 25, 2024

https://issues.redhat.com/browse/WFDISC-53

There is a possibility of starvation of one node in the cluster if connection creation time is substantial. We encountered this scenario in environment in which LDAP was used as a credential store:

  1. During the invocation the connection is opened to a node (node1) that was configured in EJB client. The connection is established, EJB client channel is created and information about applications on node1 are provided to the client. Information about other nodes in the cluster are provided in the client.
  2. During subsequent invocations the discovery runs as follows: node1 DiscoveryAttempt's code runs immediately (connection is already established) and adds a result to discovery queue. DiscoveryAttempt's for all other nodes in the cluster initiate creation of connections to those nodes. The result from node1 is accepted and cancellation process for all other DiscoveryAttempt starts. Cancellation happens before a connection to any other node is established - information about applications on other node is not provided to the client.

The above scenario repeats for all subsequent invocations and node1 is selected 100% times.

I was able to make this scenario work by delaying discovery cancellation for a configured amount of time as seen in this PR. With this delay, node1 still wins in first invocations but connections are established to all other nodes and during subsequent invocation they all have equal chances of providing the result to the queue.

@fl4via @dmlloyd Do you recall if in remoting there is some elegant to implement that?

@dmlloyd
Copy link
Member

dmlloyd commented Feb 26, 2024

I believe that the best solution to this problem is to eliminate the connection cancellation process initiated by scenario part 2. Let the connection attempts complete, so that on the next request those nodes are either connected or else there is still an in-progress connection. I don't think there's any need to time them out as long as we're reusing the connection attempts property (and I think there is a separate configuration item for connect timeout, isn't there?).

@tadamski tadamski changed the title [WFDISC-53] Add configuration parameter that enables discovery cancel… [WFDISC-53] Do not cancel pending discovery invocations Mar 1, 2024
@tadamski tadamski changed the title [WFDISC-53] Do not cancel pending discovery invocations [WFDISC-53] Node starvation is possible if time to establish connection is non-negligible Mar 1, 2024
@tadamski
Copy link
Contributor Author

tadamski commented Mar 1, 2024

I'm going to close this pool request and open the one that removes cancellation in the new JIRA. I want to keep WFDISC-53 with description of this bug scenario for the future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants