[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

tadamski · 2024-02-25T22:34:38Z

https://issues.redhat.com/browse/WFDISC-53

There is a possibility of starvation of one node in the cluster if connection creation time is substantial. We encountered this scenario in environment in which LDAP was used as a credential store:

During the invocation the connection is opened to a node (node1) that was configured in EJB client. The connection is established, EJB client channel is created and information about applications on node1 are provided to the client. Information about other nodes in the cluster are provided in the client.
During subsequent invocations the discovery runs as follows: node1 DiscoveryAttempt's code runs immediately (connection is already established) and adds a result to discovery queue. DiscoveryAttempt's for all other nodes in the cluster initiate creation of connections to those nodes. The result from node1 is accepted and cancellation process for all other DiscoveryAttempt starts. Cancellation happens before a connection to any other node is established - information about applications on other node is not provided to the client.

The above scenario repeats for all subsequent invocations and node1 is selected 100% times.

I was able to make this scenario work by delaying discovery cancellation for a configured amount of time as seen in this PR. With this delay, node1 still wins in first invocations but connections are established to all other nodes and during subsequent invocation they all have equal chances of providing the result to the queue.

@fl4via @dmlloyd Do you recall if in remoting there is some elegant to implement that?

…lation delay

dmlloyd · 2024-02-26T14:01:06Z

I believe that the best solution to this problem is to eliminate the connection cancellation process initiated by scenario part 2. Let the connection attempts complete, so that on the next request those nodes are either connected or else there is still an in-progress connection. I don't think there's any need to time them out as long as we're reusing the connection attempts property (and I think there is a separate configuration item for connect timeout, isn't there?).

tadamski · 2024-03-01T13:54:29Z

I'm going to close this pool request and open the one that removes cancellation in the new JIRA. I want to keep WFDISC-53 with description of this bug scenario for the future reference.

[WFDISC-53] Add configuration parameter that enables discovery cancel…

38781f1

…lation delay

tadamski changed the title ~~[WFDISC-53] Add configuration parameter that enables discovery cancel…~~ [WFDISC-53] Do not cancel pending discovery invocations Mar 1, 2024

tadamski changed the title ~~[WFDISC-53] Do not cancel pending discovery invocations~~ [WFDISC-53] Node starvation is possible if time to establish connection is non-negligible Mar 1, 2024

tadamski closed this Mar 1, 2024

tadamski deleted the WFDISC-53 branch March 1, 2024 14:36

tadamski mentioned this pull request Mar 1, 2024

[WFDISC-54] Do not cancel DiscoveryAttempt connection creation if oth… #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

tadamski commented Feb 25, 2024 •

edited

Loading

dmlloyd commented Feb 26, 2024

tadamski commented Mar 1, 2024

[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

[WFDISC-53] Node starvation is possible if time to establish connection is non-negligible #68

Conversation

tadamski commented Feb 25, 2024 • edited Loading

dmlloyd commented Feb 26, 2024

tadamski commented Mar 1, 2024

tadamski commented Feb 25, 2024 •

edited

Loading