Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

iziemba · 2025-03-03T13:29:49Z

Storage libfabric users using a connectionless EP do not expect the EP to become disabled if an RDMA operation fails. The current libfabric documentation states that if various resource management enabled errors occur on an endpoint, the endpoint becomes disabled and must be re-enabled to be reusable. While this makes sense for a connected endpoint (e.g., TCP socket and IB RC QP), this does not make sense for connectionless endpoints. Consider the following RDM EP example:

RPC client sends RKEY to RPC server.
RPC client dies resulting in RKEY now being invalid.
RPC server issues RMA to RPC client.
RMA operation fails due to unmatched RMA.
RPC server RDM EP is disabled.

The above shows that a single RPC client crashing can trigger server endpoint becoming disabled. This would impact RPC server connectivity to all other up RPC clients.

This PR addresses this problem by

Clarifying what resource management errors disable connected and connectionless
Defining unreachable EP resource management scenario

man/fi_domain.3.md

Previously, any resource management error would result in an endpoint being disabled. Storage libfabric users have the expectation that connectionless endpoints will not be disabled in an RDMA operation results in an error. For example, if a client sends a server a bad remote key and the server issues an RMA read/write to this key, by the current resource management definition, the endpoint will be disabled. RDMA failure to a single connectionless peer should not an impact endpoints connectively to other peers. Fix this by clarifying exactly which errors will cause a connectionless endpoint to transition into a disabled state. Signed-off-by: Ian Ziemba <[email protected]>

Resource management unreachable EP addresses the issue of issuing RDMA operations to a connectionless EP which cannot be reach. Such examples include no-route-to-host or target NIC down. Defining this behavior is important for storage use-cases where NICs may unexpectedly disappear. Signed-off-by: Ian Ziemba <[email protected]>

shefty

minor comments on wording

shefty · 2025-03-04T20:20:05Z

man/fi_domain.3.md

-the endpoint must be re-enabled before it will accept new data transfer
-operations.  For connected endpoints, the connection is torn down and
-must be re-established.
+When a resource management error occurs on an connected endpoint, the endpoint


... a connected endpoint...

shefty · 2025-03-04T20:22:14Z

man/fi_domain.3.md

-must be re-established.
+When a resource management error occurs on an connected endpoint, the endpoint
+will transition into a disabled state and the connection torn down. While
+transitioning to disabled, any queued and inflight operations will be dropped.


A disabled endpoint will drop any queued or inflight operations."

shefty · 2025-03-04T20:24:46Z

man/fi_domain.3.md

+When a resource management error occurs on an connected endpoint, the endpoint
+will transition into a disabled state and the connection torn down. While
+transitioning to disabled, any queued and inflight operations will be dropped.
+Connection must be re-established for endpoint to be usable.


I'd consider removing this sentence. It may not be possible to re-reconnect the same endpoint, forcing the use of a new one.

shefty · 2025-03-04T20:32:24Z

man/fi_domain.3.md

+transitioning to disabled, any queued and inflight, local transmit operations
+will be dropped. Endpoints targeting a disabled EP must adhere to the Target EP
+behavior. If the endpoint becomes disabled, the endpoint must be re-enabled
+before it will accept new data transfer operations.


The behavior of resource management errors on connectionless endpoints depends on the type of error. If RM is disabled and one of the following errors occur, the endpoint will be disabled: Tx Ctx, Rx Ctx, Tx CQ, or Rx CQ. For other errors (Target EP, No Rx Buffer, etc.), the operation may fail, but the endpoint will remain enabled. A disabled endpoint will drop or fail any queued or inflight operations. ...

shefty · 2025-03-04T20:36:53Z

man/fi_domain.3.md

+  operations targeting an unreachable endpoint will have operation dropped. For
+  FI_EP_RDM, target operations targeting an unreachable endpoint will result in
+  a transmit error. A provider may choose to set the completion error code to
+  FI_EHOSTUNREACH signaling to user the target endpoint is unreachable.


I'd hesitate to call out a specific error code, as that depends on getting some sort of feedback from the network or target node to be accurate.

iziemba requested review from soumagne and j-xiong March 3, 2025 13:29

iziemba mentioned this pull request Mar 3, 2025

Improve libfabric RDM EP for storage #10798

Open

iziemba requested a review from shefty March 3, 2025 15:45

soumagne approved these changes Mar 3, 2025

View reviewed changes

j-xiong added the for-2.1.x label Mar 3, 2025

j-xiong reviewed Mar 3, 2025

View reviewed changes

man/fi_domain.3.md Outdated Show resolved Hide resolved

man/fi_domain.3.md Outdated Show resolved Hide resolved

iziemba added 2 commits March 4, 2025 12:54

iziemba force-pushed the update_rdm_rm branch from 53fc9da to c254ace Compare March 4, 2025 19:00

iziemba requested a review from j-xiong March 4, 2025 19:00

j-xiong approved these changes Mar 4, 2025

View reviewed changes

shefty reviewed Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

iziemba commented Mar 3, 2025

shefty left a comment

shefty Mar 4, 2025

shefty Mar 4, 2025

shefty Mar 4, 2025

shefty Mar 4, 2025

shefty Mar 4, 2025

Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

Are you sure you want to change the base?

Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

Conversation

iziemba commented Mar 3, 2025

shefty left a comment

Choose a reason for hiding this comment

shefty Mar 4, 2025

Choose a reason for hiding this comment

shefty Mar 4, 2025

Choose a reason for hiding this comment

shefty Mar 4, 2025

Choose a reason for hiding this comment

shefty Mar 4, 2025

Choose a reason for hiding this comment

shefty Mar 4, 2025

Choose a reason for hiding this comment