Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage/RPC Related Resource Management Updates for Connectionless EPs #10837

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iziemba
Copy link
Contributor

@iziemba iziemba commented Mar 3, 2025

Storage libfabric users using a connectionless EP do not expect the EP to become disabled if an RDMA operation fails. The current libfabric documentation states that if various resource management enabled errors occur on an endpoint, the endpoint becomes disabled and must be re-enabled to be reusable. While this makes sense for a connected endpoint (e.g., TCP socket and IB RC QP), this does not make sense for connectionless endpoints. Consider the following RDM EP example:

  1. RPC client sends RKEY to RPC server.
  2. RPC client dies resulting in RKEY now being invalid.
  3. RPC server issues RMA to RPC client.
  4. RMA operation fails due to unmatched RMA.
  5. RPC server RDM EP is disabled.

The above shows that a single RPC client crashing can trigger server endpoint becoming disabled. This would impact RPC server connectivity to all other up RPC clients.

This PR addresses this problem by

  • Clarifying what resource management errors disable connected and connectionless
  • Defining unreachable EP resource management scenario

iziemba added 2 commits March 4, 2025 12:54
Previously, any resource management error would result in an endpoint
being disabled. Storage libfabric users have the expectation that
connectionless endpoints will not be disabled in an RDMA operation
results in an error. For example, if a client sends a server a bad
remote key and the server issues an RMA read/write to this key, by the
current resource management definition, the endpoint will be disabled.
RDMA failure to a single connectionless peer should not an impact
endpoints connectively to other peers.

Fix this by clarifying exactly which errors will cause a connectionless
endpoint to transition into a disabled state.

Signed-off-by: Ian Ziemba <[email protected]>
Resource management unreachable EP addresses the issue of issuing RDMA
operations to a connectionless EP which cannot be reach. Such examples
include no-route-to-host or target NIC down. Defining this behavior is
important for storage use-cases where NICs may unexpectedly disappear.

Signed-off-by: Ian Ziemba <[email protected]>
@iziemba iziemba requested a review from j-xiong March 4, 2025 19:00
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments on wording

the endpoint must be re-enabled before it will accept new data transfer
operations. For connected endpoints, the connection is torn down and
must be re-established.
When a resource management error occurs on an connected endpoint, the endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... a connected endpoint...

must be re-established.
When a resource management error occurs on an connected endpoint, the endpoint
will transition into a disabled state and the connection torn down. While
transitioning to disabled, any queued and inflight operations will be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A disabled endpoint will drop any queued or inflight operations."

When a resource management error occurs on an connected endpoint, the endpoint
will transition into a disabled state and the connection torn down. While
transitioning to disabled, any queued and inflight operations will be dropped.
Connection must be re-established for endpoint to be usable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider removing this sentence. It may not be possible to re-reconnect the same endpoint, forcing the use of a new one.

transitioning to disabled, any queued and inflight, local transmit operations
will be dropped. Endpoints targeting a disabled EP must adhere to the Target EP
behavior. If the endpoint becomes disabled, the endpoint must be re-enabled
before it will accept new data transfer operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of resource management errors on connectionless endpoints depends
on the type of error.  If RM is disabled and one of the following errors
occur, the endpoint will be disabled: Tx Ctx, Rx Ctx, Tx CQ, or Rx CQ.  For other errors
(Target EP, No Rx Buffer, etc.), the operation may fail, but the endpoint will remain enabled.
A disabled endpoint will drop or fail any queued or inflight operations.  ...

operations targeting an unreachable endpoint will have operation dropped. For
FI_EP_RDM, target operations targeting an unreachable endpoint will result in
a transmit error. A provider may choose to set the completion error code to
FI_EHOSTUNREACH signaling to user the target endpoint is unreachable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hesitate to call out a specific error code, as that depends on getting some sort of feedback from the network or target node to be accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants