-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage/RPC Related Resource Management Updates for Connectionless EPs #10837
base: main
Are you sure you want to change the base?
Conversation
Previously, any resource management error would result in an endpoint being disabled. Storage libfabric users have the expectation that connectionless endpoints will not be disabled in an RDMA operation results in an error. For example, if a client sends a server a bad remote key and the server issues an RMA read/write to this key, by the current resource management definition, the endpoint will be disabled. RDMA failure to a single connectionless peer should not an impact endpoints connectively to other peers. Fix this by clarifying exactly which errors will cause a connectionless endpoint to transition into a disabled state. Signed-off-by: Ian Ziemba <[email protected]>
Resource management unreachable EP addresses the issue of issuing RDMA operations to a connectionless EP which cannot be reach. Such examples include no-route-to-host or target NIC down. Defining this behavior is important for storage use-cases where NICs may unexpectedly disappear. Signed-off-by: Ian Ziemba <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments on wording
the endpoint must be re-enabled before it will accept new data transfer | ||
operations. For connected endpoints, the connection is torn down and | ||
must be re-established. | ||
When a resource management error occurs on an connected endpoint, the endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... a connected endpoint...
must be re-established. | ||
When a resource management error occurs on an connected endpoint, the endpoint | ||
will transition into a disabled state and the connection torn down. While | ||
transitioning to disabled, any queued and inflight operations will be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A disabled endpoint will drop any queued or inflight operations."
When a resource management error occurs on an connected endpoint, the endpoint | ||
will transition into a disabled state and the connection torn down. While | ||
transitioning to disabled, any queued and inflight operations will be dropped. | ||
Connection must be re-established for endpoint to be usable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider removing this sentence. It may not be possible to re-reconnect the same endpoint, forcing the use of a new one.
man/fi_domain.3.md
Outdated
transitioning to disabled, any queued and inflight, local transmit operations | ||
will be dropped. Endpoints targeting a disabled EP must adhere to the Target EP | ||
behavior. If the endpoint becomes disabled, the endpoint must be re-enabled | ||
before it will accept new data transfer operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of resource management errors on connectionless endpoints depends
on the type of error. If RM is disabled and one of the following errors
occur, the endpoint will be disabled: Tx Ctx, Rx Ctx, Tx CQ, or Rx CQ. For other errors
(Target EP, No Rx Buffer, etc.), the operation may fail, but the endpoint will remain enabled.
A disabled endpoint will drop or fail any queued or inflight operations. ...
operations targeting an unreachable endpoint will have operation dropped. For | ||
FI_EP_RDM, target operations targeting an unreachable endpoint will result in | ||
a transmit error. A provider may choose to set the completion error code to | ||
FI_EHOSTUNREACH signaling to user the target endpoint is unreachable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hesitate to call out a specific error code, as that depends on getting some sort of feedback from the network or target node to be accurate.
Storage libfabric users using a connectionless EP do not expect the EP to become disabled if an RDMA operation fails. The current libfabric documentation states that if various resource management enabled errors occur on an endpoint, the endpoint becomes disabled and must be re-enabled to be reusable. While this makes sense for a connected endpoint (e.g., TCP socket and IB RC QP), this does not make sense for connectionless endpoints. Consider the following RDM EP example:
The above shows that a single RPC client crashing can trigger server endpoint becoming disabled. This would impact RPC server connectivity to all other up RPC clients.
This PR addresses this problem by