prov/lnx: Introducing the LINKx (lnx) provider

The LINKx (lnx) provider offers a framework by which multiple providers can be linked together and presented as one provider to the application. This abstracts away the details of the traffic providers from the application. This iteration of the provider allows linking only two providers, shm and another provider, ex; CXI or RXM. The composite providers which are linked together need to support the peer infrastructure. Currently the provider supports creating a unique chain of fabric->domain->ep. It doesn't support creating multiple domains per fabric and multiple endpoints per domain. This will be addresses in followup updates to the provider. This iteration mainly focuses on supporting open MPI's MTL path which uses the libfabric tagged APIs. It has been tested with linking shm and cxi and shm and rxm. Future work will include: - Supporting 1:N of fabric:domain and domain:endpoint, etc - Hardware offload support - Arbitrary provider linking - Memory caching and registration - Full libfabric API support - Multi-Rail feature In order to use the lnx provider the user needs to: export FI_LNX_PROV_LINKS="shm+<inter-node provider>" ex: export FI_LNX_PROV_LINKS="shm+cxi" or export FI_LNX_PROV_LINKS="shm+tcp;ofi_rxm" This results in the lnx provider returning all available links to the application, which can then select the most appropriate one to use. Signed-off-by: Amir Shehata <[email protected]>
ofiwg · Oct 24, 2024 · 0fbf4f4 · 0fbf4f4
1 parent 8457ba6
commit 0fbf4f4
Show file tree

Hide file tree

Showing 21 changed files with 5,562 additions and 7 deletions.
diff --git a/Makefile.am b/Makefile.am
@@ -485,6 +485,7 @@ include prov/sm2/Makefile.include
 include prov/tcp/Makefile.include
 include prov/ucx/Makefile.include
 include prov/lpp/Makefile.include
+include prov/lnx/Makefile.include
 include prov/hook/Makefile.include
 include prov/hook/perf/Makefile.include
 include prov/hook/trace/Makefile.include

diff --git a/configure.ac b/configure.ac
@@ -1125,6 +1125,7 @@ FI_PROVIDER_SETUP([hook_debug])
 FI_PROVIDER_SETUP([hook_hmem])
 FI_PROVIDER_SETUP([dmabuf_peer_mem])
 FI_PROVIDER_SETUP([opx])
+FI_PROVIDER_SETUP([lnx])
 FI_PROVIDER_FINI
 dnl Configure the .pc file
 FI_PROVIDER_SETUP_PC

diff --git a/include/ofi.h b/include/ofi.h
@@ -297,6 +297,7 @@ enum ofi_prov_type {
 	OFI_PROV_UTIL,
 	OFI_PROV_HOOK,
 	OFI_PROV_OFFLOAD,
+	OFI_PROV_LNX,
 };
 
 /* Restrict to size of struct fi_provider::context (struct fi_context) */

diff --git a/include/ofi_prov.h b/include/ofi_prov.h
@@ -211,6 +211,17 @@ MRAIL_INI ;
 #  define MRAIL_INIT NULL
 #endif
 
+#if (HAVE_LNX) && (HAVE_LNX_DL)
+#  define LNX_INI FI_EXT_INI
+#  define LNX_INIT NULL
+#elif (HAVE_LNX)
+#  define LNX_INI INI_SIG(fi_lnx_ini)
+#  define LNX_INIT fi_lnx_ini()
+LNX_INI ;
+#else
+#  define LNX_INIT NULL
+#endif
+
 #if (HAVE_PERF) && (HAVE_PERF_DL)
 #  define HOOK_PERF_INI FI_EXT_INI
 #  define HOOK_PERF_INIT NULL

diff --git a/include/ofi_util.h b/include/ofi_util.h
@@ -1172,9 +1172,11 @@ void ofi_fabric_remove(struct util_fabric *fabric);
  * Utility Providers
  */
 
-#define OFI_NAME_DELIM	';'
+#define OFI_NAME_LNX_DELIM ':'
+#define OFI_NAME_DELIM ';'
 #define OFI_UTIL_PREFIX "ofi_"
 #define OFI_OFFLOAD_PREFIX "off_"
+#define OFI_LNX "lnx"
 
 static inline int ofi_has_util_prefix(const char *str)
 {
@@ -1186,6 +1188,16 @@ static inline int ofi_has_offload_prefix(const char *str)
 	return !strncasecmp(str, OFI_OFFLOAD_PREFIX, strlen(OFI_OFFLOAD_PREFIX));
 }
 
+static inline int ofi_is_lnx(const char *str)
+{
+	return !strncasecmp(str, OFI_LNX, strlen(OFI_LNX));
+}
+
+static inline int ofi_is_linked(const char *str)
+{
+	return (strcasestr(str, OFI_LNX)) ? 1 : 0;
+}
+
 int ofi_get_core_info(uint32_t version, const char *node, const char *service,
 		      uint64_t flags, const struct util_prov *util_prov,
 		      const struct fi_info *util_hints,
@@ -1201,6 +1213,7 @@ int ofi_get_core_info_fabric(const struct fi_provider *prov,
 			     struct fi_info **core_info);
 
 
+char *ofi_strdup_link_append(const char *head, const char *tail);
 char *ofi_strdup_append(const char *head, const char *tail);
 // char *ofi_strdup_head(const char *str);
 // char *ofi_strdup_tail(const char *str);

diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h
@@ -340,6 +340,7 @@ enum {
 	FI_PROTO_SM2,
 	FI_PROTO_CXI_RNR,
 	FI_PROTO_LPP,
+	FI_PROTO_LNX,
 };
 
 enum {

diff --git a/include/rdma/fi_errno.h b/include/rdma/fi_errno.h
@@ -114,7 +114,7 @@ extern "C" {
 //#define	FI_EADV			EADV		/* Advertise error */
 //#define	FI_ESRMNT		ESRMNT		/* Srmount error */
 //#define	FI_ECOMM		ECOMM		/* Communication error on send */
-//#define	FI_EPROTO		EPROTO		/* Protocol error */
+#define	FI_EPROTO		EPROTO			/* Protocol error */
 //#define	FI_EMULTIHOP		EMULTIHOP	/* Multihop attempted */
 //#define	FI_EDOTDOT		EDOTDOT		/* RFS specific error */
 //#define	FI_EBADMSG		EBADMSG		/* Not a data message */

diff --git a/man/fi_lnx.7.md b/man/fi_lnx.7.md
@@ -0,0 +1,157 @@
+---
+layout: page
+title: fi_lnx(7)
+tagline: Libfabric Programmer's Manual
+---
+{% include JB/setup %}
+
+# NAME
+
+fi_lnx \- The LINKx (LNX) Provider
+
+# OVERVIEW
+
+The LNX provider is designed to link two or more providers, allowing
+applications to seamlessly use multiple providers or NICs. This provider uses
+the libfabric peer infrastructure to aid in the use of the underlying providers.
+This version of the provider currently supports linking the libfabric
+shared memory provider for intra-node traffic and another provider for
+inter-node traffic. Future releases of the provider will allow linking any
+number of providers and provide the users with the ability to influence
+the way the providers are utilized for traffic load.
+
+# SUPPORTED FEATURES
+
+This release contains an initial implementation of the LNX provider that
+offers the following support:
+
+*Endpoint types*
+: The provider supports only endpoint type *FI_EP_RDM*.
+
+*Endpoint capabilities*
+: LNX is a passthrough layer on the send path. On the receive path LNX
+  utilizes the peer infrastructure to create shared receive queues (SRQ).
+  Receive requests are placed on the SRQ instead of on the core provider
+  receive queue. When the provider receives a message it queries the SRQ for
+  a match. If one is found the receive request is completed, otherwise the
+  message is placed on the LNX shared unexpected queue (SUQ). Further receive
+  requests query the SUQ for matches.
+  The first release of the provider only supports tagged and RMA operations.
+  Other message types will be supported in future releases.
+
+*Modes*
+: The provider does not require the use of any mode bits.
+
+*Progress*
+: LNX utilizes the peer infrastructure to provide a shared completion
+  queue. Each linked provider still needs to handle its own progress.
+  Completion events will however be placed on the shared completion queue,
+  which is passed to the application for access.
+
+*Address Format*
+: LNX wraps the linked providers addresses in one common binary blob.
+  It does not alter or change the linked providers address format. It wraps
+  them into a LNX structure which is then flattened and returned to the
+  application. This is passed between different nodes. The LNX provider
+  is able to parse the flattened format and operate on the different links.
+  This assumes that nodes in the same group are all using the same version of
+  the provider with the exact same links. IE: you can't have one node linking
+  SHM+CXI while another linking SHM+RXM.
+
+*Message Operations*
+: LNX is designed to intercept message operations such as fi_tsenddata
+  and based on specific criteria forward the operation to the appropriate
+  provider. For the first release, LNX will only support linking SHM
+  provider for intra-node traffic and another provider (ex: CXI) for inter
+  node traffic. LNX send operation looks at the destination and based on
+  whether the destination is local or remote it will select the provider to
+  forward the operation to. The receive case has been described earlier.
+
+*Using the Provider*
+: In order to use the provider the user needs to set FI_LNX_PROV_LINKS
+  environment variable to the linked providers in the following format
+  shm+<prov>. This will allow LNX to report back to the application in the
+  fi_getinfo() call the different links which can be selected. Since there are
+  multiple domains per provider LNX reports a permutation of all the
+  possible links. For example if there are two CXI interfaces on the machine
+  LNX will report back shm+cxi0 and shm+cxi1. The application can then
+  select based on its own criteria the link it wishes to use.
+  The application typically uses the PCI information in the fi_info
+  structure to select the interface to use. A common selection criteria is
+  the interface nearest the core the process is bound to. In order to make
+  this determination, the application requires the PCI information about the
+  interface. For this reason LNX forwards the PCI information for the
+  inter-node provider in the link to the application.
+
+# LIMITATIONS AND FUTURE WORK
+
+*Hardware Support*
+: LNX doesn't support hardware offload; ex hardware tag matching. This is
+  an inherit limitation when using the peer infrastructure. Due to the use
+  of a shared receive queue which linked providers need to query when
+  a message is received, any hardware offload which requires sending the
+  receive buffers to the hardware directly will not work with the shared
+  receive queue. The shared receive queue provides two advantages; 1) reduce
+  memory usage, 2) coordinate the receive operations. For #2 this is needed
+  when receiving from FI_ADDR_UNSPEC. In this case both providers which are
+  part of the link can race to gain access to the receive buffer. It is
+  a future effort to determine a way to use hardware tag matching and other
+  hardware offload capability with LNX
+
+*Limited Linking*
+: This release of the provider supports linking SHM provider for intra-node
+  operations and another provider which supports the FI_PEER capability for
+  inter-node operations. It is a future effort to expand to link any
+  multiple sets of providers.
+
+*Memory Registration*
+: As part of the memory registration operation, varying hardware can perform
+  hardware specific steps such as memory pinning. Due to the fact that
+  memory registration APIs do not specify the source or destination
+  addresses it is not possible for LNX to determine which provider to
+  forward the memory registration to. LNX, therefore, registers the memory
+  with all linked providers. This might not be efficient and might have
+  unforeseen side effects. A better method is needed to support memory
+  registration. One option is to have memory registration cache in lnx
+  to avoid expensive operations.
+
+*Operation Types*
+: This release of LNX supports tagged and RMA operations only. Future
+  releases will expand the support to other operation types.
+
+*Multi-Rail*
+: Future design effort is being planned to support utilizing multiple interfaces
+  for traffic simultaneously. This can be over homogeneous interfaces or over
+  heterogeneous interfaces.
+
+# RUNTIME PARAMETERS
+
+The *LNX* provider checks for the following environment variables:
+
+*FI_LNX_PROV_LINKS*
+: This environment variable is used to specify which providers to link. This
+  must be set in order for the LNX provider to return a list of fi_info
+  blocks in the fi_getinfo() call. The format which must be used is:
+  <prov1>+<prov2>+... As mentioned earlier currently LNX supports linking
+  only two providers the first of which is SHM followed by one other
+  provider for inter-node operations
+
+*FI_LNX_DISABLE_SHM*
+: By default this environment variable is set to 0. However, the user can
+  set it to one and then the SHM provider will not be used. This can be
+  useful for debugging and performance analysis. The SHM provider will
+  naturally be used for all intra-node operations. Therefore, to test SHM in
+  isolation with LNX, the processes can be limited to the same node only.
+
+*FI_LNX_USE_SRQ*
+: Shared Receive Queues are integral part of the peer infrastructure, but
+  they have the limitation of not using hardware offload, such as tag
+  matching. SRQ is needed to support the FI_ADDR_UNSPEC case. If the application
+  is sure this will never be the case, then it can turn off SRQ support by
+  setting this environment variable to 0. It is 1 by default.
+
+# SEE ALSO
+
+[`fabric`(7)](fabric.7.html),
+[`fi_provider`(7)](fi_provider.7.html),
+[`fi_getinfo`(3)](fi_getinfo.3.html)