-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ofi 2.0 changes #10009
Ofi 2.0 changes #10009
Conversation
Several defines and values should not have been exposed in the public header files. Remove or move the definitions into internal headers. This removes the chance of possible conflicts with application definitions and API breakage. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Remove support for asynchronous insertions and AV_MAP. The format of the fi_addr_t value will either be indexed based in the standard case or provider defined in more advanced use cases, based on the AV configuration (such as using auth_keys). Signed-off-by: Sean Hefty <[email protected]>
Remove FI_BUFFERED_RECV as an exported API option. Since it's currently used internally between mrail and rxm, make it an internal only option. It has a limited use case for multirail over rxm over connected endpoints where shared receive queues are not available. With shared receive queues, the feature wouldn't be needed, as mrail could own the buffers outright. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Remove overly complicated threading models and focus on specific models to allow better alignment between application designs and provider implementation. Use FI_THREAD_DOMAIN as the preferred lockless threading model for standard endpoints. Signed-off-by: Sean Hefty <[email protected]>
Combine data and control progress into one progress option. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Completions are always unordered. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Field was deprecated and only serves as a placeholder for compatility. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Support for wait sets adds significant complexity to the provider implementation and is basically an abstraction around the OS constructs for poll/epoll (on Linux). Remove the feature from the API, but keep the internal implementation for now. This allows providers to move away from wait set support. Note that blocking support and support for native wait objects (e.g. epoll fd's) are still supported by the API. Only the wait set abstraction is removed, which allows providers control over creating wait objects. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Setting control=FI_PROGRESS_CONTROL_UNIFIED, data=FI_PROGRESS_MANUAL, and | ||
threading=FI_THREAD_DOMAIN/FI_THREAD_COMPLETION allows Libfabric to remove all | ||
locking in the critical data progress path. | ||
: This progress model indicates that the user will synchronize progressing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove control progress, aren't we assuming data and ctrl progress are unified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a question I would like to discuss. It is possible that both use the same model (auto or manual) but progress independently. But does it make practical sense to keep them separated? Or are there technical difficulties for all providers to support unified progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @shefty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to remove the arbitrary separation between data and control progress. There's only progress, which can be manual or auto. Heck, even trying to determine if an async operation that generates a completion meets the requirement for being manual versus auto is non-trivial.
The only "non-data" asynchronous operation that is left is connection establishment. Such operations are reported to an EQ, which already provides separation from data transfer completions. Connections require user intervention in order to complete. I don't know what auto progress really means there.
03514ce
to
a77882b
Compare
AWS CI failed in ubertest as
|
@shijin-aws Thanks. Similarly, the failure in Intel CI is related to trying to run removed test (fi_poll). |
#define FI_VARIABLE_MSG 0ULL | ||
#define FI_NOTIFY_FLAGS_ONLY 0ULL | ||
#define FI_RESTRICTED_COMP 0ULL | ||
#define FI_EP_SOCK_STREAM FI_EP_UNSPEC | ||
#define FI_ORDER_NONE 0ULL | ||
#define FI_MR_BASIC (FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY) | ||
#define FI_MR_SCALABLE 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining FI_MR_BASIC and FI_MR_SCALABLE allow the app to compile, but the behavior isn't the same. FI_MR_BASIC, specifically will result is a broken app. IMO, it's better to fail the compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep these temporarily for current CI testing. Will remove before merge.
Makefile.am
Outdated
@@ -294,7 +293,7 @@ real_man_pages = \ | |||
man/man7/fi_arch.7 \ | |||
man/man7/fi_direct.7 \ | |||
man/man7/fi_guide.7 \ | |||
man/man7/fi_hook.7 \ | |||
man/man7/fi_hook.7 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tab got inserted here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to the OPX provider look good to me.
include/rdma/fabric.h
Outdated
enum { | ||
FI_TAG_BITS, | ||
FI_TAG_HPC, | ||
FI_TAG_AI, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we wanting HPC and AI or MPI and CCL? The latter are closer aligned to what the definitions correspond to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI and CCL. Forgot to change that.
|
||
The AI tag format does not require searching for matching receive | ||
buffers, only directing the message to the correct virtual message queue | ||
based on to the payload identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra 'to' in the text that isn't needed
: 16-bit half precision floating point value (IEEE 754-2008). | ||
|
||
*FI_BFLOAT16* | ||
: 16-bit brain floating point value (IEEE 754-2008). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brain floating point - I love IEEE
Poll sets are a simple iterator around progressing multiple objects. Remove from the API to reduce provider complexity. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
It is only implemented in the sockets provider as a proof of concept. The sockets provider itself is deprecated. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Remove FI_MR_BASIC/SCALABLE/UNSPEC. These were deprecated in version 1.5. Remove FI_LOCAL_MR, which was an earlier version of FI_MR_LOCAL and also deprecated. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Feature is not implemented natively by providers. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Remove FI_ORDER_NONE (flag that's 0) and FI_ORDER_STRICT (which isn't a flag, and covers only a portion of the valid flags). Remove FI_ORDER_DATA, which will not be supported by version 2 in order to allow for greater provider optimization. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
They are still used by MPICH which is needed by current CI runs. Signed-off-by: Jianxin Xiong <[email protected]>
Remove the option of directing transmit and receive completions to separate CQs for the same endpoint. The option adds complexity at the provider and application levels. This is largely the result of needing SW based protocols for certain operations, such as tag matching. This either makes it necessary for the app to drive progress across multiple CQs, or the provider emulates the application's CQs in SW. This change updates the man page only. Provider developers are left to update their code bases separately. Signed-off-by: Sean Hefty <[email protected]>
Disallow users hand-crafting or hand-copying their own fi_info structs. This allows the library to allocate hidden fields for internal use. Plus, there's no need to apps to do this, given that the API call is way easier to use. Signed-off-by: Sean Hefty <[email protected]>
Add a new call that takes fi_info as input, which provides consistency with the domain and endpoint open calls. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Subsystem filtering isn't useful (or likely ever used). Remove the enum fi_log_subsys. Instead convert the API to accept int flags. This maintains API compatibility. Update all current FI_LOG_xxx subsys values to 0. This avoids needing to update the providers to the new API, forcing them to pass in 0 for the flags. No actual flag values are defined. Those become a placeholder for future options. The logging checks are simplified by this change. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Introduce the concept of peer groups. A peer group is a set of peers that are communicating together for some specific set of tasks. Peer groups provide a lower-level mapping of HPC and AI communicators. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Jianxin Xiong <[email protected]>
Allow specifying precise tag formatting options. The mem_tag_format takes as input a set of bit fields. In practice, this ends up being unusable to implement, resulting in the entire tag simply being masked with ignore bits. When the mem_tag_format value only has the lower bits set (< 256), interpret the format as specific options. Two new options are defined, one aligned with MPI and the other with CCLs. This information can be used by providers to optimize for the separate use cases. Signed-off-by: Sean Hefty <[email protected]>
Add 16-bit and 8-bit floating point datatypes that are often used for machine learning and AI applications. Add atomic diff operation. Signed-off-by: Jianxin Xiong <[email protected]>
Today, `ep_attr->max_msg_size` and `tx_attr->inject_size` define the transport's size limitation for a single data transfer operation. It lacks the flexibility to allow providers having different limits on different categories of operations (msg/tagged/rma/atomic). At the same time, applications can't express usage models that have very different size limits (e.g, only need small send/recv, but want large RMA). Here a set of options are added to the fi_getopt()/fi_setopt() interface. These options allow the maximum message size and inject size be retrieved or set independently for msg/tagged/rma/atomci operations. The existing definition of `ep_attr->max_msg_size` and `tx_attr->inject_size` remain unchanged -- they are the upper bound of the limits for all the data transfer functions. A typical application will call fi_getopt() to retrieve the default values first, and call fi_setopt() if lower values are desired. If -FI_ENOPROTOOPT is returned, `ep_attr->max_msg_size` and `tx_attr->inject_size` should be used. Signed-off-by: Jianxin Xiong <[email protected]>
Dear all, Are there any published schedule for the 2.0 API/library release and/or any a specific public mailing list or meeting where the ideas/release plans are discussed? Thank you very much ! |
For 8f17f05, I think it doesn't have to be ofi 2.0 only, can we make a separate PR that can be merged to main and we can release it in a future 1.x release |
There is a roadmap here: Discussions are usually held on the bi-weekly ofiwg meeting on Tuesdays mornings: https://www.openfabrics.org/my-calendar/ |
@shijin-aws Yes, we can do that. |
Here are the proposed changes for OFI 2.0, most of which are cherry-picked from #9384 and modified to fit current tip of main branch. Since the sockets and usnic provider have not been removed as suggested by #9384, extra work is needed to keep them in sync with the changes.
More commits are going to be added to this PR. The merging of this PR will embark the forking of
v1.x-main
.