forked from ofiwg/libfabric
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uring enablement #8
Open
ooststep
wants to merge
337
commits into
main
Choose a base branch
from
uring
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+31,077
−13,786
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ooststep
force-pushed
the
uring
branch
3 times, most recently
from
August 14, 2024 18:34
9c8b7eb
to
9ebf395
Compare
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.4.0 to 4.4.1. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@5076954...604373d) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.26.10 to 3.26.11. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@e2b3eaf...6db8d63) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
When zcpy rx is on, both inject_msg_size and inject_rma_size should be reported as inline buf size. Signed-off-by: Shi Jin <[email protected]>
A provider may update the memory region that is added to accommodate for instance alignment of the region to a larger page boundary. In such cases, the MR cache info used to search the cache should use the updated region. This allows the provider to avoid walking /proc/pid/smaps if the underlying kernel component may more efficiently determine the backing page size. Signed-off-by: Steve Welch <[email protected]> Signed-off-by: Ian Ziemba <[email protected]>
Application should use FI_MR_DMABUF API to pass the dmabuf fd and offset to make Libfabric register the mr via dmabuf. The only exception is for synapseai, beacuse dmabuf is the only way to register Gaudi device buffer and it was implemented before the FI_MR_DMABUF API. Keep this behavior unchanged for compatibility. Signed-off-by: Shi Jin <[email protected]>
psm3 is failing onecclgpu because of a missing package. Disable it until the package dependency is resolved. Signed-off-by: Zach Dworkin <[email protected]>
Remove deprecated FI_MR_BASIC flag Signed-off-by: Tadeusz Struk <[email protected]>
This patch adds the missing inband sync in ft_fabric_init_cm to handle the case where rx buffers are not pre-posted by the application. The default behaviour in fabtests is to pre-post a rx buffer. This change enables fabtests using ft_fabric_init_cm to consume the posted receive with an inband sync by setting the test option FT_OPT_NO_PRE_POSTED_RX. Similar changes have been made to ft_init_fabric ofiwg#10394 Signed-off-by: Nikhil Nanal <[email protected]>
efa_mr_hmem_setup previously always called ofi_hmem_dev_register on all FI_HMEM_CUDA calls, regardless of the presence of FI_MR_DMABUF in flags. When gdrcopy is enabled, this means deconstructing the fi_mr_dmabuf into a struct iovec from its {base, offset, len} 3-tuple, then passing the resulting iovec to gdr_pin followed by gdr_map. a dmabuf cannot be exported by the nvidia module without an implicit promise that the address space is already reserved and mapped in the current pid, of appropriate size and alignment, and that all pages/ranges backing it can be made available to an importer. All requirements are enforced by the cuda APIs used to acquire one. At best, calls to libgdrcopy here are unnecessary for dmabufs, and at worst the pgprots set by gdrdrv are different enough from the ones setup by cuda proper to cause issues, or the redundant mappings become costly for the driver to maintain. Prior to this patch, apps can only prevent these gdr_map calls on dmabuf arguments by disabling gdrcopy entirely through environment variables before launch. But apps may wish to use fi_mr_regattr with dmabuf arguments in the default case, while still reserving the right to call fi_mr_regattr with iov arguments on the same domain, where the gdr flow may still be desired in the latter case. This makes that possible. Signed-off-by: Nicholas Sielicki <[email protected]>
fi_multinode command line arguments changed. Update script to accommodate the change. Signed-off-by: Amir Shehata <[email protected]>
set FT_OPT_ADDR_IS_OOB by default. It enables out of band address exchange which is needed by CXI. Signed-off-by: Amir Shehata <[email protected]>
Signed-off-by: Peinan Zhang <[email protected]>
Add clarification in the man page indicating that the owner is responsible for creating unique fi_peer_*_contexts for each peer and that the peers are only allowed to set the peer ops of that context. Signed-off-by: Alexia Ingerson <[email protected]>
The peer API has been updated to specify that the owner must allocate the peer's fid_peer_srx. The shm implementation was allocating its own internal fid_peer_srx. This updates the shm implementation to assume it has a unique fid_peer_srx and updates the imported fid_peer_srx peer_ops, saving a pointer to the fid_peer_srx instead of the internal fid_ep which required a wrapper function to get back to the fid_peer_srx It also returns an internal fid_ep for the created srx which is used to close the srx by the owner. Even though shm doesn't need anything attached to the internal fid_ep, it is there for consistency and to track the domain reference counting for errors. This patch also moves the srx specific functions into smr_domain where they belong Signed-off-by: Alexia Ingerson <[email protected]>
The previous definition of the peer API didn't specify who allocated the second peer structure (the one referenced by the peer). The shm implementation was choosing to duplicate the imported srx and set it internally. The new definition specifies that the owner handle the duplication of the peer resource which is then imported into the peer to just set. Shm has been updated accordingly but efa needs to be updated to create a second peer_srx and set the fields to the original one for the peer to reference the owner_ops correctly. This also adds a missing fi_close for the shm srx resource Signed-off-by: Alexia Ingerson <[email protected]> Signed-off-by: Shi Jin <[email protected]>
Signed-off-by: OFIWG Bot <[email protected]>
Signed-off-by: Zach Dworkin <[email protected]>
Signed-off-by: Zach Dworkin <[email protected]>
In order to receive unmap events, uffd uses 'mode missing' when registering memory regions. This implies getting page fault events as well. So handle them by returning a zero-filled page. Page faults come in 3 flavors: reads, writes and writes to protected pages. The only ones we can handle are writes to non-backed pages. Signed-off-by: Mike Uttormark <[email protected]> Signed-off-by: Ian Ziemba <[email protected]>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.2.0 to 4.2.1. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@d632683...eef6144) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.26.11 to 3.26.13. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@6db8d63...f779452) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.4.1 to 4.4.3. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@604373d...b4b15b8) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Itai Masuari <[email protected]>
Fix incorrect atomic LOR on complex numbers. The values were incorrectly getting ANDed together instead of ORed. This went unnoticed because the code was very difficult to read. This also refactors the logical checks with a helper function to make it more readible and less prone to errors. Signed-off-by: Alexia Ingerson <[email protected]>
This allows fabtests to make use of atomic validation code There were many Windows atomics bugs, inconsistencies, and missing definitions. This patch also cleans up the entire ofi_atomic.c implementation for unix and windows The following changes are included: - Separate fill and check based on real or complex types as setting and reading complexes on windows is not allowed (not native datatype, abstracted). Complex versions use eq and set functions specific for complexes defined in osd.h - Remove duplicated ofi_complex definitions in ofi_atomic (already in osd.h file) - Add general check_atomic and fill_atomic calls and use them in ubertest - Add EXPAND ( x ) x define to work nicely with windows VA_ARGS handling - Fix inconsistency with ofi_complex_type/or naming ('complex' always should come first) - Fix inconsistency with op names "equ" and "mul" -> "eq" and "prod" - Add missing lxor complex op definitions on Windows Signed-off-by: Alexia Ingerson <[email protected]>
To properly validate atomic data, we need host bounce buffers for the result and compare buffers in addition to the regular bounce buffer for the tx/rx bufs. This adds two extra bufs allocated only for atomic purposes and adds hmem support to the common atomic validation path. It also renames the alloc/free_tx_buf calls to generic alloc/free_host_bufs which allocates all three buffers at once. Signed-off-by: Alexia Ingerson <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jessie Yang <[email protected]>
pingpong doesn't support FI_MR_ENDPOINT today, so the mr is associated with domain instead of ep. It is unsafe to close mr before closing ep because it can cause an EBUSY error when there are outstanding recvs of the mr posted to the ep/qp. This patch fixes this issue by moving the mr close after the ep close. Signed-off-by: Shi Jin <[email protected]>
Signed-off-by: Zach Dworkin <[email protected]>
Uplevel pre-build directory so that it is not scp'd Signed-off-by: Zach Dworkin <[email protected]>
Signed-off-by: Zach Dworkin <[email protected]>
Signed-off-by: Zach Dworkin <[email protected]>
Put slow stages first so they start executing and other tests can complete in parallel while the slow one is running. Signed-off-by: Zach Dworkin <[email protected]>
ooststep
force-pushed
the
uring
branch
4 times, most recently
from
December 17, 2024 22:25
fedf1c5
to
7617ac3
Compare
…ions fixed fabtests send and recv functions to use flags argument type as uint64_t instead of int as the underlying fi calls use uint64_t. removed declaration of unused function ft_writemsg from shared.h Also fixed functions calling ft_sendmsg and ft_recvmsg touse uint64_t for flags Signed-off-by: Nikhil Nanal <[email protected]>
Lookup a all teams and users in the ofiwg github team. If the submitter is not in the list of users then deny them Signed-off-by: Zach Dworkin <[email protected]>
lpp includes stdatomic.h but does not include a check for it in the configure so can cause a build to fail on a system without it Signed-off-by: Alexia Ingerson <[email protected]>
Could result in a peer getting incorrectly unmmaped Signed-off-by: Alexia Ingerson <[email protected]>
This commit fixes the following bugs in neuron fabtests 1. The neuron accelerator detection is broken on some OSs because the full path of the executable `neuron-ls` was not used 2. Before this commit, each pytest worker was assigned a single neuron core. This works on multi node tests but fails on single node tests because a neuron core can only be opened by a single process. This commit assigns two different neuron cores to each pytest worker for client-server tests: one for the server and one for the client. Trn1 has 2 cores per neuron device and Trn2 has 8 cores per neuron device, so this assignment works for both. 3. When running in serial mode, the env var PYTEST_XDIST_WORKER is not set, so the NEURON_RT_VISIBLE_CORES env var is also not set. This causes the server to occupy all neuron cores and the client fails. So this commit assigns device 0 to the server and client when running with one worker. Signed-off-by: Sai Sunku <[email protected]>
Before this change, the EFA AV entry contained a reference to efa_rdm_peer which is specific to a given endpoint. This member also prevented binding a single AV to multiple endpoints. This change removes efa_rdm_peer from AV entry by adding a hashmap to the endpoint that maps fi_addr to efa_rdm_peer. And it also enables multiple EFA endpoints to bind to the same AV. Co-authored-by: Shi Jin <[email protected]> Signed-off-by: Sai Sunku <[email protected]>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.27.6 to 3.27.9. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@aa57810...df409f7) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Seth Zegelstein <[email protected]>
Signed-off-by: OFIWG Bot <[email protected]>
Remove all CXI_MAP_IOVA_ALLOC references from libfabric. Signed-off-by: Soumendu Satapathy <[email protected]>
Currently, when local support unsolicited write recv while the peer doesn't support it, the peer will crash because it expects to get a valid wr_id for IBV_WC_RECV_RDMA_WITH_IMM op code. This peer crash can cause weird error message on sender side's cq when it is still sending data to it. When local doesn't support unsolicited write recv while the peer support it, local will get cq error for the rdma op as "Unexpected status" as well. This patch makes the initiator of rdma write imm detect the unsolicited write recv support status on both sides. If there is inconsistency, the initiator will return error with clear error messages that instruct the mitigation. Signed-off-by: Shi Jin <[email protected]>
efa_fork_support_enable_if_requested was moved to EFA_INI, so efa_fork_support_install_fork_handler can be registered at any stage that is later. Move efa_fork_support_install_fork_handler back to efa_domain_open to avoid installing fork handler for non-EFA provider during fi_getinfo's provider discovery process. Signed-off-by: Jessie Yang <[email protected]>
This commit disables most Intel CI and should not be merged.
we may receive uring events before we're fully connected so don't try to progress rx until that connection is established
the previously used io_uring_prep_readv function does not support flags, instead flags were being passed as an offset, triggering an illegal seek error
multishot is not supported on older kernels (prior to 5.19) and is unreliable in early 6.x kernels. For now, use single-shot and re-submit
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.