Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.22.x] prov/efa: Handle recv cancel for zero copy recv #10218

Merged
merged 1 commit into from
Jul 25, 2024

Conversation

shijin-aws
Copy link
Contributor

backport #10215

Currently, posted recv is not tracked in zero-copy mode,
which breaks various ep operations including fi_cancel. This patch fixes this
issue by introducing a user_recv_rxe_list that tracks the posted user recv,
and implement fi_cancel operation for this list.

Signed-off-by: Shi Jin <[email protected]>
(cherry picked from commit 2cffc27)
@shijin-aws
Copy link
Contributor Author

shijin-aws commented Jul 24, 2024

@j-xiong does this have to be merged by 5 pm to catch 1.22.0 release? Or merging #10215 is enough

@j-xiong
Copy link
Contributor

j-xiong commented Jul 24, 2024

@shijin-aws No. It's enough to have the PR is here. I can monitor the CI and get it merged before I create my updated version of #10211

@shijin-aws
Copy link
Contributor Author

@j-xiong Awesome, thx!

@j-xiong
Copy link
Contributor

j-xiong commented Jul 25, 2024

@shijin-aws What is the AWS CI error about? the same "out of memory error" as we saw in #10216?

@shijin-aws
Copy link
Contributor Author

Yeah, not sure why, anyway, we may need to reduce the message size in this test

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.66.252 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10218-debug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 1024 -I 5 -v -S 1048576 -D cuda -i 0 -p shm -E=9228'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.66.252 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10218-debug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 1024 -I 5 -v -S 1048576 -D cuda -i 0 -p shm -E=9228 172.31.66.252'"'"''
client_stdout:
[error] fabtests:common/hmem_cuda.c:305: cudaMalloc failed: cudaErrorMemoryAllocation out of memory
timeout: the monitored command dumped core

client returncode: 255
server_stdout:

server returncode: 124

@shijin-aws
Copy link
Contributor Author

I opened a PR #10219 to remove that 1M test, will see how it goes

@j-xiong
Copy link
Contributor

j-xiong commented Jul 25, 2024

bot:aws:retest

@j-xiong j-xiong merged commit c21b2ad into ofiwg:v1.22.x Jul 25, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants