Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm,rxm,efa, fabtests: fix min multi recv size setting #10618

Merged
merged 4 commits into from
Dec 10, 2024

Conversation

aingerson
Copy link
Contributor

Fixes #10591

shm uses the util srx and sets the minimum multi receive size through
the srx. However, the srx code doesn't get initialized until the endpoint
gets enabled. So if the application calls setopt (before FI_ENABLE), this
will segfault because the srx has not been initialized. Instead, we need
to save the multi recv size in the shm endpoint to be valid during setopt
and then pass that into the util_srx creation to set the multi recv size

Signed-off-by: Alexia Ingerson <[email protected]>
rxm uses the util srx and sets the minimum multi receive size through
the srx. However, the srx code doesn't get initialized until the endpoint
gets enabled. So if the application calls setopt (before FI_ENABLE), this
will segfault because the srx has not been initialized. Instead, we need
to save the multi recv size in the rxm endpoint to be valid during setopt
and then pass that into the util_srx creation to set the multi recv size

Signed-off-by: Alexia Ingerson <[email protected]>

if (level != FI_OPT_ENDPOINT)
return -FI_ENOPROTOOPT;

if (optname == FI_OPT_MIN_MULTI_RECV) {
srx = smr_ep->srx->ep_fid.fid.context;
srx->min_multi_recv_size = *(size_t *)optval;
smr_ep->min_multi_recv_size = *(size_t *)optval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think EFA provider has the same bug: I was copying it from shm :(

https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_ep_fiops.c#L1666-L1667

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijin-aws Oops! Sorry!! I will add the fix to efa as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the EFA code, it seems enough to remove L1666 and L1667

@aingerson aingerson requested a review from shijin-aws December 9, 2024 20:23
j-xiong
j-xiong previously approved these changes Dec 9, 2024
@shijin-aws
Copy link
Contributor

shijin-aws commented Dec 9, 2024

@aingerson Would you mind replacing your efa commit by shijin-aws@9eb3c79. I fixed a bug in our unit test which makes this bug not caught earlier. Thanks!

@shijin-aws shijin-aws changed the title prov/shm, rxm: fix min multi recv size setting for shm and rxm prov/shm, rxm, efa: fix min multi recv size setting for shm and rxm Dec 9, 2024
@shijin-aws shijin-aws changed the title prov/shm, rxm, efa: fix min multi recv size setting for shm and rxm prov/shm, rxm, efa: fix min multi recv size setting Dec 9, 2024
@j-xiong
Copy link
Contributor

j-xiong commented Dec 9, 2024

Intel CI failed with fi_multi_recv. The reason is that the test calls fi_setopt() after enabling the EP, which won't be effective with the new code.

@j-xiong j-xiong dismissed their stale review December 10, 2024 00:08

issues found in test

@shijin-aws
Copy link
Contributor

Intel CI failed with fi_multi_recv. The reason is that the test calls fi_setopt() after enabling the EP, which won't be effective with the new code.

Yikes, that explained why the bug was not discovered earlier.

@shijin-aws
Copy link
Contributor

AWS CI failure should be fixed after ingesting shijin-aws@9eb3c79

shijin-aws and others added 2 commits December 9, 2024 17:39
efa uses the util srx and sets the minimum multi receive size through
the srx. However, the srx code doesn't get initialized until the endpoint
gets enabled. So if the application calls setopt (before FI_ENABLE), this
will segfault because the srx has not been initialized. Instead, we need
to save the multi recv size in the efa endpoint to be valid during setopt
and then pass that into the util_srx creation to set the multi recv size

Signed-off-by: Alexia Ingerson <[email protected]>
Signed-off-by: Shi Jin <[email protected]>
… enable

fi_setopt has to be called before enabling an endpoint. This adds an opt arg
to allow setting this option in the common code like the other EP options.

Signed-off-by: Alexia Ingerson <[email protected]>
@aingerson
Copy link
Contributor Author

aingerson commented Dec 10, 2024

Man, that was just one after another issue. This new version should fix it now - I picked your patch @shijin-aws and I also added a patch in fabtests to call setopt before enable instead of after

@aingerson aingerson requested a review from j-xiong December 10, 2024 01:48
@aingerson aingerson changed the title prov/shm, rxm, efa: fix min multi recv size setting prov/shm,rxm,efa, fabtests: fix min multi recv size setting Dec 10, 2024
@shijin-aws shijin-aws merged commit 9e711f9 into ofiwg:main Dec 10, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prov/shm: fi_setopt causes a segmentation fault when setting FI_OPT_MIN_MULTI_RECV
3 participants