Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hmem/rocr: Add ROCr dmabuf support under verbs #9618

Merged
merged 2 commits into from
Dec 5, 2023

Conversation

AtlantaPepsi
Copy link
Contributor

  • Added get_dmabuf_fd for ROCr under HMEM interface, it acquires dmabuf handle through HSA runtime and is now invoked by verbs providers.
  • 3 conditions for support of this functionality: ROCr support, Linux kernel support, and a user set environmental variable FI_HMEM_ROCR_USE_DMABUF (please let us know if naming is ok) as a switch that will be checked during runtime
  • currently for verbs, if dmabuf handle cannot be successfully acquired for any reason, it will opt to not register any memory region and just crash. We are not sure if this is intended or a bug, but tentatively we fixed this and make it invoke failover routine in case of failure.

@AtlantaPepsi
Copy link
Contributor Author

@edgargabriel @nusislam for additional review

src/hmem_rocr.c Outdated
bool dmabuf_support = false, dmabuf_kernel = false;

char *s;
if (!(s = getenv("FI_HMEM_ROCR_USE_DMABUF")) || !atoi(s)) goto out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use fi_param_define() and fi_param_get_*() to define and read environment variables. Only use getenv for variables defined externally (e.g. MPI).

src/hmem_rocr.c Outdated
}

#if HAVE_HSA_AMD_PORTABLE_EXPORT_DMABUF
const char kernel_opt1[] = "CONFIG_DMABUF_MOVE_NOTIFY=y";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't interleave declarations with statements. Either move the declarations up or add braces around the entire block. Same for the next function.

src/hmem_rocr.c Outdated
}

int rocr_hmem_get_dmabuf_fd(const void *addr, uint64_t size, int *dmabuf_fd,
uint64_t *offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment is off.

src/hmem_rocr.c Outdated
Comment on lines 1070 to 1071

char *s;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the empty line after this line.

src/hmem_rocr.c Outdated
// HSA_AMD_SYSTEM_INFO_DMABUF_SUPPORTED = 0x204, we use the number instead of var name
// for backward compatibility reasons.
hsa_ret = hsa_ops.hsa_system_get_info((hsa_system_info_t) 0x204, &dmabuf_support);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: here we usually don't add the empty line between the call and the return value checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the return value checking clause after this should also have { at same line?

int rocr_hmem_get_dmabuf_fd(const void *addr, uint64_t size, int *dmabuf_fd,
uint64_t *offset)
{
static bool supported = false, checked = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an empty line here.

src/hmem_rocr.c Outdated
{
static bool supported = false, checked = false;
if (!checked)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move { to the end of the previous line.

src/hmem_rocr.c Outdated
hsa_ret = hsa_ops.hsa_amd_portable_export_dmabuf(addr, size, dmabuf_fd, offset);

if (hsa_ret != HSA_STATUS_SUCCESS)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

src/hmem_rocr.c Outdated
@@ -1054,6 +1063,100 @@ int rocr_dev_reg_copy_from_hmem(uint64_t handle, void *dest, const void *src,
return FI_SUCCESS;
}

bool rocr_is_dmabuf_supported(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function intended to be used anywhere else? If not, maybe define it as static. If yes, consider moving the checked logic from the caller side to here to avoid duplicated checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense and I made it static, the hmem_cuda counterpart also have this function but non-static, I wonder if it had a specific reason there?

@shijin-aws
Copy link
Contributor

bot:aws:retest

@AtlantaPepsi
Copy link
Contributor Author

@j-xiong thank you for the review and comments, could you take a look at the updated commit as well?

Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please squash the third commit into the first one and update the PR by "git push -f". We don't do auto-squash here and don't expect one commit fixing another commit in the same PR.

src/hmem_rocr.c Outdated
return -FI_EOPNOTSUPP;

#if HAVE_HSA_AMD_PORTABLE_EXPORT_DMABUF
hsa_status_t hsa_ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-xiong would you please check again?

@j-xiong
Copy link
Contributor

j-xiong commented Dec 4, 2023

Windows build error:

  C:\projects\libfabric\src\hmem_rocr.c(42,1): fatal error C1083: Cannot open include file: 'sys/utsname.h': No such file or directory [C:\projects\libfabric\libfabric.vcxproj]

@j-xiong
Copy link
Contributor

j-xiong commented Dec 4, 2023

I think this #include line can be moved below #if HAVE_ROCR.

@shijin-aws
Copy link
Contributor

bot:aws:retest

@AtlantaPepsi
Copy link
Contributor Author

AtlantaPepsi commented Dec 4, 2023

bot:aws:retest

I think this #include line can be moved below #if HAVE_ROCR.

@j-xiong how about now?
Is windows build a part of CI, too? didn't spot it

@j-xiong
Copy link
Contributor

j-xiong commented Dec 4, 2023

@AtlantaPepsi Let's wait for the CI test result. FYI, the appveyor CI is the windows build test.

@AtlantaPepsi
Copy link
Contributor Author

@j-xiong I think all checks have been passed

@j-xiong j-xiong merged commit 0b65c16 into ofiwg:main Dec 5, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants