Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lapack][blas][cuda] Update host task impl to use enqueue_native_command #572

Merged
merged 11 commits into from
Oct 8, 2024

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Sep 19, 2024

Description

Update host task impl to use enqueue_native_command for blas/lapack using the cuda backend (cublas/cusolver). I did both backends in a single PR because the cusolver backend uses the cublas backend of oneMKL.

The sycl_ext_codeplay_enqueue_native_command extension reduces latency wrt the host_task for native library submissions, and allows integration with sycl task_graph / events. See https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc
for details.

This extension has already been shown to lead to considerable performance improvements for applications that call oneMKL, such as Gromacs for the oneMKL fft backend. We expect similar improvements for the lapack and blas backends implemented here.

I had to update the lapack tests because they previously relied on the synchronous behaviour of the native calls due to the fact we had to sync the native streams, since previously with standard host_task we are not able to integrate the native event into the sycl task_graph/ sycl::event.
I did not need to update the blas tests since they already take into account asynchronous behaviour.

Checklist

#216 is for the most part fixed, but technically this PR maximally enables ooo queue interoperability so we can say that this
fixes #216

All Submissions

  • Do all unit tests pass locally? Attach a log.

I've added a test for each backend for each of the possible codepaths:

  • SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND is defined: "..native_command"
  • SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND is not defined so use previous code path with standard host_task

test_main_blas_ct_host_task.txt
test_main_blas_rt_host_task.txt
test_main_lapack_rt_native_command.txt
test_main_lapack_ct_native_command.txt
test_main_lapack_ct_host_task.txt
test_main_lapack_rt_host_task.txt
test_main_blas_ct_res_native_command.txt
test_main_blas_rt_res_native_command.txt

See SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND for details.

Signed-off-by: JackAKirk <[email protected]>
See SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND extension document for
details.

Generalize helpers funcs and use them for blas l1, l2, l3, batch

Signed-off-by: JackAKirk <[email protected]>
cublas_native_named_func

Signed-off-by: JackAKirk <[email protected]>
Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the PR

src/blas/backends/cublas/cublas_batch.cpp Outdated Show resolved Hide resolved
@JackAKirk
Copy link
Contributor Author

@hdelan please review this when you are back.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Sep 25, 2024

I have a small patch ready to update cublas backend just a little to implement missing GEMV_BATCH.
This I think puts cublas backend to a status where everything that maps directly between oneMKL and cublas APIs is supported to some degree (some types remain unimplemented, such as bfloat16/some mixed precisions already identified in the issues board etc).
Is it OK for me to add it here, to save the PR review overhead?
@Rbiessy what do you think?

@Rbiessy
Copy link
Contributor

Rbiessy commented Sep 25, 2024

I would prefer to have a separate PR to make it clearer which commit implements what.

@JackAKirk
Copy link
Contributor Author

I would prefer to have a separate PR to make it clearer which commit implements what.

Yeah OK, I'll be patient, thanks.

@JackAKirk
Copy link
Contributor Author

Hi @oneapi-src/onemkl-blas-write @oneapi-src/onemkl-lapack-write
would it be possible for you to review this?

Thanks

Copy link
Contributor

@hdelan hdelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BLAS part looks good, thank you. I'm less familiar with the LAPACK part.

Copy link
Contributor

@sknepper sknepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide more details about why the LAPACK tests required updating? I see your comment in the description of the PR, but I still have a few questions. Thanks!
Also, if the LAPACK tests do need updating, how will that impact testing the other (non-cuSOLVER) backends?

tests/unit_tests/lapack/source/gebrd.cpp Outdated Show resolved Hide resolved
tests/unit_tests/lapack/source/gebrd.cpp Outdated Show resolved Hide resolved
tests/unit_tests/lapack/source/gebrd.cpp Outdated Show resolved Hide resolved
this dep check is overzealous because it enforces that a dependent event cannot be submitted to run on the native device queue but not
completed before a later event it is dependent upon has also been marked
running on the device. This is not part of the sycl spec and
unnecessarily slows down execution.

Signed-off-by: JackAKirk <[email protected]>
These funcs are async in the cusolver backend.

Signed-off-by: JackAKirk <[email protected]>
Copy link
Contributor

@sknepper sknepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here to improve performance, @JackAKirk !

Signed-off-by: JackAKirk <[email protected]>
This reverts commit 61c9a53.
Signed-off-by: JackAKirk <[email protected]>
@Rbiessy Rbiessy merged commit 7adfbcc into uxlfoundation:develop Oct 8, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUSOLVER] cuSOLVER handler does not support multiple streams
5 participants