Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPC++ compile bug #483

Closed
horrorChen opened this issue Dec 5, 2023 · 9 comments · Fixed by #484
Closed

DPC++ compile bug #483

horrorChen opened this issue Dec 5, 2023 · 9 comments · Fixed by #484
Assignees

Comments

@horrorChen
Copy link

In file portBLAS/include/blas_meta.h, the header file

#include <ext/oneapi/experimental/sycl_complex.hpp>

no longer exists in DPC++, instead you can use

#include <ext/oneapi/experimental/complex/complex.hpp>

which is located in llvm-dpcpp/build/include/sycl/ext/oneapi/experimental/complex/complex.hpp.

@horrorChen
Copy link
Author

horrorChen commented Dec 7, 2023

I also find another compile error when I use DPC++ on NVIDIA A100, but I can't fix it.

I want to test the performance of portBLAS gemm on tensor core, so I revise cmake/Modules/FindDPCPP.cmake and add 2 lines to set the macro NVIDIA_GPU

  if (${start_idx} AND ${sm_val} GREATER_EQUAL "80")
    add_definitions(-DSB_ENABLE_JOINT_MATRIX=1)
    add_definitions(-DNVIDIA_GPU=1)
    list(APPEND DPCPP_FLAGS "-Xclang;-cl-mad-enable")
    list(APPEND DPCPP_FLAGS "-DSYCL_EXT_ONEAPI_MATRIX_VERSION=4")
    list(APPEND DPCPP_FLAGS "-DSB_ENABLE_JOINT_MATRIX=1")
    list(APPEND DPCPP_FLAGS "-DNVIDIA_GPU=1")
  endif()

Meanwhile, I revise the call of API joint_matrix_mad in src/operations/blas3/gemm_local_joint_matrix.hpp:830 according to the update of DPC++

        // reg_res[frag] = joint_matrix_mad(sg, inA, inB, reg_res[frag]);
        joint_matrix_mad(sg, reg_res[frag], inA, inB, reg_res[frag]);

After that I compile portBLAS and sample with the command

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF -DCMAKE_BUILD_TYPE=Debug
$ ninja

and get the error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: use of undeclared identifier 'get_wi_data'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:607:9: error: use of undeclared identifier 'get_wi_data'
  607 |         get_wi_data(sg, float_out)[i] =
      |         ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:576:40: error: use of undeclared identifier 'get_wi_data'
  576 |                 static_cast<element_t>(get_wi_data(sg, reg_res[frag])[i]);
      |                                        ^

In DPC++ sycl/include/sycl/ext/oneapi/matrix/matrix-unified.hpp, the call of sycl::ext::oneapi::detail::get_wi_data is used for Intel GPU, while for NVIDIA GPU and AMD GPU, sycl visits the object in joint_matrix jm by directly call jm.matrix_impl.wi_marray, but it doesn't work for portBLAS. If I define the macro

#define get_wi_data(sg, jm) jm.matrix_impl.wi_marray

and compile, I will get another error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: no member named 'matrix_impl' in 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::sub_group, float, sycl::ext::oneapi::experimental::matrix::use::accumulator, 16, 16>'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^               ~~~~~~~~~
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:33:32: note: expanded from macro 'get_wi_data'
   33 | #define get_wi_data(sg, jm) jm.matrix_impl.wi_marray
      |                             ~~ ^

It seems like that since the code in gemm_local_joint_matrix.hpp is packed as device code in DPC++, it is unable to access the struct matrix_impl.

I want to learn how to fix it and test the performance of gemm on tensor core.

@horrorChen
Copy link
Author

Is anyone working on this issue?🤔

@muhammad-tanvir-1211
Copy link
Collaborator

Hi @horrorChen,
Thank you for opening this issue. The complex header problem has been addressed in #484. The joint_matrix support in portBLAS is out of date and needs to be refactored to match the latest changes in DPC++. We are currently working on addressing this and will put up the changes for review on portBLAS soon.
As for the changes you made to the FindDPCPP.cmake file, you don't need to add NVIDIA_GPU as a definition in this file, but instead get the same behaviour by passing the -DTUNING_TARGET=NVIDIA_GPU cmake flag in your build command. Thanks.

@muhammad-tanvir-1211 muhammad-tanvir-1211 self-assigned this Dec 15, 2023
@horrorChen
Copy link
Author

Thanks for your reply @muhammad-tanvir-1211.

Actually, I find that the call of get_wi_data is used to get the result of $AB$ and calculate the answer of $\alpha AB + \beta C$ with the joint_matrix struct, but it is unnecessary to address it with joint_matrix. The data can also be loaded with joint_matrix_load API and do the calculation. The idea may be helpful.

Hope for your update of work.

@muhammad-tanvir-1211
Copy link
Collaborator

Hi @horrorChen
The PR for joint_matrix fix (#491) is now up, please use the new changes to build the library and let us know if it is still causing any issues. Thanks.

@horrorChen
Copy link
Author

hi @muhammad-tanvir-1211
I have tested the PR and sample_gemm works. But something went wrong before completiong of compile.

I use the command below.

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF
$ ninja

The error info is like

[419/420] Linking CXX shared library libportblas.so.0.1.0
FAILED: libportblas.so.0.1.0 
.....
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)

Did you ever encounter this problem?

@muhammad-tanvir-1211
Copy link
Collaborator

muhammad-tanvir-1211 commented Jan 8, 2024

Hi @horrorChen
Yes, Sorry I forgot to mention this earlier. There are a few symbols missing in the compiler for NVIDIA backend. This PR (intel/llvm#12218) fixes the above linker error.

@horrorChen
Copy link
Author

Hi @muhammad-tanvir-1211
Thanks for your notification. PortBLAS now works well in my environment after updating DPC++.
Grateful for your work.

@hdelan
Copy link

hdelan commented Jan 11, 2024

intel/llvm#12218 has just been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants