DPC++ compile bug #483

horrorChen · 2023-12-05T01:44:12Z

In file portBLAS/include/blas_meta.h, the header file

#include <ext/oneapi/experimental/sycl_complex.hpp>

no longer exists in DPC++, instead you can use

#include <ext/oneapi/experimental/complex/complex.hpp>

which is located in llvm-dpcpp/build/include/sycl/ext/oneapi/experimental/complex/complex.hpp.

The text was updated successfully, but these errors were encountered:

horrorChen · 2023-12-07T02:46:11Z

I also find another compile error when I use DPC++ on NVIDIA A100, but I can't fix it.

I want to test the performance of portBLAS gemm on tensor core, so I revise cmake/Modules/FindDPCPP.cmake and add 2 lines to set the macro NVIDIA_GPU

  if (${start_idx} AND ${sm_val} GREATER_EQUAL "80")
    add_definitions(-DSB_ENABLE_JOINT_MATRIX=1)
    add_definitions(-DNVIDIA_GPU=1)
    list(APPEND DPCPP_FLAGS "-Xclang;-cl-mad-enable")
    list(APPEND DPCPP_FLAGS "-DSYCL_EXT_ONEAPI_MATRIX_VERSION=4")
    list(APPEND DPCPP_FLAGS "-DSB_ENABLE_JOINT_MATRIX=1")
    list(APPEND DPCPP_FLAGS "-DNVIDIA_GPU=1")
  endif()

Meanwhile, I revise the call of API joint_matrix_mad in src/operations/blas3/gemm_local_joint_matrix.hpp:830 according to the update of DPC++

        // reg_res[frag] = joint_matrix_mad(sg, inA, inB, reg_res[frag]);
        joint_matrix_mad(sg, reg_res[frag], inA, inB, reg_res[frag]);

After that I compile portBLAS and sample with the command

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF -DCMAKE_BUILD_TYPE=Debug
$ ninja

and get the error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: use of undeclared identifier 'get_wi_data'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:607:9: error: use of undeclared identifier 'get_wi_data'
  607 |         get_wi_data(sg, float_out)[i] =
      |         ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:576:40: error: use of undeclared identifier 'get_wi_data'
  576 |                 static_cast<element_t>(get_wi_data(sg, reg_res[frag])[i]);
      |                                        ^

In DPC++ sycl/include/sycl/ext/oneapi/matrix/matrix-unified.hpp, the call of sycl::ext::oneapi::detail::get_wi_data is used for Intel GPU, while for NVIDIA GPU and AMD GPU, sycl visits the object in joint_matrix jm by directly call jm.matrix_impl.wi_marray, but it doesn't work for portBLAS. If I define the macro

#define get_wi_data(sg, jm) jm.matrix_impl.wi_marray

and compile, I will get another error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: no member named 'matrix_impl' in 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::sub_group, float, sycl::ext::oneapi::experimental::matrix::use::accumulator, 16, 16>'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^               ~~~~~~~~~
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:33:32: note: expanded from macro 'get_wi_data'
   33 | #define get_wi_data(sg, jm) jm.matrix_impl.wi_marray
      |                             ~~ ^

It seems like that since the code in gemm_local_joint_matrix.hpp is packed as device code in DPC++, it is unable to access the struct matrix_impl.

I want to learn how to fix it and test the performance of gemm on tensor core.

horrorChen · 2023-12-12T07:08:21Z

Is anyone working on this issue?🤔

muhammad-tanvir-1211 · 2023-12-15T11:39:57Z

Hi @horrorChen,
Thank you for opening this issue. The complex header problem has been addressed in #484. The joint_matrix support in portBLAS is out of date and needs to be refactored to match the latest changes in DPC++. We are currently working on addressing this and will put up the changes for review on portBLAS soon.
As for the changes you made to the FindDPCPP.cmake file, you don't need to add NVIDIA_GPU as a definition in this file, but instead get the same behaviour by passing the -DTUNING_TARGET=NVIDIA_GPU cmake flag in your build command. Thanks.

horrorChen · 2023-12-15T11:55:37Z

Thanks for your reply @muhammad-tanvir-1211.

Actually, I find that the call of get_wi_data is used to get the result of $AB$ and calculate the answer of $\alpha AB + \beta C$ with the joint_matrix struct, but it is unnecessary to address it with joint_matrix. The data can also be loaded with joint_matrix_load API and do the calculation. The idea may be helpful.

Hope for your update of work.

muhammad-tanvir-1211 · 2024-01-05T15:27:37Z

Hi @horrorChen
The PR for joint_matrix fix (#491) is now up, please use the new changes to build the library and let us know if it is still causing any issues. Thanks.

horrorChen · 2024-01-08T08:18:57Z

hi @muhammad-tanvir-1211
I have tested the PR and sample_gemm works. But something went wrong before completiong of compile.

I use the command below.

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF
$ ninja

The error info is like

[419/420] Linking CXX shared library libportblas.so.0.1.0
FAILED: libportblas.so.0.1.0 
.....
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)

Did you ever encounter this problem?

muhammad-tanvir-1211 · 2024-01-08T10:23:17Z

Hi @horrorChen
Yes, Sorry I forgot to mention this earlier. There are a few symbols missing in the compiler for NVIDIA backend. This PR (intel/llvm#12218) fixes the above linker error.

horrorChen · 2024-01-09T03:07:20Z

Hi @muhammad-tanvir-1211
Thanks for your notification. PortBLAS now works well in my environment after updating DPC++.
Grateful for your work.

hdelan · 2024-01-11T12:16:44Z

intel/llvm#12218 has just been merged.

s-Nick mentioned this issue Dec 6, 2023

Update gemm config for CPUs & Update complex header #484

Merged

muhammad-tanvir-1211 self-assigned this Dec 15, 2023

muhammad-tanvir-1211 closed this as completed in #484 Dec 19, 2023

muhammad-tanvir-1211 reopened this Jan 5, 2024

horrorChen closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPC++ compile bug #483

DPC++ compile bug #483

horrorChen commented Dec 5, 2023

horrorChen commented Dec 7, 2023 •

edited

Loading

horrorChen commented Dec 12, 2023

muhammad-tanvir-1211 commented Dec 15, 2023

horrorChen commented Dec 15, 2023

muhammad-tanvir-1211 commented Jan 5, 2024

horrorChen commented Jan 8, 2024

muhammad-tanvir-1211 commented Jan 8, 2024 •

edited

Loading

horrorChen commented Jan 9, 2024

hdelan commented Jan 11, 2024

DPC++ compile bug #483

DPC++ compile bug #483

Comments

horrorChen commented Dec 5, 2023

horrorChen commented Dec 7, 2023 • edited Loading

horrorChen commented Dec 12, 2023

muhammad-tanvir-1211 commented Dec 15, 2023

horrorChen commented Dec 15, 2023

muhammad-tanvir-1211 commented Jan 5, 2024

horrorChen commented Jan 8, 2024

muhammad-tanvir-1211 commented Jan 8, 2024 • edited Loading

horrorChen commented Jan 9, 2024

hdelan commented Jan 11, 2024

horrorChen commented Dec 7, 2023 •

edited

Loading

muhammad-tanvir-1211 commented Jan 8, 2024 •

edited

Loading