Add new batch_gemm types #466

AidanBeltonS · 2024-04-03T14:17:56Z

Description

This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.

New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.

Test logs:
rocblas_test_log.txt
cublas_test_log.txt

I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.

Fixes # (GitHub issue)
#446

Checklist

All Submissions

Do all unit tests pass locally? Attach a log.
[x]Have you formatted the code using clang-format?

New interfaces

Have you provided motivation for adding a new feature as part of RFC and
it was accepted? # (RFC)

New features

Have you provided motivation for adding a new feature?
Have you added relevant tests?

AidanBeltonS · 2024-04-03T14:18:34Z

@Rbiessy, cc

mmeterel · 2024-04-03T16:24:13Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

hjabird

Since the reference cblas implementation doesn't support some of the operations that are being added (as I understand it), is the new functionality actually tested?

src/blas/backends/cublas/cublas_helper.hpp

src/blas/backends/portblas/portblas_batch.cxx

src/blas/backends/rocblas/rocblas_helper.hpp

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

Rbiessy · 2024-04-04T12:37:05Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

src/blas/backends/portblas/portblas_batch.cxx

mmeterel · 2024-04-04T15:41:48Z

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.
"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

mmeterel · 2024-04-04T16:09:12Z

@andrewtbarker Will you be able to help with this review?

andrewtbarker · 2024-04-04T16:33:11Z

@andrewtbarker Will you be able to help with this review?

Sure, I will take a look.

andrewtbarker

Thanks for the PR, there is a lot of good work here. Most of my comments are just about style and naming consistency.

src/blas/backends/cublas/cublas_batch.cpp

src/blas/backends/rocblas/rocblas_batch.cpp

src/blas/function_table.hpp

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

tests/unit_tests/blas/batch/gemm_batch_stride_usm.cpp

tests/unit_tests/blas/batch/gemm_batch_usm.cpp

andrewtbarker · 2024-04-04T22:33:16Z

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

Yes, this should have been fixed in #445 . If not we should fix it.

mmeterel · 2024-04-04T22:55:13Z

Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs?

AidanBeltonS · 2024-04-08T15:54:09Z

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases.
mkl_test_log.txt
netlib_test_log.txt
port_blas_test_logs.txt

MKL tests error:
mkl_test_log.txt

andrewtbarker · 2024-04-10T17:12:11Z

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

AidanBeltonS · 2024-04-12T13:48:33Z

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

The failures it Dot are due to error

[ RUN      ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
relative error = 1.83849e-08 absolute error = 1.24863e-07 limit = 3.01315e-13
Difference in result: DPC++ 6.79159 vs. Reference 6.79159
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/level1/dot.cpp:157: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (1 ms)

DotU is an odd one, it does not appear to be related to my changes however

[ RUN      ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
Caught synchronous SYCL exception during DOTU:
The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Max 1100':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
OpenCL status: sycl:7
unknown file: Failure
C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
[  FAILED  ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (0 ms)

AidanBeltonS · 2024-04-12T13:49:11Z

I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled?

[ RUN      ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y
relative error = 0.000911658 absolute error = 0.00168478 limit = 0.000333786
Difference in entry (58,119): DPC++ 1.84973 vs. Reference 1.84804
relative error = 0.000812301 absolute error = 0.00121021 limit = 0.000333786
Difference in entry (0,124): DPC++ 1.49107 vs. Reference 1.48986
relative error = 0.000534697 absolute error = 0.000857353 limit = 0.000333786
Difference in entry (17,144): DPC++ 1.60258 vs. Reference 1.60344
relative error = 0.000527185 absolute error = 0.00049144 limit = 0.000333786
Difference in entry (52,186): DPC++ -0.932689 vs. Reference -0.932197
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/batch/gemm_batch_usm.cpp:408: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y, where GetParam() = (0x56845d0, 1-byte object <01>) (331 ms)

mmeterel · 2024-04-12T16:02:25Z

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt

MKL tests error: mkl_test_log.txt

Can you please test hipSYCL backend as well?

andrewtbarker

Some suggestions to make interpreting failed test results easier - I flagged a few places but there are similar issues in most of the new tests.

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

andrewtbarker · 2024-04-12T16:46:24Z

Does anyone have any recommendations on how they would like to see this handled?

If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from tests/unit_tests/blas/include/reference_blas_templates.hpp we call (might need to add a reference gemm in that file).

andrewtbarker · 2024-05-01T23:39:12Z

What is the status here? As I see it we have three outstanding items:

AdaptiveCpp testing
Test names (my most recent review, minor)
Failure in RealIntRealScalarPrecision

(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

Rbiessy · 2024-05-02T10:01:53Z

Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week!

AidanBeltonS · 2024-05-13T15:48:41Z

What is the status here? As I see it we have three outstanding items:
1. AdaptiveCpp testing

2. Test names (my most recent review, minor)

3. Failure in `RealIntRealScalarPrecision`
(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

I have addressed items 2. and 3.
To resolve 3 I am scaling the tolerance by the possible input range from int8 matricies. i.e. 256
I have yet to test this with AdaptiveCpp, Ill start looking at that shortly

src/include/error_helper.hpp

andrewtbarker

I think we're good to go. Thanks for sticking with this one!

AidanBeltonS · 2024-06-05T11:16:00Z

I have also disabled the in8, float combination for MKLCPU/GPU as I found similar precision issues.
#506 will be updated to reflect this

tests/unit_tests/blas/batch/gemm_batch_stride.cpp

tests/unit_tests/blas/include/test_common.hpp

Rbiessy · 2024-06-24T08:46:40Z

I have confirmed the tests pass with AdaptiveCpp on AMD and Nvidia HW.

Add support for more batch_gemm types to follow the specification. Some combination using int8 are disabled on some backends due to precision issue.

Rbiessy self-assigned this Apr 3, 2024

Rbiessy requested a review from mmeterel April 3, 2024 15:23

hjabird self-assigned this Apr 3, 2024

hjabird reviewed Apr 4, 2024

View reviewed changes

Rbiessy reviewed Apr 4, 2024

View reviewed changes

src/blas/backends/portblas/portblas_batch.cxx Outdated Show resolved Hide resolved

andrewtbarker reviewed Apr 4, 2024

View reviewed changes

andrewtbarker mentioned this pull request Apr 4, 2024

What is the supported precision for gemm in oneMKL open source project ? #310

Open

AidanBeltonS force-pushed the add_new_batch_gemm_types branch from e436f1c to 20a7057 Compare April 8, 2024 15:38

andrewtbarker reviewed Apr 12, 2024

View reviewed changes

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

Rbiessy reviewed May 13, 2024

View reviewed changes

src/include/error_helper.hpp Outdated Show resolved Hide resolved

Rbiessy mentioned this pull request May 14, 2024

[BLAS] Netlib backend fails to compile with AdaptiveCpp #485

Open

andrewtbarker approved these changes May 20, 2024

View reviewed changes

Rbiessy previously approved these changes May 27, 2024

View reviewed changes

Aidan added 12 commits June 5, 2024 10:09

Clang-foramt

f9d2b85

Fix allocator msitake

a4d7916

Add check matrix instantiation

f0e3004

Add src/include to rocBlas include path

e83fa1e

Increase tolerancing for int8_t inputs

2f5fa02

Change test names

bcc3a29

Rename error_helper to dtype_string

5963b39

Disable int8, gemm_batch for cublas

6a3860f

Make check_equal_matrix static

a00af3e

Format dtype_string

6588324

Set gemm_batch int8, float to unimplemented

3641906

Add exceptions header

7dee1c1

AidanBeltonS force-pushed the add_new_batch_gemm_types branch from 6e70f68 to 7dee1c1 Compare June 5, 2024 11:14

Aidan added 2 commits June 14, 2024 16:07

Remove check_equal_matrix specialization

c31b635

Remove static/inline

28e098a

Rbiessy reviewed Jun 17, 2024

View reviewed changes

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

tests/unit_tests/blas/batch/gemm_batch_stride.cpp Outdated Show resolved Hide resolved

Aidan added 2 commits June 18, 2024 16:47

refactor almost_equal_matrix check

b7a650f

Remove static

a7b21c0

Rbiessy reviewed Jun 19, 2024

View reviewed changes

tests/unit_tests/blas/include/test_common.hpp Outdated Show resolved Hide resolved

Aidan added 2 commits June 20, 2024 11:33

Fix int32_t check

ead997f

Pass error_mag through all functions

164ed6e

Rbiessy approved these changes Jun 21, 2024

View reviewed changes

Rbiessy merged commit 6433690 into uxlfoundation:develop Jun 24, 2024
6 checks passed

zhiweij1 mentioned this pull request Jul 4, 2024

[SYCLomatic] Support more gemm_batch data type combinations when using opensource oneMKL and fix a type casting bug oneapi-src/SYCLomatic#2121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new batch_gemm types #466

Add new batch_gemm types #466

AidanBeltonS commented Apr 3, 2024

AidanBeltonS commented Apr 3, 2024

mmeterel commented Apr 3, 2024

hjabird left a comment

Rbiessy commented Apr 4, 2024

mmeterel commented Apr 4, 2024

mmeterel commented Apr 4, 2024

andrewtbarker commented Apr 4, 2024

andrewtbarker left a comment

andrewtbarker commented Apr 4, 2024

mmeterel commented Apr 4, 2024

AidanBeltonS commented Apr 8, 2024

andrewtbarker commented Apr 10, 2024

AidanBeltonS commented Apr 12, 2024

AidanBeltonS commented Apr 12, 2024

mmeterel commented Apr 12, 2024

andrewtbarker left a comment

andrewtbarker commented Apr 12, 2024

andrewtbarker commented May 1, 2024 •

edited

Loading

Rbiessy commented May 2, 2024

AidanBeltonS commented May 13, 2024

andrewtbarker left a comment

AidanBeltonS commented Jun 5, 2024 •

edited

Loading

Rbiessy commented Jun 24, 2024

Add new batch_gemm types #466

Add new batch_gemm types #466

Conversation

AidanBeltonS commented Apr 3, 2024

Description

Checklist

All Submissions

New interfaces

New features

AidanBeltonS commented Apr 3, 2024

mmeterel commented Apr 3, 2024

hjabird left a comment

Choose a reason for hiding this comment

Rbiessy commented Apr 4, 2024

mmeterel commented Apr 4, 2024

mmeterel commented Apr 4, 2024

andrewtbarker commented Apr 4, 2024

andrewtbarker left a comment

Choose a reason for hiding this comment

andrewtbarker commented Apr 4, 2024

mmeterel commented Apr 4, 2024

AidanBeltonS commented Apr 8, 2024

andrewtbarker commented Apr 10, 2024

AidanBeltonS commented Apr 12, 2024

AidanBeltonS commented Apr 12, 2024

mmeterel commented Apr 12, 2024

andrewtbarker left a comment

Choose a reason for hiding this comment

andrewtbarker commented Apr 12, 2024

andrewtbarker commented May 1, 2024 • edited Loading

Rbiessy commented May 2, 2024

AidanBeltonS commented May 13, 2024

andrewtbarker left a comment

Choose a reason for hiding this comment

AidanBeltonS commented Jun 5, 2024 • edited Loading

Rbiessy commented Jun 24, 2024

andrewtbarker commented May 1, 2024 •

edited

Loading

AidanBeltonS commented Jun 5, 2024 •

edited

Loading