Vectorize the cmath matrix operations #116

beomki-yeo · 2024-04-11T15:08:21Z

Free lunch time

What's changed?

Just to help understanding what's going on with new matrix multiplication.

It should be noted that the cmath matrix is made of multiple arrays where each of them represents each column of matrix.
Imagine that we do multiply 4x4 matrix A and B to obtain the matrix C. If we do the matrix multiplication as we usually do like the following figure, we will have to pick up the row of A which cannot be vectorized:

So I came up with an idea to do the matrix multiplication column-wisely by multiplying a column of A to the element of B:

     for (size_type k = 0; k < M; ++k) {
        C[j][k] += A[i][k] * B[j][i];
      }

The following figure shows how the column of C is calculated in the new matrix multiplication

beomki-yeo · 2024-04-11T17:11:50Z

Currently I only optimized the addition and constant multiplication.
Vectorization in the matrix multiplication makes the program slower as far as I can see.

Before Optimization

CPU unsync propagation/8        223808 ns       223285 ns         3101 TracksPropagated=286.629k/s
CPU unsync propagation/16       959241 ns       955581 ns          708 TracksPropagated=267.9k/s
CPU unsync propagation/32      3838758 ns      3756516 ns          178 TracksPropagated=272.593k/s
CPU unsync propagation/64     14608613 ns     14524319 ns           45 TracksPropagated=282.01k/s
CPU unsync propagation/128    63197599 ns     62829071 ns           12 TracksPropagated=260.771k/s
CPU unsync propagation/256   245358319 ns    244832250 ns            3 TracksPropagated=267.677k/s
CPU sync propagation/8          252922 ns       251167 ns         2626 TracksPropagated=254.81k/s
CPU sync propagation/16        1202544 ns      1146934 ns          634 TracksPropagated=223.204k/s
CPU sync propagation/32        5024259 ns      4963774 ns          186 TracksPropagated=206.295k/s
CPU sync propagation/64       15572888 ns     15382790 ns           47 TracksPropagated=266.272k/s
CPU sync propagation/128      62399932 ns     61958364 ns           11 TracksPropagated=264.436k/s
CPU sync propagation/256     249225684 ns    248533783 ns            3 TracksPropagated=263.691k/s
CUDA unsync propagation/8      4139261 ns      4125437 ns          160 TracksPropagated=15.5135k/s
CUDA unsync propagation/16    10304838 ns     10280975 ns           65 TracksPropagated=24.9004k/s
CUDA unsync propagation/32    11759391 ns     11545430 ns           60 TracksPropagated=88.6931k/s
CUDA unsync propagation/64    15486088 ns     15454851 ns           44 TracksPropagated=265.03k/s
CUDA unsync propagation/128   42272228 ns     41622470 ns           17 TracksPropagated=393.634k/s
CUDA unsync propagation/256  173225146 ns    172866587 ns            4 TracksPropagated=379.113k/s
CUDA sync propagation/8        3932279 ns      3921049 ns          178 TracksPropagated=16.3222k/s
CUDA sync propagation/16      10153322 ns     10131947 ns           67 TracksPropagated=25.2666k/s
CUDA sync propagation/32      10911805 ns     10888519 ns           62 TracksPropagated=94.044k/s
CUDA sync propagation/64      13669611 ns     13640279 ns           50 TracksPropagated=300.287k/s
CUDA sync propagation/128     39348997 ns     39250588 ns           18 TracksPropagated=417.421k/s
CUDA sync propagation/256    161579596 ns    158291125 ns            4 TracksPropagated=414.022k/s

After Optimization

CPU unsync propagation/8        245712 ns       244354 ns         2676 TracksPropagated=261.915k/s
CPU unsync propagation/16       940747 ns       937644 ns          700 TracksPropagated=273.025k/s
CPU unsync propagation/32      4044877 ns      3915183 ns          188 TracksPropagated=261.546k/s
CPU unsync propagation/64     13939578 ns     13733911 ns           48 TracksPropagated=298.24k/s
CPU unsync propagation/128    58935459 ns     56378559 ns           11 TracksPropagated=290.607k/s
CPU unsync propagation/256   229938854 ns    228334281 ns            3 TracksPropagated=287.018k/s
CPU sync propagation/8          261180 ns       260296 ns         2528 TracksPropagated=245.874k/s
CPU sync propagation/16         947878 ns       942067 ns          724 TracksPropagated=271.743k/s
CPU sync propagation/32        5582073 ns      5339018 ns          100 TracksPropagated=191.796k/s
CPU sync propagation/64       14422517 ns     14356729 ns           48 TracksPropagated=285.302k/s
CPU sync propagation/128      68016045 ns     60527760 ns           11 TracksPropagated=270.686k/s
CPU sync propagation/256     243352156 ns    238947192 ns            3 TracksPropagated=274.27k/s
CUDA unsync propagation/8   10978744501 ns   10920950916 ns            1 TracksPropagated=5.8603/s
CUDA unsync propagation/16    10598858 ns     10279684 ns           49 TracksPropagated=24.9035k/s
CUDA unsync propagation/32    11414197 ns     11390716 ns           59 TracksPropagated=89.8978k/s
CUDA unsync propagation/64    15507599 ns     15278148 ns           45 TracksPropagated=268.095k/s
CUDA unsync propagation/128   41922541 ns     41473470 ns           17 TracksPropagated=395.048k/s
CUDA unsync propagation/256  167457081 ns    167079324 ns            4 TracksPropagated=392.245k/s
CUDA sync propagation/8        4023485 ns      4010791 ns          174 TracksPropagated=15.957k/s
CUDA sync propagation/16       9929345 ns      9907102 ns           68 TracksPropagated=25.84k/s
CUDA sync propagation/32      10773593 ns     10748340 ns           63 TracksPropagated=95.2705k/s
CUDA sync propagation/64      13444032 ns     13413441 ns           50 TracksPropagated=305.365k/s
CUDA sync propagation/128     39619134 ns     39513598 ns           18 TracksPropagated=414.642k/s
CUDA sync propagation/256    161714244 ns    161347023 ns            4 TracksPropagated=406.18k/s

VTune analyzer result with detray integration test with rk propagation & constant bfield

Before optimization

After optimization

niermann999 · 2024-04-11T18:35:30Z

Is this single or double precision? You may not see that much of a speedup in double precision

beomki-yeo · 2024-04-11T19:11:39Z

It's double. I will check single later

stephenswat · 2024-04-11T19:32:37Z

Are you compiling with support for vector extensions enabled?

beomki-yeo · 2024-04-11T19:52:36Z

Maybe not. I thought Release build would do that. Do you know how to enable it?

Maybe this?
-ftree-vectorize -march=native -mtune=native -mavx2

niermann999 · 2024-04-11T20:44:37Z

You can also ask the compiler to tell you about vectorization optimizations it did -fopt-info-vec-all and similar

niermann999 · 2024-04-11T20:58:38Z

Another thing to consider is data alignment maybe

beomki-yeo · 2024-04-15T19:21:10Z

Intel advisor report of algebra_test_array (compiled with "-march=native" "-ftree-vectorize") says that the matrix multiplication is vectorized:

And as expected, it is reported that there are bunch of non-vectorized operations with partial_pivot_lud:

algebra-plugins/math/cmath/include/algebra/math/algorithms/matrix/inverse/partial_pivot_lud.hpp

Line 67 in 9e684de

element_getter_t()(lu, i, k) * element_getter_t()(inv, k, j);

Partial Pivot LU decomposition is written in a totally anti-vectorized manner: The sub vector matrix represents a column of the matrix and we do the Gaussian elimination row-wisely. We will need to rewirte the partial pivot lud later with column-wise gauss elimination.

beomki-yeo · 2024-04-15T19:55:49Z

I will skip partial pivot lud optimization in this PR as it is going to require a significant amount of time

beomki-yeo marked this pull request as draft April 11, 2024 15:08

niermann999 mentioned this pull request Apr 12, 2024

ref: Restore explicit vectorization in the Vc AoS plugin #118

Merged

beomki-yeo mentioned this pull request Apr 12, 2024

Turn on array_cmath vectorization #119

Closed

beomki-yeo force-pushed the optimize-cmath branch from de06258 to 536b045 Compare April 15, 2024 19:54

beomki-yeo requested a review from niermann999 April 15, 2024 19:55

beomki-yeo marked this pull request as ready for review April 15, 2024 19:55

beomki-yeo mentioned this pull request Apr 15, 2024

Update algebra-plugins and Vectorize the benchmark acts-project/detray#717

Merged

beomki-yeo force-pushed the optimize-cmath branch from 536b045 to e2f4951 Compare April 16, 2024 12:29

Optimize the cmath operators

2360abe

beomki-yeo force-pushed the optimize-cmath branch from e2f4951 to 2360abe Compare April 16, 2024 15:26

niermann999 approved these changes Apr 20, 2024

View reviewed changes

beomki-yeo merged commit ee9cf86 into acts-project:main Apr 20, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize the cmath matrix operations #116

Vectorize the cmath matrix operations #116

beomki-yeo commented Apr 11, 2024 •

edited

Loading

beomki-yeo commented Apr 11, 2024

niermann999 commented Apr 11, 2024 •

edited

Loading

beomki-yeo commented Apr 11, 2024

stephenswat commented Apr 11, 2024

beomki-yeo commented Apr 11, 2024 •

edited

Loading

niermann999 commented Apr 11, 2024

niermann999 commented Apr 11, 2024

beomki-yeo commented Apr 15, 2024 •

edited

Loading

beomki-yeo commented Apr 15, 2024 •

edited

Loading

Vectorize the cmath matrix operations #116

Vectorize the cmath matrix operations #116

Conversation

beomki-yeo commented Apr 11, 2024 • edited Loading

What's changed?

beomki-yeo commented Apr 11, 2024

niermann999 commented Apr 11, 2024 • edited Loading

beomki-yeo commented Apr 11, 2024

stephenswat commented Apr 11, 2024

beomki-yeo commented Apr 11, 2024 • edited Loading

niermann999 commented Apr 11, 2024

niermann999 commented Apr 11, 2024

beomki-yeo commented Apr 15, 2024 • edited Loading

beomki-yeo commented Apr 15, 2024 • edited Loading

beomki-yeo commented Apr 11, 2024 •

edited

Loading

niermann999 commented Apr 11, 2024 •

edited

Loading

beomki-yeo commented Apr 11, 2024 •

edited

Loading

beomki-yeo commented Apr 15, 2024 •

edited

Loading

beomki-yeo commented Apr 15, 2024 •

edited

Loading