Remove UNI_OMP guards surrounding `#pragma omp...` #526

IvanaGyro · 2024-11-30T09:32:30Z

Remove UNI_OMP guards surrounding `#pragma omp...`

The compilers don't stop compiling if they cannot find the token
following #pragma, so we don't have to use the compile definition to
guard #pragma omp... statements.

This is a change for #517.

Remove `find_package(GTest)` from the root CMake file

We have used include(GoogleTest), which is the new mechanism, to discover the
tests.

Remove `-fopenmp` flag when only enabling MKL

We don't guard #pragam omp... statements now, so they take effects
when -fopenmp compiler flag is set. set() function only affect the
targets defined in the same scope (the same directory) and the
subdirectories after the line of set(). Setting -fopenmp flag when
the user only enable USE_MKL but not enable USE_OMP makes the user
suprised.

By default, MKL links to their own openmp (not gomp) for threading. The
build process has no problem with no setting -fopenmp compiler flag
when using MKL as the provider of BLAS and LAPACK. The above claim is
true for the MKL version later then 2023.2.0. I am not sure if it is
also true for the older version of MKL.

codecov · 2024-11-30T09:47:59Z

Codecov Report

Attention: Patch coverage is 16.19048% with 792 lines in your changes missing coverage. Please review.

Project coverage is 17.58%. Comparing base (21efcbc) to head (672b7e9).
Report is 9 commits behind head on dev-master.

Files with missing lines	Patch %	Lines
src/backend/linalg_internal_cpu/Cpr_internal.cpp	0.00%	264 Missing ⚠️
src/backend/linalg_internal_cpu/Add_internal.cpp	6.06%	248 Missing ⚠️
src/backend/linalg_internal_cpu/Mul_internal.cpp	3.27%	207 Missing ⚠️
src/backend/utils_internal_cpu/GetElems_cpu.cpp	0.00%	11 Missing ⚠️
src/backend/utils_internal_cpu/SetArange_cpu.cpp	18.18%	9 Missing ⚠️
...end/utils_internal_cpu/GetElems_contiguous_cpu.cpp	27.27%	8 Missing ⚠️
src/backend/linalg_internal_cpu/Abs_internal.cpp	0.00%	7 Missing ⚠️
src/utils/vec_range.cpp	50.00%	5 Missing ⚠️
src/Bond.cpp	20.00%	4 Missing ⚠️
...ckend/linalg_internal_cpu/Inv_inplace_internal.cpp	0.00%	4 Missing ⚠️
... and 17 more

Additional details and impacted files

@@              Coverage Diff               @@
##           dev-master     #526      +/-   ##
==============================================
+ Coverage       16.54%   17.58%   +1.04%     
==============================================
  Files             211      211              
  Lines           48829    44733    -4096     
  Branches        18900    14941    -3959     
==============================================
- Hits             8080     7868     -212     
+ Misses          36603    32751    -3852     
+ Partials         4146     4114      -32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

manuschneider

Seems good for now. But in #517 suggests to remove #pragma omp parallel for schedule(dynamic) everywhere, so the issue still needs some work.

ianmccul · 2024-12-02T11:31:38Z

Seems good for now. But in #517 suggests to remove #pragma omp parallel for schedule(dynamic) everywhere, so the issue still needs some work.

Yes, all of the #pragma omp code I saw uses schedule(dynamic) in a way that makes it much slower than not using OpenMP at all. There are too many omp calls to check all of them by hand, but all the loops I saw were 1:1 memory access:computation ratio, i.e. every loop iteration loads new memory. These don't benefit much from OpenMP, because they can just about saturate the memory bandwidth even on one CPU core. Replacing schedule(dynamic) with schedule(static) would certainly be better than the current situation, but that will at best give a few percent speedup. I suggest simply removing all of the #pragma omp lines completely.

Perhaps try some test case with 3 benchmarks:

the code as it is
globally replace schedule(dynamic) with schedule(static)
disable OpenMP - either compiler flag or globally remove all lines #pragma omp ...

I tried (1) and (3), using C = A+B for A, B, C being 10000x10000 tensors of Double. bench-add-noomp is linking against cytnx compiled without OpenMP, bench-add is the same code linking against an OpenMP build.

$ ./bench-add-noomp
Average time for Cytnx Tensor: 0.212623 seconds
$ OMP_NUM_THREADS=1 ./bench-add
Average time for Cytnx Tensor: 0.781158 seconds
$ OMP_NUM_THREADS=10 ./bench-add
Average time for Cytnx Tensor: 3.15877 seconds

It gets worse the more threads you add because they are all fighting each other for the same L1 cache entries.

Get the correct number of CPU cores when OpenMP is not used.

The compilers don't stop compiling if they cannot find the token following `#pragma`, so we don't have to use the compile definition to guard `#pragma omp...` statements. Remove `find_package(GTest)` from the root CMake file becuase we have used `include(GoogleTest)`, which is the new mechanism, to discover the tests.

We don't guard `#pragam omp...` statements now, so they take effects when `-fopenmp` compiler flag is set. `set()` function only affect the targets defined in the same scope (the same directory) and the subdirectories after the line of `set()`. Setting `-fopenmp` flag when the user only enable `USE_MKL` but not enable `USE_OMP` makes the user suprised. By default, MKL links to their own openmp (not gomp) for threading. The build process has no problem with no setting `-fopenmp` compiler flag when using MKL as the provider of BLAS and LAPACK. The above claim is true for the MKL version later then 2023.2.0. I am not sure if it is also true for the older version of MKL.

IvanaGyro · 2024-12-02T13:26:49Z

Yes, the issue is still there. Because I cannot confirm that all #pragma omp code is redundant, I didn't remove it by meta programming. I will check them by hand, but this will take some time.

ianmccul · 2024-12-02T15:53:55Z

This code is very suspicious:

Cytnx/src/backend/linalg_internal_cpu/MaxMin_internal.cpp

Lines 15 to 46 in 384bafa

    
               void MaxMin_internal_u64(boost::intrusive_ptr<Storage_base> &out, 
        
                                        const boost::intrusive_ptr<Storage_base> &ten, 
        
                                        const cytnx_uint64 &Nelem, const char &type) { 
        
                 cytnx_uint64 *_ten = (cytnx_uint64 *)ten->Mem; 
        
                 cytnx_uint64 *_out = (cytnx_uint64 *)out->Mem; 
        
                 if (type == 'x') { 
        
           #ifdef UNI_OMP 
        
                   vector<cytnx_uint64> buf; 
        
                   unsigned int Nproc; 
        
             #pragma omp parallel 
        
                   { 
        
                     if (omp_get_thread_num() == 0) Nproc = omp_get_num_threads(); 
        
                   } 
        
                   buf = vector<cytnx_uint64>(Nproc, numeric_limits<cytnx_uint64>::min()); 
        
             #pragma omp parallel for schedule(dynamic) 
        
                   for (cytnx_uint64 n = 0; n < Nelem; n++) { 
        
                     if (_ten[n] > buf[omp_get_thread_num()]) buf[omp_get_thread_num()] = _ten[n]; 
        
                   } 
        
                   for (int i = 1; i < Nproc; i++) { 
        
                     if (buf[i] > buf[0]) buf[0] = buf[i]; 
        
                   } 
        
                   _out[0] = buf[0]; 
        
           #else 
        
                   _out[0] = numeric_limits<cytnx_uint64>::min(); 
        
                   for (cytnx_uint64 n = 0; n < Nelem; n++) { 
        
                     if (_ten[n] > _out[0]) _out[0] = _ten[n]; 
        
                   } 
        
           #endif

It looks like an attempt to do a parallel reduction. However it declares a parallel for inside a parallel region, which is nested parallelism. This is allowed in OpenMP, although by default the nested parallel region doesn't use additional threads, it is possible for a user to configure OpenMP to use $n$ threads in the outer parallel region and $n\times m$ threads in the inner region. If this code was run using nested parallelism (omp_set_nested(1)), then the inner parallel for loop will run with $n$ independent thread teams, each having a thread index from 0 .. $m$. That is probably not the intent of the code. Sorry, I mis-read that. The parallel regions are not nested.

Constructing a std::vector of length omp_get_num_threads() looks like an attempt to reproduce the functionality of private variables in OpenMP (but slower, since the elements of buf will be at successive memory locations, so most likely share the same cache lines in the L1 cache - the different threads will be fighting each other to access it). Much better to simply have a local variable, which will be stored on the stack of each thread. A variable in a local scope inside the parallel region is automatically private to each thread. Otherwise you can specify that it is private when declaring the parallel region. eg (untested)

    uint64_t global_max = std::numeric_limits<int64_t>::min();
    #pragma omp parallel
    {
        // Private variable for each thread's local maximum
        uint64_t local_max = std::numeric_limits<uint64_t>::min();

        // Each thread iterates over part of the array
        #pragma omp for
        for (std::size_t n = 0; n < Nelem; ++n) {
            if (_ten[n] > local_max) {
                local_max = _ten[n];
            }
        }

        // Reduce the local maxima into the global maximum
        #pragma omp critical
        {
            if (local_max > global_max) {
                global_max = local_max;
            }
        }
    }
    _out[0] = global_max;

The #pragma omp critical is a critical region protecting the shared variable global_max; only one thread can enter it at a time.

But this itself is a long-winded way of writing a reduction in OpenMP, which is simply (untested)

    uint64_t global_max = std::numeric_limits<uint64_t>::min();
    // Parallel reduction to compute the maximum value
    #pragma omp parallel for reduction(max:global_max)
    for (std::size_t n = 0; n < Nelem; ++n) {
        if (_ten[n] > global_max) {
            global_max = _ten[n];
        }
    }
    _out[0] = global_max;

But for sure the OpenMP is not going to give much benefit here anyway, so best option is to simply replace it with something like:

    uint64_t global_max = std::numeric_limits<int64_t>::min();
    std::size_t len = Nelem;
    for (std::size_t i = 0; i < len; ++i) {
      if (_ten[n] > global_max) global_max = _ten[n];
    }
    _out[0] = global_max;

This is slightly different to the existing non-OpenMP code:

Cytnx/src/backend/linalg_internal_cpu/MaxMin_internal.cpp

Lines 42 to 45 in 384bafa

    
           _out[0] = numeric_limits<cytnx_uint64>::min(); 
        
           for (cytnx_uint64 n = 0; n < Nelem; n++) { 
        
             if (_ten[n] > _out[0]) _out[0] = _ten[n]; 
        
           }

In particular, the existing non-OpenMP code is unlikely to be optimizable by the compiler, for reasons discussed in #511 : rather than writing to a local variable, which would almost certainly be kept in a CPU register, it writes to _out[0] inside the loop. Unless the compiler goes to extreme lengths, it doesn't know whether writing to _out[0] will change any values in the _ten array, and also, for the uint64_t version, since it is passing Nelem by reference, it doesn't know whether Nelem will get modified by writing to _out[0] either, so it needs to load Nelem again every time. The better solution for that is to pass Nelem by value (which will also be faster, since it will most likely pass it in a register anyway).
Verifying this behavior with https://godbolt.org/z/154oa6ocx there are 3 memory accesses in the loop, plus a 4th memory access if the test is true and it updates _out[0]. (With -O3 it produces some more complicated code but still has 3+1 memory accesses).

Compare https://godbolt.org/z/scheP5ePb which produces identical code with -O2 and -O3, with just 1 memory access inside the loop (it looks like there are two, but actually it is unrolling the loop and doing two comparisons per loop iteration), no memory access if it updates global_max.

Rather than re-factor every kernel here separately, I suggest replacing all of the functions in this file with a template, and also consider what the max / min of a complex tensor should be...

IvanaGyro · 2024-12-02T17:53:14Z

For non-complex, maybe just use std::max_element :)

ianmccul · 2024-12-02T19:18:42Z

For non-complex, maybe just use std::max_element :)

And for the complex case too, just need to decide what the comparison function is.

IvanaGyro requested review from kaihsin and manuschneider November 30, 2024 09:35

manuschneider approved these changes Dec 2, 2024

View reviewed changes

IvanaGyro added 4 commits December 2, 2024 21:22

Get number of CPU cores with std::thread

7fbd8db

Get the correct number of CPU cores when OpenMP is not used.

Remove commented commands in CMakeLists.txt

672b7e9

IvanaGyro force-pushed the remove-uni-omp-pr branch from 6a70dd3 to 672b7e9 Compare December 2, 2024 13:23

IvanaGyro merged commit dd72877 into dev-master Dec 2, 2024
2 checks passed

IvanaGyro deleted the remove-uni-omp-pr branch December 2, 2024 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove UNI_OMP guards surrounding `#pragma omp...` #526

Remove UNI_OMP guards surrounding `#pragma omp...` #526

IvanaGyro commented Nov 30, 2024 •

edited

Loading

codecov bot commented Nov 30, 2024 •

edited

Loading

manuschneider left a comment

ianmccul commented Dec 2, 2024 •

edited

Loading

IvanaGyro commented Dec 2, 2024

ianmccul commented Dec 2, 2024 •

edited

Loading

IvanaGyro commented Dec 2, 2024

ianmccul commented Dec 2, 2024

Remove UNI_OMP guards surrounding #pragma omp... #526

Remove UNI_OMP guards surrounding #pragma omp... #526

Conversation

IvanaGyro commented Nov 30, 2024 • edited Loading