Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove UNI_OMP guards surrounding #pragma omp... #526

Merged
merged 4 commits into from
Dec 2, 2024

Conversation

IvanaGyro
Copy link
Collaborator

@IvanaGyro IvanaGyro commented Nov 30, 2024

Remove UNI_OMP guards surrounding #pragma omp...

The compilers don't stop compiling if they cannot find the token
following #pragma, so we don't have to use the compile definition to
guard #pragma omp... statements.

This is a change for #517.

Remove find_package(GTest) from the root CMake file

We have used include(GoogleTest), which is the new mechanism, to discover the
tests.

Remove -fopenmp flag when only enabling MKL

We don't guard #pragam omp... statements now, so they take effects
when -fopenmp compiler flag is set. set() function only affect the
targets defined in the same scope (the same directory) and the
subdirectories after the line of set(). Setting -fopenmp flag when
the user only enable USE_MKL but not enable USE_OMP makes the user
suprised.

By default, MKL links to their own openmp (not gomp) for threading. The
build process has no problem with no setting -fopenmp compiler flag
when using MKL as the provider of BLAS and LAPACK. The above claim is
true for the MKL version later then 2023.2.0. I am not sure if it is
also true for the older version of MKL.

Copy link

codecov bot commented Nov 30, 2024

Codecov Report

Attention: Patch coverage is 16.19048% with 792 lines in your changes missing coverage. Please review.

Project coverage is 17.58%. Comparing base (21efcbc) to head (672b7e9).
Report is 9 commits behind head on dev-master.

Files with missing lines Patch % Lines
src/backend/linalg_internal_cpu/Cpr_internal.cpp 0.00% 264 Missing ⚠️
src/backend/linalg_internal_cpu/Add_internal.cpp 6.06% 248 Missing ⚠️
src/backend/linalg_internal_cpu/Mul_internal.cpp 3.27% 207 Missing ⚠️
src/backend/utils_internal_cpu/GetElems_cpu.cpp 0.00% 11 Missing ⚠️
src/backend/utils_internal_cpu/SetArange_cpu.cpp 18.18% 9 Missing ⚠️
...end/utils_internal_cpu/GetElems_contiguous_cpu.cpp 27.27% 8 Missing ⚠️
src/backend/linalg_internal_cpu/Abs_internal.cpp 0.00% 7 Missing ⚠️
src/utils/vec_range.cpp 50.00% 5 Missing ⚠️
src/Bond.cpp 20.00% 4 Missing ⚠️
...ckend/linalg_internal_cpu/Inv_inplace_internal.cpp 0.00% 4 Missing ⚠️
... and 17 more
Additional details and impacted files
@@              Coverage Diff               @@
##           dev-master     #526      +/-   ##
==============================================
+ Coverage       16.54%   17.58%   +1.04%     
==============================================
  Files             211      211              
  Lines           48829    44733    -4096     
  Branches        18900    14941    -3959     
==============================================
- Hits             8080     7868     -212     
+ Misses          36603    32751    -3852     
+ Partials         4146     4114      -32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@manuschneider manuschneider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good for now. But in #517 suggests to remove #pragma omp parallel for schedule(dynamic) everywhere, so the issue still needs some work.

@ianmccul
Copy link
Contributor

ianmccul commented Dec 2, 2024

Seems good for now. But in #517 suggests to remove #pragma omp parallel for schedule(dynamic) everywhere, so the issue still needs some work.

Yes, all of the #pragma omp code I saw uses schedule(dynamic) in a way that makes it much slower than not using OpenMP at all. There are too many omp calls to check all of them by hand, but all the loops I saw were 1:1 memory access:computation ratio, i.e. every loop iteration loads new memory. These don't benefit much from OpenMP, because they can just about saturate the memory bandwidth even on one CPU core. Replacing schedule(dynamic) with schedule(static) would certainly be better than the current situation, but that will at best give a few percent speedup. I suggest simply removing all of the #pragma omp lines completely.

Perhaps try some test case with 3 benchmarks:

  1. the code as it is
  2. globally replace schedule(dynamic) with schedule(static)
  3. disable OpenMP - either compiler flag or globally remove all lines #pragma omp ...

I tried (1) and (3), using C = A+B for A, B, C being 10000x10000 tensors of Double. bench-add-noomp is linking against cytnx compiled without OpenMP, bench-add is the same code linking against an OpenMP build.

$ ./bench-add-noomp
Average time for Cytnx Tensor: 0.212623 seconds
$ OMP_NUM_THREADS=1 ./bench-add
Average time for Cytnx Tensor: 0.781158 seconds
$ OMP_NUM_THREADS=10 ./bench-add
Average time for Cytnx Tensor: 3.15877 seconds

It gets worse the more threads you add because they are all fighting each other for the same L1 cache entries.

Get the correct number of CPU cores when OpenMP is not used.
The compilers don't stop compiling if they cannot find the token
following `#pragma`, so we don't have to use the compile definition to
guard `#pragma omp...` statements.

Remove `find_package(GTest)` from the root CMake file becuase we have
used `include(GoogleTest)`, which is the new mechanism, to discover the
tests.
We don't guard `#pragam omp...` statements now, so they take effects
when `-fopenmp` compiler flag is set. `set()` function only affect the
targets defined in the same scope (the same directory) and the
subdirectories after the line of `set()`. Setting `-fopenmp` flag when
the user only enable `USE_MKL` but not enable `USE_OMP` makes the user
suprised.

By default, MKL links to their own openmp (not gomp) for threading. The
build process has no problem with no setting `-fopenmp` compiler flag
when using MKL as the provider of BLAS and LAPACK. The above claim is
true for the MKL version later then 2023.2.0. I am not sure if it is
also true for the older version of MKL.
@IvanaGyro
Copy link
Collaborator Author

Yes, the issue is still there. Because I cannot confirm that all #pragma omp code is redundant, I didn't remove it by meta programming. I will check them by hand, but this will take some time.

@IvanaGyro IvanaGyro merged commit dd72877 into dev-master Dec 2, 2024
2 checks passed
@IvanaGyro IvanaGyro deleted the remove-uni-omp-pr branch December 2, 2024 13:27
@ianmccul
Copy link
Contributor

ianmccul commented Dec 2, 2024

This code is very suspicious:

void MaxMin_internal_u64(boost::intrusive_ptr<Storage_base> &out,
const boost::intrusive_ptr<Storage_base> &ten,
const cytnx_uint64 &Nelem, const char &type) {
cytnx_uint64 *_ten = (cytnx_uint64 *)ten->Mem;
cytnx_uint64 *_out = (cytnx_uint64 *)out->Mem;
if (type == 'x') {
#ifdef UNI_OMP
vector<cytnx_uint64> buf;
unsigned int Nproc;
#pragma omp parallel
{
if (omp_get_thread_num() == 0) Nproc = omp_get_num_threads();
}
buf = vector<cytnx_uint64>(Nproc, numeric_limits<cytnx_uint64>::min());
#pragma omp parallel for schedule(dynamic)
for (cytnx_uint64 n = 0; n < Nelem; n++) {
if (_ten[n] > buf[omp_get_thread_num()]) buf[omp_get_thread_num()] = _ten[n];
}
for (int i = 1; i < Nproc; i++) {
if (buf[i] > buf[0]) buf[0] = buf[i];
}
_out[0] = buf[0];
#else
_out[0] = numeric_limits<cytnx_uint64>::min();
for (cytnx_uint64 n = 0; n < Nelem; n++) {
if (_ten[n] > _out[0]) _out[0] = _ten[n];
}
#endif

It looks like an attempt to do a parallel reduction. However it declares a parallel for inside a parallel region, which is nested parallelism. This is allowed in OpenMP, although by default the nested parallel region doesn't use additional threads, it is possible for a user to configure OpenMP to use $n$ threads in the outer parallel region and $n\times m$ threads in the inner region. If this code was run using nested parallelism (omp_set_nested(1)), then the inner parallel for loop will run with $n$ independent thread teams, each having a thread index from 0 .. $m$. That is probably not the intent of the code. Sorry, I mis-read that. The parallel regions are not nested.

Constructing a std::vector of length omp_get_num_threads() looks like an attempt to reproduce the functionality of private variables in OpenMP (but slower, since the elements of buf will be at successive memory locations, so most likely share the same cache lines in the L1 cache - the different threads will be fighting each other to access it). Much better to simply have a local variable, which will be stored on the stack of each thread. A variable in a local scope inside the parallel region is automatically private to each thread. Otherwise you can specify that it is private when declaring the parallel region. eg (untested)

    uint64_t global_max = std::numeric_limits<int64_t>::min();
    #pragma omp parallel
    {
        // Private variable for each thread's local maximum
        uint64_t local_max = std::numeric_limits<uint64_t>::min();

        // Each thread iterates over part of the array
        #pragma omp for
        for (std::size_t n = 0; n < Nelem; ++n) {
            if (_ten[n] > local_max) {
                local_max = _ten[n];
            }
        }

        // Reduce the local maxima into the global maximum
        #pragma omp critical
        {
            if (local_max > global_max) {
                global_max = local_max;
            }
        }
    }
    _out[0] = global_max;

The #pragma omp critical is a critical region protecting the shared variable global_max; only one thread can enter it at a time.

But this itself is a long-winded way of writing a reduction in OpenMP, which is simply (untested)

    uint64_t global_max = std::numeric_limits<uint64_t>::min();
    // Parallel reduction to compute the maximum value
    #pragma omp parallel for reduction(max:global_max)
    for (std::size_t n = 0; n < Nelem; ++n) {
        if (_ten[n] > global_max) {
            global_max = _ten[n];
        }
    }
    _out[0] = global_max;

But for sure the OpenMP is not going to give much benefit here anyway, so best option is to simply replace it with something like:

    uint64_t global_max = std::numeric_limits<int64_t>::min();
    std::size_t len = Nelem;
    for (std::size_t i = 0; i < len; ++i) {
      if (_ten[n] > global_max) global_max = _ten[n];
    }
    _out[0] = global_max;

This is slightly different to the existing non-OpenMP code:

_out[0] = numeric_limits<cytnx_uint64>::min();
for (cytnx_uint64 n = 0; n < Nelem; n++) {
if (_ten[n] > _out[0]) _out[0] = _ten[n];
}

In particular, the existing non-OpenMP code is unlikely to be optimizable by the compiler, for reasons discussed in #511 : rather than writing to a local variable, which would almost certainly be kept in a CPU register, it writes to _out[0] inside the loop. Unless the compiler goes to extreme lengths, it doesn't know whether writing to _out[0] will change any values in the _ten array, and also, for the uint64_t version, since it is passing Nelem by reference, it doesn't know whether Nelem will get modified by writing to _out[0] either, so it needs to load Nelem again every time. The better solution for that is to pass Nelem by value (which will also be faster, since it will most likely pass it in a register anyway).
Verifying this behavior with https://godbolt.org/z/154oa6ocx there are 3 memory accesses in the loop, plus a 4th memory access if the test is true and it updates _out[0]. (With -O3 it produces some more complicated code but still has 3+1 memory accesses).

Compare https://godbolt.org/z/scheP5ePb which produces identical code with -O2 and -O3, with just 1 memory access inside the loop (it looks like there are two, but actually it is unrolling the loop and doing two comparisons per loop iteration), no memory access if it updates global_max.

Rather than re-factor every kernel here separately, I suggest replacing all of the functions in this file with a template, and also consider what the max / min of a complex tensor should be...

@IvanaGyro
Copy link
Collaborator Author

For non-complex, maybe just use std::max_element :)

@ianmccul
Copy link
Contributor

ianmccul commented Dec 2, 2024

For non-complex, maybe just use std::max_element :)

And for the complex case too, just need to decide what the comparison function is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants