inline static members in Kokkos 4.0 class not persistent with CUDA backend #55

kaschau · 2023-04-20T14:40:38Z

Kokkos 4.0 changed many class members set with Kokkos::initialize() to inline static T types. With this change it seems there is an issue with pybind11 and setting these members persistently when called from python.

Whenever using cuda, the TileSizeProperties attribute maxThreads is being set to zeros, and causes an abort at the first MDRange execution.

When Kokkos::initialize() is called (from python bound function), cudaProp.maxThreadsPerMultiProcessor (from here ) reports 1024, however, by the time we get to the MDRange policy here, the space.impl_internal_space_instance()->m_maxThreadsPerSM is 0. This causes an abort at this check here.

I am only having an issue with CUDA, and it works fine with OpenMP and Serial backends. It has been consistent with every host/device compiler I have tried.

Primarily gcc 9.4.0/intel19.04 + CUDA 11.7

The text was updated successfully, but these errors were encountered:

kaschau · 2023-04-21T14:04:41Z

To reproduce I would expect any CUDA kernel to fail when Kokkos::Initialize() is called from pykokkos-base, and a subsequent kokkos kernel is called. I cannot reproduce in Kokkos/C++ only code.

kaschau · 2023-05-13T14:44:09Z

It seems like the inline static member behavior is different when Kokkos is compiled as a static versus a shared library. Because pybind11 requires PIC, generally one just compiles Kokkos as a shared library, so there are no problems when compiling pykokkos-base. However, this leads to the behavior described above (with 4.0).

However, when I compile Kokkos as static libraries, with -fPIC, I am able to get Kokkos 4.0 to run on the cuda backend.

This is well over my compiling/C++ object lifetime/ instruction unit pay grade, so not sure what to make of it. But at least it works.

crtrott · 2023-05-13T17:59:12Z

hm interesting. @nliber do you have any idea what this could be? I think it is potentially the jitting of stuff where we would have inline static things inside header files? So if something gets recompiled and then relinked it might cause issues?

I wonder if this is fixable by having all inline-static variables actually be static variables inside functions which are compiled inside the Kokkos library itself. I.e. for every static int foo; make it actually static int& foo(); and have int& foo() { static int val; return val; } somewhere?

crtrott · 2023-05-13T17:59:58Z

@kaschau do you feel you could take this experiment on, i.e. make a branch of Kokkos Core go through all these variables and see if we can get this fixed that way?

kaschau · 2023-05-13T19:09:52Z

@crtrott I'm a c++ ignoramos but I think I can give it a shot. I think just being able to prove one variable (the tile size for example) survives this way should be doable for me, as a proof of concept.

jrmadsen · 2023-05-16T04:09:38Z

@kaschau A bit of a shot in the dark but try setting this variable to OFF and rebuild pykokkos-base:

pykokkos-base/cmake/Modules/KokkosPythonOptions.cmake

Line 82 in 94553b7

    
           set(CMAKE_VISIBILITY_INLINES_HIDDEN ON CACHE BOOL "Add compile flag to hide symbols of inline functions")

I suspect the reason you see this issue with shared libraries is there is some symbol that exists in both the pykokkos-base library and the Kokkos library and pykokkos-base is initializing it's copy of the symbol instead of the one that exists in the Kokkos library. And when a static Kokkos library is used, these symbols get merged.

jrmadsen · 2023-05-16T04:21:01Z

@kaschau do you feel you could take this experiment on, i.e. make a branch of Kokkos Core go through all these variables and see if we can get this fixed that way?

A potential starting place might be to use the nm command line tool and see which Kokkos variables are defined in the text section of the pykokkos-base library. man nm will explain the codes for whether a symbol is undefined (i.e. defined in another library), a symbol defined in the text section, etc. Filter out any pybind symbols and see if there are any symbols defined in both the Kokkos shared library and pykokkos-base library that look suspicious.

kaschau · 2023-05-16T17:47:24Z

@kaschau A bit of a shot in the dark but try setting this variable to OFF and rebuild pykokkos-base:

pykokkos-base/cmake/Modules/KokkosPythonOptions.cmake

Line 82 in 94553b7

set(CMAKE_VISIBILITY_INLINES_HIDDEN ON CACHE BOOL "Add compile flag to hide symbols of inline functions")

I suspect the reason you see this issue with shared libraries is there is some symbol that exists in both the pykokkos-base library and the Kokkos library and pykokkos-base is initializing it's copy of the symbol instead of the one that exists in the Kokkos library. And when a static Kokkos library is used, these symbols get merged.

@jrmadsen Tried this, still had the same issue. I will take a look at nm when I have some time. Thanks!

Yaraslaut · 2023-05-18T21:29:06Z

Commit that broke pybind11 : kokkos/kokkos@1f048cf
And some info from valgrind (not very helpful)

==1994128== Invalid read of size 32
==1994128==    at 0x4FB9B89: __wcsncpy_avx2 (strncpy-avx2.S:306)
==1994128==    by 0x4B59439: UnknownInlinedFun (wchar2.h:146)
==1994128==    by 0x4B59439: _Py_wrealpath (fileutils.c:1996)
==1994128==    by 0x4B54A0C: _PyPathConfig_ComputeSysPath0.constprop.0 (pathconfig.c:495)
==1994128==    by 0x4B544F4: UnknownInlinedFun (main.c:575)
==1994128==    by 0x4B544F4: Py_RunMain (main.c:680)
==1994128==    by 0x4B1CF6A: Py_BytesMain (main.c:734)
==1994128==    by 0x4E7F84F: (below main) (libc_start_call_main.h:58)
==1994128==  Address 0x5ccb2a0 is 16 bytes after a block of size 176 in arena "client"

And python itself

ExecSpace Error: MDRange tile dims exceed maximum number of threads per block - choose smaller tile dims
Backtrace:
                                                               Kokkos::Impl::save_stacktrace() [0x7efc8e28d915]
Kokkos::Impl::traceback_callstack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&) [0x7efc8e280cf1]
                                                         Kokkos::Impl::host_abort(char const*) [0x7efc8e280d98]
                                                                                               [0x7efc8e4f3696]
                                                                                               [0x7efc8e4f671c]
                                                                                               [0x7efc8e4f3e43]
                                                                                               [0x7efc8e4f2679]
                                                                                               [0x7efc8e4f1ebe]
                                                                                               [0x7efc8e4f1dba]
                                                                                               [0x7efc8e4f1cde]
                                                                                               [0x7efc8e4d9030]
                                                                                               [0x7efcafa04a81]
                                                                          _PyObject_MakeTpCall [0x7efcaf9e53e4]
                                                                                               [0x7efcafa360fe]
                                                                                               [0x7efcafa1d100]
                                                                                               [0x7efcaf9e575a]
                                                                                               [0x7efc8e4d3cdb]
                                                                          _PyObject_MakeTpCall [0x7efcaf9e53e4]
                                                                      _PyEval_EvalFrameDefault [0x7efcaf9efbcb]
                                                                                               [0x7efcafaa9f6a]
                                                                               PyEval_EvalCode [0x7efcafaa997c]
                                                                                               [0x7efcafac86b3]
                                                                                               [0x7efcafac43ba]
                                                                                               [0x7efcafadadd3]
                                                                       _PyRun_SimpleFileObject [0x7efcafad9ef4]
                                                                          _PyRun_AnyFileObject [0x7efcafad8de8]
                                                                                    Py_RunMain [0x7efcafad3722]
                                                                                  Py_BytesMain [0x7efcafa9bf6b]
                                                                                               [0x7efcaf639850]
                                                                             __libc_start_main [0x7efcaf63990a]
                                                                                        _start [0x55e4512bb045]

Yaraslaut · 2023-05-20T09:45:00Z

I was trying to figure out what is going on in my case, and something very odd is happening since if i look at the addresses of this variable in here and here they are different.
Good news is that if I fetch kokkos and pybind directly from pykokkos-base with using CPM

FetchContent_Declare(
  PyKokkosbase
  GIT_REPOSITORY https://github.com/kokkos/pykokkos-base.git
  GIT_TAG        94553b7e4be91b042baa9d903dc98e73722eeced
)
FetchContent_MakeAvailable(PyKokkosbase)
find_package(Python3 COMPONENTS Development)

..... 
pybind11_add_module(...)
target_link_libraries( ... Kokkos::kokkos)
.....

Everything starts to work properly
by default kokkos 3.7 is used inside pykokkos-base , to check with kokkos 4.0 you can update submodule index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline static members in Kokkos 4.0 class not persistent with CUDA backend #55

inline static members in Kokkos 4.0 class not persistent with CUDA backend #55

kaschau commented Apr 20, 2023

kaschau commented Apr 21, 2023

kaschau commented May 13, 2023 •

edited

Loading

crtrott commented May 13, 2023

crtrott commented May 13, 2023

kaschau commented May 13, 2023

jrmadsen commented May 16, 2023

jrmadsen commented May 16, 2023

kaschau commented May 16, 2023 •

edited

Loading

Yaraslaut commented May 18, 2023

Yaraslaut commented May 20, 2023

inline static members in Kokkos 4.0 class not persistent with CUDA backend #55

inline static members in Kokkos 4.0 class not persistent with CUDA backend #55

Comments

kaschau commented Apr 20, 2023

kaschau commented Apr 21, 2023

kaschau commented May 13, 2023 • edited Loading

crtrott commented May 13, 2023

crtrott commented May 13, 2023

kaschau commented May 13, 2023

jrmadsen commented May 16, 2023

jrmadsen commented May 16, 2023

kaschau commented May 16, 2023 • edited Loading

Yaraslaut commented May 18, 2023

Yaraslaut commented May 20, 2023

kaschau commented May 13, 2023 •

edited

Loading

kaschau commented May 16, 2023 •

edited

Loading