Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanity check for rebound #20

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

Conversation

adamdempsey90
Copy link
Collaborator

Checks that the particle allocation persists across the interface when we create the rebound simulation.

Background

Description of Changes

Checklist

  • New features are documented
  • Tests added for bug fixes and new features
  • (@lanl.gov employees) Update copyright on changed files

@adamdempsey90 adamdempsey90 requested review from brryan, pdmullen and 111-1000 and removed request for 111-1000 November 26, 2024 23:11
@adamdempsey90
Copy link
Collaborator Author

The traceback is

/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:118
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:175
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/basic_string.h:226
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:47
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:95
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/shared_ptr_base.h:729
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/outputs.cpp:462

@adamdempsey90
Copy link
Collaborator Author

adamdempsey90 commented Nov 27, 2024

The traceback is

/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:118
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:175
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/basic_string.h:226
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:47
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:95
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/shared_ptr_base.h:729
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/outputs.cpp:462

The failing function call in parthenon_hdf5.hpp is

  H5A const attribute = H5A::FromHIDCheck(
      H5Acreate(location, name.c_str(), type, data_space, H5P_DEFAULT, H5P_DEFAULT));

putting a print statement and a sleep in the writing function gives (serial cpu btw):

Writing nbody/dt_reb
Done nbody/dt_reb
malloc(): unaligned tcache chunk detected
[cn612:1851443] *** Process received signal ***
[cn612:1851443] Signal: Aborted (6)
[cn612:1851443] Signal code:  (-6)
Writing nbody/mscale
[cn612:1851443] [ 0] /lib64/libpthread.so.0(+0x12d10)[0x14a6b73d4d10]

so it's failing when writing nbody/mscale. If i print the value out at that time it is correct.

@adamdempsey90
Copy link
Collaborator Author

The traceback is

/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:118
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:175
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/basic_string.h:226
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:47
/vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:95
/vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/shared_ptr_base.h:729
/vast/home/adempsey/github/artemis/external/parthenon/src/outputs/outputs.cpp:462

The failing function call in parthenon_hdf5.hpp is

  H5A const attribute = H5A::FromHIDCheck(
      H5Acreate(location, name.c_str(), type, data_space, H5P_DEFAULT, H5P_DEFAULT));

putting a print statement and a sleep in the writing function gives (serial cpu btw):

Writing nbody/dt_reb
Done nbody/dt_reb
malloc(): unaligned tcache chunk detected
[cn612:1851443] *** Process received signal ***
[cn612:1851443] Signal: Aborted (6)
[cn612:1851443] Signal code:  (-6)
Writing nbody/mscale
[cn612:1851443] [ 0] /lib64/libpthread.so.0(+0x12d10)[0x14a6b73d4d10]

so it's failing when writing nbody/mscale. If i print the value out at that time it is correct.

And the full backtrace from gdb

Core was generated by `/vast/home/adempsey/github/artemis/build/cpu/src/artemis -r /vast/home/adempsey'.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x000014ad157c6326 in malloc () from /lib64/libc.so.6
#1  0x000014ad08214492 in _objalloc_alloc () from /lib64/libucs.so.0
#2  0x000014ad0815bca1 in bfd_hash_insert () from /lib64/libucs.so.0
#3  0x000014ad0815f3be in bfd_make_section_anyway_with_flags () from /lib64/libucs.so.0
#4  0x000014ad08177a68 in _bfd_elf_make_section_from_shdr.part.22 () from /lib64/libucs.so.0
#5  0x000014ad08176dc8 in bfd_section_from_shdr () from /lib64/libucs.so.0
#6  0x000014ad0817282c in bfd_elf64_object_p () from /lib64/libucs.so.0
#7  0x000014ad0815b3c5 in bfd_check_format_matches () from /lib64/libucs.so.0
#8  0x000014ad08140a5b in load_file () from /lib64/libucs.so.0
#9  0x000014ad08141258 in ucs_debug_backtrace_create.part () from /lib64/libucs.so.0
#10 0x000014ad081414e0 in ucs_debug_backtrace_create () from /lib64/libucs.so.0
#11 0x000014ad08141a44 in ucs_debug_show_innermost_source_file () from /lib64/libucs.so.0
#12 0x000014ad08144180 in ucs_handle_error () from /lib64/libucs.so.0
#13 0x000014ad0814436c in ucs_debug_handle_error_signal () from /lib64/libucs.so.0
#14 0x000014ad0814453a in ucs_error_signal_handler () from /lib64/libucs.so.0
#15 <signal handler called>
#16 0x000014ad157c6326 in malloc () from /lib64/libc.so.6
#17 0x000014ad16bfca9d in H5FL__malloc () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#18 0x000014ad16bfd389 in H5FL_blk_malloc () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#19 0x000014ad16c2f044 in H5HF__man_dblock_create () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#20 0x000014ad16c2f5ca in H5HF__man_dblock_new () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#21 0x000014ad16c3888e in H5HF__man_insert () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#22 0x000014ad16c25212 in H5HF_insert () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#23 0x000014ad16b26d07 in H5A__dense_insert () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#24 0x000014ad16c6e104 in H5O__attr_create () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#25 0x000014ad16b2dad8 in H5A__create () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#26 0x000014ad16dd9849 in H5VL__native_attr_create () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#27 0x000014ad16dc2834 in H5VL_attr_create () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#28 0x000014ad16b1f8e0 in H5Acreate2 () from /usr/projects/jovian/dependencies/install/skylake-gold/hdf5-1.12.2/lib/libhdf5.so.200
#29 0x0000000000dd00f6 in parthenon::HDF5::HDF5WriteAttribute<double> (name=..., num_values=1, data=0x5209e70, location=144115188075855890) at /vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/basic_string.h:2304
#30 0x0000000000de49eb in parthenon::HDF5::HDF5WriteAttribute<double> (location=144115188075855890, values=<synthetic pointer>..., name=...) at /vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:148
#31 parthenon::HDF5::HDF5WriteAttribute<double, 0> (name=..., value=@0x364b000: 1, location=144115188075855890) at /vast/home/adempsey/github/artemis/external/parthenon/src/outputs/parthenon_hdf5.hpp:191
#32 0x0000000000e366ad in parthenon::Params::WriteToHDF5AllParamsOfType<double> (this=this@entry=0x364a630, prefix=..., group=...) at /vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/char_traits.h:300
#33 0x0000000000e44f43 in parthenon::Params::WriteToHDF5AllParamsOfMultipleTypes<double, std::vector<double, std::allocator<double> >, parthenon::ParArrayGeneric<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double**, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double**, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, Kokkos::View<double*>, Kokkos::View<double**>, parthenon::ParArrayGeneric<Kokkos::View<double*******, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double*******, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, parthenon::HDF5::H5Handle<&H5Gclose> const&) const::{lambda()#1}::operator()() const (
    this=<optimized out>) at /vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:47
#34 parthenon::Params::WriteToHDF5AllParamsOfMultipleTypes<double, std::vector<double, std::allocator<double> >, parthenon::ParArrayGeneric<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double**, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double**, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int>, Kokkos::View<double*>, Kokkos::View<double**>, parthenon::ParArrayGeneric<Kokkos::View<double*******, Kokkos::LayoutRight, Kokkos::HostSpace>, parthenon::empty_state_t, int>, parthenon::ParArrayGeneric<Kokkos::View<double*******, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>, parthenon::empty_state_t, int> > (this=this@entry=0x364a630, prefix=..., group=...) at /vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:47
#35 0x0000000000e2cb47 in parthenon::Params::WriteToHDF5AllParamsOfTypeOrVec<double> (group=..., prefix=..., this=0x364a630) at /vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:92
#36 parthenon::Params::WriteAllToHDF5 (this=0x364a630, prefix=..., group=...) at /vast/home/adempsey/github/artemis/external/parthenon/src/interface/params.cpp:92
#37 0x0000000000dd5f87 in parthenon::PHDF5Output::WriteOutputFileImpl<false> (this=0x30d3e70, pm=0x3653a80, pin=<optimized out>, tm=0x7ffe905bbb08, signal=<optimized out>)
    at /vast/home/adempsey/github/artemis/external/parthenon/src/interface/state_descriptor.hpp:142
#38 0x0000000000dc66bf in parthenon::Outputs::MakeOutputs (this=<optimized out>, pm=0x3653a80, pin=0x3592fd0, tm=tm@entry=0x7ffe905bbb08, signal=signal@entry=parthenon::SignalHandler::OutputSignal::final)
    at /vast/home/adempsey/github/artemis/external/parthenon/src/outputs/outputs.cpp:460
#39 0x0000000000d002bd in parthenon::EvolutionDriver::Execute (this=this@entry=0x7ffe905bbad0) at /vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/unique_ptr.h:360
--Type <RET> for more, q to quit, c to continue without paging--
#40 0x0000000000433e7d in LaunchWorkFlow (pman=..., pin=0x3592fd0) at /vast/home/adempsey/github/artemis/src/main.cpp:90
#41 0x0000000000429192 in main (argc=9, argv=0x7ffe905bbd98) at /vast/projects/opt/centos8/x86_64/gcc/9.4.0/include/c++/9.4.0/bits/unique_ptr.h:360

@Yurlungur you were recently in this file. Do you have any idea if the changes you made could have caused something like this? It's like some pointer got corrupted, or something.

@Yurlungur
Copy link

Yurlungur commented Nov 28, 2024

It looks like one of your params is not being properly output as an attribute. Do you know which varaible it is failing on? In WriteToHDF5AllParamsOfType, in params.cpp, try adding a print statement:

template <typename T>
void Params::WriteToHDF5AllParamsOfType(const std::string &prefix,
                                        const HDF5::H5G &group) const {
  for (const auto &p : myParams_) {
    const auto &key = p.first;
    const auto type = myTypes_.at(key);
    if (type == std::type_index(typeid(T))) {
      auto typed_ptr = dynamic_cast<Params::object_t<T> *>((p.second).get());
      std::cout << "Writing param " << key << std::endl;
      HDF5::HDF5WriteAttribute(prefix + "/" + key, *typed_ptr->pValue, group);
    }
  }
}

this'll produce a lot of output. But whatever param it prints out right before the segfault should tell us what param is causing problems.

(Unless of course the out of memory access happened somewhere completely different and it just segfaults here because we're lucky. Which is possible.)

@adamdempsey90
Copy link
Collaborator Author

It looks like one of your params is not being properly output as an attribute. Do you know which varaible it is failing on? In WriteToHDF5AllParamsOfType, in params.cpp, try adding a print statement:

template <typename T>
void Params::WriteToHDF5AllParamsOfType(const std::string &prefix,
                                        const HDF5::H5G &group) const {
  for (const auto &p : myParams_) {
    const auto &key = p.first;
    const auto type = myTypes_.at(key);
    if (type == std::type_index(typeid(T))) {
      auto typed_ptr = dynamic_cast<Params::object_t<T> *>((p.second).get());
      std::cout << "Writing param " << key << std::endl;
      HDF5::HDF5WriteAttribute(prefix + "/" + key, *typed_ptr->pValue, group);
    }
  }
}

this'll produce a lot of output. But whatever param it prints out right before the segfault should tell us what param is causing problems.

(Unless of course the out of memory access happened somewhere completely different and it just segfaults here because we're lucky. Which is possible.)

it fails pretty reliably when writing nbody/mscale. Happens on serial cpu debug or release build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants