Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libomp bug in stedc_solve #103

Open
2 of 6 tasks
wavefunction91 opened this issue Aug 27, 2023 · 0 comments
Open
2 of 6 tasks

libomp bug in stedc_solve #103

wavefunction91 opened this issue Aug 27, 2023 · 0 comments

Comments

@wavefunction91
Copy link

Description

When compiled with clang++ and linked with libomp, stedc_solve stochastically fails if OMP_NUM_THREADS > 1. I originally thought that it might be an accidental double linkage with gomp through the Fortran linker, but on inspecting the compiler and linker lines, no such issue.

Steps To Reproduce

  1. Build SLATE with clang / libomp
  2. Run tests with OMP_NUM_THREADS > 1
$ OMP_NUM_THREADS=1 python3 run_tests.py --syev
<...>
--------------------------------------------------------------------------------
All routines passed
$ OMP_NUM_THREADS=2 python3 run_tests.py --syev
<...>
./tester  --origin s --target t --ref n --nb 64,100 --type s,d,c,z --lookahead 1 --dim 100:500:100 --jobz v --method-eig qr,dc heev
% SLATE version 2023.08.25, id 57ea922b
% input: ./tester --origin s --target t --ref n --nb 64,100 --type s,d,c,z --lookahead 1 --dim 100:500:100 --jobz v --method-eig qr,dc heev
% 2023-08-27 21:37:32, 1 MPI ranks, CPU-only MPI, 2 OpenMP threads per MPI rank
                                                                                                                                                    
type  origin  target  eig   A   jobz    uplo       n    nb  ib    p    q  la  pt  value err   back err    Z orth.   time (s)  ref time (s)  status  
   s  scalpk    task   qr   1    vec   lower     100    64  32    1    1   1   1         NA   2.74e-08   1.44e-07     0.0125            NA  pass    
   s  scalpk    task   qr   1    vec   lower     100   100  32    1    1   1   1         NA   1.46e-08   1.42e-07    0.00620            NA  pass    
   s  scalpk    task   qr   1    vec   lower     200    64  32    1    1   1   1         NA   2.37e-08   1.50e-07     0.0452            NA  pass    
   s  scalpk    task   qr   1    vec   lower     200   100  32    1    1   1   1         NA   1.08e-08   1.40e-07     0.0385            NA  pass    
   s  scalpk    task   qr   1    vec   lower     300    64  32    1    1   1   1         NA   3.22e-08   1.42e-07      0.114            NA  pass    
   s  scalpk    task   qr   1    vec   lower     300   100  32    1    1   1   1         NA   1.35e-08   1.37e-07      0.113            NA  pass    
   s  scalpk    task   qr   1    vec   lower     400    64  32    1    1   1   1         NA   9.17e-09   1.28e-07      0.237            NA  pass    
   s  scalpk    task   qr   1    vec   lower     400   100  32    1    1   1   1         NA   2.78e-08   1.26e-07      0.232            NA  pass    
   s  scalpk    task   qr   1    vec   lower     500    64  32    1    1   1   1         NA   1.55e-08   1.24e-07      0.421            NA  pass    
   s  scalpk    task   qr   1    vec   lower     500   100  32    1    1   1   1         NA   1.61e-08   1.35e-07      0.431            NA  pass    
tester: /application/slate/src/stedc_solve.cc:120: void slate::stedc_solve(std::vector<real_t> &, std::vector<real_t> &, Matrix<real_t> &, Matrix<real_t> &, Matrix<real_t> &, const slate::Options &) [real_t = float]: Assertion `Qii.mb() == ib' failed.
[76d71bce518f:00035] *** Process received signal ***
[76d71bce518f:00035] Signal: Aborted (6)
[76d71bce518f:00035] Signal code:  (-6)
[76d71bce518f:00035] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f650ad5c520]
[76d71bce518f:00035] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f650adb0a7c]
[76d71bce518f:00035] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f650ad5c476]
[76d71bce518f:00035] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f650ad427f3]
[76d71bce518f:00035] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f650ad4271b]
[76d71bce518f:00035] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f650ad53e96]
[76d71bce518f:00035] [ 6] /application/build_slate/libslate.so(+0xc3c79e)[0x7f650d3b979e]
[76d71bce518f:00035] [ 7] /lib/x86_64-linux-gnu/libomp.so.5(+0x6156c)[0x7f650bdda56c]
[76d71bce518f:00035] [ 8] /lib/x86_64-linux-gnu/libomp.so.5(+0x653b2)[0x7f650bdde3b2]
[76d71bce518f:00035] [ 9] /lib/x86_64-linux-gnu/libomp.so.5(+0x72f90)[0x7f650bdebf90]
[76d71bce518f:00035] [10] /lib/x86_64-linux-gnu/libomp.so.5(+0x6e5ea)[0x7f650bde75ea]
[76d71bce518f:00035] [11] /lib/x86_64-linux-gnu/libomp.so.5(+0x7257e)[0x7f650bdeb57e]
[76d71bce518f:00035] [12] /lib/x86_64-linux-gnu/libomp.so.5(+0x44d3d)[0x7f650bdbdd3d]
[76d71bce518f:00035] [13] /lib/x86_64-linux-gnu/libomp.so.5(+0xa29f4)[0x7f650be1b9f4]
[76d71bce518f:00035] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f650adaeb43]
[76d71bce518f:00035] [15] /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f650ae3fbb4]
[76d71bce518f:00035] *** End of error message ***
FAILED: heev, exit code -6
<...>
./tester  --origin s --target t --ref n --nb 64,100 --dim 100:500:100 stedc
tester: /application/slate/src/stedc_solve.cc:120: void slate::stedc_solve(std::vector<real_t> &, std::vector<real_t> &, Matrix<real_t> &, Matrix<real_t> &, Matrix<real_t> &, const slate::Options &) [real_t = double]: Assertion `Qii.mb() == ib' failed.
[76d71bce518f:00095] *** Process received signal ***
[76d71bce518f:00095] Signal: Aborted (6)
[76d71bce518f:00095] Signal code:  (-6)
[76d71bce518f:00095] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f0d0b6ce520]
[76d71bce518f:00095] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f0d0b722a7c]
[76d71bce518f:00095] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f0d0b6ce476]
[76d71bce518f:00095] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f0d0b6b47f3]
[76d71bce518f:00095] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f0d0b6b471b]
[76d71bce518f:00095] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f0d0b6c5e96]
[76d71bce518f:00095] [ 6] /application/build_slate/libslate.so(+0xc3cc9e)[0x7f0d0dd2bc9e]
[76d71bce518f:00095] [ 7] /lib/x86_64-linux-gnu/libomp.so.5(+0x6156c)[0x7f0d0c74c56c]
[76d71bce518f:00095] [ 8] /lib/x86_64-linux-gnu/libomp.so.5(+0x653b2)[0x7f0d0c7503b2]
[76d71bce518f:00095] [ 9] /lib/x86_64-linux-gnu/libomp.so.5(+0x72f90)[0x7f0d0c75df90]
[76d71bce518f:00095] [10] /lib/x86_64-linux-gnu/libomp.so.5(+0x6e5ea)[0x7f0d0c7595ea]
[76d71bce518f:00095] [11] /lib/x86_64-linux-gnu/libomp.so.5(+0x7257e)[0x7f0d0c75d57e]
[76d71bce518f:00095] [12] /lib/x86_64-linux-gnu/libomp.so.5(+0x44d3d)[0x7f0d0c72fd3d]
[76d71bce518f:00095] [13] /lib/x86_64-linux-gnu/libomp.so.5(+0xa29f4)[0x7f0d0c78d9f4]
[76d71bce518f:00095] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f0d0b720b43]
[76d71bce518f:00095] [15] /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f0d0b7b1bb4]
[76d71bce518f:00095] *** End of error message ***
FAILED: stedc, exit code -6

Environment

I've also attached a Dockerfile to reproduce the build environment.

# Dockerfile
FROM ubuntu:22.04

RUN apt update && \
    apt install -y locales && \
    locale-gen "en_US.UTF-8" && \
    update-locale LANG=en_US.UTF-8

ENV LANGUAGE en_US:en
ENV LANG en_US.UTF-8 
ENV LC_ALL en_US.UTF-8

WORKDIR /application

# Base Environment 
RUN apt -y update && apt -y install make wget curl \
      lsb-release coreutils sudo bash-completion \
      apt-transport-https software-properties-common \
      ca-certificates gnupg linux-tools-common time pciutils \
      build-essential wget curl \
      git make ninja-build \
      gdb valgrind \
      libeigen3-dev \
      libblas-dev liblapack-dev liblapacke-dev \
      libunwind-dev libtbb-dev libomp-dev \
      libopenmpi-dev openmpi-bin libscalapack-openmpi-dev 

# CMake + Clang
RUN apt -y install cmake cmake-curses-gui
RUN apt -y install clang-12 libomp-12-dev

# Clone SLATE
RUN git clone --recurse-submodules https://github.com/icl-utk-edu/slate.git
RUN git -C slate checkout 57ea922b4a10876ba990a41648590ef36019acdd

# Build BLASPP
RUN cmake -S slate/blaspp -B build_blaspp -DCMAKE_C_COMPILER=clang-12 -DCMAKE_CXX_COMPILER=clang++-12
RUN cmake --build build_blaspp --target blaspp -j2 

# Build LAPACKPP
RUN cmake -S slate/lapackpp -B build_lapackpp -DCMAKE_C_COMPILER=clang-12 -DCMAKE_CXX_COMPILER=clang++-12 -Dblaspp_DIR=$PWD/build_blaspp
RUN cmake --build build_lapackpp --target lapackpp -j2 

# Build SLATE
RUN cmake -S slate -B build_slate -DCMAKE_CXX_COMPILER=clang++-12 -Dblaspp_DIR=$PWD/build_blaspp -Dlapackpp_DIR=$PWD/build_lapackpp -DBUILD_TESTING=ON -DSCALAPACK_LIBRARIES="/usr/lib/x86_64-linux-gnu/libscalapack-openmpi.so"
RUN cmake --build build_slate --target all -j2 --verbose
  • SLATE version / commit ID (e.g., git log --oneline -n 1): 57ea922
  • How installed:
    • git clone
    • release tar file
    • Spack
    • module
  • How compiled:
    • makefile (include your make.inc)
    • CMake (include your command line options)
  • Compiler & version (e.g., mpicxx --version):
  • BLAS library (e.g., MKL, ESSL, OpenBLAS) & version: NETLIB
  • CUDA / ROCm / oneMKL version (e.g., nvcc --version): N/A
  • MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes mpicxx -v gives info.): Open MPI
  • OS: Ubuntu 22.04
  • Hardware (CPUs, GPUs, nodes):AMD EPYC 7302P 16-Core
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant