Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threaded segfault in setindex! with arrays of compound-typed elements #1179

Closed
lxvm opened this issue Nov 15, 2024 · 2 comments · Fixed by #1181
Closed

threaded segfault in setindex! with arrays of compound-typed elements #1179

lxvm opened this issue Nov 15, 2024 · 2 comments · Fixed by #1181

Comments

@lxvm
Copy link
Contributor

lxvm commented Nov 15, 2024

Hello,

@phibeck reported this bug to me and I have made a mwe of a segfault. setindex! on a dataset with an array of a compound datatype appears to be thread unsafe.

This is the mwe

using HDF5

T = ComplexF64
# T = Float64 # OK
sizes = (100, 100, 100)
Threads.@threads for i in 1:Threads.nthreads()
    h5open("file$i.hdf5", "w") do h5
        d = create_dataset(h5, "rand", T, sizes)
        for n in 1:sizes[3]
            d[:,:,n] = rand(T, sizes[1:2])
        end
    end
end

This is the segfault, although the message is different every time

$ julia --project --threads 6 mwe.jl

[400950] signal 7 (128): Bus error
in expression starting at /local/home/lxvm/projects/autobz/hdf5/mwe.jl:6
H5FL_reg_malloc at /home/lxvm/.julia/artifacts/3f844d84068534dcd6606936ab5f28e1120a9bb0/lib/libhdf5.so (unknown line)
H5FL_reg_calloc at /home/lxvm/.julia/artifacts/3f844d84068534dcd6606936ab5f28e1120a9bb0/lib/libhdf5.so (unknown line)
H5CX_push at /home/lxvm/.julia/artifacts/3f844d84068534dcd6606936ab5f28e1120a9bb0/lib/libhdf5.so (unknown line)
H5Sget_simple_extent_type at /home/lxvm/.julia/artifacts/3f844d84068534dcd6606936ab5f28e1120a9bb0/lib/libhdf5.so (unknown line)
h5s_get_simple_extent_type at /home/lxvm/.julia/packages/HDF5/Z859u/src/api/functions.jl:6716
setindex! at /home/lxvm/.julia/packages/HDF5/Z859u/src/datasets.jl:368
unknown function (ip: 0x7f6496319060)
#2 at /local/home/lxvm/projects/autobz/hdf5/mwe.jl:10
#17 at /home/lxvm/.julia/packages/HDF5/Z859u/src/file.jl:101
task_local_storage at ./task.jl:315
#h5open#16 at /home/lxvm/.julia/packages/HDF5/Z859u/src/file.jl:96 [inlined]
h5open at /home/lxvm/.julia/packages/HDF5/Z859u/src/file.jl:94 [inlined]
macro expansion at /local/home/lxvm/projects/autobz/hdf5/mwe.jl:7 [inlined]
#26#threadsfor_fun#1 at ./threadingconstructs.jl:252
#26#threadsfor_fun at ./threadingconstructs.jl:219 [inlined]
#1 at ./threadingconstructs.jl:154
unknown function (ip: 0x7f649630ba9f)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
start_task at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:1202
Allocations: 4069986 (Pool: 4069810; Big: 176); GC: 6
Bus error

In @phibeck's bug report the segfault showed a trace through setindex! -> get_jl_type -> get_mem_compatible_jl_type -> h5t_get_member_name and I noticed there was no lock on the library call in this function:

HDF5.jl/src/api/helpers.jl

Lines 982 to 989 in 1a24872

# Note: The following two functions implement direct ccalls because the binding generator
# cannot (yet) do the string wrapping and memory freeing.
"""
h5t_get_member_name(type_id::hid_t, index::Cuint) -> String
See `libhdf5` documentation for [`H5Oopen`](https://portal.hdfgroup.org/display/HDF5/H5T_GET_MEMBER_NAME).
"""
function h5t_get_member_name(type_id, index)

It seems that calling setindex! with an array of a compound datatype, such as ComplexF64, hits this codepath and consistently segfaults.
Although our code could be rewritten to avoid multi-threaded hdf5 calls, I am hoping to identify and fix the issue.

For reference, I ran this on linux and this is the environment:

julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.1 (2024-10-16)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(hdf5) pkg> st -m
Status `/local/home/lxvm/projects/autobz/hdf5/Manifest.toml`
  [34da2185] Compat v4.16.0
  [f67ccb44] HDF5 v0.17.2
  [692b3bcd] JLLWrappers v1.6.1
  [3da0fdf6] MPIPreferences v0.1.11
  [21216c6a] Preferences v1.4.3
  [ae029012] Requires v1.3.0
  [0234f1f7] HDF5_jll v1.14.3+3
  [e33a78d0] Hwloc_jll v2.11.2+1
  [7cb0a576] MPICH_jll v4.2.3+0
  [f1f71cc9] MPItrampoline_jll v5.5.1+0
  [9237b28f] MicrosoftMPI_jll v10.1.4+2
⌅ [fe0851c0] OpenMPI_jll v4.1.6+0
  [458c3c95] OpenSSL_jll v3.0.15+1
  [477f73a3] libaec_jll v1.1.2+0
  [0dad84c5] ArgTools v1.1.2
  [56f22d72] Artifacts v1.11.0
  [2a0f44e3] Base64 v1.11.0
  [ade2ca70] Dates v1.11.0
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching v1.11.0
  [4af54fe1] LazyArtifacts v1.11.0
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2 v1.11.0
  [8f399da3] Libdl v1.11.0
  [56ddb016] Logging v1.11.0
  [d6f4376e] Markdown v1.11.0
  [a63ad114] Mmap v1.11.0
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.11.0
  [de0858da] Printf v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.6.0+0
  [e37daf67] LibGit2_jll v1.7.2+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.6+0
  [14a3606d] MozillaCACerts_jll v2023.12.12
  [83775a58] Zlib_jll v1.2.13+1
  [8e850ede] nghttp2_jll v1.59.0+0
  [3f19e933] p7zip_jll v17.4.0+2
@lxvm lxvm changed the title threaded segfault in setindex! with compound types threaded segfault in setindex! with arrays of compound-typed elements Nov 15, 2024
@mkitti
Copy link
Member

mkitti commented Nov 18, 2024

The entire C library is not thread safe. You are correct that in this case the API lock on the Julia side has been bypassed. I suppose we can make it not segfault, but nonetheless using threads with HDF5 is not really going to work.

@lxvm
Copy link
Contributor Author

lxvm commented Nov 18, 2024

In our application the overhead of the API lock is negligible. We are usually calculating expensive integrals each taking at least a minute per thread that return 3x3 complex matrices. We only need the real part of this matrix, and could avoid segfaults by taking the real part. We could also rewrite our code in a distributed manner since it is embarrassingly parallel. However, if all that is required to avoid these segfaults is to wrap each ccall in src/api/helpers.jl with a lock and try block, like src/api/functions.jl does, then I am happy to make a pr because it only looks like there are only 7 thread-unsafe ccalls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants