You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our working guess is that this is caused by HOOMD's MPI helper methods like scatter_v and gather_v accidentally overflowing the MPI count (signed int) when data gets serialized to bytes.
I propose to register MPI_Datatypes for common data types we use these methods on that don't need to be serialized (like Scalar3). I would register these types in the MPIConfiguration and provide getters to access them. Then, callers can either invoke the MPI operations they want directly, or we can provide helper methods using these types (likely in the MPIConfiguration class as well).
We should also, at minimum, add a check in the MPI helper methods and throw an exception if the serialized data is expected to overflow a signed int.
The script below was user reported to cause segfaults on 48 CPUs.
Description
MPCD segfaults for large system sizes:
Our working guess is that this is caused by HOOMD's MPI helper methods like
scatter_v
andgather_v
accidentally overflowing the MPI count (signed int) when data gets serialized to bytes.I propose to register
MPI_Datatype
s for common data types we use these methods on that don't need to be serialized (likeScalar3
). I would register these types in theMPIConfiguration
and provide getters to access them. Then, callers can either invoke the MPI operations they want directly, or we can provide helper methods using these types (likely in theMPIConfiguration
class as well).We should also, at minimum, add a check in the MPI helper methods and throw an exception if the serialized data is expected to overflow a signed int.
The script below was user reported to cause segfaults on 48 CPUs.
Script
Input files
No response
Output
Expected output
No response
Platform
CPU, GPU, Linux
Installation method
Compiled from source
HOOMD-blue version
4.8.2
Python version
3.12
The text was updated successfully, but these errors were encountered: