Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault for MPI operations on large data sizes #1895

Closed
mphoward opened this issue Sep 25, 2024 · 1 comment · Fixed by #1897
Closed

Segmentation fault for MPI operations on large data sizes #1895

mphoward opened this issue Sep 25, 2024 · 1 comment · Fixed by #1897
Assignees
Labels
bug Something isn't working mpcd MPCD component

Comments

@mphoward
Copy link
Collaborator

Description

MPCD segfaults for large system sizes:

Our working guess is that this is caused by HOOMD's MPI helper methods like scatter_v and gather_v accidentally overflowing the MPI count (signed int) when data gets serialized to bytes.

I propose to register MPI_Datatypes for common data types we use these methods on that don't need to be serialized (like Scalar3). I would register these types in the MPIConfiguration and provide getters to access them. Then, callers can either invoke the MPI operations they want directly, or we can provide helper methods using these types (likely in the MPIConfiguration class as well).

We should also, at minimum, add a check in the MPI helper methods and throw an exception if the serialized data is expected to overflow a signed int.

The script below was user reported to cause segfaults on 48 CPUs.

Script

import hoomd
import numpy

device = hoomd.device.CPU()
simulation = hoomd.Simulation(device=device, seed=1)

snapshot = hoomd.Snapshot()
L = 128
density = 50
kT = 1

if snapshot.communicator.rank == 0:
    rng = numpy.random.default_rng(seed=42)
    snapshot.configuration.box = [L,L,L,0,0,0]
    snapshot.mpcd.types = ['A']
    snapshot.mpcd.N = int(density * L * L * L)
    snapshot.mpcd.position[:] = rng.uniform(low=-0.5*L,high=0.5*L,size=(snapshot.mpcd.N,3))

    velocity = rng.normal(0.0, numpy.sqrt(kT), (snapshot.mpcd.N, 3))
    velocity -= numpy.mean(velocity, axis=0)
    snapshot.mpcd.velocity[:] = velocity

simulation.create_state_from_snapshot(snapshot)

integrator = hoomd.mpcd.Integrator(dt=0.02)
integrator.collision_method = hoomd.mpcd.collide.StochasticRotationDynamics(
    period=1, angle=130, kT=kT
)

integrator.streaming_method = hoomd.mpcd.stream.Bulk(
    period=integrator.collision_method.period
)

integrator.mpcd_particle_sorter = hoomd.mpcd.tune.ParticleSorter(trigger=20)
simulation.operations.integrator = integrator

simulation.run(100)
device.notice(f'{simulation.tps}')

Input files

No response

Output

Segmentation fault

Expected output

No response

Platform

CPU, GPU, Linux

Installation method

Compiled from source

HOOMD-blue version

4.8.2

Python version

3.12

@mphoward mphoward added bug Something isn't working mpcd MPCD component labels Sep 25, 2024
@mphoward mphoward self-assigned this Sep 25, 2024
@joaander
Copy link
Member

Thanks for thinking through this. I look forward to the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mpcd MPCD component
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants