Time of embarrassingly parallel meep simulations increases exponentially as I increase the number of cores #2102

ahmed6795 · 2022-06-13T17:17:59Z

ahmed6795
Jun 13, 2022

Hello,

I am new to meep and I have been experimenting with mp.divide_parallel_processes(N) where I have a simple piece of waveguide and I am sweeping the width and each width is expected to run on a separate core. Here is my code:

import meep as mp
N = 20
n = mp.divide_parallel_processes(N)
####################Imports####################
import numpy as np
import matplotlib.pyplot as plt
####################Define geometry####################

#Define Simulation volume and PML
cell = mp.Vector3(16, 8, 0)
resolution = 40
pml_layers = [mp.PML(1.0)]

#Define geometry
mat_si=mp.Medium(epsilon=2.59**2)
mat_ox=mp.Medium(epsilon=1.44**2)
geometry = [mp.Block(mp.Vector3(mp.inf,0.5+n/40,mp.inf), center=mp.Vector3(), material=mat_si)]

####################Define sources####################

#Sources
wavelength = 1.55 #um
BW = 0.2 #fractional bandwidth
f0 = 1/wavelength
sources = [mp.EigenModeSource(mp.GaussianSource(frequency=f0, fwidth=BW * f0),
                                eig_band=2, #2nd mode
                                direction=mp.AUTOMATIC,
                                eig_kpoint=mp.Vector3(1, 0, 0),
                                size=mp.Vector3(0, 2, 0),
                                center=mp.Vector3(0, 0, 0))]

####################Define monitors and run the simulation####################

#Simulation object
sim = mp.Simulation(cell_size=cell, resolution=resolution, boundary_layers=pml_layers, geometry=geometry, sources=sources,
                    default_material=mat_ox)

#Monitors
num_freq = 1
wl = np.linspace(1.52, 1.57, num_freq)
freq = 1/wl

src_flux = sim.add_flux(freq ,mp.FluxRegion(center=mp.Vector3(1, 0, 0), size=mp.Vector3(0, 2, 0)))
tran_flux = sim.add_flux(freq ,mp.FluxRegion(center=mp.Vector3(5, 0, 0), size=mp.Vector3(0, 2, 0)))
refl_flux = sim.add_flux(freq ,mp.FluxRegion(center=mp.Vector3(-5, 0, 0), size=mp.Vector3(0, 2, 0)))

DFT = sim.add_dft_fields([mp.Ex,mp.Ey,mp.Ez, mp.Hx, mp.Hy, mp.Hz],[f0],center=mp.Vector3(0,0,0),size=mp.Vector3(16,8,0))

#Run
sim.run(until=1000)  #Run until t = whatever in meep time units

####################Extract the results####################
eps_data = sim.get_array(center=mp.Vector3(), size=cell, component=mp.Dielectric)
Hz_data = sim.get_array(center=mp.Vector3(), size=cell, component=mp.Hz)

src_coeff = np.squeeze(sim.get_eigenmode_coefficients(src_flux, [2]).alpha) #[2] = 2nd mode
tran_coeff = np.squeeze(sim.get_eigenmode_coefficients(tran_flux, [2]).alpha)
refl_coeff = np.squeeze(sim.get_eigenmode_coefficients(refl_flux, [2]).alpha)

Now, I change N and run the script using mpirun -np N python MyWG_simple.py > "script_printed.out" & and I change N in the code above to be the same as N in this command I run, so I expect each simulation to run on one physical core. I am running meep 1.23.0 with mpich 4.0.2 and mpi4py 3.1.3 on Ubuntu 20.04 on a 32-core and 64-thread machine, and I am the only one using this machine. Here are the time results:

I expected the time to stay constant because all the simulations are running in parallel, but the time increases exponentially with increasing the number of cores What am I doing wrong?

I repeated the experiment using taskset to fix the cores I am running on using taskset -c 0-4 mpirun -np 5 python MyWG_simple.py > "script_printed.out" & (and of course I change the number of cores each time I run to match the N in the meep script), and I get worse results:

I am wondering:
a) Why does the time increase exponentially and not stay constant as I increase the number of cores for an embarrassingly parallel task?
b) Why does it take more time to run if I use taskset than when I am not using taskset?

smartalecH · 2022-06-14T00:16:13Z

smartalecH
Jun 14, 2022
Collaborator

Looks like your memory bandwidth is saturating (as expected). Note that FDTD is heavily memory bound (not compute-bound).

3 replies

ahmed6795 Jun 14, 2022
Author

Thanks for your reply and your contributions.

To make sure I understand correctly, you mean memory speed and not memory size, right? I have 250G of ram and using htop, the memory is never full.
My ram is 8 x 32G and each is 3200 MHz, so I think it has an ok speed. Does that mean I can't run this task in parallel on more than 20 cores (as per the graph)? Or is there something you suggest I do (other than buying faster ram)?
Do you have an idea why using taskset makes things worse?

Thanks in advance.

smartalecH Jun 14, 2022
Collaborator

To make sure I understand correctly, you mean memory speed and not memory size, right?

Yes, I'm referring to memory bandwidth.

Also, do you have an idea why using taskset makes things worse?

taskset binds the process to a particular core. I imagine your many-core system has a somewhat sophisticated architecture (e.g. multiple NUMA nodes, which have independent memory bus lanes). My guess is taskset is doing a poor job at "load-balancing" the many processes. When you launch an MPI job, the OS has some flexibility with where it chooses to bind each process. It's really hard to say though...

ahmed6795 Jun 14, 2022
Author

Many thanks for your response.

It will be interesting to look for how to monitor memory data transfer rate to confirm that this is the issue and buy faster ram if necessary. For reference, my ram is 8 x 32G and each is 3200 MHz.
If I repeat the test with half the resolution, is it expected to hit the memory speed limit at a larger number of cores?
It seems like I can't run this task in parallel on more than 20 cores (as per the graph) as I seem to hit the memory bound there. Do you think there is something I am not optimizing that can alleviate this issue to some extent, other than buying more ram or reducing simulation space size or resolution?

Thank you

oskooi · 2022-06-14T04:37:22Z

oskooi
Jun 14, 2022
Collaborator

See also #882 which may be relevant to the performance degradation you are experiencing as you ramp up the number of "independent" jobs/threads.

1 reply

ahmed6795 Jun 15, 2022
Author

Thanks for the helpful link. What was your conclusion? Is it the memory bandwidth as well?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time of embarrassingly parallel meep simulations increases exponentially as I increase the number of cores #2102

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Time of embarrassingly parallel meep simulations increases exponentially as I increase the number of cores #2102

ahmed6795 Jun 13, 2022

Replies: 2 comments · 4 replies

smartalecH Jun 14, 2022 Collaborator

ahmed6795 Jun 14, 2022 Author

smartalecH Jun 14, 2022 Collaborator

ahmed6795 Jun 14, 2022 Author

oskooi Jun 14, 2022 Collaborator

ahmed6795 Jun 15, 2022 Author

ahmed6795
Jun 13, 2022

Replies: 2 comments 4 replies

smartalecH
Jun 14, 2022
Collaborator

ahmed6795 Jun 14, 2022
Author

smartalecH Jun 14, 2022
Collaborator

ahmed6795 Jun 14, 2022
Author

oskooi
Jun 14, 2022
Collaborator

ahmed6795 Jun 15, 2022
Author