Benchmark suggestions requested #3672

gcstoianowski · 2023-12-15T17:27:37Z

gcstoianowski
Dec 15, 2023

I got AMReX built and run on a few of the tests on one or two nodes of a cluster of Cascade Lake processors. I also ran the one under Tests/Amr/Advection_AmrLevel/Exec/UniformVelocity on up to 32 nodes of the same cluster by increasing the sizes in amr.n_cell to 512x512x512. My interest is in examining the impact of the interconnection network HCA's link and width on the performance of AMReX. What would be a good benchmark to run for that purpose?

Saludos,

Gerardo

Senior Engineer, Networking HPC Applications Performance (at NVIDIA)

Answered by WeiqunZhang

Dec 15, 2023

I think https://github.com/AMReX-Codes/amrex/tree/development/Tests/LinearSolvers/ABecLaplacian_C could be a good test. Multigrid solver is often the bottleneck of many AMReX applications. At large scales, the communication cost starts to dominate. You can use the setup in https://github.com/AMReX-Codes/amrex/tree/development/Tests/LinearSolvers/ABecLaplacian_C/scalingtest. 256^3 cells per GPU is probably a good target for weak scaling runs. (For example, n_cell=256 for 1 GPU and n_cell=512 for 8 GPUs.)
To use GPU aware MPI, run with amrex.use_gpu_aware_mpi=1.

View full answer

WeiqunZhang · 2023-12-15T18:36:32Z

WeiqunZhang
Dec 15, 2023
Maintainer

I think https://github.com/AMReX-Codes/amrex/tree/development/Tests/LinearSolvers/ABecLaplacian_C could be a good test. Multigrid solver is often the bottleneck of many AMReX applications. At large scales, the communication cost starts to dominate. You can use the setup in https://github.com/AMReX-Codes/amrex/tree/development/Tests/LinearSolvers/ABecLaplacian_C/scalingtest. 256^3 cells per GPU is probably a good target for weak scaling runs. (For example, n_cell=256 for 1 GPU and n_cell=512 for 8 GPUs.)
To use GPU aware MPI, run with amrex.use_gpu_aware_mpi=1.

6 replies

gcstoianowski Dec 18, 2023
Author

Weiqun,

I note you assumed I would be running with GPUs, which is not the case. In any case, thanks again for the reply.

Saludos,
Gerardo

gcstoianowski Dec 18, 2023
Author

Here are results from running the test case you suggested (inputs.test from Tests/LinearSolvers/ABecLaplacian_C/scalingtest) on a cluster of Cascade Lakes (Xeon 8280, 56 cores per node).

Scaling isn't great, and the runs are too short for my purposes. Suggestions are welcome!

WeiqunZhang Dec 18, 2023
Maintainer

I suggest the linear solver test because it's known to not scale well. How did you measure the time? Did it include writePlotfile time? If so, you might want to comment it out in main.cpp. As for the runs being too short, maybe you can add a for loop in main.cpp.

gcstoianowski Dec 18, 2023
Author

The time was measured externally with '/usr/bin/time -p', and I applied the changes documented in scalingtest/main.diff to main.cpp, so the call to writePlotfile was commented out. I don't know what the BL_PROFILE_REGION("main") and BL_PROFILES_REGIN("LinearSolver") macros or calls are supposed to do, but the output is very short:

Initializing AMReX (23.12-dirty)...
MPI initialized with 224 MPI processes
MPI initialized with thread support level 0
AMReX (23.12-dirty) initialized
AMReX (23.12-dirty) finalized
real 57.06
user 2819.00
sys 233.62

WeiqunZhang Dec 18, 2023
Maintainer

The time reported by time includes some one time amrex initialization costs.

You could try amrex's profiler by using TINY_PROFILE=TRUE for GNU make or -DAMReX_TINY_PROFILE=ON for CMake. Then at the end of the run, a summary of the profiling results will be shown.

gcstoianowski · 2023-12-18T22:02:36Z

gcstoianowski
Dec 18, 2023
Author

Weiqun,

Thanks again. I rebuilt with TINY_PROFILE=TRUE in the GNUmakefile. Is it correct to assume the time of interest is the one reported for the LinearSolver region? For instance, on 32 nodes,

Name NCalls Incl. Min Incl. Avg Incl. Max Max %

REG::LinearSolver 1 16.47 16.5 16.53 98.36%

Saludos,
Gerardo

5 replies

WeiqunZhang Dec 18, 2023
Maintainer

Yes

gcstoianowski Dec 18, 2023
Author

Thanks! That looks like it scales a lot better. Is there any way to make the run take longer?

WeiqunZhang Dec 18, 2023
Maintainer

Something like

int main (int argc, char* argv[])
{
    amrex::Initialize(argc, argv);

    {
        BL_PROFILE("main");
        for (int i = 0; i < 100; ++i) {
            MyTest mytest;
            {
                BL_PROFILE_REGION("LinearSolver");
                mytest.solve();
            }
        }
    }

    amrex::Finalize();
}

gcstoianowski Dec 18, 2023
Author

That should work! Thanks again!

gcstoianowski Dec 20, 2023
Author

Weiqun,

I settled on a repeat count of 20 to have reasonable run times from 8 to 32 nodes, with n_cell=2048. (The program runs out of memory on 6 nodes with n_cell=2048.) Everything ran fine except for the 24-node (1344-MPI rank) job, which crashed once with a segfault (backtrace attached) and with a UCX error other times (UCX is the underlying p2p communications library sitting between Open MPI and the InfiniBand network on the cluster). Curiously, when I ran on a different cluster (Sapphire Rapids, 112 cores/node), I got the same UCX failure only with the 20-repetition job on 12 nodes (1344 ranks!), but not with the single-solve 12-node job or the 20-repetition jobs on more or fewer nodes. I've contacted the UCX developers to ask about this weird issue I'm seeing.

Saludos,
Gerardo
Backtrace.497.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark suggestions requested #3672

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmark suggestions requested #3672

gcstoianowski Dec 15, 2023

Replies: 2 comments · 11 replies

WeiqunZhang Dec 15, 2023 Maintainer

gcstoianowski Dec 18, 2023 Author

gcstoianowski Dec 18, 2023 Author

WeiqunZhang Dec 18, 2023 Maintainer

gcstoianowski Dec 18, 2023 Author

WeiqunZhang Dec 18, 2023 Maintainer

gcstoianowski Dec 18, 2023 Author

WeiqunZhang Dec 18, 2023 Maintainer

gcstoianowski Dec 18, 2023 Author

WeiqunZhang Dec 18, 2023 Maintainer

gcstoianowski Dec 18, 2023 Author

gcstoianowski Dec 20, 2023 Author

gcstoianowski
Dec 15, 2023

Replies: 2 comments 11 replies

WeiqunZhang
Dec 15, 2023
Maintainer

gcstoianowski Dec 18, 2023
Author

gcstoianowski Dec 18, 2023
Author

WeiqunZhang Dec 18, 2023
Maintainer

gcstoianowski Dec 18, 2023
Author

WeiqunZhang Dec 18, 2023
Maintainer

gcstoianowski
Dec 18, 2023
Author

WeiqunZhang Dec 18, 2023
Maintainer

gcstoianowski Dec 18, 2023
Author

WeiqunZhang Dec 18, 2023
Maintainer

gcstoianowski Dec 18, 2023
Author

gcstoianowski Dec 20, 2023
Author