-
Notifications
You must be signed in to change notification settings - Fork 365
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
amrex.omp_threads
: Can Avoid SMT (#3607)
## Summary In all our applications in BLAST, the OpenMP default to use all [logical cores on modern CPUs](https://en.wikipedia.org/wiki/Simultaneous_multithreading) results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option `amrex.omp_threads` that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it. For codes, users that change the default to `amrex.omp_threads = nosmt`, the `OMP_NUM_THREADS` variable will still take precedence. This is a bit unusual (because CLI options usually have higher precedence than env vars - and they do if the user provides a number here), but done intentionally: this way, codes like WarpX can set the `nosmt` default and HPC job scripts will set the exact, preferably benchmarked number of threads as usual without surprises. - [x] document ## Tests Performed for AMReX OMP Backend Tests were performed with very small examples, WarpX 3D LWFA test as checked in or AMReX AMRCore 3d test. - [x] Ubuntu 22.04 Laptop w/ 12th Gen Intel i9-12900H: @ax3l - 20 logical cores; the first 12 logical cores use 2x SMT/HT - 20 virtual (default) -> 14 physical (`amrex.omp_threads = nosmt`) - faster runtime! - [x] Perlmutter (SUSE Linux Enterprise 15.4, kernel 5.14.21) - [CPU node](https://docs.nersc.gov/systems/perlmutter/architecture/) with 2x [AMD EPYC 7763](https://www.amd.com/en/products/cpu/amd-epyc-7763) - 2x SMT - 256 default, 128 with `amrex.omp_threads = nosmt` - faster runtime! - [x] Frontier (SUSE Linux Enterprise 15.4, kernel 5.14.21) - 1x AMD EPYC 7763 64-Core Processor (w/ 2x SMT enabled) - 2x SMT - 128 default - 64 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might also be lower, due to first cores used by OS and [low-noise cores](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#low-noise-mode-layout) after that. But that is an orthogonal question and should be set in job scripts: `#SBATCH --ntasks-per-node=8` `#SBATCH --cpus-per-task=7` `#SBATCH --gpus-per-task=1` - [x] Summit (RHEL 8.2, kernel 4.18.0) - 2x IBM Power9 (each 22 physical cores each, each 6 disabled/hidden for OS?, 4x SMT enabled; cpuinfo says 128 total) - 4x SMT - 128 default, 32 with `amrex.omp_threads = nosmt` - faster runtime! - [x] [Lassen](https://hpc.llnl.gov/hardware/compute-platforms/lassen) (RHEL 7.9, kernel 4.14.0) - 2x IBM Power9 (each 22 physical cores, each 2 reserved for OS?, 4x SMT enabled) - 4x SMT - 160 default, 44 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might be even down to 40, but that is an orthogonal question and should be set in job scripts. - [x] macOS M1 (arm64/aarch64) mini: - no SMT/HT - 8 default, 8 with `amrex.omp_threads = nosmt` - [x] macOS (OSX Ventura 13.5.2, 2.8 GHz Quad-Core Intel Core i7-8569U) Intel x86_64 @n01r - 2x SMT - 8 default, 4 with `amrex.omp_threads = nosmt` - faster runtime! - [x] macOS (OSX Ventura 13.5.2) M1 Max on mac studio @RTSandberg - no SMT/HT - 10 default, 10 with `amrex.omp_threads = nosmt` - [ ] some BSD/FreeBSD system? - no user requests - low priority, we just keep the default for now - [ ] Windows... looking for a system ## Additional background ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>
- Loading branch information
1 parent
606a94c
commit a7afcba
Showing
7 changed files
with
224 additions
and
7 deletions.
There are no files selected for viewing
21 changes: 21 additions & 0 deletions
21
Docs/sphinx_documentation/source/InputsComputeBackends.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
.. _Chap:InputsComputeBackends: | ||
|
||
Compute Backends | ||
================ | ||
|
||
The following inputs must be preceded by ``amrex.`` and determine runtime options of CPU or GPU compute implementations. | ||
|
||
+------------------------+-----------------------------------------------------------------------+-------------+------------+ | ||
| Parameter | Description | Type | Default | | ||
+========================+=======================================================================+=============+============+ | ||
| ``omp_threads`` | If OpenMP is enabled, this can be used to set the default number of | String | ``system`` | | ||
| | threads. The special value ``nosmt`` can be used to avoid using | or Int | | | ||
| | threads for virtual cores (aka Hyperthreading or SMT), as is default | | | | ||
| | in OpenMP, and instead only spawns threads equal to the number of | | | | ||
| | physical cores in the system. | | | | ||
| | For the values ``system`` and ``nosmt``, the environment variable | | | | ||
| | ``OMP_NUM_THREADS`` takes precedence. For Integer values, | | | | ||
| | ``OMP_NUM_THREADS`` is ignored. | | | | ||
+------------------------+-----------------------------------------------------------------------+-------------+------------+ | ||
|
||
For GPU-specific parameters, see also the :ref:`GPU chapter <sec:gpu:parameters>`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
#include <AMReX_OpenMP.H> | ||
#include <AMReX.H> | ||
#include <AMReX_ParmParse.H> | ||
#include <AMReX_Print.H> | ||
|
||
#if defined(__APPLE__) | ||
#include <sys/types.h> | ||
#include <sys/sysctl.h> | ||
#endif | ||
|
||
#if defined(_WIN32) | ||
#include <windows.h> | ||
#endif | ||
|
||
#include <cstdlib> | ||
#include <fstream> | ||
#include <iostream> | ||
#include <optional> | ||
#include <set> | ||
#include <sstream> | ||
#include <string> | ||
#include <thread> | ||
#include <vector> | ||
|
||
|
||
namespace amrex | ||
{ | ||
int | ||
numUniquePhysicalCores () | ||
{ | ||
int ncores; | ||
|
||
#if defined(__APPLE__) | ||
size_t len = sizeof(ncores); | ||
// See hw.physicalcpu and hw.physicalcpu_max | ||
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname/determining_system_capabilities/ | ||
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname | ||
if (sysctlbyname("hw.physicalcpu", &ncores, &len, NULL, 0) == -1) { | ||
if (system::verbose > 0) { | ||
amrex::Print() << "numUniquePhysicalCores(): Error receiving hw.physicalcpu! " | ||
<< "Defaulting to visible cores.\n"; | ||
} | ||
ncores = int(std::thread::hardware_concurrency()); | ||
} | ||
#elif defined(__linux__) | ||
std::set<std::vector<int>> uniqueThreadSets; | ||
int cpuIndex = 0; | ||
|
||
while (true) { | ||
// for each logical CPU in cpuIndex from 0...N-1 | ||
std::string path = "/sys/devices/system/cpu/cpu" + std::to_string(cpuIndex) + "/topology/thread_siblings_list"; | ||
std::ifstream file(path); | ||
if (!file.is_open()) { | ||
break; // no further CPUs to check | ||
} | ||
|
||
// find its siblings | ||
std::vector<int> siblings; | ||
std::string line; | ||
if (std::getline(file, line)) { | ||
std::stringstream ss(line); | ||
std::string token; | ||
|
||
// Possible syntax: 0-3, 8-11, 14,17 | ||
// https://github.com/torvalds/linux/blob/v6.5/Documentation/ABI/stable/sysfs-devices-system-cpu#L68-L72 | ||
while (std::getline(ss, token, ',')) { | ||
size_t dashPos = token.find('-'); | ||
if (dashPos != std::string::npos) { | ||
// Range detected | ||
int start = std::stoi(token.substr(0, dashPos)); | ||
int end = std::stoi(token.substr(dashPos + 1)); | ||
for (int i = start; i <= end; ++i) { | ||
siblings.push_back(i); | ||
} | ||
} else { | ||
siblings.push_back(std::stoi(token)); | ||
} | ||
} | ||
} | ||
|
||
// and record the siblings group | ||
// (assumes: ascending and unique sets per cpuIndex) | ||
uniqueThreadSets.insert(siblings); | ||
cpuIndex++; | ||
} | ||
|
||
if (cpuIndex == 0) { | ||
if (system::verbose > 0) { | ||
amrex::Print() << "numUniquePhysicalCores(): Error reading CPU info.\n"; | ||
} | ||
ncores = int(std::thread::hardware_concurrency()); | ||
} else { | ||
ncores = int(uniqueThreadSets.size()); | ||
} | ||
#elif defined(_WIN32) | ||
DWORD length = 0; | ||
bool result = GetLogicalProcessorInformation(NULL, &length); | ||
|
||
if (!result) { | ||
if (system::verbose > 0) { | ||
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! " | ||
<< "Defaulting to visible cores.\n"; | ||
} | ||
ncores = int(std::thread::hardware_concurrency()); | ||
} | ||
else { | ||
std::vector<SYSTEM_LOGICAL_PROCESSOR_INFORMATION> buffer(length / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION)); | ||
if (!GetLogicalProcessorInformation(&buffer[0], &length)) { | ||
if (system::verbose > 0) { | ||
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! " | ||
<< "Defaulting to visible cores.\n"; | ||
} | ||
ncores = int(std::thread::hardware_concurrency()); | ||
} else { | ||
ncores = 0; | ||
for (const auto& info : buffer) { | ||
if (info.Relationship == RelationProcessorCore) { | ||
ncores++; | ||
} | ||
} | ||
} | ||
} | ||
#else | ||
// TODO: | ||
// BSD | ||
if (system::verbose > 0) { | ||
amrex::Print() << "numUniquePhysicalCores(): Unknown system. Defaulting to visible cores.\n"; | ||
} | ||
ncores = int(std::thread::hardware_concurrency()); | ||
#endif | ||
return ncores; | ||
} | ||
} // namespace amrex | ||
|
||
#ifdef AMREX_USE_OMP | ||
namespace amrex::OpenMP | ||
{ | ||
void init_threads () | ||
{ | ||
amrex::ParmParse pp("amrex"); | ||
std::string omp_threads = "system"; | ||
pp.queryAdd("omp_threads", omp_threads); | ||
|
||
auto to_int = [](std::string const & str_omp_threads) { | ||
std::optional<int> num; | ||
try { num = std::stoi(str_omp_threads); } | ||
catch (...) { /* nothing */ } | ||
return num; | ||
}; | ||
|
||
if (omp_threads == "system") { | ||
// default or OMP_NUM_THREADS environment variable | ||
} else if (omp_threads == "nosmt") { | ||
char const *env_omp_num_threads = std::getenv("OMP_NUM_THREADS"); | ||
if (env_omp_num_threads != nullptr && amrex::system::verbose > 1) { | ||
amrex::Print() << "amrex.omp_threads was set to nosmt," | ||
<< "but OMP_NUM_THREADS was set. Will keep " | ||
<< "OMP_NUM_THREADS=" << env_omp_num_threads << ".\n"; | ||
} else { | ||
omp_set_num_threads(numUniquePhysicalCores()); | ||
} | ||
} else { | ||
std::optional<int> num_omp_threads = to_int(omp_threads); | ||
if (num_omp_threads.has_value()) { | ||
omp_set_num_threads(num_omp_threads.value()); | ||
} | ||
else { | ||
if (amrex::system::verbose > 0) { | ||
amrex::Print() << "amrex.omp_threads has an unknown value: " | ||
<< omp_threads | ||
<< " (try system, nosmt, or a positive integer)\n"; | ||
} | ||
} | ||
} | ||
} | ||
} // namespace amrex::OpenMP | ||
#endif // AMREX_USE_OMP |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters