From 4f71ee16ac9b51f68a4c576897030a8cc90fb200 Mon Sep 17 00:00:00 2001 From: Jordan Ogas <jogas@lanl.gov> Date: Thu, 26 Jan 2023 12:24:48 -0700 Subject: [PATCH 1/9] initial notes --- doc/faq.rst | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) diff --git a/doc/faq.rst b/doc/faq.rst index 35f9b52c4..4f00d03c1 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -29,6 +29,149 @@ How large is Charliecloud? .. include:: _loc.rst +MPI best practices +================== + +Using MPI optimally with unprivileged containers can be challenging. This +section summarizes key concepts and best practices for using MPI with +Charliecloud. + +Best practices on offer are the result of our experience with midigating some +of the complex issues containers face with MPI. It is important to note that, +despite marketing claims, no single container implementation has "solved" MPI, +or is free of warts; the issues are numerous, multifacted, and dynamic. + +The following is a summary of key concepts and issues covered in more detail in +subsequent sections. + + 1. **Workload management**. Running applications on HPC clusters requires + resource management and job scheduling. Put simply, resource management + is the act of allocating and restricting compute resources, e.g., CPU and + memory, whereas job scheduling is the act of prioritizing and enforcing + resource management. Both require privileged operations. + + Some privileged container implementations attempt to provide (sell) their + own workload management, often referred to as 'container orchestration'. + + Charliecloud is completely unprivileged. We rely on existing, reputable + and well established HPC workload managers, e.g., SLURM, etc. + + (FIXME: snarky alternative: + + Charliecloud is completely unprivileged; we do not have the luxury of + trying to replace existing, reputable and well established HPC workload + managers.) + + 2. **Job launch**. When a MPI job is launched, node(s) must launch a number + of containerized processes, i.e., ranks. + + Unprivileged container implementations can not launch containerized + processes on other nodes in a scalable manner. Support for MPI application + interactions with the workload manager is then needed to facilitate the + job launch. See PMI and alternative workflows. + + 3. **Shared memory**. Unprivileged containerized processes cannot access each + other's memory using the newer, faster :code:`process_vm_ready(2)`. + + Proccesess on the same node can be placed in the same namespace for plain + shared memory. See :code:`--join` in :code:`ch-run`. + + 4. **Network fabric.** Containerized MPI processes must be able to recognize + and use a system's high-speed interconnect. Common issues that arise are: + + a. Libraries required to use the high-speed interconnect are proprietary, + or otherwise unavailable to the container. + + b. The high-speed interconnect is not supported by the container MPI. + + In both cases, the containerized MPI application will run significantly + slower, or worse, fail to run altogether. + +Approach +-------- + +Build a flexible MPI container using: + + a. **Libfabric** to manage process communcation over a robust set of network + fabrics, + + b. **Parallel Process Managment** (PMI) supported by the host workload + manager to facilitate parallel process interactions between the container + application and resource manager, and + + c. **MPI** implementation that supports Libfabric and host supported PMI(s). + +The following sections cover the elements of this approach in more detail. Note +that alternative workflows are covered in the alternative section. + +Libfabric +--------- + +Libfabric, also known as OpenFabric Interfaces (OFI), is a low-level +communcation library that abstracts diverse networking technologies. It +defines interfaces, called **providers**, that implement the semantic mapping +between application facing software and network specific protocols, drivers, +and hardware. These providers have been co-designed with fabric hardware and +application developers with a focus on HPC needs. + + - https://ofiwg.github.io/libfabric + - https://ofiwg.github.io/libfabric/v1.14.0/man/fi_provider.3.html + - https://github.com/ofiwg/libfabric/blob/main/README.md + +Using Libfabric, we can more easily manage MPI communcation over a diverse set +of network fabrics with built-in or loadable providers. Our Libfabric example, +:code:`examples/Docker file.libfabric`, compiles :code:`psm3`, :code:`rxm`, +:code:`shm`, :code:`tcp`, and :code:`verbs` build-in providers. It is capable +of running over most socket and verb devices using TCP, IB, OPA, and RoCE +protocols. + +Shared providers, compiled with :code:`-dl`, e.g., :code:`--with-gni-dl`, can +be compiled on a target host and later added to the container. For example, on +the Cray systems with the :code:`Gemini/Aries` network, users can build the +shared :cod:`gni` provider to be added to, and used by, the Charliecloud +container's Libfabric at runtime. The same is true for any other Libfabric +provider. + +Finally, our Libfabric can also be replaced by the hosts, which is presently the +only way to leverage Cray's Slingshot :code:`CXI` provider. See ch-fromhost. + +Parallel process management +--------------------------- + +Unprivileged containers are unable to launch containerized processes on +different nodes, aside from using SSH, which isn't scalable. We must either +(a) rely on a host supported parallel process management interface (PMI), or +(b) achieve host/container MPI compatatbility with unsavory binary patching. +The former is recommended, the latter is described in more detail in the +deprecated section. + +The preferred PMI implementation, e.g., :code:`PMI-1`, :code:`PMI-2`, +:code:`OpenPMIx`, :code:`flux-pmi`, etc., will be that which is supported +by your host workload manager and container MPI. + +In our example, :code:`example/Dockerfile.libfabrc`, we use :code:`OpenPMIx` +because: 1) it is supported by SLURM, OpenMPI, and MPICH, 2) scales better than +PMI2, and (3) OpenMPI versions 5 and newer will only support PMIx. + +MPI +--- + +There are various MPI implementations, e.g., OpenMPI, MPICH, MVAPICH2, +Intel-MPI, etc., to consider. We generally recommend OpenMPI, however, your +MPI implementation will ultimately need to be one that supports Libfabric and +the PMI compatible with your host workload manager. + +Injection +--------- + +Alternatives +------------ + +Using Libfabric and PMI is not the only way to make use of proprietary or +unsupported network fabrics. There are other, more complicated, ways to FIXME + +Common problems +--------------- Errors ====== From 7b616b8d41fac668e09ace136269c9f6de940de6 Mon Sep 17 00:00:00 2001 From: Jordan Ogas <jogas@lanl.gov> Date: Sat, 28 Jan 2023 14:10:31 -0700 Subject: [PATCH 2/9] edits --- doc/faq.rst | 99 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 58 insertions(+), 41 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index 4f00d03c1..cea9bdbdb 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -32,17 +32,12 @@ How large is Charliecloud? MPI best practices ================== -Using MPI optimally with unprivileged containers can be challenging. This -section summarizes key concepts and best practices for using MPI with -Charliecloud. +Best practices on offer are derived from our experience in midigating container +MPI issues. It is important to note that, despite marketing claims, no single +container implementation has "solved" MPI, or is free of warts; the issues are +numerous, multifacted, and dynamic. -Best practices on offer are the result of our experience with midigating some -of the complex issues containers face with MPI. It is important to note that, -despite marketing claims, no single container implementation has "solved" MPI, -or is free of warts; the issues are numerous, multifacted, and dynamic. - -The following is a summary of key concepts and issues covered in more detail in -subsequent sections. +Key concepts and issues are as follows. 1. **Workload management**. Running applications on HPC clusters requires resource management and job scheduling. Put simply, resource management @@ -62,13 +57,13 @@ subsequent sections. trying to replace existing, reputable and well established HPC workload managers.) - 2. **Job launch**. When a MPI job is launched, node(s) must launch a number - of containerized processes, i.e., ranks. + 2. **Job launch**. When a multinode MPI job is launched, each node must launch + a number of containerized processes, i.e., ranks. Unprivileged container implementations can not launch containerized - processes on other nodes in a scalable manner. Support for MPI application - interactions with the workload manager is then needed to facilitate the - job launch. See PMI and alternative workflows. + processes on other nodes in a scalable manner. Thus, support for + interactions between the MPI application and workload manager is needed to + facilitate the job launch. 3. **Shared memory**. Unprivileged containerized processes cannot access each other's memory using the newer, faster :code:`process_vm_ready(2)`. @@ -90,19 +85,17 @@ subsequent sections. Approach -------- -Build a flexible MPI container using: +Best practice is to build a flexible MPI container using: - a. **Libfabric** to manage process communcation over a robust set of network - fabrics, + a. **Libfabric** to flexibly manage process communcation over a deiverse set + of network fabrics; - b. **Parallel Process Managment** (PMI) supported by the host workload - manager to facilitate parallel process interactions between the container - application and resource manager, and + b. a **parallel process managment interface** (PMI) compatible with the host + workload manager; and - c. **MPI** implementation that supports Libfabric and host supported PMI(s). + c. a **MPI** implementation that supports (1) Libfabric and (2) a PMI + compatible with the host workload manager. -The following sections cover the elements of this approach in more detail. Note -that alternative workflows are covered in the alternative section. Libfabric --------- @@ -119,18 +112,43 @@ application developers with a focus on HPC needs. - https://github.com/ofiwg/libfabric/blob/main/README.md Using Libfabric, we can more easily manage MPI communcation over a diverse set -of network fabrics with built-in or loadable providers. Our Libfabric example, -:code:`examples/Docker file.libfabric`, compiles :code:`psm3`, :code:`rxm`, -:code:`shm`, :code:`tcp`, and :code:`verbs` build-in providers. It is capable -of running over most socket and verb devices using TCP, IB, OPA, and RoCE -protocols. - -Shared providers, compiled with :code:`-dl`, e.g., :code:`--with-gni-dl`, can -be compiled on a target host and later added to the container. For example, on -the Cray systems with the :code:`Gemini/Aries` network, users can build the -shared :cod:`gni` provider to be added to, and used by, the Charliecloud -container's Libfabric at runtime. The same is true for any other Libfabric -provider. +of network fabrics with built-in or loadable providers. + +The following snippet is from our Libfabric example, +:code:`examples/Dockerfile.Libfabric`. + +:: + ARG LIBFABRIC_VERSION=1.15.1 + RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ + https://github.com/ofiwg/libfabric/ \ + && cd libfabric \ + && ./autogen.sh \ + && ./configure --prefix=/usr/local \ + --disable-opx \ + --disable-psm2 \ + --disable-efa \ + --disable-sockets \ + --enable-psm3 \ + --enable-rxm \ + --enable-shm \ + --enable-tcp \ + --enable-verbs \ + && make -j$(getconf _NPROCESSORS_ONLN) install \ + && rm -Rf ../libfabric* + +The above compiles Libfabric with several "built-in" providers, e.g., +:code:`psm3`, :code:`rxm`, :code:`shm`, :code:`tcp`, and :code:`verbs`, which +enables MPI applications to run efficiently over most verb devices using TCP, +IB, OPA, and RoCE protocols. + +Providers can also be compiled as "shared providers", by adding :code:`-dl`, +e.g., :code:`--with-psm3-dl`. A Shared provider can be used by a Libfabric that +did not originally compile it, i.e., they can be compiled on a target host and +later added to, and used by, a container's Libfabric. For example, on the Cray +systems with the :code:`Gemini/Aries` network, users can build a shared +:code:`gni` provider that can be added to container(s) before runtime. Unlike +"MPI replacement", where the container's MPI libraries are replaced with the +hosts, a shared provider is a single library that is added. Finally, our Libfabric can also be replaced by the hosts, which is presently the only way to leverage Cray's Slingshot :code:`CXI` provider. See ch-fromhost. @@ -141,17 +159,16 @@ Parallel process management Unprivileged containers are unable to launch containerized processes on different nodes, aside from using SSH, which isn't scalable. We must either (a) rely on a host supported parallel process management interface (PMI), or -(b) achieve host/container MPI compatatbility with unsavory binary patching. -The former is recommended, the latter is described in more detail in the -deprecated section. +(b) achieve host/container MPI ABI compatatbility through unsavory practices, +including binary patching when running on a Cray. The preferred PMI implementation, e.g., :code:`PMI-1`, :code:`PMI-2`, :code:`OpenPMIx`, :code:`flux-pmi`, etc., will be that which is supported by your host workload manager and container MPI. -In our example, :code:`example/Dockerfile.libfabrc`, we use :code:`OpenPMIx` +In our example, :code:`example/Dockerfile.libfabrc`, we prefer :code:`OpenPMIx` because: 1) it is supported by SLURM, OpenMPI, and MPICH, 2) scales better than -PMI2, and (3) OpenMPI versions 5 and newer will only support PMIx. +PMI2, and (3) OpenMPI versions 5 and newer will no longer support PMI2. MPI --- From f5bb575a355e832e2eb4610919b56cd28e2b734a Mon Sep 17 00:00:00 2001 From: Jordan Ogas <jogas@lanl.gov> Date: Mon, 1 Apr 2024 13:03:02 -0600 Subject: [PATCH 3/9] update --- doc/faq.rst | 99 +++++++++++++++++++++++++---------------------------- 1 file changed, 46 insertions(+), 53 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index cea9bdbdb..c0755b1e1 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -37,45 +37,39 @@ MPI issues. It is important to note that, despite marketing claims, no single container implementation has "solved" MPI, or is free of warts; the issues are numerous, multifacted, and dynamic. -Key concepts and issues are as follows. +Key concepts and related issues are listed as follows. 1. **Workload management**. Running applications on HPC clusters requires resource management and job scheduling. Put simply, resource management is the act of allocating and restricting compute resources, e.g., CPU and memory, whereas job scheduling is the act of prioritizing and enforcing - resource management. Both require privileged operations. + resource management. **Both require privileged operations.** - Some privileged container implementations attempt to provide (sell) their - own workload management, often referred to as 'container orchestration'. + Some privileged container implementations attempt to provide their own + workload management, often referred to as 'container orchestration'. - Charliecloud is completely unprivileged. We rely on existing, reputable + Charliecloud is **completely unprivileged;** we rely on existing, reputable and well established HPC workload managers, e.g., SLURM, etc. - (FIXME: snarky alternative: - - Charliecloud is completely unprivileged; we do not have the luxury of - trying to replace existing, reputable and well established HPC workload - managers.) - 2. **Job launch**. When a multinode MPI job is launched, each node must launch a number of containerized processes, i.e., ranks. Unprivileged container implementations can not launch containerized processes on other nodes in a scalable manner. Thus, support for - interactions between the MPI application and workload manager is needed to - facilitate the job launch. + interactions between the MPI application and workload manager, i.e., + Process Management Interface (PMI), is needed to facilitate the job + launch. - 3. **Shared memory**. Unprivileged containerized processes cannot access each - other's memory using the newer, faster :code:`process_vm_ready(2)`. - - Proccesess on the same node can be placed in the same namespace for plain - shared memory. See :code:`--join` in :code:`ch-run`. + 3. **Shared memory**. Unprivileged containerized processes on the same node + cannot access each other's memory using the newer, faster + :code:`process_vm_ready(2)`. They must can be placed in the same, unshared + namespace for plain shared memory. See :code:`--join` in :code:`ch-run`. 4. **Network fabric.** Containerized MPI processes must be able to recognize and use a system's high-speed interconnect. Common issues that arise are: a. Libraries required to use the high-speed interconnect are proprietary, - or otherwise unavailable to the container. + or otherwise unavailable to the container (Thanks Cray). b. The high-speed interconnect is not supported by the container MPI. @@ -85,7 +79,7 @@ Key concepts and issues are as follows. Approach -------- -Best practice is to build a flexible MPI container using: +Generally we recommend building a flexible MPI container using: a. **Libfabric** to flexibly manage process communcation over a deiverse set of network fabrics; @@ -96,9 +90,16 @@ Best practice is to build a flexible MPI container using: c. a **MPI** implementation that supports (1) Libfabric and (2) a PMI compatible with the host workload manager. +More experienced MPI and unprivileged container users can find success through +MPI-replacement (injection), however, such practes are beyond the scope of this +FAQ. + +The remaining sections detail the reasoning behind our approach. We recommend +referencing, or directly using, our :code:`examples/Dockerfile.libfabric` and +`examples/Dockerfile.{openmpi,mpich}` examples. -Libfabric ---------- +Use Libfabric +------------- Libfabric, also known as OpenFabric Interfaces (OFI), is a low-level communcation library that abstracts diverse networking technologies. It @@ -118,7 +119,7 @@ The following snippet is from our Libfabric example, :code:`examples/Dockerfile.Libfabric`. :: - ARG LIBFABRIC_VERSION=1.15.1 + ARG LIBFABRIC_VERSION=${OFI_VERSION} RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ https://github.com/ofiwg/libfabric/ \ && cd libfabric \ @@ -141,54 +142,46 @@ The above compiles Libfabric with several "built-in" providers, e.g., enables MPI applications to run efficiently over most verb devices using TCP, IB, OPA, and RoCE protocols. -Providers can also be compiled as "shared providers", by adding :code:`-dl`, -e.g., :code:`--with-psm3-dl`. A Shared provider can be used by a Libfabric that -did not originally compile it, i.e., they can be compiled on a target host and -later added to, and used by, a container's Libfabric. For example, on the Cray -systems with the :code:`Gemini/Aries` network, users can build a shared -:code:`gni` provider that can be added to container(s) before runtime. Unlike -"MPI replacement", where the container's MPI libraries are replaced with the -hosts, a shared provider is a single library that is added. +Two key advantages of using Libfabric are: (1) the container's Libfabric can +make use of "external", i.e. dynamic-shared-object (DSO) providers, and (2) +visibility to Cray's Slingshot provider, CXI, can be achieved by replacing the +container's :code:`libfabric.so` with the Cray host's. -Finally, our Libfabric can also be replaced by the hosts, which is presently the -only way to leverage Cray's Slingshot :code:`CXI` provider. See ch-fromhost. +Libfabric providers can be compiled as "shared", i.e., DSO providers, by adding +:code:`=dl`, e.g., :code:`--with-cxi=dl`, at configure time. A shared provider +can be used by a Libfabric that did not originally compile it, i.e., **they can +be compiled on a target host and later bind-mounted in, along with any missing +shared library dependencies, and used by the container's Libfabric.** -Parallel process management ---------------------------- +A container's Libfabric can also be replaced by a host Libfabric. This is an +unsavory but effective way to give containers access to the Cray Libfabric +Slingshot provider :code:`cxi`. + +Choose a compatible parallel process manager +-------------------------------------------- Unprivileged containers are unable to launch containerized processes on different nodes, aside from using SSH, which isn't scalable. We must either (a) rely on a host supported parallel process management interface (PMI), or (b) achieve host/container MPI ABI compatatbility through unsavory practices, -including binary patching when running on a Cray. +e.g., complete container-MPI replacement. The preferred PMI implementation, e.g., :code:`PMI-1`, :code:`PMI-2`, :code:`OpenPMIx`, :code:`flux-pmi`, etc., will be that which is supported by your host workload manager and container MPI. In our example, :code:`example/Dockerfile.libfabrc`, we prefer :code:`OpenPMIx` -because: 1) it is supported by SLURM, OpenMPI, and MPICH, 2) scales better than -PMI2, and (3) OpenMPI versions 5 and newer will no longer support PMI2. +because: 1) it is supported by SLURM, OpenMPI, and MPICH, 2) is required for +exascale, and (3) OpenMPI versions 5 and newer will no longer support PMI2. -MPI ---- +Choose a MPI flavor compatible with Libfabric and your process manager +---------------------------------------------------------------------- There are various MPI implementations, e.g., OpenMPI, MPICH, MVAPICH2, Intel-MPI, etc., to consider. We generally recommend OpenMPI, however, your -MPI implementation will ultimately need to be one that supports Libfabric and -the PMI compatible with your host workload manager. - -Injection ---------- - -Alternatives ------------- - -Using Libfabric and PMI is not the only way to make use of proprietary or -unsupported network fabrics. There are other, more complicated, ways to FIXME +MPI implementation of choicse will ultimately be that which supports Libfabric +and the PMI compatible with your host workload manager. -Common problems ---------------- Errors ====== From 8c78162d77ce19b84df0941d9a07a92facefcab9 Mon Sep 17 00:00:00 2001 From: Jordan Ogas <jogas@lanl.gov> Date: Mon, 1 Apr 2024 13:18:49 -0600 Subject: [PATCH 4/9] bitterly comply with sphinx --- doc/faq.rst | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index a370c6eb1..187c114ba 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -120,22 +120,21 @@ The following snippet is from our Libfabric example, :: ARG LIBFABRIC_VERSION=${OFI_VERSION} - RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ - https://github.com/ofiwg/libfabric/ \ - && cd libfabric \ - && ./autogen.sh \ - && ./configure --prefix=/usr/local \ - --disable-opx \ - --disable-psm2 \ - --disable-efa \ - --disable-sockets \ - --enable-psm3 \ - --enable-rxm \ - --enable-shm \ - --enable-tcp \ - --enable-verbs \ - && make -j$(getconf _NPROCESSORS_ONLN) install \ - && rm -Rf ../libfabric* + RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 https://github.com/ofiwg/libfabric/ \ + && cd libfabric \ + && ./autogen.sh \ + && ./configure --prefix=/usr/local \ + --disable-opx \ + --disable-psm2 \ + --disable-efa \ + --disable-sockets \ + --enable-psm3 \ + --enable-rxm \ + --enable-shm \ + --enable-tcp \ + --enable-verbs \ + && make -j$(getconf _NPROCESSORS_ONLN) install \ + && rm -Rf ../libfabric* The above compiles Libfabric with several "built-in" providers, e.g., :code:`psm3`, :code:`rxm`, :code:`shm`, :code:`tcp`, and :code:`verbs`, which From 01e77992af2ec3ec08b96284d0d6a07c7c0dab3e Mon Sep 17 00:00:00 2001 From: Reid Priedhorsky <reidpr@lanl.gov> Date: Tue, 16 Apr 2024 14:51:17 -0600 Subject: [PATCH 5/9] Revert "bitterly comply with sphinx" This reverts commit 8c78162d77ce19b84df0941d9a07a92facefcab9. --- doc/faq.rst | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index 187c114ba..a370c6eb1 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -120,21 +120,22 @@ The following snippet is from our Libfabric example, :: ARG LIBFABRIC_VERSION=${OFI_VERSION} - RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 https://github.com/ofiwg/libfabric/ \ - && cd libfabric \ - && ./autogen.sh \ - && ./configure --prefix=/usr/local \ - --disable-opx \ - --disable-psm2 \ - --disable-efa \ - --disable-sockets \ - --enable-psm3 \ - --enable-rxm \ - --enable-shm \ - --enable-tcp \ - --enable-verbs \ - && make -j$(getconf _NPROCESSORS_ONLN) install \ - && rm -Rf ../libfabric* + RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ + https://github.com/ofiwg/libfabric/ \ + && cd libfabric \ + && ./autogen.sh \ + && ./configure --prefix=/usr/local \ + --disable-opx \ + --disable-psm2 \ + --disable-efa \ + --disable-sockets \ + --enable-psm3 \ + --enable-rxm \ + --enable-shm \ + --enable-tcp \ + --enable-verbs \ + && make -j$(getconf _NPROCESSORS_ONLN) install \ + && rm -Rf ../libfabric* The above compiles Libfabric with several "built-in" providers, e.g., :code:`psm3`, :code:`rxm`, :code:`shm`, :code:`tcp`, and :code:`verbs`, which From b34d6d14e7937b6aa81a6ef516a851c7678e2e78 Mon Sep 17 00:00:00 2001 From: Reid Priedhorsky <reidpr@lanl.gov> Date: Tue, 16 Apr 2024 14:52:55 -0600 Subject: [PATCH 6/9] fix code rendering --- doc/faq.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/faq.rst b/doc/faq.rst index a370c6eb1..d4ff146a1 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -119,6 +119,7 @@ The following snippet is from our Libfabric example, :code:`examples/Dockerfile.Libfabric`. :: + ARG LIBFABRIC_VERSION=${OFI_VERSION} RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ https://github.com/ofiwg/libfabric/ \ From 859b659280a9aa51f53c21971e20e33c4e59cd41 Mon Sep 17 00:00:00 2001 From: Reid Priedhorsky <reidpr@lanl.gov> Date: Tue, 16 Apr 2024 15:58:26 -0600 Subject: [PATCH 7/9] tidy --- doc/best_practices.rst | 150 ++++++++++++++++++++++++++++++++++++++- doc/faq.rst | 156 +---------------------------------------- 2 files changed, 149 insertions(+), 157 deletions(-) diff --git a/doc/best_practices.rst b/doc/best_practices.rst index 429cbb2bf..0a1a2969d 100644 --- a/doc/best_practices.rst +++ b/doc/best_practices.rst @@ -1,6 +1,10 @@ Best practices ************** +.. contents:: + :depth: 3 + :local: + Other best practices information ================================ @@ -303,5 +307,147 @@ building, and then run using a separate container invoked from a different terminal. -.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi nist -.. LocalWords: ident OCFS MAGICK +MPI +=== + +Problems that best practices help you avoid +------------------------------------------- + +These recommendations are derived from our experience in mitigating container +MPI issues. It is important to note that, despite marketing claims, no single +container implementation has “solved” MPI or is free of warts; the issues are +numerous, multifaceted, and dynamic. + +Key concepts and related issues include: + + 1. **Workload management**. Running applications on HPC clusters requires + resource management and job scheduling. Put simply, resource management + is the act of allocating and restricting compute resources, e.g., CPU and + memory, whereas job scheduling is the act of prioritizing and enforcing + resource management. *Both require privileged operations.* + + Some privileged container implementations attempt to provide their own + workload management, often referred to as “container orchestration”. + + Charliecloud is lightweight and completely unprivileged. We rely on + existing, reputable and well established HPC workload managers such as + Slurm. + + 2. **Job launch**. When a multi-node MPI job is launched, each node must + launch a number of containerized processes, i.e., *ranks*. Doing this + unprivileged and at scale requires interaction between the application + and workload manager. That is, something like Process Management + Interface (PMI) is needed to facilitate the job launch. + + 3. **Shared memory**. Processes in separate sibling containers cannot use + single-copy *cross-memory attach* (CMA), as opposed to double-copy POSIX + or SysV shared memory. The solution is to put all ranks in the *same* + container with :code:`ch-run --join`. (See above for details: + :ref:`faq_join`.) + + 4. **Network fabric.** Performant MPI jobs must recognize and use a system’s + high-speed interconnect. Common issues that arise are: + + a. Libraries required to use the interconnect are proprietary or + otherwise unavailable to the container. + + b. The interconnect is not supported by the container MPI. + + In both cases, the containerized MPI application will either fail or run + significantly slower. + +These problems can be avoided, and this section describes our recommendations +to do so. + +Recommendations TL;DR +--------------------- + +Generally, we recommend building a flexible MPI container using: + + a. **libfabric** to flexibly manage process communication over a diverse + set of network fabrics; + + b. a parallel **process management interface** (PMI) compatible with the + host workload manager; and + + c. an **MPI** that supports (1) libfabric and (2) the selected PMI. + +More experienced MPI and unprivileged container users can find success through +MPI replacement (injection); however, such practices are beyond the scope of +this FAQ. + +The remaining sections detail the reasoning behind our approach. We recommend +referencing, or directly using, our examples +:code:`examples/Dockerfile.{libfabric,mpich,openmpi}`. + +Use libfabric +------------- + +`libfabric <https://ofiwg.github.io/libfabric>`_ (a.k.a. Open Fabrics +Interfaces or OFI) is a low-level communication library that abstracts diverse +networking technologies. It defines *providers* that implement the mapping +between application-facing software (e.g., MPI) and network specific drivers, +protocols, and hardware. These providers have been co-designed with fabric +hardware and application developers with a focus on HPC needs. libfabric lets +us more easily manage MPI communication over diverse network high-speed +interconnects (a.k.a. *fabrics*). + +From our libfabric example (:code:`examples/Dockerfile.libfabric`): + +.. literalinclude:: ../examples/Dockerfile.libfabric + :language: docker + :lines: 116-135 + +The above compiles libfabric with several “built-in” providers, i.e. +:code:`psm3` (on x86-64), :code:`rxm`, :code:`shm`, :code:`tcp`, and +:code:`verbs`, which enables MPI applications to run efficiently over most +verb devices using TCP, IB, OPA, and RoCE protocols. + +Two key advantages of using libfabric are: (1) the container’s libfabric can +make use of “external” i.e. dynamic-shared-object (DSO) providers, and +(2) Cray’s Slingshot provider (CXI) can be used by replacing the container +image’s :code:`libfabric.so` with the Cray host’s. + +A DSO provider can be used by a libfabric that did not originally compile it, +i.e., they can be compiled on a target host and later injected into the +container along with any missing shared library dependencies, and used by the +container's libfabric. To build a libfabric provider as a DSO, add :code:`=dl` +to its :code:`configure` argument, e.g., :code:`--with-cxi=dl`. + +A container's libfabric can also be replaced by a host libfabric. This is a +brittle but usually effective way to give containers access to the Cray +libfabric Slingshot provider :code:`cxi`. + +In Charliecloud, both of these injection operations are currently done with +:code:`ch-fromhost`, though see `issue #1861 +<https://github.com/hpc/charliecloud/issues/1861>`_. + +Choose a compatible PMI +----------------------- + +Unprivileged processes, including unprivileged containerized processes, are +unable to independently launch containerized processes on different nodes, +aside from using SSH, which isn’t scalable. We must either (1)_rely on a host +supported parallel process management interface (PMI), or (2)_achieve +host/container MPI ABI compatibility through unsavory practices such as +complete container MPI replacement. + +The preferred PMI implementation, e.g., PMI1, PMI2, OpenPMIx, or flux-pmi, +will be that which is best supported by your host workload manager and +container MPI. + +In :code:`example/Dockerfile.libfabric`, we selected :code:`OpenPMIx` because +(1) it is supported by SLURM, OpenMPI, and MPICH, (2)~it is required for +exascale, and (3) OpenMPI versions 5 and newer will no longer support PMI2. + +Choose an MPI compatible with your libfabric and PMI +---------------------------------------------------- + +There are various MPI implementations, e.g., OpenMPI, MPICH, MVAPICH2, +Intel-MPI, etc., to consider. We generally recommend OpenMPI; however, your +MPI implementation of choice will ultimately be that which best supports the +libfabric and PMI most compatible with your hardware and workload manager. + + +.. LocalWords: userguide Gruening Souppaya Morello Scarfone openmpi nist dl +.. LocalWords: ident OCFS MAGICK mpich psm rxm shm DSO pmi MVAPICH diff --git a/doc/faq.rst b/doc/faq.rst index d4ff146a1..83ba73e8e 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -29,160 +29,6 @@ How large is Charliecloud? .. include:: _loc.rst -MPI best practices -================== - -Best practices on offer are derived from our experience in midigating container -MPI issues. It is important to note that, despite marketing claims, no single -container implementation has "solved" MPI, or is free of warts; the issues are -numerous, multifacted, and dynamic. - -Key concepts and related issues are listed as follows. - - 1. **Workload management**. Running applications on HPC clusters requires - resource management and job scheduling. Put simply, resource management - is the act of allocating and restricting compute resources, e.g., CPU and - memory, whereas job scheduling is the act of prioritizing and enforcing - resource management. **Both require privileged operations.** - - Some privileged container implementations attempt to provide their own - workload management, often referred to as 'container orchestration'. - - Charliecloud is **completely unprivileged;** we rely on existing, reputable - and well established HPC workload managers, e.g., SLURM, etc. - - 2. **Job launch**. When a multinode MPI job is launched, each node must launch - a number of containerized processes, i.e., ranks. - - Unprivileged container implementations can not launch containerized - processes on other nodes in a scalable manner. Thus, support for - interactions between the MPI application and workload manager, i.e., - Process Management Interface (PMI), is needed to facilitate the job - launch. - - 3. **Shared memory**. Unprivileged containerized processes on the same node - cannot access each other's memory using the newer, faster - :code:`process_vm_ready(2)`. They must can be placed in the same, unshared - namespace for plain shared memory. See :code:`--join` in :code:`ch-run`. - - 4. **Network fabric.** Containerized MPI processes must be able to recognize - and use a system's high-speed interconnect. Common issues that arise are: - - a. Libraries required to use the high-speed interconnect are proprietary, - or otherwise unavailable to the container (Thanks Cray). - - b. The high-speed interconnect is not supported by the container MPI. - - In both cases, the containerized MPI application will run significantly - slower, or worse, fail to run altogether. - -Approach --------- - -Generally we recommend building a flexible MPI container using: - - a. **Libfabric** to flexibly manage process communcation over a deiverse set - of network fabrics; - - b. a **parallel process managment interface** (PMI) compatible with the host - workload manager; and - - c. a **MPI** implementation that supports (1) Libfabric and (2) a PMI - compatible with the host workload manager. - -More experienced MPI and unprivileged container users can find success through -MPI-replacement (injection), however, such practes are beyond the scope of this -FAQ. - -The remaining sections detail the reasoning behind our approach. We recommend -referencing, or directly using, our :code:`examples/Dockerfile.libfabric` and -`examples/Dockerfile.{openmpi,mpich}` examples. - -Use Libfabric -------------- - -Libfabric, also known as OpenFabric Interfaces (OFI), is a low-level -communcation library that abstracts diverse networking technologies. It -defines interfaces, called **providers**, that implement the semantic mapping -between application facing software and network specific protocols, drivers, -and hardware. These providers have been co-designed with fabric hardware and -application developers with a focus on HPC needs. - - - https://ofiwg.github.io/libfabric - - https://ofiwg.github.io/libfabric/v1.14.0/man/fi_provider.3.html - - https://github.com/ofiwg/libfabric/blob/main/README.md - -Using Libfabric, we can more easily manage MPI communcation over a diverse set -of network fabrics with built-in or loadable providers. - -The following snippet is from our Libfabric example, -:code:`examples/Dockerfile.Libfabric`. - -:: - - ARG LIBFABRIC_VERSION=${OFI_VERSION} - RUN git clone --branch v${LIBFABRIC_VERSION} --depth 1 \ - https://github.com/ofiwg/libfabric/ \ - && cd libfabric \ - && ./autogen.sh \ - && ./configure --prefix=/usr/local \ - --disable-opx \ - --disable-psm2 \ - --disable-efa \ - --disable-sockets \ - --enable-psm3 \ - --enable-rxm \ - --enable-shm \ - --enable-tcp \ - --enable-verbs \ - && make -j$(getconf _NPROCESSORS_ONLN) install \ - && rm -Rf ../libfabric* - -The above compiles Libfabric with several "built-in" providers, e.g., -:code:`psm3`, :code:`rxm`, :code:`shm`, :code:`tcp`, and :code:`verbs`, which -enables MPI applications to run efficiently over most verb devices using TCP, -IB, OPA, and RoCE protocols. - -Two key advantages of using Libfabric are: (1) the container's Libfabric can -make use of "external", i.e. dynamic-shared-object (DSO) providers, and (2) -visibility to Cray's Slingshot provider, CXI, can be achieved by replacing the -container's :code:`libfabric.so` with the Cray host's. - -Libfabric providers can be compiled as "shared", i.e., DSO providers, by adding -:code:`=dl`, e.g., :code:`--with-cxi=dl`, at configure time. A shared provider -can be used by a Libfabric that did not originally compile it, i.e., **they can -be compiled on a target host and later bind-mounted in, along with any missing -shared library dependencies, and used by the container's Libfabric.** - -A container's Libfabric can also be replaced by a host Libfabric. This is an -unsavory but effective way to give containers access to the Cray Libfabric -Slingshot provider :code:`cxi`. - -Choose a compatible parallel process manager --------------------------------------------- - -Unprivileged containers are unable to launch containerized processes on -different nodes, aside from using SSH, which isn't scalable. We must either -(a) rely on a host supported parallel process management interface (PMI), or -(b) achieve host/container MPI ABI compatatbility through unsavory practices, -e.g., complete container-MPI replacement. - -The preferred PMI implementation, e.g., :code:`PMI-1`, :code:`PMI-2`, -:code:`OpenPMIx`, :code:`flux-pmi`, etc., will be that which is supported -by your host workload manager and container MPI. - -In our example, :code:`example/Dockerfile.libfabrc`, we prefer :code:`OpenPMIx` -because: 1) it is supported by SLURM, OpenMPI, and MPICH, 2) is required for -exascale, and (3) OpenMPI versions 5 and newer will no longer support PMI2. - -Choose a MPI flavor compatible with Libfabric and your process manager ----------------------------------------------------------------------- - -There are various MPI implementations, e.g., OpenMPI, MPICH, MVAPICH2, -Intel-MPI, etc., to consider. We generally recommend OpenMPI, however, your -MPI implementation of choicse will ultimately be that which supports Libfabric -and the PMI compatible with your host workload manager. - Errors ====== @@ -1518,4 +1364,4 @@ conversion. Important caveats include: .. LocalWords: CAs SY Gutmann AUTH rHsFFqwwqh MrieaQ Za loc mpihello mvo du .. LocalWords: VirtualSize linuxcontainers jour uk lxd rwxr xr qq qqq drwxr -.. LocalWords: drwx +.. LocalWords: drwx mpich From b69464ae93b82f7a37ef73fe29294208836cd1ed Mon Sep 17 00:00:00 2001 From: Reid Priedhorsky <reidpr@lanl.gov> Date: Wed, 17 Apr 2024 13:46:04 -0600 Subject: [PATCH 8/9] fix AlmaLinux URL (curl does not follow redirects by default) --- test/build/60_force.bats | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/build/60_force.bats b/test/build/60_force.bats index 2205efb89..81595d0c8 100644 --- a/test/build/60_force.bats +++ b/test/build/60_force.bats @@ -67,7 +67,7 @@ EOF ch-image -v build --force -t tmpimg -f - . <<'EOF' FROM almalinux:8 -RUN curl -sO https://repo.almalinux.org/vault/8.6/BaseOS/x86_64/os/Packages/openssh-8.0p1-13.el8.x86_64.rpm +RUN curl -sSOL https://vault.almalinux.org/8.6/BaseOS/x86_64/os/Packages/openssh-8.0p1-13.el8.x86_64.rpm RUN rpm --install *.rpm EOF } From 201c1bf7a4da21d007afbc918c4a1d558dbd64c3 Mon Sep 17 00:00:00 2001 From: Reid Priedhorsky <reidpr@lanl.gov> Date: Wed, 17 Apr 2024 13:52:23 -0600 Subject: [PATCH 9/9] incorporate feedback --- doc/best_practices.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/doc/best_practices.rst b/doc/best_practices.rst index 0a1a2969d..57cb8fb61 100644 --- a/doc/best_practices.rst +++ b/doc/best_practices.rst @@ -367,8 +367,8 @@ Generally, we recommend building a flexible MPI container using: a. **libfabric** to flexibly manage process communication over a diverse set of network fabrics; - b. a parallel **process management interface** (PMI) compatible with the - host workload manager; and + b. a parallel **process management interface** (PMI), compatible with the + host workload manager (e.g., PMI2, PMIx, flux-pmi); and c. an **MPI** that supports (1) libfabric and (2) the selected PMI. @@ -405,8 +405,10 @@ verb devices using TCP, IB, OPA, and RoCE protocols. Two key advantages of using libfabric are: (1) the container’s libfabric can make use of “external” i.e. dynamic-shared-object (DSO) providers, and -(2) Cray’s Slingshot provider (CXI) can be used by replacing the container -image’s :code:`libfabric.so` with the Cray host’s. +(2) libfabric replacement is simpler than MPI replacement and preserves the +original container MPI. That is, managing host/container ABI compatibility is +difficult and error-prone, so we instead manage the more forgiving libfabric +ABI compatibility. A DSO provider can be used by a libfabric that did not originally compile it, i.e., they can be compiled on a target host and later injected into the @@ -427,8 +429,8 @@ Choose a compatible PMI Unprivileged processes, including unprivileged containerized processes, are unable to independently launch containerized processes on different nodes, -aside from using SSH, which isn’t scalable. We must either (1)_rely on a host -supported parallel process management interface (PMI), or (2)_achieve +aside from using SSH, which isn’t scalable. We must either (1) rely on a host +supported parallel process management interface (PMI), or (2) achieve host/container MPI ABI compatibility through unsavory practices such as complete container MPI replacement.