Skip to content

Commit

Permalink
Updating documentation
Browse files Browse the repository at this point in the history
Signed-off-by: cgoveas <[email protected]>
  • Loading branch information
cgoveas committed Oct 17, 2023
1 parent dc0f467 commit 86f1ab5
Show file tree
Hide file tree
Showing 6 changed files with 219 additions and 2 deletions.
62 changes: 62 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/AutomatingOneAPI.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
Automate installation oneAPI on Intel processors for MPI jobs
------------------------------------------------------------------

This topic explains how to automatically update servers for MPI jobs. To manually install oneAPI, `click here. <OneAPI.html>`_

**Pre-requisites**

* ``provision.yml`` has been executed.
* An Omnia **slurm** cluster has been set up by ``omnia.yml`` running with at least 2 nodes: 1 manager and 1 compute.
* Verify that the target nodes are in the ``booted`` state. For more information, `click here <../InstallingProvisionTool/ViewingDB.html>`_.

**To run the playbook**::


cd benchmarks
ansible-playbook intel_benchmark.yml -i inventory


**To execute multi-node jobs**

* Make sure to have NFS shares on each node.
* Copy slurm script to NFS share and execute it from there.
* Load all the necessary modules using module load: ::

module load mpi
module load pmi/pmix-x86_64
module load mkl

* If the commands/batch script are to be run over TCP instead of Infiniband ports, include the below line: ::

export FI_PROVIDER=tcp


Job execution can now be initiated.

.. note:: Ensure ``runme_intel64_dynamic`` is downloaded before running this command.

::

srun -N 2 /mnt/nfs_shares/appshare/mkl/2023.0.0/benchmarks/mp_linpack/runme_intel64_dynamic


For a batch job using the same parameters, the script would be: ::


#!/bin/bash
#SBATCH --job-name=testMPI
#SBATCH --output=output.txt
#SBATCH --partition=normal
#SBATCH --nodelist=node00004.omnia.test,node00005.omnia.test

pwd; hostname; date
export FI_PROVIDER=tcp
module load pmi/pmix-x86_64
module use /opt/intel/oneapi/modulefiles
module load mkl
module load mpi

srun /mnt/appshare/benchmarks/mp_linpack/runme_intel64_dynamic
date


98 changes: 98 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/AutomatingOpenMPI.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
Installing pmix and updating slurm configuration for AMD processors
--------------------------------------------------------------------

This topic explains how to automatically update AMD servers for MPI jobs. To manually install pmix and update the slurm configuration, `click here. <OpenMPI_AOCC.html>`_

**Pre-requisites**

* ``provision.yml`` has been executed.
* An Omnia **slurm** cluster has been set up by ``omnia.yml`` running with at least 2 nodes: 1 manager and 1 compute.
* Verify that the target nodes are in the ``booted`` state. For more information, `click here <../InstallingProvisionTool/ViewingDB.html>`_.

**To run the playbook**::

cd benchmarks
ansible-playbook amd_benchmark.yml -i inventory

**To execute multi-node jobs**

* OpenMPI and aocc-compiler-*.tar should be installed and compiled with slurm on all cluster nodes or should be available on the NFS share.
.. note::
* Omnia currently supports ``pmix version2``, ``pmix_v2``.

* While compiling OpenMPI, include ``pmix``, ``slurm``, ``hwloc`` and, ``libevent`` as shown in the below sample command: ::

./configure --prefix=/home/omnia-share/openmpi-4.1.5 --enable-mpi1-compatibility --enable-orterun-prefix-by-default --with-slurm=/usr --with-pmix=/usr --with-libevent=/usr --with-hwloc=/usr --with-ucx CC=clang CXX=clang++ FC=flang 2>&1 | tee config.out



* For a job to run on multiple nodes (10.5.0.4 and 10.5.0.5) where OpenMPI is compiled and installed on the NFS share (``/home/omnia-share/openmpi/bin/mpirun``), the job can be initiated as below:

.. note:: Ensure ``amd-zen-hpl-2023_07_18`` is downloaded before running this command.

::

srun -N 2 --mpi=pmix_v2 -n 2 ./amd-zen-hpl-2023_07_18/xhpl


For a batch job using the same parameters, the script would be: ::


#!/bin/bash

#SBATCH --job-name=test

#SBATCH --output=test.log

#SBATCH --partition=normal

#SBATCH -N 3

#SBATCH --time=10:00

#SBATCH --ntasks=2




source /home/omnia-share/setenv_AOCC.sh

export PATH=$PATH:/home/omnia-share/openmpi/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib

srun --mpi=pmix_v2 ./amd-zen-hpl-2023_07_18/xhpl


Alternatively, to use ``mpirun``, the script would be: ::

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --output=test.log

#SBATCH --partition=normal

#SBATCH -N 3

#SBATCH --time=10:00

#SBATCH --ntasks=2




source /home/omnia-share/setenv_AOCC.sh

export PATH=$PATH:/home/omnia-share/openmpi/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib

/home/omnia-share/openmpi/bin/mpirun --map-by ppr:1:node -np 2 --display-map --oversubscribe --mca orte_keep_fqdn_hostnames 1 ./xhpl



.. note:: The above scripts are samples that can be modified as required. Ensure that ``--mca orte_keep_fqdn_hostnames 1`` is included in the mpirun command in sbatch scripts. Omnia maintains all hostnames in FQDN format. Failing to include ``--mca orte_keep_fqdn_hostnames 1`` may cause job initiation to fail.

2 changes: 2 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/OneAPI.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Install oneAPI for MPI jobs on Intel processors
________________________________________________

This topic explains how to manually install oneAPI for MPI jobs. To install oneAPI automatically, `click here. <AutomatingOneAPI.html>`_

**Pre-requisites**

* An Omnia **slurm** cluster running with at least 2 nodes: 1 manager and 1 compute.
Expand Down
5 changes: 3 additions & 2 deletions docs/source/InstallationGuides/Benchmarks/OpenMPI_AOCC.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Open MPI AOCC HPL benchmark for AMD processors
----------------------------------------------
This topic explains how to manually update servers for MPI jobs. To automatically install pmix and configure slurm, `click here. <AutomatingOpenMPI.html>`_

**Prerequisites**

Expand All @@ -26,7 +27,7 @@ Open MPI AOCC HPL benchmark for AMD processors

ii. Push the packages to the cluster nodes:

a. Update the ``package_list`` variable in the ``os_package_update/os_package_update.conf`` file and save it. ::
a. Update the ``package_list`` variable in the ``utils/os_package_update/package_update_config.yml`` file and save it. ::

package_list: "/install/post/otherpkgs/<os_version>/x86_64/custom_software/openmpi.pkglist"

Expand Down Expand Up @@ -62,7 +63,7 @@ Open MPI AOCC HPL benchmark for AMD processors
systemctl stop slurmctld.service
systemctl start slurmctld.service

4. Job execution can now be initiated. To initiate a job use the following sample commands.
4. Job execution can now be initiated.

For a job to run on multiple nodes (10.5.0.4 and 10.5.0.5) where OpenMPI is compiled and installed on the NFS share (``/home/omnia-share/openmpi/bin/mpirun``), the job can be initiated as below:
.. note:: Ensure ``amd-zen-hpl-2023_07_18`` is downloaded before running this command.
Expand Down
51 changes: 51 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/hpcsoftwarestack.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Containerized HPC benchmark execution
--------------------------------------

Use this playbook to download docker images and pull images onto cluster nodes using `apptainer <https://apptainer.org/docs/user/main/index.html/>`_.

1. Ensure that the cluster has been `provisioned by the provision tool. <../../InstallationGuides/InstallingProvisionTool/index.html>`_ and the `cluster has been set up using omnia.yml. <../../InstallationGuides/BuildingClusters/index.html>`_

2. Enter the following variables in ``utils/hpc_apptainer_job_execution/hpc_apptainer_job_execution_config.yml``:

+-------------------------+-----------------------------------------------------------------------------------------------------------+
| Parameter | Details |
+=========================+===========================================================================================================+
| **hpc_apptainer_image** | * Docker image details to be downloaded in to cluster nodes using apptainer to create a sif file. |
| ``JSON list`` | |
| Required | * Example (for single image): :: |
| | |
| | |
| | hpc_apptainer_image: |
| | |
| | - { image_url: "docker.io/intel/oneapi-hpckit:latest" } |
| | |
| | * Example (for multiple images): :: |
| | |
| | hpc_apptainer_image: |
| | |
| | - { image_url: "docker.io/intel/oneapi-hpckit:latest" } |
| | |
| | - { image_url: "docker.io/tensorflow/tensorflow:latest" } |
| | |
| | * If provided, docker credentials in ``omnia_config.yml``, it will be used for downloading docker images. |
| | |
+-------------------------+-----------------------------------------------------------------------------------------------------------+
| **hpc_apptainer_path** | * Directory to filepath for storing apptainer sif files on cluster nodes. |
| | |
| ``string`` | * It is recommended to use a directory inside a shared path that is accessible to all cluster nodes. |
| | |
| Required | * **Default value:** ``"/home/omnia-share/softwares/apptainer"`` |
+-------------------------+-----------------------------------------------------------------------------------------------------------+

To run the playbook: ::

cd utils/hpc_apptainer_job_execution

ansible-playbook hpc_apptainer_job_execution.yml -i inventory

.. note:: Use the inventory file format specified under `Sample Files. <../../samplefiles.html>`_

HPC apptainer jobs can be initiated on a slurm cluster using the following sample command: ::

srun -N 3 --mpi=pmi2 --ntasks=4 apptainer run /home/omnia-share/softwares/apptainer/oneapi-hpckit_latest.sif hostname

3 changes: 3 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,7 @@ Running HPC benchmarks on omnia clusters

.. toctree::
OneAPI
AutomatingOneAPI
OpenMPI_AOCC
AutomatingOpenMPI
hpcsoftwarestack

0 comments on commit 86f1ab5

Please sign in to comment.