-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: cgoveas <[email protected]>
- Loading branch information
Showing
6 changed files
with
219 additions
and
2 deletions.
There are no files selected for viewing
62 changes: 62 additions & 0 deletions
62
docs/source/InstallationGuides/Benchmarks/AutomatingOneAPI.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
Automate installation oneAPI on Intel processors for MPI jobs | ||
------------------------------------------------------------------ | ||
|
||
This topic explains how to automatically update servers for MPI jobs. To manually install oneAPI, `click here. <OneAPI.html>`_ | ||
|
||
**Pre-requisites** | ||
|
||
* ``provision.yml`` has been executed. | ||
* An Omnia **slurm** cluster has been set up by ``omnia.yml`` running with at least 2 nodes: 1 manager and 1 compute. | ||
* Verify that the target nodes are in the ``booted`` state. For more information, `click here <../InstallingProvisionTool/ViewingDB.html>`_. | ||
|
||
**To run the playbook**:: | ||
|
||
|
||
cd benchmarks | ||
ansible-playbook intel_benchmark.yml -i inventory | ||
|
||
|
||
**To execute multi-node jobs** | ||
|
||
* Make sure to have NFS shares on each node. | ||
* Copy slurm script to NFS share and execute it from there. | ||
* Load all the necessary modules using module load: :: | ||
|
||
module load mpi | ||
module load pmi/pmix-x86_64 | ||
module load mkl | ||
|
||
* If the commands/batch script are to be run over TCP instead of Infiniband ports, include the below line: :: | ||
|
||
export FI_PROVIDER=tcp | ||
|
||
|
||
Job execution can now be initiated. | ||
|
||
.. note:: Ensure ``runme_intel64_dynamic`` is downloaded before running this command. | ||
|
||
:: | ||
|
||
srun -N 2 /mnt/nfs_shares/appshare/mkl/2023.0.0/benchmarks/mp_linpack/runme_intel64_dynamic | ||
|
||
|
||
For a batch job using the same parameters, the script would be: :: | ||
|
||
|
||
#!/bin/bash | ||
#SBATCH --job-name=testMPI | ||
#SBATCH --output=output.txt | ||
#SBATCH --partition=normal | ||
#SBATCH --nodelist=node00004.omnia.test,node00005.omnia.test | ||
|
||
pwd; hostname; date | ||
export FI_PROVIDER=tcp | ||
module load pmi/pmix-x86_64 | ||
module use /opt/intel/oneapi/modulefiles | ||
module load mkl | ||
module load mpi | ||
|
||
srun /mnt/appshare/benchmarks/mp_linpack/runme_intel64_dynamic | ||
date | ||
|
||
|
98 changes: 98 additions & 0 deletions
98
docs/source/InstallationGuides/Benchmarks/AutomatingOpenMPI.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
Installing pmix and updating slurm configuration for AMD processors | ||
-------------------------------------------------------------------- | ||
|
||
This topic explains how to automatically update AMD servers for MPI jobs. To manually install pmix and update the slurm configuration, `click here. <OpenMPI_AOCC.html>`_ | ||
|
||
**Pre-requisites** | ||
|
||
* ``provision.yml`` has been executed. | ||
* An Omnia **slurm** cluster has been set up by ``omnia.yml`` running with at least 2 nodes: 1 manager and 1 compute. | ||
* Verify that the target nodes are in the ``booted`` state. For more information, `click here <../InstallingProvisionTool/ViewingDB.html>`_. | ||
|
||
**To run the playbook**:: | ||
|
||
cd benchmarks | ||
ansible-playbook amd_benchmark.yml -i inventory | ||
|
||
**To execute multi-node jobs** | ||
|
||
* OpenMPI and aocc-compiler-*.tar should be installed and compiled with slurm on all cluster nodes or should be available on the NFS share. | ||
.. note:: | ||
* Omnia currently supports ``pmix version2``, ``pmix_v2``. | ||
|
||
* While compiling OpenMPI, include ``pmix``, ``slurm``, ``hwloc`` and, ``libevent`` as shown in the below sample command: :: | ||
|
||
./configure --prefix=/home/omnia-share/openmpi-4.1.5 --enable-mpi1-compatibility --enable-orterun-prefix-by-default --with-slurm=/usr --with-pmix=/usr --with-libevent=/usr --with-hwloc=/usr --with-ucx CC=clang CXX=clang++ FC=flang 2>&1 | tee config.out | ||
|
||
|
||
|
||
* For a job to run on multiple nodes (10.5.0.4 and 10.5.0.5) where OpenMPI is compiled and installed on the NFS share (``/home/omnia-share/openmpi/bin/mpirun``), the job can be initiated as below: | ||
|
||
.. note:: Ensure ``amd-zen-hpl-2023_07_18`` is downloaded before running this command. | ||
|
||
:: | ||
|
||
srun -N 2 --mpi=pmix_v2 -n 2 ./amd-zen-hpl-2023_07_18/xhpl | ||
|
||
|
||
For a batch job using the same parameters, the script would be: :: | ||
|
||
|
||
#!/bin/bash | ||
|
||
#SBATCH --job-name=test | ||
|
||
#SBATCH --output=test.log | ||
|
||
#SBATCH --partition=normal | ||
|
||
#SBATCH -N 3 | ||
|
||
#SBATCH --time=10:00 | ||
|
||
#SBATCH --ntasks=2 | ||
|
||
|
||
|
||
|
||
source /home/omnia-share/setenv_AOCC.sh | ||
|
||
export PATH=$PATH:/home/omnia-share/openmpi/bin | ||
|
||
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib | ||
|
||
srun --mpi=pmix_v2 ./amd-zen-hpl-2023_07_18/xhpl | ||
|
||
|
||
Alternatively, to use ``mpirun``, the script would be: :: | ||
|
||
#!/bin/bash | ||
|
||
#SBATCH --job-name=test | ||
|
||
#SBATCH --output=test.log | ||
|
||
#SBATCH --partition=normal | ||
|
||
#SBATCH -N 3 | ||
|
||
#SBATCH --time=10:00 | ||
|
||
#SBATCH --ntasks=2 | ||
|
||
|
||
|
||
|
||
source /home/omnia-share/setenv_AOCC.sh | ||
|
||
export PATH=$PATH:/home/omnia-share/openmpi/bin | ||
|
||
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib | ||
|
||
/home/omnia-share/openmpi/bin/mpirun --map-by ppr:1:node -np 2 --display-map --oversubscribe --mca orte_keep_fqdn_hostnames 1 ./xhpl | ||
|
||
|
||
|
||
.. note:: The above scripts are samples that can be modified as required. Ensure that ``--mca orte_keep_fqdn_hostnames 1`` is included in the mpirun command in sbatch scripts. Omnia maintains all hostnames in FQDN format. Failing to include ``--mca orte_keep_fqdn_hostnames 1`` may cause job initiation to fail. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
51 changes: 51 additions & 0 deletions
51
docs/source/InstallationGuides/Benchmarks/hpcsoftwarestack.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
Containerized HPC benchmark execution | ||
-------------------------------------- | ||
|
||
Use this playbook to download docker images and pull images onto cluster nodes using `apptainer <https://apptainer.org/docs/user/main/index.html/>`_. | ||
|
||
1. Ensure that the cluster has been `provisioned by the provision tool. <../../InstallationGuides/InstallingProvisionTool/index.html>`_ and the `cluster has been set up using omnia.yml. <../../InstallationGuides/BuildingClusters/index.html>`_ | ||
|
||
2. Enter the following variables in ``utils/hpc_apptainer_job_execution/hpc_apptainer_job_execution_config.yml``: | ||
|
||
+-------------------------+-----------------------------------------------------------------------------------------------------------+ | ||
| Parameter | Details | | ||
+=========================+===========================================================================================================+ | ||
| **hpc_apptainer_image** | * Docker image details to be downloaded in to cluster nodes using apptainer to create a sif file. | | ||
| ``JSON list`` | | | ||
| Required | * Example (for single image): :: | | ||
| | | | ||
| | | | ||
| | hpc_apptainer_image: | | ||
| | | | ||
| | - { image_url: "docker.io/intel/oneapi-hpckit:latest" } | | ||
| | | | ||
| | * Example (for multiple images): :: | | ||
| | | | ||
| | hpc_apptainer_image: | | ||
| | | | ||
| | - { image_url: "docker.io/intel/oneapi-hpckit:latest" } | | ||
| | | | ||
| | - { image_url: "docker.io/tensorflow/tensorflow:latest" } | | ||
| | | | ||
| | * If provided, docker credentials in ``omnia_config.yml``, it will be used for downloading docker images. | | ||
| | | | ||
+-------------------------+-----------------------------------------------------------------------------------------------------------+ | ||
| **hpc_apptainer_path** | * Directory to filepath for storing apptainer sif files on cluster nodes. | | ||
| | | | ||
| ``string`` | * It is recommended to use a directory inside a shared path that is accessible to all cluster nodes. | | ||
| | | | ||
| Required | * **Default value:** ``"/home/omnia-share/softwares/apptainer"`` | | ||
+-------------------------+-----------------------------------------------------------------------------------------------------------+ | ||
|
||
To run the playbook: :: | ||
|
||
cd utils/hpc_apptainer_job_execution | ||
|
||
ansible-playbook hpc_apptainer_job_execution.yml -i inventory | ||
|
||
.. note:: Use the inventory file format specified under `Sample Files. <../../samplefiles.html>`_ | ||
|
||
HPC apptainer jobs can be initiated on a slurm cluster using the following sample command: :: | ||
|
||
srun -N 3 --mpi=pmi2 --ntasks=4 apptainer run /home/omnia-share/softwares/apptainer/oneapi-hpckit_latest.sif hostname | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters