The ice-wedge polygons dataset is very large, so we use ray
to run the workflow in parallel on the Delta server. This is one of our High Performance Computing platforms. Because this is not an NCEAS server, it is important to pay attention to how many Delta credits are being utilized as each allocation's funding is finite. To get access to Delta, Juliet can help make you an account.
To follow along, use scripts within the PermafrostDiscoveryGateway/viz-workflow repo
-
ssh into the Delta server in VScode.
-
Pull updates from the
main
branch for each of the 4 PDG repositories: -
Create a virtual environment (recommended for the root of the environment to be in
/scratch/{username}
) withpython=3.9
and install local versions of the 3 PDG packages (usingpip install -e {LOCAL PACKAGE}
). Withinsetup.py
forviz-raster
, remember to comment out the GitHub install ofviz-staging
to ensure the local version is installed. Also install the packages:ray
glances
pyfastcopy
- Changes integrated with pulls or saved manual changes will automatically be implemented into that virtual environment if the local package was installed with
-e
(stands for "editable") and you are running a script from the terminal.
-
Prepare one of the two
slurm
scripts to claim some worker nodes on which we will launch a job.- By default, you are logged into the head node (login01, 02, or 03) which should not be the node that executes the bulk of the computation (you actually cannot be charged for work on this node).
- Run
squeue | grep {USERNAME}
(aslurm
command) to display info about the jobs in the scheduling queue and which nodes each are running on. This list will be empty of you haven't launched any jobs yet.- Important note: do not type in your full username, just the first half, because the query will not work if your username is too long. For example, for the username 'julietcohen', run
squeue | grep juliet
to see all jobs.
- Important note: do not type in your full username, just the first half, because the query will not work if your username is too long. For example, for the username 'julietcohen', run
- Open the appropriate script that will soon be run to claim the nodes: either
viz-workflow/slurm/BEST-v2_gpu_ray.slurm
if you're using GPU, orBEST_cpu_ray_double_srun.slurm
for CPU. - Note: lines at the top of this
slurm
script that start with#SBATCH
are not regular comments, they are settings. Other lines in the script that start with just#
are regular comments. - Change the line
#SBATCH --nodes={NUMBER}
which represents the number of nodes that will process the IWP data.- For the IWP workflow run in May 2023, 20 nodes were used to stage 17,039 input shapefiles. This was to avoid out-of-memory errors, and to process the data within 24 hours (increasing the number of nodes decreases the execution time). The data processing finished with just a few minutes to spare, so make sure you allocate enough nodes. Rsyncing the files took about an hour and the first rsync was initiated when just 80% of the tiles were complete, in order to ensure that at least that many files were stored safely off
/tmp
. Subsequentrsync
runs were faster as there were fewer files to write or re-write. - A future to-do is to set up a special
rsync
script that will continuously check for new geotiff files in the/tmp
folder and sync them to/scratch
without needing to manually run anrsync
script at intervals.
- For the IWP workflow run in May 2023, 20 nodes were used to stage 17,039 input shapefiles. This was to avoid out-of-memory errors, and to process the data within 24 hours (increasing the number of nodes decreases the execution time). The data processing finished with just a few minutes to spare, so make sure you allocate enough nodes. Rsyncing the files took about an hour and the first rsync was initiated when just 80% of the tiles were complete, in order to ensure that at least that many files were stored safely off
- Change the line
#SBATCH --time=48:00:00
which represents the total amount of time a job is allowed to run (and charge credits based on minutes and cores) before it is cancelled. The full 48 hours should be set if doing a full IWP run. If doing a test run, decrease this.- To calculate the number of CPU hours that will be charged: multiply the amount of cores in each node (128 CPU cores per node) by the number of nodes and hours. So if you're using 2 nodes for 3 hours and CPU cores, you multiply 128 cores x 3 hours x 2 nodes = ~768 CPU hours will be charged.
- To calulate the number of GPU hours that will be charged: multiply the number of nodes by the number of minutes by 0.05.
- Change the line
#SBATCH --account={ACCOUNT NAME}
and enter the account name for the appropriate allocation. Note that our allocations come with GPU and CPU counterparts that are billed separately. Use CPU when ample hours are available. CPU has smaller/tmp
local SSD (about half as much as GPU/tmp
) which is a downside. If we need more space on/tmp
than normal or are running low on compute hours, use GPU. Note that we do not store these private account names on GitHub, so pay attention to this line when you are pushing. - Find the
# global settings section
and changeconda activate {ENVIRONMENT}
orsource path/to/{ENVIRONMENT}/bin/activate
by entering your virtual environment for this workflow.- This part will be hard-coded once we make this workflow into a package with a pre-configured environment in the
.toml
file.
- This part will be hard-coded once we make this workflow into a package with a pre-configured environment in the
-
Open a new terminal, start a
tmux
session, then activate your virtual environment. (It's important to do this before ssh'ing into a node because we want thetmux
session to persist regardless of the job's runtime).- Using
tmux
is helpful because it runs in a separate terminal in the background, which will continue the job even if the ssh connection to the server is lost. Best practice is to use thetmux
feature to open multiple terminals at once with different panes so they are all visible in synchrony. Naming the terminals is very helpful.
- Using
-
Switch into the slurm dir by running
cd viz-workflow/slurm
. Then runsbatch BEST-v2_gpu_ray.slurm
orsbatch BEST_cpu_ray_double_srun.slurm
to launch the job on the number of nodes you specified within that file.- At this point, credits are being charged to the allocation. The terminal will print something like "Submitted batch job #######", a unique ID for that job. This script writes the file
slurm/nodes_array.txt
that identifies the active nodes. This file is used in several other scripts. - Other team members'
nodes_array.txt
are saved to their localviz-worflow/slurm
directory, and this file is not pushed to GitHub (it is listed in.gitignore
), so there will not be issues with your run.
- At this point, credits are being charged to the allocation. The terminal will print something like "Submitted batch job #######", a unique ID for that job. This script writes the file
-
Run
squeue | grep {USERNAME}
to see the new job that has appeared at the top of the list.- Check here for clues in parenthesis to troubleshoot an error in starting a job.
(Priority)
means that your job is queued but all of Delta's nodes are being used by others at the moment. Continue checking if your job starts every few minutes by running thesqueue
command.(Resources)
means the same as(Priority)
.QOSGrpBillingMinutes
means you tried to charge too many resources to your allcoation, similar to (Resources
). Cancel it (scancel {JOB ID}
), reduce the amount of hours or nodes you're requesting, and run thesbatch
command again.
- When there is no issue starting the job, there will be a node code such as
gpub059
if you're using GPU, orcn059
if you're using CPU. You'll need that to ssh into that specific node. If the number of nodes specified in theslurm
script is greater than 1 (likely the case!), then the code will show a range of numbers for that one job, such asgpub[059-060]
orcn[059-060]
. The smallest number in this range is the "head node" for that job, meaning that is the number you will want to ssh into in yourtmux
terminal. - The current runtime for each job is also noted in this output, which helps you track how much time you have left to finish your job.
- Check here for clues in parenthesis to troubleshoot an error in starting a job.
-
Check that each node has activated the same environment by running
which python
in a few nodes. It should return the Python subdirectory your environment's folder! Theslurm
job will not run on the nodes that don't have the environment successfully activated.
-
Sync the most recent footprints from
/scratch
to all relevant nodes'/tmp
dirs on Delta:- Since Juliet is running this workflow, this script copies footprints from her
/scratch
dir to the/tmp
dirs on the nodes. - Good practice to first check the size of the footprint source directory, so you know how large the destination directory should be when it's all transferred.
- Within a
tmux
session, switch into the correct environment, and ssh into the head node (e.g.ssh gpub059
) and runpython viz-workflow/rsync_footprints_to_nodes.py
.
Running this script is necessary to do before each job because at the end of the last job, the
/tmp
dirs on the active nodes are wiped. We do not just pull directly from the/scratch
dir for every step because the workflow runs much slower that way (a very low % CPU usage) since the/scratch
dir is not directly on each node like the/tmp
dirs are, and because the footprint files are very small and very numerous. Similarly, when we write files we do so to the/tmp
dir thenrsync
them back to/scratch
after to save time. One of our to-do's is to automate this step. - Since Juliet is running this workflow, this script copies footprints from her
-
Continue to check the size of the footprints dir within
/tmp
in a separate terminal:- To see the footprints that you are syncing to the
/tmp
dir of a node, open a new terminal and ssh into any of the active nodes. Then runcd /tmp
& don't forget the prefix slash. - Run
ds staged_footprints -d0
every few minutes. When the MB stops growing (usually just takes 5-10 min with the whole high ice dataset), you know the sync is complete. - Alternatively, you could run
find staged_footprints -type f | wc -l
and check when the number of files stops growing.
This step could be adjusted to be faster since Delta works faster moving a few large files than it does working with many small files. These files could be consolidated by "tarring" the files together before the transfer, then un-tar the files in the
/tmp
dir. - To see the footprints that you are syncing to the
- Adjust
viz-workflow/PRODUCTION_IWP_CONFIG.py
as needed:- Change the variable
head_node
to the head node.- This step should eventually be automated to define the head node similarly to how other scripts read the
nodes_array.txt
file.
- This step should eventually be automated to define the head node similarly to how other scripts read the
- With the newest batch of data as of January 2023, the
INPUT
path andFOOTPRINTS_REMOTE
paths have already been updated to Juliet's/scratch
dir. Moving forward, these will need to be changed only when the IWP data pre-processing is updated and therefore new files have been transferred. - The output dirs will always need to be different than the last run since we retain the outputs of past runs and do not overwrite them. So the
OUTPUT
path (which serves as the start of the path for all output files) includes a variableoutput_subdir
in this script which should be changed to something unique, such as any subfolders the user wants the data to be written to, with the last subfolder representing the date. Create this folder manually.- Note for automation:
subprocess
could be used to automatically name a folder with the date, but since runs will likely go overnight, usingsubprocess
would split the output of one run into 2 folders. - Note: subfolders in the paths will automatically be created if they do not already exist in
/tmp
whenrsync
is used to transfer the footprints there, and when thestaged
,geotiff
, andweb_tile
directories are written by their respective steps. Subfolders in/scratch
will not automatically be created. As a result, it is important to go into your/scratch
directory and manually create that unique subdirectory you included as the variableoutput_subdir
.
- Note for automation:
- Within the subdir just created, created a subdirectory called
staged
.
- Change the variable
-
Adjust
viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
as needed:- Within the
try:
part of the first functionmain()
, comment out / uncomment out steps depending on your stage in the workflow:
try: print("Starting main...") # TODO: sync footprints to nodes. step0_staging() # TODO: merge_staging() # step2_raster_highest() # step3_raster_lower() # step4_webtiles()
- In the future, when we run the whole workflow from start to finish, this system of uncommenting steps and running manual steps in between scripts won't be necessary, but this workflow is not yet automated. The way the workflow is structured now:
- The first step reminds you to manually run the script to
rsync
the footprints (already covered in an above step). - In the code chunk above, only uncomment the first step,
step0_staging()
, then save the script. After we run this script and transfer the tiles to/scratch
and merge them, then we can return to this script and uncomment the next step:step2_raster_highest()
.
- Within the
-
Within the
tmux
session with your virtual environment activated, ssh into the head node associated with that job by runningssh gpub059
orssh cn059
. Then run the script that we to import the config, define all custom functions necessary for the workflow, and execute the staging step on all nodes:python viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
.- The script takes a few seconds to start (the terminal looks like it's hanging), then rapidly prints statements throughout the entire process, so when that stops and there is a summary of the staging in the last print statement, you know the step is complete. You can also monitor CPU usage on glances and see that it has returned to a very low % again (like 0.2%).
-
In a separate terminal, while ssh'ed into any of the nodes running the job, and with your virtual environment activated, run
glances
to track memory usage and such as the script runs. This helps troubleshoot things like memory leaks. You can determine if issues are related to the network connection andiowait
, or the code itself.- If using CPU script: CPU usage should be around 80-100% for optimal performance, but it has fluctuated around 40% before, indicating a bottleneck. This could be the result of other users reading and writing to `/scratch, which slows down your netowrk speed but is out of your control.
- If using GPU script: In May 2023 during staging with 20 nodes and 17,039 input files, memory % increased to ~55% within the fist 16 hours of the run, then stabilized and did not fluctuate the rest of the run.
-
Once staging is complete, run
python viz-workflow/rsync_staging_to_scratch.py
.- In order to know when this step is done, check the size of the destination directory in
/scratch
until it stops changing. A to-do is to find a better way to monitor this process withrsync
.
- In order to know when this step is done, check the size of the destination directory in
-
Adjust the file
merge_staged_vector_tiles.py
as needed:- Set the variables
staged_dir_path_list
andmerged_dir_path
:- Change the last string part of
merged_dir_path
(where merged staged files will live) to the lowest number node of the nodes you're using (the head node). - Change the hard-coded nodes specified in
staged_dir_paths_list
to the list of nodes you're using for this job, except do not include the head node in this list because it was already assigned tomerged_dir_path
.
- Change the last string part of
- Within a
tmux
session, with your virtual environment activated, and ssh'd into the head node, runpython viz-workflow/merge_staged_vector_tiles.py
to consolidate all staged files to the node you specified. - You know when this is complete by looking for the last print statement: 'Done, exiting...'. Check the size of the staged directories in
/scratch
. The head node's directory size should be much larger than all other nodes' directories, because all the nodes' staged files have been consolidated there. - When merging a few TB of data, the merging step slows down after several hours. It may be necessary to execute multiple merging jobs, being careful to never merge the same workser node into the head node in 2 different jobs.
- Set the variables
-
Once the merge is complete, transfer the
log.log
in each nodes'/tmp
dir to that respective nodes' subdir withinstaging
dir: runpython viz-workflow/rsync_log_to_scratch.py
- The logs are transferred to their respective nodes' dir instead of being named after their node, because all logs need to be named the same thing so all logs receive all statements from the PDG packages throughout processing.
- This only needs to be done immeditely after merging if you are concluding the job before moving onto the raster higher step with a fresh job (which would be the case if there is not enough time left in the current job to execute raster highest before the 48 hours are up).
-
Create a
geotiff
dir:mkdir /scratch/bbou/{user}/{output_subdir}/geotiff
-
Return to the file
viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
and comment outstep0_staging()
, and uncomment out the next step:step2_raster_highest(batch_size=100)
(skipping 3d-tiling). Runpython viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
in atmux
session with the virtual environment activated and ssh'd into the head node, as usual.- You know this step is complete when the destination directory size in
/tmp
stops growing, and the summary of the step is printed.
- You know this step is complete when the destination directory size in
-
Transfer all highest z-level geotiff files from
/tmp
to/scratch
by runningpython viz-workflow/rsync_step2_raster_highest_to_scratch.py
. -
Return to the file
viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
and comment outstep2_raster_highest(batch_size=100)
, and uncomment out the next step:step3_raster_lower(batch_size_geotiffs=100)
. Runpython viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
.- The raster lower files are written directly to
/scratch
so no need torsync
them afterwards.
- The raster lower files are written directly to
-
Check that the file
rasters_summary.csv
was written to/scratch
. Download this file locally. Usually the top few lines look oddly fomatted. Delete these few lines and re-upload the file to the same directory (overwriting the misformatted one there).- This is necessary in order to use it to update ranges in the web tiling step so the color scale in the web tiles to make sense.
-
Create a new directory called
web_tiles
in your/scratch
dir.- This is necessary because the web tiling step writes directly to
/scratch
rather than/tmp
first, and directories cannot be created in/scratch
while writing files there. But subdirectories will be created as needed within theweb_tiles
dir becauseviz-staging
is configured that way.
- This is necessary because the web tiling step writes directly to
-
Return to
IN_PROGRESS_VIZ_WORKFLOW.py
and comment out the last step:step4_webtiles(batch_size_web_tiles=250)
. Runpython viz-workflow/IN_PROGRESS_VIZ_WORKFLOW.py
.- These tiles are written directly to
/scratch
- These tiles are written directly to
-
Transfer the
log.log
in each nodes'/tmp
dir to that respective nodes' subdir withinstaging
dir: runpython viz-workflow/rsync_log_to_scratch.py
- If you are in the same job as when you started the first staging step, this is will be the only time you transfer the logs. If you already did this in a previous job that just executed the staging step, you will have to pay attention to the directories that these logs are transferred to. Make sure they exist in
/scratch
before you run the script.
- If you are in the same job as when you started the first staging step, this is will be the only time you transfer the logs. If you already did this in a previous job that just executed the staging step, you will have to pay attention to the directories that these logs are transferred to. Make sure they exist in
-
Cancel the job:
scancel {JOB ID}
. The job ID can be found on the left column of the output fromsqueue | grep {USERNAME}
. No more credits are being used. Recall that the job will automatically be cancelled after 24 hours even if this command is not run. -
Remember to remove the
{ACCOUNT NAME}
for the allocation in the slurm script before pushing to GitHub. -
Move output data from Delta to the Arctic Data Center as soon as possible via Globus.
-
Use
globus
to transfer all output files from Delta's/scratch
dir to Datateam. -
The output files should have their own directory within:
datateam:/home/pdg/data/ice-wedge-polygon-data/
-
Navigate to the Rmd document to update the PDG portal. Run through the code to download the latest version of the portal (paying attention to staging or production, we need to update demo first so we make sure it looks good before putting it on production). Once the .xml file has been written locally, open it and edit the filepath for the IWP data to point to the web_tiles folder just populated with the new data. Save the file. Update the portal using the appropriate function.
-
Check that the web tiles look good on demo, and update the production .xml too.
Congratulations! You're done.