You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, @Andrew-S-Rosen I am considering using quacc and jobflow to perform a high-throughput calculations over a number of CIF files in the local computer (say mp-211.cif, mp-222.cif, ...)
I adopt the user-workstation mode, i.e. the user and runner are in my local computer and the worker is the remote HPC. At this moment, I have been able to handle with single CIF job with user-workstation mode.
Here is the toy code run in my local computer:
from jobflow_remote import submit_flow
from jobflow import Flow
from ase.build import bulk
from quacc.recipes.espresso.core import relax_job
from quacc import change_settings
atoms = bulk('Cu')
espresso_parallel_cmd = ("srun --cpu_bind=cores", "-npool 4")
with change_settings({"ESPRESSO_PARALLEL_CMD": espresso_parallel_cmd}):
# Run the relaxation job with the updated parallelization setting
job = relax_job(atoms, relax_run=False,
preset="nc_sr_0.5_pbe_stringent"
)
flow = Flow(jobs=[job])
response = submit_flow(flow, worker='hyperion1_worker')
print(response)
print(type(response))
But two questions arise.
1)
Currently, I found the resulted QE job is run with only one CPU although I have set 8 CPU cores in pre_run of jobflow-remote.
Here is the pre_run in the project.yaml file
The generated submit.sh in the remote HPC for the jobflow job is indeed as same as the pre_run except for the added line for executing jobflow, e.g. "jf -fe execution run /scratch/ywfang/jobflow-qe/ff/6e/8a/ff6e8ab0-cd4e-4d75-af15-bd5c5100b7f3_1"
However, when it is calling quacc to run Quantum Espresso job, the job becomes a serial one. Here is the partial output in pw.out
Parallel version (MPI & OpenMP), running on 1 processor cores
Number of MPI processes: 1
Threads/MPI process: 1
Note that if I remove the lines relevant to jobflow in the python script and run it directly in HPC. Quacc doesn't have this issue since I have changed the global setting in the original python file. Does this issue come from that the change of global setting in the local computer is not inherited in the remote HPC? In this case, do I have to use .quaac.yaml file in HPC to set the expected CPU cores?
2)
The second question is more general and may be challenging to answer due to the lack of specifics. However, it could still be valuable to engage in some discussion on this topic. Assuming that we have 2000 CIF structures in the local computer. Do you think it's a good manner to write a loop in python, and run it in a single folder of local computer to call jobflow, and submit the jobs in HPC? I am concerning that if any unexpected interrupt occurs (such as the parser of one structure fails and breaks out the loop), how to restart it? In addition, due to such many calculation jobs/outputs stored in the database, how do we identify which structures have been calculated generally (perhaps through printed uuid and do some investigations in the database)?
Please feel free to provide constructive feedback if any of my questions seem overly ambiguous. Thank you very much!
The issue is because you are changing the settings in-memory, and that memory is not shared with the remote machine. See the very bottom of the settings page for why this is occurring and possible alternatives.
Generally, you want each job to be its own unit of work, so 2000 jobs is not necessarily a problem from an organizational standpoint. However, you will have a problem from a queuing perspective. Your job scheduler likely will not be happy with you submitting 2000 slurm jobs. The way to get around this is to use a "pilot job" type model. With this, you request one big slurm allocation (say 200 nodes) and run many individual jobs on that one allocation until timeout. Unfortunately, this is not yet implemented in jobflow-remote, but a proof-of-concept is found here: Parallel batch submission Matgenix/jobflow-remote#172. Other workflow engines, such as Parsl, support this by default. To see which structures have been calculated generally, you would need to rely on information from the workflow engine. Namely, you would want to check the jobflow-remote database to see which tasks have run there.
The first issue was partially solved by using the settings_swap. The CPU cores I defined in espresso_parallel_cmd was shared with the remote HPC, but the parallel parameter "-npool 2" was not.
The revised script is as follows
This parameter "-npool 2" works if I run quacc directly in the HPC, in which some information about npool
K-points division: npool = 2
R & G space division: proc/nbgrp/npool/nimage = 4
was printed in the pw.out
Thanks for bringing my attention to the discussions. Honestly speaking, I don't understand it very well without testing it. I'll go back to discuss it once I had any experience of the experimental feature.
Unfortunately, I'm not sure offhand what parsing issue is happening with your -npool 2. Looks like ESPRESSO_PARALLEL_CMD isn't being handled correctly.
It could be because there is also @job in the Jobflow which is conflicted with that in quacc.
Unfortunately, I'm not sure offhand what parsing issue is happening with your -npool 2. Looks like ESPRESSO_PARALLEL_CMD isn't being handled correctly.
It could be because @job was also defined in the Jobflow that was conflicted with that in quacc. job flow seemed to overwrite the settings.
If there is an issue, please open an issue report with a minimal reproducible example. While I probably can't address it immediately, it will help ensure this does not get lost.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, @Andrew-S-Rosen I am considering using quacc and jobflow to perform a high-throughput calculations over a number of CIF files in the local computer (say mp-211.cif, mp-222.cif, ...)
I adopt the user-workstation mode, i.e. the user and runner are in my local computer and the worker is the remote HPC. At this moment, I have been able to handle with single CIF job with user-workstation mode.
Here is the toy code run in my local computer:
But two questions arise.
1)
Currently, I found the resulted QE job is run with only one CPU although I have set 8 CPU cores in pre_run of jobflow-remote.
Here is the pre_run in the project.yaml file
The generated submit.sh in the remote HPC for the jobflow job is indeed as same as the pre_run except for the added line for executing jobflow, e.g. "jf -fe execution run /scratch/ywfang/jobflow-qe/ff/6e/8a/ff6e8ab0-cd4e-4d75-af15-bd5c5100b7f3_1"
However, when it is calling quacc to run Quantum Espresso job, the job becomes a serial one. Here is the partial output in pw.out
Note that if I remove the lines relevant to jobflow in the python script and run it directly in HPC. Quacc doesn't have this issue since I have changed the global setting in the original python file. Does this issue come from that the change of global setting in the local computer is not inherited in the remote HPC? In this case, do I have to use .quaac.yaml file in HPC to set the expected CPU cores?
2)
The second question is more general and may be challenging to answer due to the lack of specifics. However, it could still be valuable to engage in some discussion on this topic. Assuming that we have 2000 CIF structures in the local computer. Do you think it's a good manner to write a loop in python, and run it in a single folder of local computer to call jobflow, and submit the jobs in HPC? I am concerning that if any unexpected interrupt occurs (such as the parser of one structure fails and breaks out the loop), how to restart it? In addition, due to such many calculation jobs/outputs stored in the database, how do we identify which structures have been calculated generally (perhaps through printed uuid and do some investigations in the database)?
Please feel free to provide constructive feedback if any of my questions seem overly ambiguous. Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions