handle with numerous CIF files #2491

yw-fang · 2024-10-08T23:54:44Z

yw-fang
Oct 8, 2024

Hi, @Andrew-S-Rosen I am considering using quacc and jobflow to perform a high-throughput calculations over a number of CIF files in the local computer (say mp-211.cif, mp-222.cif, ...)

I adopt the user-workstation mode, i.e. the user and runner are in my local computer and the worker is the remote HPC. At this moment, I have been able to handle with single CIF job with user-workstation mode.

Here is the toy code run in my local computer:

from jobflow_remote import submit_flow
from jobflow import Flow
from ase.build import bulk
from quacc.recipes.espresso.core import relax_job
from quacc import change_settings

atoms = bulk('Cu')
espresso_parallel_cmd = ("srun --cpu_bind=cores", "-npool 4")

with change_settings({"ESPRESSO_PARALLEL_CMD": espresso_parallel_cmd}):
    # Run the relaxation job with the updated parallelization setting
    job = relax_job(atoms, relax_run=False,
                    preset="nc_sr_0.5_pbe_stringent"
                    )

    flow = Flow(jobs=[job])
    response = submit_flow(flow, worker='hyperion1_worker')
print(response)
print(type(response))

But two questions arise.

1)

Currently, I found the resulted QE job is run with only one CPU although I have set 8 CPU cores in pre_run of jobflow-remote.
Here is the pre_run in the project.yaml file

    pre_run: |
      #SBATCH --partition=general
      #SBATCH --qos=test
      #SBATCH --time=00-00:10:00
      #SBATCH --job-name=20                                                                                           
      #SBATCH --mem=20gb                                                                                              
      #SBATCH --cpus-per-task=1                                                                                       
      #SBATCH --nodes=1                                                                                               
      #SBATCH --ntasks-per-node=8
      conda activate /scratch/ywfang/conda-env/jobflowenv
      module load QuantumESPRESSO

The generated submit.sh in the remote HPC for the jobflow job is indeed as same as the pre_run except for the added line for executing jobflow, e.g. "jf -fe execution run /scratch/ywfang/jobflow-qe/ff/6e/8a/ff6e8ab0-cd4e-4d75-af15-bd5c5100b7f3_1"

However, when it is calling quacc to run Quantum Espresso job, the job becomes a serial one. Here is the partial output in pw.out

     Parallel version (MPI & OpenMP), running on       1 processor cores
     Number of MPI processes:                 1
     Threads/MPI process:                     1

Note that if I remove the lines relevant to jobflow in the python script and run it directly in HPC. Quacc doesn't have this issue since I have changed the global setting in the original python file. Does this issue come from that the change of global setting in the local computer is not inherited in the remote HPC? In this case, do I have to use .quaac.yaml file in HPC to set the expected CPU cores?

2)

The second question is more general and may be challenging to answer due to the lack of specifics. However, it could still be valuable to engage in some discussion on this topic. Assuming that we have 2000 CIF structures in the local computer. Do you think it's a good manner to write a loop in python, and run it in a single folder of local computer to call jobflow, and submit the jobs in HPC? I am concerning that if any unexpected interrupt occurs (such as the parser of one structure fails and breaks out the loop), how to restart it? In addition, due to such many calculation jobs/outputs stored in the database, how do we identify which structures have been calculated generally (perhaps through printed uuid and do some investigations in the database)?

Please feel free to provide constructive feedback if any of my questions seem overly ambiguous. Thank you very much!

Andrew-S-Rosen · 2024-10-09T00:10:05Z

Andrew-S-Rosen
Oct 9, 2024
Maintainer

Thank you for your question, @yw-fang.

The issue is because you are changing the settings in-memory, and that memory is not shared with the remote machine. See the very bottom of the settings page for why this is occurring and possible alternatives.
Generally, you want each job to be its own unit of work, so 2000 jobs is not necessarily a problem from an organizational standpoint. However, you will have a problem from a queuing perspective. Your job scheduler likely will not be happy with you submitting 2000 slurm jobs. The way to get around this is to use a "pilot job" type model. With this, you request one big slurm allocation (say 200 nodes) and run many individual jobs on that one allocation until timeout. Unfortunately, this is not yet implemented in jobflow-remote, but a proof-of-concept is found here: Parallel batch submission Matgenix/jobflow-remote#172. Other workflow engines, such as Parsl, support this by default. To see which structures have been calculated generally, you would need to rely on information from the workflow engine. Namely, you would want to check the jobflow-remote database to see which tasks have run there.

5 replies

yw-fang Oct 9, 2024
Author

Thanks very much @Andrew-S-Rosen for the timely response!

The first issue was partially solved by using the settings_swap. The CPU cores I defined in espresso_parallel_cmd was shared with the remote HPC, but the parallel parameter "-npool 2" was not.
The revised script is as follows

from jobflow_remote import submit_flow
from jobflow import Flow
from ase.build import bulk
from quacc.recipes.espresso.core import relax_job
from quacc import change_settings
from quacc import job, redecorate

atoms = bulk('Cu')

espresso_parallel_cmd = ("srun --cpu_bind=cores", "-npool 2")
relax_job_ = redecorate(relax_job, job(settings_swap={"ESPRESSO_PARALLEL_CMD": espresso_parallel_cmd}))

job = relax_job_(atoms, relax_run=False,
                preset="nc_sr_0.5_pbe_stringent"
                    )

flow = Flow(jobs=[job])

resources = {"nodes": 1, "partition": "general", "qos": "test" , "nodes": "1", "ntasks_per_node": "8"}

response = submit_flow(flow, worker='example_worker', resources=resources)
print(response)

This parameter "-npool 2" works if I run quacc directly in the HPC, in which some information about npool

     K-points division:     npool     =       2
     R & G space division:  proc/nbgrp/npool/nimage =       4

was printed in the pw.out

Thanks for bringing my attention to the discussions. Honestly speaking, I don't understand it very well without testing it. I'll go back to discuss it once I had any experience of the experimental feature.

Andrew-S-Rosen Oct 11, 2024
Maintainer

Unfortunately, I'm not sure offhand what parsing issue is happening with your -npool 2. Looks like ESPRESSO_PARALLEL_CMD isn't being handled correctly.

yw-fang Oct 17, 2024
Author

It could be because there is also @job in the Jobflow which is conflicted with that in quacc.

Unfortunately, I'm not sure offhand what parsing issue is happening with your -npool 2. Looks like ESPRESSO_PARALLEL_CMD isn't being handled correctly.

It could be because @job was also defined in the Jobflow that was conflicted with that in quacc. job flow seemed to overwrite the settings.

Andrew-S-Rosen Oct 17, 2024
Maintainer

If there is an issue, please open an issue report with a minimal reproducible example. While I probably can't address it immediately, it will help ensure this does not get lost.

yw-fang Oct 17, 2024
Author

An issue was raised here for future implementation #2501
Thanks for the discussions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle with numerous CIF files #2491

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

handle with numerous CIF files #2491

yw-fang Oct 8, 2024

1)

2)

Replies: 1 comment · 5 replies

Andrew-S-Rosen Oct 9, 2024 Maintainer

yw-fang Oct 9, 2024 Author

Andrew-S-Rosen Oct 11, 2024 Maintainer

yw-fang Oct 17, 2024 Author

Andrew-S-Rosen Oct 17, 2024 Maintainer

yw-fang Oct 17, 2024 Author

yw-fang
Oct 8, 2024

Replies: 1 comment 5 replies

Andrew-S-Rosen
Oct 9, 2024
Maintainer

yw-fang Oct 9, 2024
Author

Andrew-S-Rosen Oct 11, 2024
Maintainer

yw-fang Oct 17, 2024
Author

Andrew-S-Rosen Oct 17, 2024
Maintainer

yw-fang Oct 17, 2024
Author