Need help in Running a 37 Qubit Simulation using multi-gpu multi-node on Supercomputer #170

silicofeller · 2025-01-01T10:23:56Z

silicofeller
Jan 1, 2025

I have 4 Nodes of 8 A100 GPUs each (40GB) This is the code I have run:

mpiexec -n 16 --bind-to none --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 36 --precision single --ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3

File "/opt/conda/envs/cuquantum-24.03/bin/cuquantum-benchmarks", line 8, in
sys.exit(run())
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run.py", line 335, in run
runner.run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 92, in run
self._run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 304, in _run
preprocess_data = backend.preprocess_circuit(
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/backends/backend_qiskit.py", line 61, in preprocess_circuit
self.transpiled_qc = qiskit.transpile(circuit, self.backend) # (circuit, basis_gates=['u3', 'cx'], backend=self.backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 341, in transpile
_check_circuits_coupling_map(circuits, coupling_map, backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 455, in _check_circuits_coupling_map
raise CircuitTooWideForTarget(
qiskit.transpiler.exceptions.CircuitTooWideForTarget: 'Number of qubits (36) in circuit-158 is greater than maximum (35) in the coupling_map'
Traceback (most recent call last):
File "/opt/conda/envs/cuquantum-24.03/bin/cuquantum-benchmarks", line 8, in
sys.exit(run())
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run.py", line 335, in run
runner.run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 92, in run
self._run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 304, in _run
preprocess_data = backend.preprocess_circuit(
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/backends/backend_qiskit.py", line 61, in preprocess_circuit
self.transpiled_qc = qiskit.transpile(circuit, self.backend) # (circuit, basis_gates=['u3', 'cx'], backend=self.backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 341, in transpile
_check_circuits_coupling_map(circuits, coupling_map, backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 455, in _check_circuits_coupling_map
raise CircuitTooWideForTarget(
qiskit.transpiler.exceptions.CircuitTooWideForTarget: 'Number of qubits (36) in circuit-158 is greater than maximum (35) in the coupling_map'
2025-01-01 09:51:02,316 INFO * Running quantum_volume with 1 GPUs, and 36 qubits [qiskit-v1.0.2 | cusvaer-v0.4.0]:
Traceback (most recent call last):
File "/opt/conda/envs/cuquantum-24.03/bin/cuquantum-benchmarks", line 8, in
sys.exit(run())
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run.py", line 335, in run
runner.run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 92, in run
self._run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 304, in _run
preprocess_data = backend.preprocess_circuit(
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/backends/backend_qiskit.py", line 61, in preprocess_circuit
self.transpiled_qc = qiskit.transpile(circuit, self.backend) # (circuit, basis_gates=['u3', 'cx'], backend=self.backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 341, in transpile
_check_circuits_coupling_map(circuits, coupling_map, backend)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit/compiler/transpiler.py", line 455, in _check_circuits_coupling_map
raise CircuitTooWideForTarget(
qiskit.transpiler.exceptions.CircuitTooWideForTarget: 'Number of qubits (36) in circuit-158 is greater than maximum (35) in the coupling_map'

silicofeller · 2025-01-01T10:25:12Z

silicofeller
Jan 1, 2025
Author

mpiexec -n 8 --bind-to none --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 35 --ngpus 1 --cusvaer-global-index-bits 1,1 --cusvaer-p2p-device-bits 1

2025-01-01 09:12:57,546 INFO * Running quantum_volume with 1 GPUs, and 35 qubits [qiskit-v1.0.2 | cusvaer-v0.4.0]:
2025-01-01 09:12:57,583 INFO transpile took 0.03684243559837341 s
2025-01-01 09:16:17,371 INFO - [CPU] Averaged elapsed time: 15.118398990 s
2025-01-01 09:16:17,371 INFO - [CPU] Processor type: AMD EPYC 7742 64-Core Processor
2025-01-01 09:16:17,372 INFO -
2025-01-01 09:16:17,372 INFO - [GPU] Averaged elapsed time: 15.118571680 s
2025-01-01 09:16:17,372 INFO - [GPU] GPU device name: NVIDIA A100-SXM4-40GB
2025-01-01 09:16:17,372 INFO

1 reply

silicofeller Jan 1, 2025
Author

35 Qubits is successful

silicofeller · 2025-01-01T10:25:16Z

silicofeller
Jan 1, 2025
Author

mpiexec -n 32 --bind-to none --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 35 --ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3
2025-01-01 09:43:35,530 INFO * Running quantum_volume with 1 GPUs, and 35 qubits [qiskit-v1.0.2 | cusvaer-v0.4.0]:
2025-01-01 09:43:35,568 INFO transpile took 0.037899455055594444 s
Traceback (most recent call last):
File "cusvaer/backends/cusvaerext.pyx", line 153, in cusvaer.backends.cusvaerext.CusvSimulator._setup_multi_process
cusvaer.backends.ubackendext.RuntimeError: failed to allocate device memory for state vetor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/envs/cuquantum-24.03/bin/cuquantum-benchmarks", line 8, in
sys.exit(run())
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run.py", line 335, in run
runner.run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 92, in run
self._run()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 316, in _run
perf_time, cuda_time, post_time, post_process = self.timer(backend, circuit, self.nshots) # nsamples -> nshots
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/run_interface.py", line 160, in timer
backend.run(circuit, nshots)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cuquantum_benchmarks/backends/backend_qiskit.py", line 78, in run
post_res_list = results.result().get_memory()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit_aer/jobs/utils.py", line 42, in _wrapper
return func(self, *args, **kwargs)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit_aer/jobs/aerjob.py", line 114, in result
return self._future.result(timeout=timeout)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit_aer/backends/aerbackend.py", line 414, in _execute_qobj_job
output = self._cusvaer_execute_qobj(qobj)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/qiskit_aer/backends/aer_simulator.py", line 1123, in _cusvaer_execute_qobj
cusvaer_output = cusvaer_sim._run(qobj, cusvaer_options._fields)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cusvaer/backends/statevector_simulator.py", line 474, in _run
result = self._run_job(job_id, qobj)
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cusvaer/backends/statevector_simulator.py", line 502, in _run_job
result_list.append(self.run_experiment(experiment))
File "/opt/conda/envs/cuquantum-24.03/lib/python3.10/site-packages/cusvaer/backends/statevector_simulator.py", line 600, in run_experiment
cusvsim.setup(n_wires_list, wires, self._dtype,
File "cusvaer/backends/cusvaerext.pyx", line 91, in cusvaer.backends.cusvaerext.CusvSimulator.setup
File "cusvaer/backends/cusvaerext.pyx", line 161, in cusvaer.backends.cusvaerext.CusvSimulator._setup_multi_process
qiskit.providers.basic_provider.exceptions.BasicProviderError: 'Cannot allocate 35 qubit state vector. The max number of qubits is 32, 32 process(es) / 27 qubits/GPU.'

1 reply

silicofeller Jan 1, 2025
Author

Multi node is not working in this case I guess

ymagchi · 2025-01-01T22:12:02Z

ymagchi
Jan 1, 2025
Maintainer

Hi @silicofeller,
Would it be possible to check if the issue is reproducible with a newer version 24.08 in https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuquantum-appliance/tags ?

7 replies

ymagchi Jan 2, 2025
Maintainer

I see that the error qiskit.transpiler.exceptions.CircuitTooWideForTarget has been resolved.
Could you please check if the process ranks are expectedly distributed among multi nodes by adding --report-bindings option to mpiexec or running hostname? I think that multiple processes were trying to use the same GPU.

silicofeller Jan 3, 2025
Author

mpiexec -n 16 --bind-to none --map-by node --oversubscribe --report-bindings -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 35--ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3

[f5f1cdf0a84f:00083] MCW rank 10 is not bound (or bound to all available process ors)
[f5f1cdf0a84f:00077] MCW rank 4 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00082] MCW rank 9 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00087] MCW rank 14 is not bound (or bound to all available process ors)
[f5f1cdf0a84f:00076] MCW rank 3 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00080] MCW rank 7 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00078] MCW rank 5 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00081] MCW rank 8 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00085] MCW rank 12 is not bound (or bound to all available process ors)
[f5f1cdf0a84f:00074] MCW rank 1 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00073] MCW rank 0 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00086] MCW rank 13 is not bound (or bound to all available process ors)
[f5f1cdf0a84f:00079] MCW rank 6 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00088] MCW rank 15 is not bound (or bound to all available process ors)
[f5f1cdf0a84f:00075] MCW rank 2 is not bound (or bound to all available processo rs)
[f5f1cdf0a84f:00084] MCW rank 11 is not bound (or bound to all available process ors)

silicofeller Jan 3, 2025
Author

srun --nodes=2 --ntasks-per-node=128 --gres=gpu:A100-SXM4:8 --reservation=quansim_142 --partition=airawatp --time=00:05:00 --pty /bin/bash

docker run --gpus all -it --rm nvcr.io/nvidia/cuquantum-appliance:24.08-cuda12.2.2-devel-ubuntu22.04-x86_64

export http_proxy=http://172.55.6.50:9090
export ftp_proxy=http://172.55.6.50:9090
export https_proxy=http://172.55.6.50:9090

git clone https://github.com/NVIDIA/cuQuantum.git
cd cuQuantum/benchmarks
pip install .

mpiexec -n 16 --bind-to none --report-bindings --host scn73-10g:8,scn74-10g:8 --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 36 --ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3

This will keep on running and gets stuck. So I exited the cuquantum appliance.

Then I did this:

conda activate cuquantum_env

srun --nodes=2 --ntasks-per-node=128 --gres=gpu:A100-SXM4:8 --reservation=quansim_142 --partition=airawatp --time=00:05:00 --pty /bin/bash

mpiexec -n 16 --bind-to none --report-bindings --host scn73-10g:8,scn74-10g:8 --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 36 --ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
PMIx stopped checking at the first component that it did not find.

Host: scn73-mn
Framework: psec
Component: munge

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
PMIx stopped checking at the first component that it did not find.

Host: scn73-mn
Framework: psec
Component: munge

silicofeller Jan 3, 2025
Author

Then I tried with this:

PMIX_SECURITY_MODE=none mpiexec -x OMPI_MCA_opal_cuda_support=true -x UCX_MEMTYPE_CACHE=n -mca psec none -mca routed direct -n 16 --bind-to none --report-bindings --host scn73-10g:8,scn74-10g:8 --map-by node --oversubscribe -x UCX_TLS=^cma --mca coll_hcoll_enable 0 -x OMPI_MCA_coll_hcoll_enable=0 cuquantum-benchmarks circuit --frontend qiskit --backend cusvaer --benchmark quantum_volume --nqubits 36 --ngpus 1 --cusvaer-global-index-bits 3,1 --cusvaer-p2p-device-bits 3

Again it got stuck.

ymagchi Jan 5, 2025
Maintainer

Could you please check if simple MPI communications can be executed on 2 nodes like #159 (reply in thread)?
Also, could you please share the output log when debug options OMPI_MCA_pmix_base_verbose=100, UCX_LOG_LEVEL=debug are enabled?

silicofeller · 2025-01-03T19:39:43Z

silicofeller
Jan 3, 2025
Author

Summary of MPI and CUDA-Awareness Issues on Supercomputer Cluster

Problem Statement:

We are attempting to run a 36-qubit quantum volume simulation using cuquantum-benchmarks on the supercomputer cluster, leveraging NVIDIA's cuQuantum Appliance. The setup involves multiple nodes with NVIDIA A100-SXM4-40GB GPUs. Here are the key issues encountered:

Memory Allocation:
Initial attempts to run simulations for 36 qubits resulted in memory allocation errors (cudaErrorMemoryAllocation), indicating that the available GPU memory was insufficient for the state vector simulation of 36 qubits.

CUDA Awareness in Open MPI:
Open MPI was found to be built with CUDA awareness but disabled by default. Enabling CUDA awareness through OMPI_MCA_opal_cuda_support=true was attempted.

MPI Configuration and Binding:
Errors related to process binding were observed (CircuitTooWideForTarget), suggesting issues with how qubits were distributed across nodes or GPUs. Adjustments were made using MPI options like --bind-to none, --map-by node, and --oversubscribe.

Network and Host Issues:
Problems with hostname resolution and network connectivity have been encountered, requiring the use of specific hostnames or IP addresses in MPI commands.

Munge Configuration in PMIx:
A persistent error where MPI/PMIx fails to find or use the munge component for security, even after explicit disabling attempts (-mca psec none). This error blocks the execution of any MPI command, regardless of CUDA or simulation settings.

Actions Taken:

CUDA Awareness: Enabled through environment variables and command-line parameters.
Process Distribution: Adjusted distribution settings like --cusvaer-global-index-bits and --cusvaer-p2p-device-bits.
Network Configuration: Attempts to specify nodes by IP or hostname.
Debugging: Increased verbosity to understand component loading issues.

Request for Resolution from NVIDIA:

We kindly request assistance from NVIDIA with the following:

Memory Optimization for Simulations: Guidance on optimizing the cuquantum-benchmarks for higher qubit counts or alternative simulation strategies that might be less memory-intensive.

CUDA-Aware MPI Configuration:
Verification that our approach to enable CUDA awareness in Open MPI is correct.
Assistance in configuring CUDA-aware MPI for a multi-node, multi-GPU setup like ours.

Resolution of Munge/PMIx Issues:
Help in resolving why munge is still being sought by PMIx/Open MPI even when explicitly disabled. Is there a way to permanently disable munge for our jobs, or is there a configuration error in our system setup?

Documentation or Known Issues:
Any known issues or documentation regarding the interaction between cuquantum-benchmarks, Open MPI, PMIx, and munge in a similar environment setup.

Best Practices for Node and GPU Management:
Recommendations on how to best manage node assignments, GPU allocations, and process bindings for quantum simulations on a cluster.

We appreciate any insights, configurations, or patches that could help us overcome these technical hurdles and successfully run our quantum simulations.

0 replies

silicofeller · 2025-01-03T19:45:35Z

silicofeller
Jan 3, 2025
Author

@ymagchi Is it possible that we can get on a quick zoom call at your preferred time?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help in Running a 37 Qubit Simulation using multi-gpu multi-node on Supercomputer #170

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Need help in Running a 37 Qubit Simulation using multi-gpu multi-node on Supercomputer #170

silicofeller Jan 1, 2025

Replies: 5 comments · 9 replies

silicofeller Jan 1, 2025 Author

silicofeller Jan 1, 2025 Author

silicofeller Jan 1, 2025 Author

silicofeller Jan 1, 2025 Author

ymagchi Jan 1, 2025 Maintainer

ymagchi Jan 2, 2025 Maintainer

silicofeller Jan 3, 2025 Author

silicofeller Jan 3, 2025 Author

Host: scn73-mn Framework: psec Component: munge

Host: scn73-mn Framework: psec Component: munge

silicofeller Jan 3, 2025 Author

ymagchi Jan 5, 2025 Maintainer

silicofeller Jan 3, 2025 Author

silicofeller Jan 3, 2025 Author

silicofeller
Jan 1, 2025

Replies: 5 comments 9 replies

silicofeller
Jan 1, 2025
Author

silicofeller Jan 1, 2025
Author

silicofeller
Jan 1, 2025
Author

silicofeller Jan 1, 2025
Author

ymagchi
Jan 1, 2025
Maintainer

ymagchi Jan 2, 2025
Maintainer

silicofeller Jan 3, 2025
Author

silicofeller Jan 3, 2025
Author

Host: scn73-mn
Framework: psec
Component: munge

Host: scn73-mn
Framework: psec
Component: munge

silicofeller Jan 3, 2025
Author

ymagchi Jan 5, 2025
Maintainer

silicofeller
Jan 3, 2025
Author

silicofeller
Jan 3, 2025
Author