-
I was trying to reproduce the statement in https://developer.nvidia.com/blog/nvidia-announces-cuquantum-beta-availability-record-quantum-benchmark-and-quantum-container/, in particular
I'm running the following minimal reproducer on Python 3.8, cuQuantum 22.11, NVIDIA A100 40 GB (on a GCP instance) import time
import cirq
import qsimcirq
import cupy
from cuquantum import contract
from cuquantum import CircuitToEinsum
simulator = qsimcirq.QSimSimulator()
# See https://quantumai.google/cirq/experiments/textbook_algorithms
def make_qft(qubits):
"""Generator for the QFT on a list of qubits."""
qreg = list(qubits)
while len(qreg) > 0:
q_head = qreg.pop(0)
yield cirq.H(q_head)
for i, qubit in enumerate(qreg):
yield (cirq.CZ ** (1 / 2 ** (i + 1)))(qubit, q_head)
def simulate_and_measure(nqubits):
qubits = cirq.LineQubit.range(nqubits)
qft = cirq.Circuit(make_qft(qubits))
myconverter = CircuitToEinsum(qft, backend=cupy)
tic = time.time()
simulator.simulate(qft)
elapsed_qsim = time.time() - tic
out = {"qsim": elapsed_qsim}
# CUDA expectation
pauli_string = {qubits[0]: 'Z'}
expression, operands = myconverter.expectation(pauli_string, lightcone=True)
tic = time.time()
contract(expression, *operands)
elapsed = time.time() - tic
out["cu_expectation"] = elapsed
# CUDA Batched amplitudes
# Fix everything but last qubit
fixed_states = "0" * (nqubits - 1)
fixed_index = tuple(map(int, fixed_states))
num_fixed = len(fixed_states)
fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))
expression, operands = myconverter.batched_amplitudes(fixed)
tic = time.time()
contract(expression, *operands)
elapsed = time.time() - tic
out["cu_batched"] = elapsed
return out
for i in [10, 15, 20, 25, 30]:
print(i, simulate_and_measure(i)) Output (the numbers are elapsed in seconds; 10, 15, ... are number of qubits for QFT): 10 {'qsim': 0.9677999019622803, 'cu_expectation': 0.29337143898010254, 'cu_batched': 0.07590365409851074}
15 {'qsim': 0.023270368576049805, 'cu_expectation': 0.019628524780273438, 'cu_batched': 0.3687710762023926}
20 {'qsim': 0.03504538536071777, 'cu_expectation': 0.023822784423828125, 'cu_batched': 0.9347813129425049}
25 {'qsim': 0.14235782623291016, 'cu_expectation': 0.02486586570739746, 'cu_batched': 2.39030122756958}
30 {'qsim': 3.4044816493988037, 'cu_expectation': 0.028923749923706055, 'cu_batched': 4.6819908618927}
35 {'cu_expectation': 1.0615959167480469, 'cu_batched': 10.964831829071045}
40 {'cu_expectation': 0.03381609916687012, 'cu_batched': 82.43729209899902} I wasn't able to go to 35 qubits for qsim, because I got CUDA OOM for qsim. The much reduced memory usage alone is sufficient to prefer cuQuantum for this use case. But, I was hoping that |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Hello, thanks for reaching out. First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead. Going back to your benchmark, there are a couple of things to note. First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either Second, when calling path, info = contract_path(expression, *operands)
contract(expression, *operands, optimize={"path":path}) and you’ll see that in your test cases, more time is spent on the path-finding step than on actual execution, and the execution time can be often lower than that of state vector simulation (as in qsimcirq or cuStateVec). As the problem size further increases, the execution time will rapidly scale up and dominate. A general good strategy is to cache and reuse these paths if you’re contracting the same tensor network multiple times. Checkout our python sample and notebook sample (inside Third, you’re tuning the right knob for expectation value computation. For the QFT circuit, when you turn on the Finally, another small tip for your snippet is to reuse the cuTensorNet handle to reduce the overhead in small problems, e.g: import cuquantum.cutensornet as cutn
handle = cutn.create()
def simulate_and_measure(nqubits, handle):
…
path, info = contract_path(expression, *operands, options={"handle": handle})
contract(expression, *operands, optimize={"path":path}, options={"handle": handle})
…
cutn.destroy(handle) and if you’re using cuQuantum Python 22.11 or above, enabling the nonblocking behavior would further help overlap CPU/GPU activities and reduce the run time, especially when both path finding and contraction execution take long time: import cuquantum.cutensornet as cutn
handle = cutn.create()
def simulate_and_measure(nqubits, handle):
…
path, info = contract_path(expression, *operands, options={"handle": handle, "blocking": "auto"})
contract(expression, *operands, optimize={"path":path}, options={"handle": handle, "blocking": "auto"})
…
cutn.destroy(handle) also refer to our "Network"-object based samples if you figure you would reuse the same network topology multiple times. It’d help further reduce the overhead in Python object management . |
Beta Was this translation helpful? Give feedback.
Hello, thanks for reaching out.
First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead.
Going back to your benchmark, there are a couple of things to note.
First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either
cupyx.profiler.benchmark()
or usecupy.cuda.get_elapsed…