CircuitToEinsum for QFT: batched_amplitudes slower than qsim full statevector simulation #23

rht · 2022-12-09T13:56:58Z

rht
Dec 9, 2022

I was trying to reproduce the statement in https://developer.nvidia.com/blog/nvidia-announces-cuquantum-beta-availability-record-quantum-benchmark-and-quantum-container/, in particular

Quantum Fourier Transform – accelerated from 29 mins down to 19 secs

I'm running the following minimal reproducer on Python 3.8, cuQuantum 22.11, NVIDIA A100 40 GB (on a GCP instance)

import time

import cirq
import qsimcirq
import cupy
from cuquantum import contract
from cuquantum import CircuitToEinsum

simulator = qsimcirq.QSimSimulator()

# See https://quantumai.google/cirq/experiments/textbook_algorithms
def make_qft(qubits):
    """Generator for the QFT on a list of qubits."""
    qreg = list(qubits)
    while len(qreg) > 0:
        q_head = qreg.pop(0)
        yield cirq.H(q_head)
        for i, qubit in enumerate(qreg):
            yield (cirq.CZ ** (1 / 2 ** (i + 1)))(qubit, q_head)


def simulate_and_measure(nqubits):
    qubits = cirq.LineQubit.range(nqubits)
    qft = cirq.Circuit(make_qft(qubits))

    myconverter = CircuitToEinsum(qft, backend=cupy)

    tic = time.time()
    simulator.simulate(qft)
    elapsed_qsim = time.time() - tic
    out = {"qsim": elapsed_qsim}

    # CUDA expectation
    pauli_string = {qubits[0]: 'Z'}
    expression, operands = myconverter.expectation(pauli_string, lightcone=True)
    tic = time.time()
    contract(expression, *operands)
    elapsed = time.time() - tic
    out["cu_expectation"] = elapsed

    # CUDA Batched amplitudes
    # Fix everything but last qubit
    fixed_states = "0" * (nqubits - 1)
    fixed_index = tuple(map(int, fixed_states))
    num_fixed = len(fixed_states)
    fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))
    expression, operands = myconverter.batched_amplitudes(fixed)
    tic = time.time()
    contract(expression, *operands)
    elapsed = time.time() - tic
    out["cu_batched"] = elapsed

    return out

for i in [10, 15, 20, 25, 30]:
    print(i, simulate_and_measure(i))

Output (the numbers are elapsed in seconds; 10, 15, ... are number of qubits for QFT):

10 {'qsim': 0.9677999019622803, 'cu_expectation': 0.29337143898010254, 'cu_batched': 0.07590365409851074}
15 {'qsim': 0.023270368576049805, 'cu_expectation': 0.019628524780273438, 'cu_batched': 0.3687710762023926}
20 {'qsim': 0.03504538536071777, 'cu_expectation': 0.023822784423828125, 'cu_batched': 0.9347813129425049}
25 {'qsim': 0.14235782623291016, 'cu_expectation': 0.02486586570739746, 'cu_batched': 2.39030122756958}
30 {'qsim': 3.4044816493988037, 'cu_expectation': 0.028923749923706055, 'cu_batched': 4.6819908618927}
35 {'cu_expectation': 1.0615959167480469, 'cu_batched': 10.964831829071045}
40 {'cu_expectation': 0.03381609916687012, 'cu_batched': 82.43729209899902}

I wasn't able to go to 35 qubits for qsim, because I got CUDA OOM for qsim. The much reduced memory usage alone is sufficient to prefer cuQuantum for this use case.

But, I was hoping that batched_amplitudes is going to be faster than a full statevector simulation, because some qubits are fixed. But it doesn't seem to be the case. I have also tried reduced_density_matrix (not shown, so that the code snippet is short). The only one that is consistently fast is expectation. I wonder if I did it wrongly?

Answered by yangcal

Dec 9, 2022

Hello, thanks for reaching out.

First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead.

Going back to your benchmark, there are a couple of things to note.

First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either cupyx.profiler.benchmark() or use cupy.cuda.get_elapsed…

View full answer

yangcal · 2022-12-09T19:04:27Z

yangcal
Dec 9, 2022
Maintainer

Hello, thanks for reaching out.

First, regarding the performance statement in our tech blog for cuQuantum beta, the benchmark data shown in the blog was measured with cuStateVec. Both cuQuantum libraries (cuStateVec/cuTensorNet) have been improved significantly compared to when the blog was written. Generally compared with state vector simulation, tensor networks trade computational cost for memory saving, and one would expect some performance overhead.

Going back to your benchmark, there are a couple of things to note.

First, when measuring the performance of GPU functions, you should not use timing utils for CPU. You can try either cupyx.profiler.benchmark() or use cupy.cuda.get_elapsed_time() together with CUDA events (cupy.cuda.Event()).

Second, when calling cuquantum.contract, the function call contains two step, one for path-finding and the other for contraction execution. You can time the two steps separately with following modifications:

path, info = contract_path(expression, *operands) 

contract(expression, *operands, optimize={"path":path})

and you’ll see that in your test cases, more time is spent on the path-finding step than on actual execution, and the execution time can be often lower than that of state vector simulation (as in qsimcirq or cuStateVec). As the problem size further increases, the execution time will rapidly scale up and dominate. A general good strategy is to cache and reuse these paths if you’re contracting the same tensor network multiple times. Checkout our python sample and notebook sample (inside get_expectation_cutn function)

Third, you’re tuning the right knob for expectation value computation. For the QFT circuit, when you turn on the lightcone option, the network size is greatly reduced when you’re measuring only one or few qubits. You can easily see this if you compare the number of operands for your expectation and batched_amplitudes computations. Since the graph size for expectation value is much smaller than amplitudes, you can see that the amount of time for path-finding and execution are both lower than that of amplitude computation.

Finally, another small tip for your snippet is to reuse the cuTensorNet handle to reduce the overhead in small problems, e.g:

import  cuquantum.cutensornet as cutn 

handle = cutn.create() 

def simulate_and_measure(nqubits, handle): 

    …

    path, info = contract_path(expression, *operands, options={"handle": handle}) 

    contract(expression, *operands, optimize={"path":path}, options={"handle": handle}) 

    …

cutn.destroy(handle)

and if you’re using cuQuantum Python 22.11 or above, enabling the nonblocking behavior would further help overlap CPU/GPU activities and reduce the run time, especially when both path finding and contraction execution take long time:

import  cuquantum.cutensornet as cutn 

handle = cutn.create() 

def simulate_and_measure(nqubits, handle): 

    …

    path, info = contract_path(expression, *operands, options={"handle": handle, "blocking": "auto"}) 

    contract(expression, *operands, optimize={"path":path}, options={"handle": handle, "blocking": "auto"}) 

    …

cutn.destroy(handle)

also refer to our "Network"-object based samples if you figure you would reuse the same network topology multiple times. It’d help further reduce the overhead in Python object management .

8 replies

leofang Dec 12, 2022
Maintainer

I switched to cupy.cuda.get_elapsed_time with cupy.cuda.Event(). The elapsed time is not that different, but I suppose it is more rigorous this way.

Yes indeed. Especially when you turn on the nonblocking support. Using appropriate GPU timing utils will reflect the accurate timing. For a good intro reference, see https://docs.cupy.dev/en/stable/user_guide/performance.html#benchmarking.

rht Dec 13, 2022
Author

I see. Yeah, I had since found that there is a seamless switch to swap the qsim backend into cuStateVec: https://quantumai.google/qsim/tutorials/gcp_gpu#optional_use_the_nvidia_cuquantum_sdk.

I was using the 22.07 Docker image, then manually upgraded inside the instance to 22.11 because at the time when I posted this question, the 22.11 image didn't seem to be available yet.

leofang Dec 13, 2022
Maintainer

It's already available since last week, would you check the above link again?

rht Dec 13, 2022
Author

Yes, I have just confirmed that it's now available, thank you!

leofang Dec 17, 2022
Maintainer

Great, glad to know!

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CircuitToEinsum for QFT: batched_amplitudes slower than qsim full statevector simulation #23

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

This comment has been hidden.

Select a reply

CircuitToEinsum for QFT: batched_amplitudes slower than qsim full statevector simulation #23

rht Dec 9, 2022

Replies: 3 comments · 8 replies

yangcal Dec 9, 2022 Maintainer

leofang Dec 12, 2022 Maintainer

rht Dec 13, 2022 Author

leofang Dec 13, 2022 Maintainer

rht Dec 13, 2022 Author

leofang Dec 17, 2022 Maintainer

This comment has been hidden.

rht
Dec 9, 2022

Replies: 3 comments 8 replies

yangcal
Dec 9, 2022
Maintainer

leofang Dec 12, 2022
Maintainer

rht Dec 13, 2022
Author

leofang Dec 13, 2022
Maintainer

rht Dec 13, 2022
Author

leofang Dec 17, 2022
Maintainer