[guide] Profiling

Profiling

Profiling PyOP2 programs

Profiling a PyOP2 program is as simple as profiling any other Python code. Let's use the jacobi demo in the PyOP2 demo folder:

python -m cProfile -o jacobi.dat jacobi.py

This will run the entire program under cProfile and write the profiling data to jacobi.dat. Omitting -o will print a summary to stdout, which is not very helpful in most cases.

Creating a graph

Luckily there's a much better way of representing the profiling data using the excellent gprof2dot to generated a graph. Install from PyPI with sudo pip install gprof2dot.

Use as follows to create a PDF:

gprof2dot -f pstats -n 1 jacobi.dat | dot -Tpdf -o jacobi.pdf

-f pstats tells gprof2dot that it's dealing with Python cProfile data (and not actual gprof data) and -n 1 ignores everything that makes up less than 1% of the total runtime - most likely you're not interested in that (the default is 0.5).

Consolidating profiles from different time steps

To aggregate the profiling data from all time steps, save the following as concat.py:

"""Usage: concat.py PATTERN FILE"""

import sys
from glob import glob
from pstats import Stats

if len(sys.argv) != 3:
    print __doc__
    sys.exit(1)
files = glob(sys.argv[1])
s = Stats(files[0])
for f in files[1:]: s.add(f)
s.dump_stats(sys.argv[2])

Use it as python concat.py '<basename>.*.part' <basename>.dat and then call gprof2dot as before.

Using PyOP2's internal timers

PyOP2 automatically times the execution of certain regions:

sparsity building
Plan construction
parallel loop kernel execution
PETSc Krylov solver

To output those timings, call profiling.summary() in your PyOP2 programme or run with the environment variable PYOP2_PRINT_SUMMARY set to 1.

To add additional timers to your own code, you can use the timed_region and timed_function helpers from pyop2.profiling.

There are a few caveats:

PyOP2 delays computation, which means timing a parallel loop call will not time the execution, since the evaluation only happens when the result is requested. To disable lazy evaluation of parallel loops, set the environment variable PYOP2_LAZY to 0.
Kernel execution with CUDA and OpenCL is asynchronous (though OpenCL kernels are currently launched synchronously), which means the time recorded for kernel execution is only the time for the kernel launch.

Provide feedback

Saved searches