-
Notifications
You must be signed in to change notification settings - Fork 35
[spec] Sequential backend
To run the sequential backend, you need
- instant
apt-get install python-instant
- C compiler
Rather than just compiling the user’s kernel with instant and carrying out the loop over set elements in python, we push the entire loop down into C. This is orders of magnitude faster because we don’t have python overheads in the inner loop.
The kernel we compile looks like:
void wrap_user_kernel(PyObject *arg1, PyObject *arg2, ..., PyObject *argn)
{
T1 *arg1_data = (T1 *)(((PyArrayObject *)arg1)->data);
...
for ( int i = 0; i < set_size; i++ ) {
user_kernel(...);
}
}
For every argument to the parallel loop with pass a PyObject *
which is a pointer to the data. In addition, for indirect arguments only, we pass an additional PyObject *
which is a pointer to the map.
No attempt is made to determine unique Dats and Maps to pass. So multiple data and map pointers may point to the same underlying piece of memory.
For an indirect argument a
with map m
accessing index idx
the user kernel gets a pointer to the data:
a.data + m.values[i * m.dim + idx] * a.dim
For vector maps, we declare an additional local variable in the kernel wrapper:
T *argN_vec[N];
Inside the kernel loop, this is populated with the correct data pointers (as above).
OP2, and thus PyOP2 assume that data is actually stored as 1D arrays, irrespective of the dimension of the OP2-level data object. We must therefore be careful when populating data structures that the numpy data are laid out in C order. As an example, the following will not work as expected:
import numpy as np
map_vals = np.array([range(10), range(1, 11)], dtype=int).transpose()
...
# instantiate op2 data structures.
Although when we print map_vals we see it nicely displayed in what looks like C order:
In [9]: a Out[9]: array([ [ 0, 1], [ 1, 2], [ 2, 3], [ 3, 4], [ 4, 5], [ 5, 6], [ 6, 7], [ 7, 8], [ 8, 9], [ 9, 10] ])
in actuality, the data are stored as:
In [11]: a.T Out[11]: array([ [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ])
To avoid this, either create the array correctly in the first place:
a = np.array([(i, i+1) for i in range(10)], dtype=int) or explicitly copy into the displayed order before passing to OP2 datastructures a = np.array([range(10), range(1, 11)], dtype=int).transpose() a = a.ravel()
You can debug the generated code with gdb. This requires a few modifications to sequential.py to add debug symbols and so forth to the generated code. Add:
cppargs=['-g', '-O0'], modulename=kernel._name
to the inline_with_numpy call. This makes instant compile without optimisations and with debug symbols. Additionally, the generated code is put in the module given by modulename
in the current directory, rather than instant’s cache. To debug your failing example, that executes a kernel foo
in a file foo.py
do:
$ gdb --args python foo.py ... (gdb) break wrap_foo__ Function "wrap_foo__" not defined. Make breakpoint pending on future shared library load? (y or [n]) y (gdb) r ... inspect as normal