-
Notifications
You must be signed in to change notification settings - Fork 35
GPU Support #574
base: master
Are you sure you want to change the base?
GPU Support #574
Conversation
cc @inducer |
pyop2/base.py
Outdated
""" | ||
AVAILABLE_ON_HOST_ONLY = 1 | ||
AVAILABLE_ON_DEVICE_ONLY = 2 | ||
AVAILABLE_ON_BOTH = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe IntFlag
instead of IntEnum
? (just something to perhaps consider...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't manage everything, but a bunch of comments/queries.
pyop2/transforms/snpt.py
Outdated
return arg | ||
|
||
|
||
def snpt_transform(kernel, block_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely impenetrable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to split_n_across_workgroups
, better?
pyop2/backends/cuda.py
Outdated
if self.can_be_represented_as_petscvec(): | ||
# report to petsc that we are altering data on the CPU | ||
with self.vec as petscvec: | ||
petscvec.array_w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, comments about vec state management apply here, and mutatis mutandis, the rest of the PR.
pyop2/backends/cuda.py
Outdated
grid = tuple(int(evaluate(glens[i], parameters)) if i < len(glens) else 1 | ||
for i in range(2)) | ||
block = tuple(int(evaluate(llens[i], parameters)) if i < len(llens) else 1 | ||
for i in range(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cached data is generated by a computer. Therefore there should be no need to parse and evaluate it, because the on-disk representation can just be the in-memory representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cached data is generated by a computer
I preferred a human-readable format to keep the representation valid across pyop2-versions which might not be the case with pickling. (Happy to serialize in a different format if there's a better alternative)
no need to parse and evaluate it,
I don't think we can avoid the evaluate
s because the grid sizes are typically symbolic expressions of the form: (end - start) // 32 + 1
. Therefore the grid sizes are evaluated at runtime during a GlobalKernel.__call__
invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pickle version incompatibilities are fine (they would appear when you upgrade python and we don't guarantee very much about the portability of on disk cached data between firedrake-update
calls, let alone python version upgrades).
What you're attempting to do is cache a function that computes these things. So in the code_to_compile
case, just write that function, and pickle it.
Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pickle version incompatibilities are fin
Cool, will change. (Thanks!)
Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?
Yes, the compiled code stores only the plain CUDA/OpenCL device kernel. Loopy's cache on disk mechanism works by storing both the device kernel and the host code that would call end up calling this device kernel. But adapting PyOP2 to call loopy kernels would be a bit disruptive and I think loopy's cache dir being $XDG_CACHE_DIR
has run into some problems in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just write that function,
Planning to simply generate a python function corresponding to the expression: inducer/pymbolic#107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, how come you chose to not go with the generated invokers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PyOpenCL we could borrow the invoker, but for PyCUDA/C-target we don't have an invoker in loopy (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Also, for the pyopencl invoker, a minor concern is that the disk-caching semantics don't play well with PyOP2's disk-caching semantics.
- I feel adding an invoker for the PyCUDA target shouldn't be too difficult. Taking a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is invoker generation for the C target. I'd expect that putting together CUDA invokers wouldn't be a massive effort based on a copy-paste job of the CL ones. Do you think that might be the right approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on integrating the PyCUDA/PyOpenCL invokers. For the C-target, reorganizing pyop2/compilation.py
is a bigger undertaking and also out of the scope for this PR.
pyop2/backends/cuda.py
Outdated
.callables_table)) | ||
f.write("("+",".join(str(glen) for glen in glens)+",)") | ||
f.write("\n") | ||
f.write("("+",".join(str(llen) for llen in llens)+",)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is data to be consumed by a computer, so don't make it human-readable and require parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recommend pickling?
pyop2/backends/cuda.py
Outdated
fpath = self.extra_args_cache_file_path | ||
extra_args_np = [arg | ||
for _, arg in natsorted(numpy.load(fpath).items(), | ||
key=lambda x: x[0])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tremendously fragile. Please think of a better way of saving this data in a way that doesn't rely on dubiously-stable properties of numpy.savez/load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. While np.savez
-ing the arrays also stored a string-array that tells us the keys that were used for storing these arrays. See https://github.com/OP2/PyOP2/compare/9b7eb97878f2b6609d64affe896c5d4d901a2934..0e58f08d3ea1487e01c45df2270fa9e5790a845d.
3f834d3
to
5810a41
Compare
bb09250
to
a7475e1
Compare
16720f6
to
96f54ce
Compare
5bed614
to
3df2554
Compare
98c560b
to
f58780c
Compare
f521085
to
fc1575e
Compare
@kaushikcfd just to make you aware of a couple things:
|
Hi Connor!
Let me know if anything in this patch isn't clear. |
Draft because: