-
Notifications
You must be signed in to change notification settings - Fork 94
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
227 additions
and
79 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# CPU and GPU Operator Customization with CuPy\n", | ||
"\n", | ||
"[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/brainpy/brainpy/blob/master/docs/tutorial_advanced/operator_custom_with_cupy.ipynb)\n", | ||
"[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/brainpy/brainpy/blob/master/docs/tutorial_advanced/operator_custom_with_cupy.ipynb)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This functionality is only available for ``brainpylib>=0.3.1``. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## English Version" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Although we can now use the flexible taichi custom operator approach, taichi on cuda does not have more fine-grained control or optimization for some scenarios. So for such scenarios, we can use cupy's \n", | ||
"- `RawModule`(https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-kernels)\n", | ||
"- `jit.rawkernel`(https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-kernel-definition) \n", | ||
"\n", | ||
"to compile and run CUDA native code directly as strings or cupy JIT function in real time for finer grained control.\n", | ||
"\n", | ||
"Start by importing the relevant Python package." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import brainpy.math as bm\n", | ||
"\n", | ||
"import jax\n", | ||
"import cupy as cp\n", | ||
"from cupyx import jit\n", | ||
"\n", | ||
"bm.set_platform('gpu')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### CuPy RawModule\n", | ||
"\n", | ||
"For dealing a large raw CUDA source or loading an existing CUDA binary, the RawModule class can be more handy. It can be initialized either by a CUDA source code. The needed kernels can then be retrieved by calling the get_function() method, which returns a RawKernel instance that can be invoked as discussed above.\n", | ||
"\n", | ||
"Be aware that the order of parameters in the kernel function you want to call should **keep outputs at the end of the parameter list**." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"source_code = r'''\n", | ||
" extern \"C\"{\n", | ||
"\n", | ||
" __global__ void kernel(const float* x1, const float* x2, unsigned int N, float* y)\n", | ||
" {\n", | ||
" unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x;\n", | ||
" if (tid < N)\n", | ||
" {\n", | ||
" y[tid] = x1[tid] + x2[tid];\n", | ||
" }\n", | ||
" }\n", | ||
" }\n", | ||
"'''\n", | ||
"mod = cp.RawModule(code=source_code)\n", | ||
"kernel = mod.get_function('kernel')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"After define the `RawModule` and get the kernel function. You can use `bm.XLACustomOp` to register it into it's `gpu_kernel` and call it with the appropriate `gird` and `block` you want (**Here these two parameters both should be Tuple**)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# prepare inputs\n", | ||
"N = 10\n", | ||
"x1 = bm.ones((N, N))\n", | ||
"x2 = bm.ones((N, N))\n", | ||
"\n", | ||
"# register the kernel as a custom op\n", | ||
"prim1 = bm.XLACustomOp(gpu_kernel=kernel)\n", | ||
"\n", | ||
"# call the custom op\n", | ||
"y = prim1(x1, x2, N**2, grid=(N,), block=(N,), outs=[jax.ShapeDtypeStruct((N, N), dtype=bm.float32)])[0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### CuPy JIT RawKernel\n", | ||
"The cupyx.jit.rawkernel decorator can create raw CUDA kernels from Python functions.\n", | ||
"\n", | ||
"In this section, a Python function wrapped with the decorator is called a target function.\n", | ||
"\n", | ||
"Here is a short example for how to write a cupyx.jit.rawkernel to copy the values from x to y using a grid-stride loop:\n", | ||
"\n", | ||
"Launching a CUDA kernel on a GPU with pre-determined grid/block sizes requires basic understanding in the CUDA Programming Model. And the compilation will be deferred until the first function call. CuPy’s JIT compiler infers the types of arguments at the call time, and will cache the compiled kernels for speeding up any subsequent calls." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@jit.rawkernel()\n", | ||
"def elementwise_copy(x, size, y):\n", | ||
" tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x\n", | ||
" ntid = jit.gridDim.x * jit.blockDim.x\n", | ||
" for i in range(tid, size, ntid):\n", | ||
" y[i] = x[i]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"After define the `jit.rawkernel`. You can use `bm.XLACustomOp` to register it into it's `gpu_kernel` and call it with the appropriate `gird` and `block` you want (**Here these two parameters both should be Tuple**)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# prepare inputs\n", | ||
"size = 100\n", | ||
"x = bm.ones((size,))\n", | ||
"\n", | ||
"# register the kernel as a custom op\n", | ||
"prim2 = bm.XLACustomOp(gpu_kernel=elementwise_copy)\n", | ||
"\n", | ||
"# call the custom op\n", | ||
"y = prim2(x, size, grid=(10,), block=(10,), outs=[jax.ShapeDtypeStruct((size,), dtype=bm.float32)])[0]" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.