-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel launcher #39
Comments
Maybe something like this could work: #:mute
#:def comma(a,b)
${a}$, ${b}$
#:enddef
#:def unpack(*pos)
#:if len(pos) > 1
#:set res = comma(pos[0],unpack(*pos[1:]))
$:res
#:elif len(pos) == 1
$:pos[0]
#:endif
#:enddef
#:def launch1d(name,kernel,params,args)
subroutine launch_${name}$(numBlocks, blockDim, ${args}$)
use omp_lib
type(dim3), value :: numBlocks, blockDim
! vvvv params vvvv
$:params
! ^^^^^^^^^^^^^^^
type(dim3) :: blockIdx, threadIdx
integer :: bx, tx
do bx = 1, numBlocks%x
blockIdx = dim3(bx)
!$omp parallel num_threads(blockDim%x)
tx = omp_get_thread_num()
threadIdx = dim3(tx)
! vvv kernel part vvv
$:kernel
! ^^^^^^^^^^^^^^^^^^^
!$omp end parallel
end do
end subroutine
#:enddef
#:def launch(numBlocks,tpb,name,*args)
#:set arg_list = unpack(*args)
call launch_${name}$ (dim3(${numBlocks}$), dim3(${tpb}$), &
${arg_list}$)
#:enddef
#:endmute
! --- program starts here ---
program main
implicit none
integer, parameter :: n = 100
real :: a, x(n), y(n)
type :: dim3
integer :: x, y=1, z=1
end type
type(dim3) :: threadsPerBlock, numBlocks
a = 1.0
threadsPerBlock = dim3( n )
numBlocks = dim3( n / 64 + 1 )
! Kernel definitions are found in the contains section
call launch_axpy(threadsPerBlock,numBlocks, &
n, a, x, y)
! Macro to help launching
@:launch( n/64 + 1, 64, axpy, n, a, x, y)
contains
#:block launch1d(name="axpy")
#:contains args
n, a, x, y
#:contains params
integer, value :: n
real, value :: a
real, intent(in) :: x(n)
real, intent(inout) :: y(n)
integer :: i
#:contains kernel
i = blockDim%x*(blockIdx%x - 1) + threadIdx%x
if (i < n) then
y(i) = a*x(i) + y(i)
end if
#:endblock
end program |
Another possibility would be to follow a more SYCL-like approach: #:if defined('OMP_KERNEL')
#:def parallel_for(name,kernel,params,args)
! --- OMP Kernel ---
subroutine ${name}$(${args}$)
$:params
!$omp parallel for
do i = 1, n
$:kernel
end do
!$omp end parallel for
end subroutine
! ------------------
#:enddef
#:else
#:def parallel_for(name,kernel,params,args)
! --- CUDA Kernel ---
subroutine ${name}$(${args}$)
$:params
call ${name}$_d<<<n/64 + 1,64>>>(${args}$)
end subroutine
attribute(device) subroutine ${name}$_d(${args}$)
$:params
i = blockDim%x*(blockIdx%x - 1) + threadIdx%x
$:kernel
end subroutine
! -------------------
#:enddef
#:endif
#:block parallel_for(name="axpy")
#:contains args
n, a, x, y
#:contains params
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
#:contains kernel
y(i) = a*x(i) + y(i)
#:endblock Depending on the definition, this generates: ! --- OMP Kernel ---
subroutine axpy( n, a, x, y)
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
!$omp parallel for
do i = 1, n
y(i) = a*x(i) + y(i)
end do
!$omp end parallel for
end subroutine
! ------------------ or ! --- CUDA Kernel ---
subroutine axpy( n, a, x, y)
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
call axpy_d<<<n/64 + 1,64>>>( n, a, x, y)
end subroutine
attribute(device) subroutine axpy_d( n, a, x, y)
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
i = blockDim%x*(blockIdx%x - 1) + threadIdx%x
y(i) = a*x(i) + y(i)
end subroutine
! ------------------- I think it has some potential. :) |
@ivan-pi That's a very nice idea, thanks for posting! Actually, you can reduce the redundancy somewhat, if you extract the argument list from the argument definition block. If one defines each argument on its own line and defines all properties (apart of the name) as attribute, it is easily possible (see #:mute
#! Extracts the list of the arguments from a block of argument defintions
#:def argument_list(argdefs)
#:set arglines = argdefs.split("\n")
#:set args = [argline.split("::")[-1].strip() for argline in arglines]
$:", ".join(args)
#:enddef
#:if defined('OMP_KERNEL')
#:def parallel_for(name, kernel, arguments)
! --- OMP Kernel ---
subroutine ${name}$(${argument_list(arguments)}$)
$:arguments
!$omp parallel for
do i = 1, n
$:kernel
end do
!$omp end parallel for
end subroutine
! ------------------
#:enddef
#:else
#:def parallel_for(name, kernel, arguments)
! --- CUDA Kernel ---
subroutine ${name}$(${argument_list(arguments)}$)
$:arguments
call ${name}$_d<<<n/64 + 1,64>>>(${argument_list(arguments)}$)
end subroutine
attribute(device) subroutine ${name}$_d(${argument_list(arguments)}$)
$:arguments
i = blockDim%x*(blockIdx%x - 1) + threadIdx%x
$:kernel
end subroutine
! -------------------
#:enddef
#:endif
#:endmute
#:block parallel_for(name="axpy")
#:contains arguments
#! put one argument per line
#! use '::' to separate names from attributes
#! define every attribute before the '::'
#! (e.g. use 'real, dimension(n) :: x' instead of 'real :: x(n)')
integer, intent(in) :: n
real, intent(in) :: a
real, dimension(n), intent(in) :: x
real, dimension(n), intent(inout) :: y
integer :: i
#:contains kernel
y(i) = a*x(i) + y(i)
#:endblock |
Excellent! Thanks @aradi. I was missing a rule to help extract dimensions out of the declarations. Ideally, the range of the kernel (as in the 1d-3d (or higher) index space) should be configurable and countable, so that we can generate nested OpenMP loops with a collapse clause, or a different block division in the CUDA kernel. The basic version should be functional, and then
When it comes to memory allocation, I was thinking it would be cleanest to have a It would be interesting to see how far can one get with this approach. My idea was roughly to follow what SYCL has: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_parallel_for_invoke |
Probably some implicit rules might be needed, e.g.
and I used an explicit dimension argument
|
I'd rather suggest to avoid implicit rules, if possible. I think, by using #:set kernel = "! Some kernel"
#:set ndims = 3
block
integer :: ${", ".join([f"i{idim}" for idim in range(1, ndims + 1)])}$
#:for idim in range(1, ndims + 1)
do i${idim}$ = 1, dims(${idim}$)
#:endfor
$:kernel
#:for _ in range(ndims)
end do
#:endfor
end block |
A colleague of mine has used the variadic templates of C++ to mimic a kernel launcher:
This is kind of like the CUDA triple chevron
I suppose it's possible to do something similar with Fypp, Fortran and OpenMP/OpenACC/CUDA. I came up with the following solution, but it lacks encapsulation:
The text was updated successfully, but these errors were encountered: