GrCUDA 0.2.0 - October 2021
AlbertoParravicini
released this
23 Sep 14:09
·
15 commits
to master
since this release
GrCUDA 0.2.0 - October 2021
API Changes
- Added option to specify arguments in NFI kernel signatures as
const
- The effect is the same as marking them as
in
in the NIDL syntax - It is not strictly required to have the corresponding arguments in the CUDA kernel marked as
const
, although that's recommended - Marking arguments as
const
orin
enables the async scheduler to overlap kernels that use the same read-only arguments
- The effect is the same as marking them as
New asynchronous scheduler
-
Added a new asynchronous scheduler for GrCUDA, enable it with
--experimental-options --grcuda.ExecutionPolicy=async
- With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes immediately
- The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU data are accessed by the host thread
- Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
- The scheduler supports different options, see
README.md
for the full list - It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
-
Enabled partial support for cuBLAS and cuML in the aync scheduler
- Known limitation: functions in these libraries work with the async scheduler, although they still run on the default stream (i.e. they are not asynchronous)
- They do benefit from prefetching
-
Set TensorRT support to experimental
- TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML
- Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler
New features
- Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray, MultiDimDeviceArrayView, and provides high-level array interfaces
- Added API for prefetching
- If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
- Added API for stream attachment
- Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync scheduler, it can be manually enabled
- It restricts the visibility of GPU data to the specified stream
- In architectures newer or equal than Pascal it can provide a small performance benefit
- Added
copyTo/copyFrom
functions on generic arrays (Truffle interoperable objects that expose the array API)- Internally, the copy is implemented as a for loop, instead of using CUDA's
memcpy
- It is still faster than copying using loops in the host languages, in many cases, and especially if host code is not JIT-ted
- It is also used for copying data to/from DeviceArrays with column-major layout, as
memcpy
cannot copy non-contiguous data
- Internally, the copy is implemented as a for loop, instead of using CUDA's
Demos, benchmarks and code samples
- Added demo used at SeptembeRSE 2021 (
demos/image_pipeline_local
anddemos/image_pipeline_web
)- It shows an image processing pipeline that applies a retro look to images. We have a local version and a web version that displays results a in web page
- Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
- It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.
Miscellaneosus
- Added dependency to
grcuda-data
submodule, used to store data, results and plots used in publications and demos. - Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
- Added support for Java 11 along with Java 8
- Added option to specify the location of cuBLAS and cuML with environment variables (
LIBCUBLAS_DIR
andLIBCUML_DIR
) - Refactored package hierarchy to reflect changes to current GrCUDA (e.g.
gpu -> runtime
) - Added basic support for TruffleLogger
- Removed a number of existing deprecation warnings
- Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
- Updated documentation
- Bumped GraalVM version to 21.2
- Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (see
oci_setup/
) - Added documentation to setup IntelliJ Idea for GrCUDA development
- Added documentation about Python benchmark suite
- Added documentation on asynchronous scheduler options