Skip to content

Releases: necst/grcuda

GrCUDA 0.4.1 - July 2023

19 Jul 07:56
ca826df
Compare
Choose a tag to compare

Miscellaneous

  • Bumped graal and mx versions
  • Added Java implementation of the benchmark suite:
    • Implemented the multi-gpu benchmarks present in the Python suite in Java.
    • The class Benchmark.java provides a template for future use cases.
    • Created configuration files to easily adapt the benchmarks to different types of workloads.
    • The suite is built as a Maven project, indeed running mvn test will execute all the benchmarks based on the configuration file of the appropriate GPU architecture.
    • In the default configuration, all of the benchmarks will be executed, all the benchmarks' input sizes will be tested, and all of the scheduling policies will be run
  • Refactor DeviceSelectionPolicy in GrCUDAStreamPolicy
    • Each policy type has a separate class
    • Keep only retrieveImpl
    • Now TransferTimeDeviceSelectionPolicy extends DeviceSelectionPolicy
    • Delete previously commented methods and clean code
    • Added license for each file

GrCUDA 0.4.0 (multi-GPU support) - June 2022

29 Jun 15:23
e224629
Compare
Choose a tag to compare

New features

  • Enabled support for multiple GPU in the asynchronous scheduler:
    • Added the GrCUDADeviceManager component that encapsulates the status of the multi-GPU system. It tracks the currently active GPUs, the streams and the currently active computations associated with each GPU, and what data is up-to-date on each device.
    • Added the GrCUDAStreamPolicy component that encapsulates new scheduling heuristics to select the best device for each new computation (CUDA streams are uniquely associated with a GPU), using information such as data locality and the current load of the device. We currently support 5 scheduling heuristics with increasing complexity:
      • ROUND_ROBIN: simply rotate the scheduling between GPUs. Used as initialization strategy of other policies;
      • STREAM_AWARE: assign the computation to the device with the fewest busy stream, i.e. select the device with fewer ongoing computations;
      • MIN_TRANSFER_SIZE: select the device that requires the least amount of bytes to be transferred, maximizing data locality;
      • MINMIN_TRANSFER_TIME: select the device for which the minimum total transfer time would be minimum;
      • MINMAX_TRANSFER_TIME select the device for which the maximum total transfer time would be minimum.
    • Modified the GrCUDAStreamManager component to select the stream with heuristics provided by the policy manager.
    • Extended the CUDARuntime component with APIs for selecting and managing multiple GPUs.
    • Added the possibility to export the computation DAG obtained with a certain policy. If the ExportDAG startup option is enabled, before the context's cleanup, the graph will be exported in .dot format in the path specified by the user as option's argument.
    • Added support for Graal 22.1 and CUDA 11.7.

GrCUDA MultiGPU Pre-release

15 Apr 20:31
Compare
Choose a tag to compare
Pre-release

New features

  • Enabled support for multiple GPU in the asynchronous scheduler:
    • Added the GrCUDADeviceManager component that encapsulates the status of the multi-GPU system. It tracks the currently active GPUs, the streams and the currently active computations associated with each GPU, and what data is up-to-date on each device.
    • Added the GrCUDAStreamPolicy component that encapsulates new scheduling heuristics to select the best device for each new computation (CUDA streams are uniquely associated to a GPU), using information such as data locality and the current load of the device. We currently support 5 scheduling heuristic with increasing complexity:
      • ROUND_ROBIN: simply rotate the scheduling between GPUs. Used as initialization strategy of other policies;
      • STREAM_AWARE: assign the computation to the device with the fewest busy stream, i.e. select the device with fewer ongoing computations;
      • MIN_TRANSFER_SIZE: select the device that requires the least amount of bytes to be transferred, maximizing data locality;
      • MINMIN_TRANSFER_TIME: select the device for which the minimum total transfer time would be minimum;
      • MINMAX_TRANSFER_TIME select the device for which the maximum total transfer time would be minimum.
    • Modified the GrCUDAStreamManager component to select the stream with heuristics provided by the policy manager.
    • Extended the CUDARuntime component with APIs for selecting and managing multiple GPUs.

GrCUDA 0.3.0 - December 2021

03 Jan 11:00
a3e7bfe
Compare
Choose a tag to compare

New features

  • Enabled support for cuBLAS and cuML in the asynchronous scheduler

    • Streams' management is now supported both for CUML and CUBLAS
    • This feature can be possibly applied to any library, by extending the LibrarySetStreamFunction class
  • Enabled support for cuSPARSE

    • Added support for CSR and COO spmv and gemvi.
    • Known limitation: Tgemvi works only with single-precision floating-point arithmetics.
  • Added the support of precise timing of kernels, for debugging and complex scheduling policies

    • Associated a CUDA event to the start of the computation to get the elapsed time from start to the end
    • Added ElapsedTime function to compute the elapsed time between events, aka the total execution time
    • Logging of kernel timers is controlled by the grcuda.TimeComputation option (false by default)
    • Implemented with the ProfilableElement class to store timing values in a hash table and support future business logic
    • Updated documentation for the use of the new TimeComputation option in README
  • Added read-only polyglot map to retrieve GrCUDA options. Retrieve it with getoptions. Option names and values are provided as strings. Find the full list of options in GrCUDAOptions.

  • Enabled the usage of TruffleLoggers for logging the execution of GrCUDA code

    • GrCUDA has different types of loggers, each one with its own functionality
    • Implemented GrCUDALogger class is in order to have access to loggers of interest when specific features are needed
    • Changed all the print in the source code in log events, with different logging levels
    • Added documentation about logging in docs

Miscellaneous

  • Removed deprecation warning for Truffle's ArityException.

  • Set TensorRT support to experimental

    • TensorRT is currently not supported on CUDA 11.4, making it impossible to use along with a recent version of cuML
    • Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler

GrCUDA 0.2.1 - October 2021

29 Sep 07:15
9c9a357
Compare
Choose a tag to compare

Minor fixes:

  • Fixed path in installation script
  • Fixed creation of result directory in Python benchmark suite
  • Fixed Makefile for CUDA benchmarks

GrCUDA 0.2.0 - October 2021

23 Sep 14:09
d63678d
Compare
Choose a tag to compare

GrCUDA 0.2.0 - October 2021

API Changes

  • Added option to specify arguments in NFI kernel signatures as const
    • The effect is the same as marking them as in in the NIDL syntax
    • It is not strictly required to have the corresponding arguments in the CUDA kernel marked as const, although that's recommended
    • Marking arguments as const or in enables the async scheduler to overlap kernels that use the same read-only arguments

New asynchronous scheduler

  • Added a new asynchronous scheduler for GrCUDA, enable it with --experimental-options --grcuda.ExecutionPolicy=async

    • With this scheduler, GPU kernels are executed asynchronously. Once they are launched, the host execution resumes immediately
    • The computation is synchronized (i.e. the host thread is stalled and waits for the kernel to finish) only once GPU data are accessed by the host thread
    • Execution of multiple kernels (operating on different data, e.g. distinct DeviceArrays) is overlapped using different streams
    • Data transfer and execution (on different data, e.g. distinct DeviceArrays) is overlapped using different streams
    • The scheduler supports different options, see README.md for the full list
    • It is the scheduler presented in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
  • Enabled partial support for cuBLAS and cuML in the aync scheduler

    • Known limitation: functions in these libraries work with the async scheduler, although they still run on the default stream (i.e. they are not asynchronous)
    • They do benefit from prefetching
  • Set TensorRT support to experimental

    • TensorRT is currently not supported on CUDA 11.4, making it impossible to use along a recent version of cuML
    • Known limitation: due to this incompatibility, TensorRT is currently not available on the async scheduler

New features

  • Added generic AbstractArray data structure, which is extended by DeviceArray, MultiDimDeviceArray, MultiDimDeviceArrayView, and provides high-level array interfaces
  • Added API for prefetching
    • If enabled (and using a GPU with architecture newer or equal than Pascal), it prefetches data to the GPU before executing a kernel, instead of relying on page-faults for data transfer. It can greatly improve performance
  • Added API for stream attachment
    • Always enabled in GPUs with with architecture older than Pascal, and the async scheduler is active. With the sync scheduler, it can be manually enabled
    • It restricts the visibility of GPU data to the specified stream
    • In architectures newer or equal than Pascal it can provide a small performance benefit
  • Added copyTo/copyFrom functions on generic arrays (Truffle interoperable objects that expose the array API)
    • Internally, the copy is implemented as a for loop, instead of using CUDA's memcpy
    • It is still faster than copying using loops in the host languages, in many cases, and especially if host code is not JIT-ted
    • It is also used for copying data to/from DeviceArrays with column-major layout, as memcpy cannot copy non-contiguous data

Demos, benchmarks and code samples

  • Added demo used at SeptembeRSE 2021 (demos/image_pipeline_local and demos/image_pipeline_web)
    • It shows an image processing pipeline that applies a retro look to images. We have a local version and a web version that displays results a in web page
  • Added benchmark suite written in Graalpython, used in "DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime" (IPDPS 2021)
    • It is a collection of complex multi-kernel benchmarks meant to show the benefits of asynchronous scheduling.

Miscellaneosus

  • Added dependency to grcuda-data submodule, used to store data, results and plots used in publications and demos.
  • Updated name "grCUDA" to "GrCUDA". It looks better, doesn't it?
  • Added support for Java 11 along with Java 8
  • Added option to specify the location of cuBLAS and cuML with environment variables (LIBCUBLAS_DIR and LIBCUML_DIR)
  • Refactored package hierarchy to reflect changes to current GrCUDA (e.g. gpu -> runtime)
  • Added basic support for TruffleLogger
  • Removed a number of existing deprecation warnings
  • Added around 800 unit tests, with support for extensive parametrized testing and GPU mocking
  • Updated documentation
    • Bumped GraalVM version to 21.2
    • Added scripts to setup a new machine from scratch (e.g. on OCI), plus other OCI-specific utility scripts (see oci_setup/)
    • Added documentation to setup IntelliJ Idea for GrCUDA development
    • Added documentation about Python benchmark suite
    • Added documentation on asynchronous scheduler options