Skip to content
David Sorber edited this page Aug 4, 2021 · 16 revisions

GNU Radio Accelerator Device Support Project

Description

GNU Radio provides a flexible block-based interface for signal processing tasks. Historically GNU Radio signal processing blocks have been written in software but there is increasing need to offload complex signal processing algorithms to accelerator devices including GPUs, FPGAs, and DSPs. Many accelerated blocks have been created using GNU Radio's block interface but these blocks require manual handling of data movement to and from the accelerator device. The purpose of this project is to add accelerator device support directly to GNU Radio.

Project Goals

  • Maintain backwards compatibility with all existing blocks (both in-tree and from OOT modules)
  • Create flexible interface for creating "custom buffers" to support accelerated devices
    • Custom buffer interface provides necessary hooks to allow the scheduler to handle data movement
  • Provide infrastructure to support "insert signal processing here" paradigm for common accelerator devices such as NVidia GPUs

High-level Plan

  • Milestone 1 - completed: May 11, 2021
    • refactor existing code and create single mapped buffer abstraction
    • support single accelerated block (block responsible for data movement)
    • simple custom buffer interface
  • Milestone 2 - completed: August 5, 2021
    • support multiple accelerated blocks with zero-copy between
    • more flexible custom buffer interface (scheduler handles data movement)

Overview and Usage

double copy

GNU Radio's block-based interface is very flexible and has allowed users to create their own accelerated blocks for some time. However, this approach has some limitations. In particular if the accelerator device requires special (DMA) buffers for data movement, the accelerator block must then copy data from the GNU Radio buffer into the device's buffer on the input path and vice versa on the output path. This process is inefficient and is known as the "double copy" problem as shown in the diagram above. Furthermore, in addition to the double copy inefficiency, accelerated blocks written in this fashion require the writer to manage data movement explicitly. While this is doable it may be challenging for a novice and off-putting for a user that wishes to concentrate on implementing a signal processing algorithm. The new accelerated block interface changes address both of these issues while (very importantly) maintaining backwards compatibility for all existing GNU Radio blocks.

Supporting Code

The accelerated block interface changes currently live in this repository however the intention is to upstream these changes into GNU Radio. The following repositories contain supporting code that is also intended to be upstreamed to the project but not directly into the main GNU Radio repository itself (NOTE: both of the repositories below require the accelerated block interface changes, also called "ngsched"):

  • gr-cuda_buffer - This repository contains an OOT module containing the cuda_buffer class which is a "custom buffer" supporting the CUDA runtime for NVidia GPUs. This module is intended to be a base CUDA buffer implementation for CUDA blocks and can be used directly when writing CUDA accelerated blocks for NVidia GPUs.
  • gr-blnxngsched - This repository contains an OOT module containing various examples of the accelerated block interface (aka "ngsched") changes. These blocks are described in a additional detail in the "Examples" section below. Note that the CUDA-related blocks in this OOT require cuda_buffer from gr-cuda-buffer.

Examples

  • custom_buffer - A buffer object that shows a simple example for creating a custom buffer object. It uses normal host buffers and does not require any specific accelerator hardware.
  • custom_buf_loopback - A loopback block that uses the custom_buffer class from above. It shows a very simple example of how to use a custom buffer object defined within an OOT module.
  • cuda_fanout - (NOTE: requires CUDA and gr-cuda-buffer) A simple CUDA-based fanout block that utilizes block history.
  • cuda_loopback - (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based loopback block that uses the cuda_buffer class from gr-cuda-buffer.
  • cuda_mult - (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based complex multiplication block. This block has two inputs and one output.
  • mixed_2_port_loopback - (NOTE: requires CUDA and gr-cuda-buffer) A loopback block that combines a CUDA-based loopback with a simple host loopback. This block has two inputs and two outputs. One input/output pair uses cuda_buffer while the other uses default GNU Radio host buffers.

How to Use a Custom Buffer

The following instructions illustrate how to write a block using a "custom buffer". These instructions use cuda_buffer from gr-cuda-buffer for example purposes but the same general concepts can be applied to any custom buffer class.

  1. If the custom buffer class exists in a separate OOT module, install that OOT module in the same path prefix as the OOT module containing the block which will use the buffer class. For example, to use cuda_buffer, the gr-cuda-buffer OOT module must be installed in the same prefix.
  2. In the implementation source file include the appropriate header file for the buffer class. For example, in new_block_impl.cc:
#include <cuda_buffer/cuda_buffer.h>
  1. Next, within the block's constructor update the gr::ios_ignature to include the desired buffer's type. For example:
new_block_impl::new_block_impl()
    : gr::block("my_new_block",
                gr::io_signature::make(1 /* min inputs */, 1 /* max inputs */, 
                                       sizeof(input_type), cuda_buffer::type),
                gr::io_signature::make(1 /* min outputs */, 1 /*max outputs */, 
                                       sizeof(output_type), cuda_buffer::type))
{
    . . . 
}
  1. Finally, the pointers passed to the block's work function will now be of the selected type. For the cuda_buffer class used in this example the pointers passed to the work function are not host accessible and should not be dereferenced on the host. Instead the pointers should be passed to a kernel invocation. Note that the pointer usage restrictions depend on the buffer class being used.
int new_block_impl::general_work(int noutput_items,
                                 gr_vector_int& ninput_items,
                                 gr_vector_const_void_star& input_items,
                                 gr_vector_void_star& output_items)
{
    // NOTE: in and out are *not* host accessible 
    const auto in = reinterpret_cast<const input_type*>(input_items[0]);
    auto out = reinterpret_cast<output_type*>(output_items[0]);

    // Launch kernel passing in and out as parameters
    . . .
}

Detailed Design

This section contains detailed design information for the changes introduced in the accelerated block interface.

Single Mapped Buffer Abstraction

The ability of the GNU Radio runtime to directly manipulate device-specific (aka "custom") buffers is a key aspect of the accelerated block interface changes. Integrating device-specific buffers into the runtime required several changes, the most significant of which is the refinement of the runtime buffer class and the creation of the single mapped buffer abstraction.

A GNU Radio flow graph contains a series of blocks where each pair of blocks is connected by a buffer. The "upstream" block writes (produces) to the buffer while the "downstream" block reads (consumes) from the buffer.

buffer

A GNU Radio buffer may have only one writer but multiple readers. The buffer class provides an abstraction over top of the underlying buffer itself. Likewise the buffer_reader class provides an abstraction for each reader, one or more buffer_reader instances are attached to each buffer instance. The underlying buffers, which the buffer class wraps, are very elegantly implemented circular buffers. The vmcircbuf class provides the circular buffer interface although several implementations exist. Fundamentally the vmcircbuf interface relies on virtual memory "double mapping" to provide the illusion of an automatically wrapping memory buffer (see https://www.gnuradio.org/blog/2017-01-05-buffers/ for additional details). This approach works very well for host buffers where the virtual memory mapping to the underlying physical memory can be manipulated but does not work well for device-specific buffers where the virtual memory mapping cannot be manipulated.

Buffer Type

Replace Upstream

Callback Functions

Custom Lock Interface

Data movement (post work)

host_buffer Class

Clone this wiki locally