-
Notifications
You must be signed in to change notification settings - Fork 6
Home
GNU Radio provides a flexible block-based interface for signal processing tasks. Historically GNU Radio signal processing blocks have been written in software but there is increasing need to offload complex signal processing algorithms to accelerator devices including GPUs, FPGAs, and DSPs. Many accelerated blocks have been created using GNU Radio's block interface but these blocks require manual handling of data movement to and from the accelerator device. The purpose of this project is to add accelerator device support directly to GNU Radio.
- Maintain backwards compatibility with all existing blocks (both in-tree and from OOT modules)
- Create flexible interface for creating "custom buffers" to support accelerated devices
- Custom buffer interface provides necessary hooks to allow the scheduler to handle data movement
- Provide infrastructure to support "insert signal processing here" paradigm for common accelerator devices such as NVidia GPUs
-
Milestone 1 - completed: May 11, 2021
- refactor existing code and create single mapped buffer abstraction
- support single accelerated block (block responsible for data movement)
- simple custom buffer interface
-
Milestone 2 - completed: August 5, 2021
- support multiple accelerated blocks with zero-copy between
- more flexible custom buffer interface (scheduler handles data movement)
GNU Radio's block-based interface is very flexible and has allowed users to create their own accelerated blocks for some time. However, this approach has some limitations. In particular if the accelerator device requires special (DMA) buffers for data movement, the accelerator block must then copy data from the GNU Radio buffer into the device's buffer on the input path and vice versa on the output path. This process is inefficient and is known as the "double copy" problem as shown in the diagram above. Furthermore, in addition to the double copy inefficiency, accelerated blocks written in this fashion require the writer to manage data movement explicitly. While this is doable it may be challenging for a novice and off-putting for a user that wishes to concentrate on implementing a signal processing algorithm. The new accelerated block interface changes address both of these issues while (very importantly) maintaining backwards compatibility for all existing GNU Radio blocks.
The accelerated block interface changes currently live in this repository however the intention is to upstream these changes into GNU Radio. The following repositories contain supporting code that is also intended to be upstreamed to the project but not directly into the main GNU Radio repository itself (NOTE: both of the repositories below require the accelerated block interface changes, also called "ngsched"):
-
gr-cuda_buffer - This repository contains an OOT module containing the
cuda_buffer
class which is a "custom buffer" supporting the CUDA runtime for NVidia GPUs. This module is intended to be a base CUDA buffer implementation for CUDA blocks and can be used directly when writing CUDA accelerated blocks for NVidia GPUs. -
gr-blnxngsched - This repository contains an OOT module containing various examples of the accelerated block interface (aka "ngsched") changes. These blocks are described in a additional detail in the "Examples" section below. Note that the CUDA-related blocks in this OOT require
cuda_buffer
from gr-cuda-buffer.
-
custom_buffer
- A buffer object that shows a simple example for creating a custom buffer object. It uses normal host buffers and does not require any specific accelerator hardware. -
custom_buf_loopback
- A loopback block that uses thecustom_buffer
class from above. It shows a very simple example of how to use a custom buffer object defined within an OOT module. -
cuda_fanout
- (NOTE: requires CUDA and gr-cuda-buffer) A simple CUDA-based fanout block that utilizes block history. -
cuda_loopback
- (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based loopback block that uses thecuda_buffer
class from gr-cuda-buffer. -
cuda_mult
- (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based complex multiplication block. This block has two inputs and one output. -
mixed_2_port_loopback
- (NOTE: requires CUDA and gr-cuda-buffer) A loopback block that combines a CUDA-based loopback with a simple host loopback. This block has two inputs and two outputs. One input/output pair usescuda_buffer
while the other uses default GNU Radio host buffers.
The following instructions illustrate how to write a block using a "custom buffer". These instructions use cuda_buffer
from gr-cuda-buffer for example purposes but the same general concepts can be applied to any custom buffer class.
- If the custom buffer class exists in a separate OOT module, install that OOT module in the same path prefix as the OOT module containing the block which will use the buffer class. For example, to use
cuda_buffer
, the gr-cuda-buffer OOT module must be installed in the same prefix. - In the implementation source file include the appropriate header file for the buffer class. For example, in
new_block_impl.cc
:
#include <cuda_buffer/cuda_buffer.h>
- Next, within the block's constructor update the
gr::ios_ignature
to include the desired buffer's type. For example:
new_block_impl::new_block_impl()
: gr::block("my_new_block",
gr::io_signature::make(1 /* min inputs */, 1 /* max inputs */,
sizeof(input_type), cuda_buffer::type),
gr::io_signature::make(1 /* min outputs */, 1 /*max outputs */,
sizeof(output_type), cuda_buffer::type))
{
. . .
}
- Finally, the pointers passed to the block's work function will now be of the selected type. For the
cuda_buffer
class used in this example the pointers passed to the work function are not host accessible and should not be dereferenced on the host. Instead the pointers should be passed to a kernel invocation. Note that the pointer usage restrictions depend on the buffer class being used.
int new_block_impl::general_work(int noutput_items,
gr_vector_int& ninput_items,
gr_vector_const_void_star& input_items,
gr_vector_void_star& output_items)
{
// NOTE: in and out are *not* host accessible
const auto in = reinterpret_cast<const input_type*>(input_items[0]);
auto out = reinterpret_cast<output_type*>(output_items[0]);
// Launch kernel passing in and out as parameters
. . .
}
This section contains detailed design information for the changes introduced in the accelerated block interface.
The ability of the GNU Radio runtime to directly manipulate device-specific (aka "custom") buffers is a key aspect of the accelerated block interface changes. Integrating device-specific buffers into the runtime required several changes, the most significant of which is the refinement of the runtime buffer
class and the creation of the single mapped buffer abstraction.
A GNU Radio flow graph contains a series of blocks where each pair of blocks is connected by a buffer. The "upstream" block writes (produces) to the buffer while the "downstream" block reads (consumes) from the buffer.
A GNU Radio buffer may have only one writer but multiple readers. The buffer
class provides an abstraction over top of the underlying buffer itself. Likewise the buffer_reader
class provides an abstraction for each reader, one or more buffer_reader
instances are attached to each buffer
instance. The underlying buffers, which the buffer
class wraps, are very elegantly implemented circular buffers. The vmcircbuf
class provides the circular buffer interface although several implementations exist. Fundamentally the vmcircbuf
interface relies on virtual memory "double mapping" to provide the illusion of an automatically wrapping memory buffer (see https://www.gnuradio.org/blog/2017-01-05-buffers/ for additional details). This approach works very well for host buffers where the virtual memory mapping to the underlying physical memory can be manipulated but does not work well for device-specific buffers where the virtual memory mapping cannot be manipulated.
The single mapped buffer abstraction was created to provide a similar encapsulation for underlying buffers whose virtual memory mapping cannot be manipulated. This applies to the majority if not all device-specific buffers. One side effect of the single mapped buffer abstraction is that, unlike the traditional double mapped buffers, single mapped buffers require explicit management to handle wrapping from the end of the buffer back to the beginning. Two callback functions handle wrapping and they are described in detail in the Callback Functions section below.
Some refactoring was necessary to accommodate the new single mapped buffer abstraction alongside the existing double mapped buffer abstraction. The existing buffer
class was refactored to be an abstract base class for underlying buffer wrapper abstractions. It provides the interface that those abstractions use to hook into the GNU Radio runtime. The buffer_double_mapped
and buffer_single_mapped
classes now derive from the buffer
class and implement its interface (see the simplified class diagram below). The buffer_double_mapped
class, as its name suggests, contains the double mapped buffer abstraction that was previously contained within the buffer
class. The buffer_single_mapped
class contains the new single mapped buffer abstraction that was added as part of the accelerated block interface changes.
The buffer_single_mapped
class is itself an abstract class that represents the interface for single mapped buffers. In the diagram above the interface functions are listed in the gray box. Functions that are pure virtual, that is functions that must be overridden, are listed in bold. The remaining non-bold functions are all virtual, that is they may be overridden if specific customization is necessary. Device-specific "custom buffers" must derive from the buffer_single_mapped
class and implement the interface, which includes the functions listed in bold at a minimum. Note that the host_buffer
class is an example implementation of the interface using host buffers. Additional information about it is available in the host_buffer Class section below. The cuda_buffer
and hip_buffer
classes also derive from the buffer_single_mapped
class but they reside externally in separate OOT modules.
Additional minor refactoring of the buffer_reader
class was also necessary to support the single mapped buffer abstraction. The buffer_reader_sm
("sm" for single mapped) was created to customize the behavior of several buffer_reader
functions when used with buffer_single_mapped
derived classes. To support this, several functions in the buffer_reader
class were marked virtual. The buffer_reader_sm
then derives from buffer_reader
and overrides those virtual functions. The buffer_reader
class was also slightly refactored to redirect most calls back to the corresponding buffer
object where possible. This refactoring eliminated the need to create a specific buffer_reader
derived class to accompany each buffer_single_mapped
derived class. That is, it should be possible to use buffer_reader_sm
with any buffer_single_mapped
derived class. For example, the host_buffer
, cuda_buffer
, and hip_buffer
classes all utilized buffer_reader_sm
without requiring their own corresponding specific derived classes.