General-Purpose Compute on GPUs

This section covers the very basics of using a GPU for general-purpose computations (GPGPU).

Comparison with CPUs

The traditional way to learn programming is by writing single-threaded code that runs on a CPU, which is the easiest to understand and debug. Multi-threaded programs introduce task parallelism: for example, having one thread handle heavy computation separately from the thread handling the user interface. You'll also encounter data parallelism when exploring algorithms in parallel computing: a problem (e.g. summing a large list of numbers) is divided into chunks that are computed independently by executing the same instructions on different subsets of data elements. Libraries like OpenMP and TBB make it easier for programmers to write task- or data-parallel code, and you can also manually issue special instructions on the CPU for vector math (e.g. AVX). CPUs also employ something called instruction-level parallelism that can extract parallelism even from programs that are written for sequential execution!

We can see that CPUs are powerful not just at sequential execution, but also at various forms of parallel execution. You may be wondering how a GPU is different. As a broad generalization, CPUs are designed to minimize latency: it's often desirable that tasks are completed as quickly as possible, and context switching must be efficient. Operating systems have dozens or hundreds of processes running simultaneously and users expect responsiveness. GPUs, on the other hand, are designed to maximize throughput: rendering images involves computing billions of independent pixels. The time it takes to render an individual pixel is less important, and users care about the rate at which massive quantities of pixels can be pushed to the display (i.e. frames). GPUs are highly specialized for data parallelism and throughput, but this comes at the expense of latency.

These generalizations of CPUs and GPUs are useful when categorizing the workloads that are best suited to a particular device, but they don't provide a full picture. CPUs are being designed with an increasingly large number of cores (AMD Ryzen Threadripper has up to 64!). GPUs, likewise, have picked up a convenient programming model that isn't awkwardly shoehorned into the traditional graphics-rendering pipeline. There have been attempts to realize a sort of convergence of CPU and GPU architectures (see Larrabee), but these attempts have largely been unsuccessful. If anything, it seems that CPUs are trending more toward GPU architectures than the other way around.

The important thing to keep in mind is that CPUs and GPUs are different and should be programmed as such: some algorithms are best kept on the CPU, and others work extremely well on the GPU. If you want to fill a cup with water then the CPU is like a kitchen faucet: you turn the handle, water flows instantly, and your cup is filled in a second or two. A GPU is like filling a cup of water by standing at the base a large dam and signaling an engineer above to open the locks: it will take several seconds for the water to reach you, but it will be a lot of water when it comes. Picking the right device for the right problem is important: GPUs are ideal when dealing with embarrassingly parallel problems involving lots of data and minimal control flow.

Throughput and Latency Hiding

We learned that GPUs excel at high throughput, but we didn't yet discuss how large this difference is between CPUs and GPUs. We'll go deeper into performance metrics in another section, but as a concrete example an Intel Core i7-8650U has a theoretical limit of ~550 GFLOPS (single-precision) and launched with a recommended price of ~$400. An NVIDIA GTX 1080 has a theoretical limit of ~8300 GFLOPS (single-precision) and launched with a recommended price of ~$600. Both of these devices came out around the same time (2016-2017). We can see that the GPU can theoretically push over 15x more computation in this example. This gap has only grown (considerably) in a few years, and an NVIDIA RTX 3090 has ~36,000 GFLOPS when boosted!

To be fair, theoretical limits like the numbers above are rarely achievable in real-world applications. We need to understand how this disparity between CPUs and GPUs is so large; what makes GPUs excel at high-throughput scenarios? Ultimately, the difference boils down to a large number of processing cores and latency hiding.

Care and feeding of GPU cores: CPU cores are general-purpose, can handle sophisticated control flow, and have higher clock speeds. GPU cores are simple, specialized (e.g. emphasize floating-point math), and have lower clock speeds. Running a serial program on a GPU is going to be much slower than running it on a CPU. You cannot harness the full potential of a GPU without keeping its many cores busy, and the only way to do this is with a large enough problem that can be broken down and distributed across the GPU. This also means you'll need to feed the cores with large amounts of data, so GPU memory systems are designed to prioritize bandwidth over latency to keep data flowing into the cores.

Hiding latency: the ideal situation for a GPU is that memory is accessed linearly so that data can be streamed in and ready in time for any calculation that needs it; using my earlier simile of a GPU being like a large dam, if the water is already flowing then latency isn't an issue. In practice, however, it's difficult to ensure all access is done linearly, especially in generic compute scenarios. Caches only help if memory requests are localized (and caches are generally smaller on a GPU relative to a CPU), so inevitably some work will stall and cores will go idle. The trick that GPUs employ to hide this latency is swapping work out from a core when it's stalled and putting another chunk of work on the core. This technique again relies on a large enough problem to ensure there is sufficient work to fill in when other work is stalled waiting for memory.

In summary: GPUs rely on large workloads that can be split into many parallel chunks, and they use a high core count and latency hiding to crunch through this work with exceptional throughput.

Evolution of Compute Shaders

GPUs used to be highly specialized and limited to tasks involved in rendering images. The design of the hardware itself reflected this purpose, as graphics cards had silicon dedicated to specific stages in the rendering process. The graphics pipeline and hardware evolved together, with some stages becoming programmable and other stages remaining fixed in hardware.

In the beginning, programmers wrote code for specific GPUs. This approach obviously doesn't scale well and is difficult to support, so APIs like Glide, Direct3D, and OpenGL emerged.
Graphics APIs were initially fixed-function programming models: if you wanted to transform geometry, for example, you called an API to leverage some dedicated hardware unit.
Later, GPUs started to expose some limited form of programmability through shaders. Shaders were initially written assembly, but higher-level languages like HLSL, Cg, and GLSL made it easier. GPUs had hardware dedicated to stages: vertex shaders and pixel shaders directly mapped to distinct dedicated hardware units.
Graphics programmers were demanding more control over the hardware. Some developers (even outside of graphics) started leveraging vertex & pixel shaders for GPGPU, but it was an awkward and restrictive programming model.
Vertex and pixel shaders were consolidated into hardware units that are fully programmable (unified shader model). This is where compute shaders come into existence: programmers can now write generic computations that aren't so tightly coupled to the graphics pipeline.

This (extremely brief) history lesson is intended to help contextualize the design and quirks of GPU compute interfaces like DirectX and CUDA. It is difficult to understand GPU programming without also understanding some aspects of graphics pipeline and how its design influenced (and continues to influence) hardware architectures. Compute shaders have an odd name if you don't understand this history, and you still have access to graphics-centric functionality (e.g. textures and samplers) in compute programs!

GPU Programming APIs

There are multiple programming APIs to write compute programs for GPUs.

CUDA is NVIDIA's API and effectively made GPGPU mainstream. CUDA is cross-platform (Windows, Linux, Mac OS X) but only run on NVIDIA GPUs.
OpenCL is an open standard general-purpose compute API and arrived after CUDA. Each hardware vendor can provide an implementation of OpenCL, which makes it cross-platform and cross-vendor. Not all vendor implementations are equivalent in functionality. SYCL introduced a single-source programming model that makes it more similar writing CUDA programs.
DirectCompute is Microsoft's compute technology, but it's not a standalone API; it's really more of an extension of Direct3D and HLSL. You cannot write compute shaders without also writing D3D11 or D3D12 code. As a component of DirectX it is cross-vendor, but it only runs on Windows and WSL. DirectX is mostly associated with graphics and gaming, so one unique attribute of DirectCompute relative to CUDA or OpenCL is the tight integration with the graphics pipeline. It is a popular choice for game developers.
Apple has compute shaders in its Metal API (a D3D12/Vulkan-level API).
Intel recently introduced oneAPI.

I'm sure there are many more than the ones I've listed above. We'll be focused on DirectCompute/DirectX in these docs.

Learning Resources

[Beginner]: All the Pipelines - Journey through the GPU. A good intro video that covers modern GPU pipelines and is targeted at programmers with no GPU programming experience.
[Beginner]: Intro to Parallel Programming. YouTube playlist from a Udacity course. Uses CUDA to introduce GPU programming but the same concepts carry over to any other API like DX.
[Advanced]: A Trip Through the Graphics Pipeline (Fabian Giesen). An excellent and deep exploration of the graphics pipeline ~2011. A bit dated but still highly relevant. May be overwhelming if you're just learning GPU programming, but this offers a rare glimpse into some aspects of the graphics pipeline that are simply not discussed or written about elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpgpu.md

gpgpu.md

General-Purpose Compute on GPUs

Comparison with CPUs

Throughput and Latency Hiding

Evolution of Compute Shaders

GPU Programming APIs

Learning Resources

Files

gpgpu.md

Latest commit

History

gpgpu.md

File metadata and controls

General-Purpose Compute on GPUs

Comparison with CPUs

Throughput and Latency Hiding

Evolution of Compute Shaders

GPU Programming APIs

Learning Resources