Skip to content

Broadcom Videocore IV Performance Recommendations

Yours3lf edited this page Jun 18, 2020 · 5 revisions

Profiling using hardware counters

Profiling can be done using standard Vulkan performance queries. See the query.cpp example.

Coordinate shaders

The Broadcom Videocore IV GPU has a hidden shader stage called the Coordinate Shader stage. This stage merely computes the final vertex positions. This makes sure that the GPU doesn't process vertex attributes for vertices that would be culled/clipped anyways. Therefore it is advised to supply vertex positions in a separate buffer so that the Coordinate Shader stage can achieve high cache efficiency. The rest of the vertex attributes should be located interleaved in a separate buffer.

Index buffers

Indexing can be used to make sure vertices are not processed redundantly. An index buffer optimizer library such as meshoptimizer should be used to make sure index buffers achieve maximum cache efficiency. See https://github.com/zeux/meshoptimizer

Vertex buffers

Choosing lower precision vertex attributes (8bit, 16bit) can save significant bandwidth, so choose a precision that suits your meshes. Triangles that cover very few pixels (think less than 32) will be rasterized very inefficiently. Please make sure your vertices cover large enough screen area.

Tile based architecture

The Broadcom Videocore IV GPU is a tile based GPU (but not deferred) therefore it's important to sort your geometry front-to-back to avoid any unnecessary overdraw. Face culling can also help eliminating unnecessary rasterization costs.

ALU architecture

The Broadcom Videocore IV GPU has a dual-issue scalar FP32 ALU. This means that it can execute up to two instructions per cycle using its ADD and MUL ALUs. To maximize utilization it's important to fully saturate both ADD and MUL pipelines.
For maximum performance it's advised to exploit the fact that additional calculations can be run in parallel to the main ALU pipeline eg. special functions and texture sampling using signaling bits.
The shader processors on the GPU have two hardware threads that can be used to hide texture fetching latency. This unfortunately halves the available register space (two threads each need their own). However, this is not mandatory and therefore shaders with high register usage can opt to use synchronous texture fetching.
Small immediates can be used to encode constants in the assembly code to avoid loading a uniform from the uniform FIFO.
Pack and unpack flags can be used to convert between 8, 16 and 32bit data. These flags don't introduce additional cycles.
The GPU ALU can set flag bits, and one can use those to conditionally store ALU operation results. Conditional MOVs are a great way to eliminate the need for branch instructions.

Resolution

The Broadcom Videocore IV GPU is not really suited for 1080p resolution, therefore it's advisable to run at 720p to make sure the GPU is not overwhelmed with fragment work. This leads to a more balanced Vertex/Fragment workload and also a more balanced CPU/GPU workload.

Clears

Use Load/Store operations to clear your textures. Any other method will likely result in a full-screen quad to clear parts or all of a texture. Failure to clear render target contents before starting a renderpass and failure to discard render target contents will result in wasting bandwidth.

Push constants

Vulkan's push constants directly translate to the uniform FIFO buffer on the Broadcom Videocore IV. Therefore it's advised to use push constants whenever possible. Otherwise uniform fetches require expensive generic buffer fetch operations.

Additional documentation

Additional documentation for the Broadcom Videocore IV GPU can be found here: https://docs.broadcom.com/docs/12358545