-
Notifications
You must be signed in to change notification settings - Fork 70
Broadcom Videocore IV Performance Recommendations
Profiling can be done using standard Vulkan performance queries. See the query.cpp example.
The Broadcom Videocore IV GPU has a hidden shader stage called the Coordinate Shader stage. This stage merely computes the final vertex positions. This makes sure that the GPU doesn't process vertex attributes for vertices that would be culled/clipped anyways. Therefore it is advised to supply vertex positions in a separate buffer so that the Coordinate Shader stage can achieve high cache efficiency. The rest of the vertex attributes should be located interleaved in a separate buffer.
Indexing can be used to make sure vertices are not processed redundantly. An index buffer optimizer library such as meshoptimizer should be used to make sure index buffers achieve maximum cache efficiency. See https://github.com/zeux/meshoptimizer
Choosing lower precision vertex attributes (8bit, 16bit) can save significant bandwidth, so choose a precision that suits your meshes. Triangles that cover very few pixels (think less than 32) will be rasterized very inefficiently. Please make sure your vertices cover large enough screen area.
The Broadcom Videocore IV GPU is a tile based GPU (but not deferred) therefore it's important to sort your geometry front-to-back to avoid any unnecessary overdraw.
The Broadcom Videocore IV GPU has a dual-issue scalar FP32 ALU. This means that it can execute up to two instructions per cycle using its ADD and MUL ALUs. To maximize utilization it's important to fully saturate both ADD and MUL pipelines.
For maximum performance it's advised to exploit the fact that additional calculations can be run in parallel to the main ALU pipeline eg. special functions and texture sampling using signaling bits.
The shader processors on the GPU have two hardware threads that can be used to hide texture fetching latency. This unfortunately halves the available register space (two threads each need their own). However, this is not mandatory and therefore shaders with high register usage can opt to use synchronous texture fetching.
Small immediates can be used to encode constants in the assembly code to avoid loading a uniform from the uniform FIFO.
Pack and unpack flags can be used to convert between 8, 16 and 32bit data. These flags don't introduce additional cycles.
The GPU ALU can set flag bits, and one can use those to conditionally store ALU operation results. Conditional MOVs are a great way to eliminate the need for branch instructions.
The Broadcom Videocore IV GPU is not really suited for 1080p resolution, therefore it's advisable to run at 720p to make sure the GPU is not overwhelmed with fragment work. This leads to a more balanced Vertex/Fragment workload and also a more balanced CPU/GPU workload.
Use Load/Store operations to clear your textures. Any other method will likely result in a full-screen quad to clear parts or all of a texture. Failure to clear render target contents before starting a renderpass and failure to discard render target contents will result in wasting bandwidth.
Additional documentation for the Broadcom Videocore IV GPU can be found here: https://docs.broadcom.com/docs/12358545