KHR_pipeline_executable_properties
. RenderDoc Meta Fork implements this extension as a shader disassembly option in the shader module panel, so it is not necessary to write code to retrieve them. |  |
---|---|
Description | Total count of all instructions. Complex shaders with high instruction counts may have long execution times. More instructions mean longer shader execution time in most cases. An instruction count larger than instruction cache size causes I$ miss and harms shader performance. |
Value Range | Positive integer |
Optimization | Avoid redundant operations. |
 |  |
---|---|
Description | Total count of all 32-bit ALU instructions. More ALU instructions might not affect performance, but can consume more power. |
Value Range | 0 - Total instructions |
Optimization | Remove redundant computations. |
 |  |
---|---|
Description | Total count of all 16-bit ALU instructions. 16-bit ALU instructions perform better and use less register space than 32-bit instructions. These are similar to full precision ALU instructions. Converting full-precision ALU instructions into half-precision ALU instructions can improve performance in ALU-bound shaders. |
Value Range | 0 - Total instructions |
Optimization | Use lower precision as much as possible. |
 |  |
---|---|
Description | Total count of all complex instructions (sin, cos, and so on). EFU instructions are more time consuming than ALU instructions. Short latency sync instruction is needed for the dependency between an EFU and its use instruction. |
Value Range | 0 - Total instructions |
Optimization | If EFU instructions must be used, group some of them together to reduce short sync latency instructions. Aggressively grouping is not recommended. |
 |  |
---|---|
Description | Total count of all texture read instructions. Texture fetch causes memory access latency, which must be hidden by ALU instructions. Latency is decided by the number of texture fetches and their locality in cache. |
Value Range | 0 - Total instructions |
Optimization | Texture fetches that can be coalesced should be grouped together to avoid cache thrashing. Limit the number of texture fetches in each group to be below 15. |
Additional Information | Generally, VkImage reads with or without a VkSampler . Also includes input attachment reads. |
 |  |
---|---|
Description | Total count of all flow control instructions. More flow control instructions mean more divergence in shader code, which can harm shader performance. A control flow instruction takes more execution time than an ALU instruction. |
Value Range | 0 - Total instructions |
Optimization | Reduce instructions inside control flow blocks as much as possible so that the control flow can be flattened by the compiler. |
 |  |
---|---|
Description | Total count of all barrier and fence instructions. Global sync reduces wave parallelism and extends application execution time. More power is consumed if the execution time is longer. |
Value Range | 0 - Total instructions |
Optimization | Avoid frequent global synchronization. |
Additional Information | Generally, these are op*Barrier instructions in the shader. |
 |  |
---|---|
Description | Total count of all short latency sync instructions. Shader instruction execution can be delayed if it is too close to the instruction that causes this short latency sync, and there is no other wave to hide this latency |
Value Range | 0 - Total instructions |
Optimization | Put EFU instructions together if this does not increase the def-use distance too much. |
 |  |
---|---|
Description | Total count of all long latency sync instructions. This is caused by memory access and it can delay shader instruction execution if the latency is long and there are not enough waves to hide the latency. |
Value Range | 0 - Total instructions |
Optimization | Latency is determined by multiple factors. Improving the locality of memory instructions can help. |
 |  |
---|---|
Description | Number of 128-bit registers used by each shader instance. Each 128-bit register can store 4 FP32 values. This is similar to the number of registers. This is the full-precision registers needed by the shader. |
Value Range | 0 - Max full registers allowed by ISA |
Optimization | In addition to the items related to the number of registers, use lower precision variables to avoid high register use in a shader. |
 |  |
---|---|
Description | Number of 64-bit registers used by each shader instance. Each 64-bit register can store 4 FP16 values. This is similar to the number of registers. This is the half-precision registers needed by the shader. |
Value Range | 0 - Max half registers allowed by ISA |
Optimization | Try to use half-precision variables, but avoid excessive mixed-precision operations. |
 |  |
---|---|
Description | Number of 128-bit registers used by each shader instance. Each 128-bit register can store 4 FP32 values, or 8 FP16 values. Using too many registers reduces active waves in shader execution and can cause register spill in some cases. Higher active wave counts can hide longer memory latency. However, low active wave count cannot hide latency and ALU utilization will be low. |
Value Range | 0 - Max registers allowed by ISA |
Optimization | Avoid large vector variables with dynamic accesses. Use a constant array/vector if possible, instead of declaring an array/vector and assigning constant values to it. |
 |  |
---|---|
Description | Number of 128-bit slots of scratch memory used by each shader instance. |
Value Range | 0 - Max scratch memory size |
Optimization | Avoid use of scratch memory. If the shader uses any scratch memory, it will perform poorly. |
 |  |
---|---|
Description | Total count of all shader stage output components. |
Value Range | 0 - 128 |
 |  |
---|---|
Description | Total count of all shader stage input components. |
Value Range | 0 - 128 |
 |  |
---|---|
Description | The maximum shader processor utilization for the shader. A higher shader processor utilization percentage means more parallelism in shader execution. Low fiber count can cause memory access latency and idle time on the ALU or other GPU components. |
Value Range | 0 - 100% |
Optimization | Try to lower register use to increase total fiber count. If this number is low, the shader may perform poorly. |
 |  |
---|---|
Description | Total count of all memory read instructions. This is similar to texture fetches, but to different memory access units |
Value Range | 0 - Total instructions |
Optimization | Similar to texture fetches. |
Additional Information | Generally, these are VkImage /VkBuffer reads through a storage descriptor. |
 |  |
---|---|
Description | Total count of all memory write instructions. |
Value Range | 0 - Total instructions |
Optimization | Use vector store when possible. Scattered memory store instructions to contiguous memory locations can harm performance. |
Additional Information | Generally, these are VkImage /VkBuffer writes through a storage descriptor. |