Dumb question, but if register contention deadlock is possible, what would stop the shaders or hardware from, say, detecting the deadlock and spilling one or more threads to local memory?
Shaders: the shader compiler does go to great lengths to avoid register bank conflicts of all kinds.
Hardware: you generally don't want to have that much complexity in a GPU pipeline, especially not in the front end -- these are very simply machines, no OpO execution, very little prediction, and so their power comes from width, not depth. If you make each core a little more complex, you lose out because you now need to spend area that would have otherwise been spent on more compute units.
Huh, I would have expected the request for additional registers to be linked into the resource request scoreboarding you see for texture accesses, rt hardware requests, etc. Busy waiting on condition code seems non optimal, and opening the window for cases where one or all shader programs on core can no longer make progress.
The feature was clearly designed for the ray-tracing use case, where the waves are independent and really shouldn't run into deadlocks as long as one wave can continue. That's why they have the deadlock avoidance mode, it guarantees one wave will always continues.
I believe they are using this to store the BVH traversal stack in registers, with allocation size depending on traversal depth. Eventually all the rays in the wave will hit or miss and free those registers.
Busy waiting is only the naive solution. I bet they went with the software based solution so they can experiment with other approaches. Such as dynamically switching to the previous approach of a stack in shared-memory when registers run out. Or maybe they can sort the rays, moving the long-running rays out, grouping them together into waves dedicated for deep traversal.
AMD's deep dive [1] says register use is low during BVH traversal, so they are still using shared memory for the stack. According to the ISA doc [2] they have a special mode for the stack that overwrites older entries instead of overflowing and a fallback stackless mode.
So the register allocation is actually increased when invoking the hit shader, allowing each hit shader to use a different number of VGPRs, spinning is probably the most logical approach. They probably don't even need the deadlock avoidance mode with their current implementation, which seems to allocate the full set of registers when calling the hit shader and only free them at the end.
That's a really good trick. New to me. Anyone know a reasonable way to do the query equivalent? Not allocate more, ask how many are currently available.
Knowing how many registers are allocated is roughly how many are live, which is how many to write to the stack during a coroutine switch.
I'm having trouble with the tradeoff between compiler complexity and perceived value of userspace scheduling on gpus so a means of moving work from compiler into runtime helps lower the barrier to landing the implementation.
Oh, or with the work hat on, function pointers are a nuisance with kernel allocated resources, and a jump table keyed off the result of a register file size query would be a partial solution to that.
You might get similar effects with work graphs, CUDA dynamic parallelism or whatnot. But most coroutine patterns have significant branch divergence that probably is grossly inefficient in GPU land.
The GPU equivalent to coroutines would be to save off a continuation structure to a append buffer, and then startup a new GPU call with that buffer now turned into a consume buffer (which is how Raytracing circa 2018ish worked in Blender, before deficated Raytracing hardware was a thing)
You need to consolidate all of the divergent threads and coalesce them into converged threads to benefit from SIMD architecture.
-------
EDIT: Remember that in a 1920 x 1080p pixel screen has 2073600 pixels, which becomes 2073600 threads for 1-ray-per-pixel typical GPU implementation.
With so many threads, it doesn't make sense to "immediately" call coroutines. Instead, it makes far more sense to consolidate and coalesce threads together (ex: group up "hit" raytraces vs "miss" raytraces, which are handled differently).
It makes a lot more sense when programming it as a vector machine. The divergence/convergence stuff is the cuda/simt model, there's no need to use that on amdgpu. Branches are cheap(ish) when they're done on the scalar unit.
Coroutines aren't currently a thing on amdgpu but I think they should be.
> The divergence/convergence stuff is the cuda/simt model
Even on NVidia, you're "allowed" to diverge and converge. But its not efficient.
Optimal NVidia coding will force more convergence than divergence. That's innate to GPU architecture. Its more efficient to run 32-at-a-time per NVidia warp, than a diverged 8-at-a-time warp.
Yes, NVidia _CAN_ diverge and properly execute a subwarp of 8-at-a-time per clocktick... including with complex atomics and all that. But running a full 32-at-a-time warp is 400% the speed because its ALWAYS better to do more per clock tick than less-per-clock tick.
Yes... I could see that approach being desirable on a CPU, it would make all context switches cheaper. Are we re-inventing register windows?
Of course, we are on a GPU here. Skimming through the RDNA 4 ISA document [1], looks like they expect waves using the dynamic VGPRs features to keep track of how many VGPRs they have requested in a Scalar register (SGPR). Each wave always has 106 SGPRs allocated (32-bit each), so the overhead of a single SGPR isn't an issue. You wouldn't even need the jump table, as AMD allows you to index into the VGPR register file. A small loop would allow you to push all active VGPRs onto the stack.
Shame they don't also allow you to index into the SGPR registers too.
But GPUs aren't really designed with coroutines or context switching in mind. Modern GPUs can do it, but GPUs have deep historical assumptions that every single wave on every single core will be running the exact same code, and that 99.99% of waves will run to completion without needing a context switch. When preemptive context switches are needed (no more than once every few ms, and only with long-running compute shaders), they can probably get away with simply DMAing the whole register file to memory.
Though, part of the reason this feature was implemented is that the historic assumptions are no-longer true. DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
> DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).
AMD systems push the hit/miss onto various buffers and pass them around.
Intel systems push the entire call-stack and shuffles them around.
Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).
To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.
In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.
Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)
At any time a coroutine switch could happen, doesn't the compiler statically know how many registers would be allocated by this anyway?
Register numbers aren't dynamically indexed, they're baked into the instruction stream, so there's not really any new information you can't get from the instruction pointer and a static (compiler-generated) lookup table?
Yep, the compiler can do it all. However getting proof of concepts together to justify shipping the compiler complexity can go better with more work done by the runtime code.
Dynamically switch algorithms between one that is faster but has more register pressure and a slower one with less pressure. Lowering numerical precision might be on the table too.
The issue is there's plan b: spill your data to L1 cache.
Whatever plan you have needs to be far faster than L1 cache to even make sense. I can imagine some very heavy register-pressure situations where registers are much better than cache accesses.
Dumb question, but if register contention deadlock is possible, what would stop the shaders or hardware from, say, detecting the deadlock and spilling one or more threads to local memory?
To answer your questions by parts:
Shaders: the shader compiler does go to great lengths to avoid register bank conflicts of all kinds.
Hardware: you generally don't want to have that much complexity in a GPU pipeline, especially not in the front end -- these are very simply machines, no OpO execution, very little prediction, and so their power comes from width, not depth. If you make each core a little more complex, you lose out because you now need to spend area that would have otherwise been spent on more compute units.
Huh, I would have expected the request for additional registers to be linked into the resource request scoreboarding you see for texture accesses, rt hardware requests, etc. Busy waiting on condition code seems non optimal, and opening the window for cases where one or all shader programs on core can no longer make progress.
The feature was clearly designed for the ray-tracing use case, where the waves are independent and really shouldn't run into deadlocks as long as one wave can continue. That's why they have the deadlock avoidance mode, it guarantees one wave will always continues.
I believe they are using this to store the BVH traversal stack in registers, with allocation size depending on traversal depth. Eventually all the rays in the wave will hit or miss and free those registers.
Busy waiting is only the naive solution. I bet they went with the software based solution so they can experiment with other approaches. Such as dynamically switching to the previous approach of a stack in shared-memory when registers run out. Or maybe they can sort the rays, moving the long-running rays out, grouping them together into waves dedicated for deep traversal.
Actually, I'm wrong.
AMD's deep dive [1] says register use is low during BVH traversal, so they are still using shared memory for the stack. According to the ISA doc [2] they have a special mode for the stack that overwrites older entries instead of overflowing and a fallback stackless mode.
So the register allocation is actually increased when invoking the hit shader, allowing each hit shader to use a different number of VGPRs, spinning is probably the most logical approach. They probably don't even need the deadlock avoidance mode with their current implementation, which seems to allocate the full set of registers when calling the hit shader and only free them at the end.
[1] https://www.techpowerup.com/review/amd-radeon-rx-9070-series...
[2] See 12.5.3. DS Stack Operations for Ray Tracing - https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
That's a really good trick. New to me. Anyone know a reasonable way to do the query equivalent? Not allocate more, ask how many are currently available.
Why bother?
If the next assembly line says to use register#200 (but you failed because you only have 160 registers), what's the processor supposed to do?
The only hope is to call for more registers and lock until those registers are available. There's no way to progress otherwise.
Knowing how many registers are allocated is roughly how many are live, which is how many to write to the stack during a coroutine switch.
I'm having trouble with the tradeoff between compiler complexity and perceived value of userspace scheduling on gpus so a means of moving work from compiler into runtime helps lower the barrier to landing the implementation.
Oh, or with the work hat on, function pointers are a nuisance with kernel allocated resources, and a jump table keyed off the result of a register file size query would be a partial solution to that.
Coroutines aren't really a thing in modern GPUs.
You might get similar effects with work graphs, CUDA dynamic parallelism or whatnot. But most coroutine patterns have significant branch divergence that probably is grossly inefficient in GPU land.
The GPU equivalent to coroutines would be to save off a continuation structure to a append buffer, and then startup a new GPU call with that buffer now turned into a consume buffer (which is how Raytracing circa 2018ish worked in Blender, before deficated Raytracing hardware was a thing)
You need to consolidate all of the divergent threads and coalesce them into converged threads to benefit from SIMD architecture.
-------
EDIT: Remember that in a 1920 x 1080p pixel screen has 2073600 pixels, which becomes 2073600 threads for 1-ray-per-pixel typical GPU implementation.
With so many threads, it doesn't make sense to "immediately" call coroutines. Instead, it makes far more sense to consolidate and coalesce threads together (ex: group up "hit" raytraces vs "miss" raytraces, which are handled differently).
It makes a lot more sense when programming it as a vector machine. The divergence/convergence stuff is the cuda/simt model, there's no need to use that on amdgpu. Branches are cheap(ish) when they're done on the scalar unit.
Coroutines aren't currently a thing on amdgpu but I think they should be.
> The divergence/convergence stuff is the cuda/simt model
Even on NVidia, you're "allowed" to diverge and converge. But its not efficient.
Optimal NVidia coding will force more convergence than divergence. That's innate to GPU architecture. Its more efficient to run 32-at-a-time per NVidia warp, than a diverged 8-at-a-time warp.
Yes, NVidia _CAN_ diverge and properly execute a subwarp of 8-at-a-time per clocktick... including with complex atomics and all that. But running a full 32-at-a-time warp is 400% the speed because its ALWAYS better to do more per clock tick than less-per-clock tick.
Yes... I could see that approach being desirable on a CPU, it would make all context switches cheaper. Are we re-inventing register windows?
Of course, we are on a GPU here. Skimming through the RDNA 4 ISA document [1], looks like they expect waves using the dynamic VGPRs features to keep track of how many VGPRs they have requested in a Scalar register (SGPR). Each wave always has 106 SGPRs allocated (32-bit each), so the overhead of a single SGPR isn't an issue. You wouldn't even need the jump table, as AMD allows you to index into the VGPR register file. A small loop would allow you to push all active VGPRs onto the stack.
Shame they don't also allow you to index into the SGPR registers too.
But GPUs aren't really designed with coroutines or context switching in mind. Modern GPUs can do it, but GPUs have deep historical assumptions that every single wave on every single core will be running the exact same code, and that 99.99% of waves will run to completion without needing a context switch. When preemptive context switches are needed (no more than once every few ms, and only with long-running compute shaders), they can probably get away with simply DMAing the whole register file to memory.
Though, part of the reason this feature was implemented is that the historic assumptions are no-longer true. DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
> DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).
AMD systems push the hit/miss onto various buffers and pass them around.
Intel systems push the entire call-stack and shuffles them around.
Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).
To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.
In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.
Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)
At any time a coroutine switch could happen, doesn't the compiler statically know how many registers would be allocated by this anyway?
Register numbers aren't dynamically indexed, they're baked into the instruction stream, so there's not really any new information you can't get from the instruction pointer and a static (compiler-generated) lookup table?
Yep, the compiler can do it all. However getting proof of concepts together to justify shipping the compiler complexity can go better with more work done by the runtime code.
Dynamically switch algorithms between one that is faster but has more register pressure and a slower one with less pressure. Lowering numerical precision might be on the table too.
And what, refill the instruction cache in less time than it takes to wait for more registers to become available?
On AMD ,not available to users yet AFAIK. Only available to the driver and currently targeted at ray tracing?
On Nvidia you have to be writing in the PTX language to have access to something similar.
Given the speed of occupancy changes, you’d almost always fall afoul of TOCTOU on this.
I don’t know whether it’s reasonable, but
> Allocation requests don’t always succeed. s_alloc_vgpr sets the Scalar Condition Code (SCC) to indicate success, or clears it on failure.
So, you can do a binary search:
- set result to zero.
- try to allocate 512. If that succeeds, add 512 to result
- try to allocate 256. If that succeeds, add 256 to result
- try to allocate 128. If that succeeds, add 128 to result
...
- try to allocate 1. If that succeeds, add 1 to result (you’ll want to stop at the granularity at which registers are allocated)
- free ’result’ registers
(This isn’t race-free)
Note that:
1. Registers are not allocated individually, but in clusters of 32, and...
2. The max registers a single thread can access is 8 clusters (256 registers). The minimum is 1 cluster.
You don't really need much of a binary search for this.
The issue is there's plan b: spill your data to L1 cache.
Whatever plan you have needs to be far faster than L1 cache to even make sense. I can imagine some very heavy register-pressure situations where registers are much better than cache accesses.