我一直试图通过尝试一些琐碎的例子来解决内存同步和一致性问题。
在此,我正在调度一个具有 8x8x1 大小工作组的计算着色器。工作组的数量足以覆盖屏幕,即 720x480。
计算着色器代码:
#version 450 core
layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
layout (binding = 0, rgba8) uniform image2D u_fboImg;
layout (binding = 0, offset = 0) uniform atomic_uint u_counters[100];
void main() {
ivec2 texCoord = ivec2(gl_GlobalInvocationID.xy);
// Only use shader invocations within first 100x500 pixels
if (texCoord.x >= 100 || texCoord.y >= 500) {
return;
}
// Each counter should be incremented 400 times
atomicCounterIncrement(u_counters[texCoord.x]);
memoryBarrier();
// Use only "bottom row" of invocations to draw results
// Draw a white column as high as the counter at given x
if (texCoord.y == 0) {
int c = int(atomicCounter(u_counters[texCoord.x]));
for (int y = 0; y < c; ++y) {
imageStore(u_fboImg, ivec2(texCoord.x, y), vec4(1.0f));
}
}
}
这就是我得到的:(锯齿状条的高度每次都不同,但平均约为该高度)
这是我所期望的,并且是将 for 循环硬编码到 400 的结果。
奇怪的是,如果我减少调度中的工作组数量,比如将 x 值减半(现在只覆盖屏幕的一半),条形会变大:
Finally to prove there isn't some other nonsense going on, here i'm just coloring based on local invocation id:
*Edit: Forgot to mention the dispatch is followed immediately by glMemoryBarrier(GL_ALL_BARRIER_BITS);
Unless otherwise stated, all shader invocations for a particular shader stage, including the compute shader stage, execute independently of one another, in an order which is undefined. And calling memoryBarrier
does not change this fact. This means that, when the stuff after the memoryBarrier
is called, there is no guarantee that the value from the atomic counter has been incremented by all of the shader invocations that will eventually do so.
So what you're seeing is exactly what one would expect to see: the invocations writing somewhat random values, depending on the implementation-dependent order that the invocations just so happen to be executed in.
What you're wanting to do is execute all of the atomic increments for all invocations, then read those values and draw stuff based on what you read. Your code as written cannot do that.
While compute shaders do have some ability to manipulate the order of execution of invocations, this only works for invocations within the same work group (this is in fact why work groups exist). That is, you can have invocations ordered to a degree in a work group, but never between work groups.
对此的简单解决方法是将其转换为 2 个计算着色器调度操作。第一个完成所有的递增。第二个将读取值并将结果写入图像。
更聪明的解决方案将涉及雇用工作组。也就是说,将您的工作分组,以便使相同原子计数器递增的任何内容都将在同一个工作组中执行。这样,您甚至不需要原子计数器;您只需使用共享变量(可以执行原子操作)。barrier()
在完成共享变量的所有递增之后调用;这确保在任何调用继续超过该点之前,所有调用至少已经执行了那么远。这样所有的增量就完成了。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句