将值写入 iOS Metal 中的 float4x4 矩阵 - 替换现有浮点值

Question

我正在尝试制作一个缓冲区来计算均匀间隔网格的某些单元格中的粒子/点（每个粒子都会根据其位置添加到单元格中）。然后将每个粒子的索引存储在对应于每个单元格的 float4x4 矩阵中，以供稍后查找。我正在使用原子计数器为每个单元格添加粒子计数，使粒子数不超过 16，并使用此计数器按顺序将粒子索引添加到矩阵中。

当行数或列数>4时，即下一列/行，位置[0]的索引写入0.0。我不确定这是否与 float4x4 矩阵中的列被覆盖的方式有关，或者是否与 float4x4 的内存分配方式/指针有关。基本上，我只想按顺序添加和更改各个值，而不影响矩阵中的现有值。

非常感谢任何帮助，非常感谢！

这是内核：

kernel void findCellandCount(  device Particle *particles [[ buffer(0) ]],
                     volatile device atomic_uint *cellCountArray [[buffer(1)]],
                     device float4x4* cellIndicesBuffer [[ buffer(2) ]],
                     uint id [[ thread_position_in_grid ]]) {
  
    uint particleIndex = id;
    
    Particle particle = particles[particleIndex];

    const float cellSize = Params().cellWidth;

*// GET CELL INDEX:*

    int2 cellIndex = int2(fast::floor(particle.position.x / cellSize), fast::floor(particle.position.y / cellSize)); 

    uint flatCellIndex = GetFlatCellIndex(cellIndex, numberGridCells); // this is in the range 0-15 (16 cells)
    
    // This is a counter to store particle count in each cell // this is reset to zero each frame:
    int cellCounter = atomic_fetch_add_explicit(&cellCountArray[flatCellIndex], 1, memory_order_relaxed);
        
    if (cellCounter < 16) { // i.e. count is less than 4x4 float matrix 

        uint a = cellCounter % 4;
        uint b = cellCounter / 4;

    //m[index][column][row]

        cellIndicesBuffer[flatCellIndex][a][b] = id; // this writes the particle index to the float4x4
 
    }
  
}

这是单元格的输出：

Cell:  3 cell indices:  simd_float4x4([[23.0, 0.0, 0.0, 0.0], [25.0, 0.0, 0.0, 0.0], [44.0, 0.0, 0.0, 0.0], [61.0, 0.0, 0.0, 0.0]])

^ 此输出符合预期，前 4 行存储了 4 个索引

Cell:  8 cell indices:  simd_float4x4([[0.0, 38.0, 0.0, 0.0], [0.0, 39.0, 0.0, 0.0], [0.0, 42.0, 0.0, 0.0], [0.0, 63.0, 0.0, 0.0]])

^这里存储了8个索引，但是位置[0]的值被0.0覆盖/替换。

Cell:  9 cell indices:  simd_float4x4([[0.0, 35.0, 0.0, 0.0], [0.0, 45.0, 0.0, 0.0], [13.0, 0.0, 0.0, 0.0], [28.0, 0.0, 0.0, 0.0]])

^ 这里存储了 6 个索引，但第一个位置的值再次被覆盖。

Answer 1

我认为我知道发生了什么。在你的代码中有

cellIndicesBuffer[flatCellIndex][a][b] = id;

我认为那条线的效果基本上是这样的：

float4x4 temp = cellIndicesBuffer[flatCellIndex];
temp[a][b] = id;
cellIndicesBuffer[flatCellIndex] = temp;

尽管您使用 atomic_uint 来防止 cellCountArray 数组元素上的数据竞争，但您并没有使用 cellIndicesBuffer 元素来防止它，这样做会无论如何都会对性能造成问题。

我认为问题源于 float4x4 是 SIMD 类型 - 本质上是 struct。我建议它将整个内容读入 gpu-thread 本地内存（可能是寄存器），更新元素，然后将整个内容写回数组，覆盖可能已写入的 float4x4 的元素由其他线程在中间时间。为避免这种情况，您只需要解决要更新的元素，而无需通过 float4x4，您可以通过将 cellIndicesBuffer 重铸为 float*.[=31 来实现=]

auto i = flatCellIndex * 16 + a * 4 + b;
reinterpret_cast<device float*>(cellIndicesBuffer)[i] = id;

如果您打算在继续下一列之前用连续的 cellCounter 值填充每一列，您可以消除 a 和 b:

auto i = flatCellIndex * 16 + cellCounter;
reinterpret_cast<device float*>(cellIndicesBuffer)[i] = id;

我应该提一下，负责任的 C++ 程序员可能对我在这里使用 reinterpret_cast 感到畏缩，这是正确的。如果您不需要在着色器函数中的其他任何地方专门使用 cellIndicesBuffer 作为 device float4x4*，而您在提供的代码中不需要，最好将参数类型更改为 device float*。那么您就不必执行 reinterpret_cast。您不需要对 Swift 代码进行任何更改。

将值写入 iOS Metal 中的 float4x4 矩阵 - 替换现有浮点值

Writing values to a float4x4 matrix in iOS Metal - existing float values replaced

matrix

particles

swift

metal