Cuda将数组中的元素移动到末尾
Cuda move element in array to the end
你好,我的问题是任何建议都会被采纳:
我有结构数组(代表粒子),但为了简化,我的数组在开始时只包含 True 值 (Particle.exist = True)。我在这个数组上 运行 我自己的 CUDA 内核函数,在某些情况下,True 值更改为 False。之后我必须将这个值移动到数组的末尾以获得更好的优化(不再使用死粒子(exist = False))。
理论上我有两种选择如何做到这一点...
- 一些并行排序算法或
- 将死粒子移动到末尾并移动数组。
第二个选项应该是更好的选择,但我不知道如何并行执行此操作。我可以有 1 000 000 个粒子,所以在一个线程中移动不是个好主意...
这是我的代码示例。我把 Todo 放在我需要 shift array
的地方
struct Particle
{
float2 position;
float angle;
bool exists;
};
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
{
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
{
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
);
if (currentLength < bestLength)
{
bestLength = currentLength;
bestLeaderIndex = i;
}
}
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
// TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
}
}
我正在考虑的一个可能的解决方案是做自己的内核函数,它只会移动粒子。像这样
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
}
在 GPU 上排序效率相当低,因此最好 select 保留值并执行 分区基于它们。要轻松做到这一点,您可以使用非常高效的 CUB(因为它通常实现最先进的算法或接近)。
您可以使用 DevicePartition or two DeviceSelect(前者可能会更快,除非您根本不想保留死粒子)。如果你想执行一些高级 tweaks/optimizations.
,你也可以使用 block primitives
如果您出于某种原因仍想自己执行此操作(例如,减少项目中的依赖项数量),那么您可以在相对较新的设备上使用 atomic adds,因为它们通过硬件得到了很好的优化。在旧设备上,您可以使用扫描来做到这一点,但它更难实现。问题是当有很多 SM 时原子不会扩展,所以你需要执行一些高级 blocking 策略。这是一个未经测试的天真实现来理解这个想法:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
if(exists)
{
// Move the current particle to the beginning
outParticles[localPos] = current;
}
else
{
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
}
}
请注意,由于 原子操作,排序 未保留。如果您需要保持粒子的顺序,那么它会变得更加复杂,尤其是在 GPU 上,因为它会使算法更具顺序性。在这种情况下,一个天真的解决方案可能是使用稳定排序。另一种解决方案是使用全局 scan 后跟间接存储值(因此有两次传递)。实现高效扫描有点complex/tedious。希望 CUB 在这种情况下可以通过其 DeviceScan 原语提供很多帮助。
最后请注意,使用结构数组效率不高,尤其是在使用 SIMD 指令的硬件(如 GPU)上。使用数组结构(由于高速缓存行、合并、访问模式的连续性等),实施速度应该明显更快。
你好,我的问题是任何建议都会被采纳:
我有结构数组(代表粒子),但为了简化,我的数组在开始时只包含 True 值 (Particle.exist = True)。我在这个数组上 运行 我自己的 CUDA 内核函数,在某些情况下,True 值更改为 False。之后我必须将这个值移动到数组的末尾以获得更好的优化(不再使用死粒子(exist = False))。
理论上我有两种选择如何做到这一点...
- 一些并行排序算法或
- 将死粒子移动到末尾并移动数组。
第二个选项应该是更好的选择,但我不知道如何并行执行此操作。我可以有 1 000 000 个粒子,所以在一个线程中移动不是个好主意...
这是我的代码示例。我把 Todo 放在我需要 shift array
的地方struct Particle
{
float2 position;
float angle;
bool exists;
};
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
{
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
{
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
);
if (currentLength < bestLength)
{
bestLength = currentLength;
bestLeaderIndex = i;
}
}
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
// TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
}
}
我正在考虑的一个可能的解决方案是做自己的内核函数,它只会移动粒子。像这样
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
}
在 GPU 上排序效率相当低,因此最好 select 保留值并执行 分区基于它们。要轻松做到这一点,您可以使用非常高效的 CUB(因为它通常实现最先进的算法或接近)。 您可以使用 DevicePartition or two DeviceSelect(前者可能会更快,除非您根本不想保留死粒子)。如果你想执行一些高级 tweaks/optimizations.
,你也可以使用 block primitives如果您出于某种原因仍想自己执行此操作(例如,减少项目中的依赖项数量),那么您可以在相对较新的设备上使用 atomic adds,因为它们通过硬件得到了很好的优化。在旧设备上,您可以使用扫描来做到这一点,但它更难实现。问题是当有很多 SM 时原子不会扩展,所以你需要执行一些高级 blocking 策略。这是一个未经测试的天真实现来理解这个想法:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
if(exists)
{
// Move the current particle to the beginning
outParticles[localPos] = current;
}
else
{
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
}
}
请注意,由于 原子操作,排序 未保留。如果您需要保持粒子的顺序,那么它会变得更加复杂,尤其是在 GPU 上,因为它会使算法更具顺序性。在这种情况下,一个天真的解决方案可能是使用稳定排序。另一种解决方案是使用全局 scan 后跟间接存储值(因此有两次传递)。实现高效扫描有点complex/tedious。希望 CUB 在这种情况下可以通过其 DeviceScan 原语提供很多帮助。
最后请注意,使用结构数组效率不高,尤其是在使用 SIMD 指令的硬件(如 GPU)上。使用数组结构(由于高速缓存行、合并、访问模式的连续性等),实施速度应该明显更快。