GPU 上的 OpenCL RK4 集成
OpenCL RK4 Integration on GPU
使用 OpenCL 并行化集成器时 - 将整个循环放在内核中是否是一种不好的做法?
我正在尝试将我用 C++ 编写的 RK4 集成器移动到 OpenCL 中,这样我就可以 运行 在 GPU 上进行操作 - 目前它使用 OpenMP。
我需要 运行 1000 万+独立积分 运行s,每个 运行 大约有 700 次循环迭代。我目前已将带有停止条件的循环写入内核,但它的性能不如我预期。
当前的 CL 内核片段:
`
while (inPos.z > -1.0f){
cnt++;
//Eval 1
//Euler Velocity
vel1 = inVel + (inAcc * 0.0f);
//Euler Position
pos1 = inPos + (vel1 * 0.0f) + ((inAcc * 0.0f)*0.5f);
//Drag and accels
combVel = sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2));
//motionUtils::drag(netForce, combVel, mortSigma, outPos.z);
dragForce = mortSigma*1.225f*pow(combVel, 2);
//Normalise vector
normVel = vel1 / combVel;
//Drag Components
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
//Acceleration components
acc1 = drag/mortMass;
...
//Taylor Expansion
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);
//Swap ready for next iteration
inPos = inPos + (tayVel * timeStep);
inVel = inVel + (inAcc * timeStep);
`
任何想法/建议,非常感谢。
尝试速度较快(精度较低)的慢速函数版本:
sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2))
至
native_rsqrt(vel1.x*vel1.x+vel1.y*vel1.y+vel1.z*vel1.z)
normVel = vel1 / combVel;
至
normVel = vel1 * combVel;
dragForce = mortSigma*1.225f*pow(combVel, 2);
至
dragForce = mortSigma*1.225f*(combVel*combVel);
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
至
drag = -normVel * dragForce;
//Add Gravity force
drag.z-=mortMass*9.801f;
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);
至
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (0.166666f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (0.166666f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (0.166666f);
如果您使用了太多变量,请尝试将本地工作组大小从 256 减少到 128 或 64,如果它们没有在循环外使用,请将它们的声明放在循环中以便同时发出更多线程.
使用 OpenCL 并行化集成器时 - 将整个循环放在内核中是否是一种不好的做法?
我正在尝试将我用 C++ 编写的 RK4 集成器移动到 OpenCL 中,这样我就可以 运行 在 GPU 上进行操作 - 目前它使用 OpenMP。
我需要 运行 1000 万+独立积分 运行s,每个 运行 大约有 700 次循环迭代。我目前已将带有停止条件的循环写入内核,但它的性能不如我预期。
当前的 CL 内核片段:
`
while (inPos.z > -1.0f){
cnt++;
//Eval 1
//Euler Velocity
vel1 = inVel + (inAcc * 0.0f);
//Euler Position
pos1 = inPos + (vel1 * 0.0f) + ((inAcc * 0.0f)*0.5f);
//Drag and accels
combVel = sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2));
//motionUtils::drag(netForce, combVel, mortSigma, outPos.z);
dragForce = mortSigma*1.225f*pow(combVel, 2);
//Normalise vector
normVel = vel1 / combVel;
//Drag Components
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
//Acceleration components
acc1 = drag/mortMass;
...
//Taylor Expansion
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);
//Swap ready for next iteration
inPos = inPos + (tayVel * timeStep);
inVel = inVel + (inAcc * timeStep);
` 任何想法/建议,非常感谢。
尝试速度较快(精度较低)的慢速函数版本:
sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2))
至
native_rsqrt(vel1.x*vel1.x+vel1.y*vel1.y+vel1.z*vel1.z)
normVel = vel1 / combVel;
至
normVel = vel1 * combVel;
dragForce = mortSigma*1.225f*pow(combVel, 2);
至
dragForce = mortSigma*1.225f*(combVel*combVel);
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
至
drag = -normVel * dragForce;
//Add Gravity force
drag.z-=mortMass*9.801f;
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);
至
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (0.166666f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (0.166666f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (0.166666f);
如果您使用了太多变量,请尝试将本地工作组大小从 256 减少到 128 或 64,如果它们没有在循环外使用,请将它们的声明放在循环中以便同时发出更多线程.