Unity中通过CPU和GPU着色器移动物体的速度差异

Question

我一直在测试通过普通 C# 代码和 HLSL 着色器在 Unity 中移动大量对象。但是，速度没有差异。 FPS 保持不变。使用不同的柏林噪声来改变位置。 C# 代码使用标准 Mathf.PerlinNoise，而 HLSL 使用自定义噪声函数。

场景 1 - 仅通过 C# 代码更新

对象生成：

[SerializeField]
private GameObject prefab;

private void Start()
{
    for (int i = 0; i < 50; i++)
        for (int j = 0; j < 50; j++)
        {
            GameObject createdParticle;
            createdParticle = Instantiate(prefab);
            createdParticle.transform.position = new Vector3(i * 1f, Random.Range(-1f, 1f), j * 1f);
        }
}

通过 C# 移动对象的代码。这个脚本被添加到每个创建的对象中：

private Vector3 position = new Vector3();

private void Start()
{
    position = new Vector3(transform.position.x, Mathf.PerlinNoise(Time.time, Time.time), transform.position.z);
}

private void Update()
{
    position.y = Mathf.PerlinNoise(transform.position.x / 20f + Time.time, transform.position.z / 20f + Time.time) * 5f;
    transform.position = position;
}

场景 2 - 通过计算内核 (GPGPU)

第 1 部分：C# 客户端代码

生成对象，运行在着色器上进行计算并将结果值分配给对象：

public struct Particle
{
    public Vector3 position;
}

[SerializeField]
private GameObject prefab;
[SerializeField]
private ComputeShader computeShader;

private List<GameObject> particlesList = new List<GameObject>();
private Particle[] particlesDataArray;

private void Start()
{
    CreateParticles();
}

private void Update()
{
    UpdateParticlePosition();
}

private void CreateParticles()
{
    List<Particle> particlesDataList = new List<Particle>();

    for (int i = 0; i < 50; i++)
        for (int j = 0; j < 50; j++)
        {
            GameObject createdParticle;
            createdParticle = Instantiate(prefab);
            createdParticle.transform.position = new Vector3(i * 1f, Random.Range(-1f, 1f), j * 1f);
            particlesList.Add(createdParticle);
            Particle particle = new Particle();
            particle.position = createdParticle.transform.position;
            particlesDataList.Add(particle);
        }

    particlesDataArray = particlesDataList.ToArray();
    particlesDataList.Clear();
    computeBuffer = new ComputeBuffer(particlesDataArray.Length, sizeof(float) * 7);
    computeBuffer.SetData(particlesDataArray);
    computeShader.SetBuffer(0, "particles", computeBuffer);
}

private ComputeBuffer computeBuffer;
private void UpdateParticlePosition()
{
    computeShader.SetFloat("time", Time.time);
    computeShader.Dispatch(computeShader.FindKernel("CSMain"), particlesDataArray.Length / 10, 1, 1);
    computeBuffer.GetData(particlesDataArray);

    for (int i = 0; i < particlesDataArray.Length; i++)
    {
        Vector3 pos = particlesList[i].transform.position;
        pos.y = particlesDataArray[i].position.y;
        particlesList[i].transform.position = pos;
    }
}

第 2 部分：计算内核 (GPGPU)

#pragma kernel CSMain

struct Particle {
    float3 position;
    float4 color;
};

RWStructuredBuffer<Particle> particles;
float time;

float mod(float x, float y)
{
    return x - y * floor(x / y);
}

float  permute(float x) { return floor(mod(((x * 34.0) + 1.0) * x, 289.0)); }
float3 permute(float3 x) { return mod(((x * 34.0) + 1.0) * x, 289.0); }
float4 permute(float4 x) { return mod(((x * 34.0) + 1.0) * x, 289.0); }
float taylorInvSqrt(float r) { return 1.79284291400159 - 0.85373472095314 * r; }
float4 taylorInvSqrt(float4 r) { return float4(taylorInvSqrt(r.x), taylorInvSqrt(r.y), taylorInvSqrt(r.z), taylorInvSqrt(r.w)); }

float3 rand3(float3 c) {
    float j = 4096.0 * sin(dot(c, float3(17.0, 59.4, 15.0)));
    float3 r;
    r.z = frac(512.0 * j);
    j *= .125;
    r.x = frac(512.0 * j);
    j *= .125;
    r.y = frac(512.0 * j);
    return r - 0.5;
}

float _snoise(float3 p) {
    const float F3 = 0.3333333;
    const float G3 = 0.1666667;
    float3 s = floor(p + dot(p, float3(F3, F3, F3)));
    float3 x = p - s + dot(s, float3(G3, G3, G3));

    float3 e = step(float3(0.0, 0.0, 0.0), x - x.yzx);
    float3 i1 = e * (1.0 - e.zxy);
    float3 i2 = 1.0 - e.zxy * (1.0 - e);

    float3 x1 = x - i1 + G3;
    float3 x2 = x - i2 + 2.0 * G3;
    float3 x3 = x - 1.0 + 3.0 * G3;

    float4 w, d;

    w.x = dot(x, x);
    w.y = dot(x1, x1);
    w.z = dot(x2, x2);
    w.w = dot(x3, x3);

    w = max(0.6 - w, 0.0);

    d.x = dot(rand3(s), x);
    d.y = dot(rand3(s + i1), x1);
    d.z = dot(rand3(s + i2), x2);
    d.w = dot(rand3(s + 1.0), x3);

    w *= w;
    w *= w;
    d *= w;

    return dot(d, float4(52.0, 52.0, 52.0, 52.0));
}

[numthreads(10, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    Particle particle = particles[id.x];
    float modifyTime = time / 5.0;
    float positionY = _snoise(float3(particle.position.x / 20.0 + modifyTime, 0.0, particle.position.z / 20.0 + modifyTime)) * 5.0;

    particle.position = float3(particle.position.x, positionY, particle.position.z);
    particles[id.x] = particle;
}

我哪里做错了，为什么计算速度没有提升？ :)

提前致谢！

Answer 1

TL;DR: 您的 GPGPU（计算着色器）方案是 未优化的，因此会影响您的结果。考虑将 material 绑定到 computeBuffer 并通过 Graphics.DrawProcedural 呈现。这样一切都保留在 GPU 上。

OP:

What am I doing wrong, why is there no increase in calculation speed?

基本上，您的问题有两个部分。

(1) 从 GPU 读取速度慢

对于大多数事情 GPU-related，您通常希望避免从 GPU 读取，因为它会阻塞 CPU。 GPGPU 场景也是如此。

如果我敢猜测，那将是如下所示的 GPGPU（计算着色器）调用 computeBuffer.GetData()：

private void Update()
{
    UpdateParticlePosition();
}

private void UpdateParticlePosition()
{
.
.
.
    computeBuffer.GetData(particlesDataArray); // <----- OUCH!

团结（我的重点）：

ComputeBuffer.GetData

Read data values from the buffer into an array...
Note that this function reads the data back from the GPU, which can be slow...If any GPU work has been submitted that writes to this buffer, Unity waits for the tasks to complete before it retrieves the requested data. Tell me more...

(2) 在您的场景中不需要显式 GPU 读取

我可以看到您正在创建 2,500 个“粒子”，其中每个粒子都附加到一个 GameObject。如果目的只是绘制一个简单的四边形，那么创建一个包含 Vector3 位置的数组 structs 然后执行批渲染调用以一次性绘制所有粒子会更有效。

证明：参见下面的 nBody 模拟视频。 60+ FPS on 2014 era NVidia 显卡

例如对于我的 GPGPU n-Body Galaxy Simulation 我就是这样做的。实际渲染时注意StarMaterial.SetBuffer("stars", _starsBuffer)。这告诉 GPU 使用 GPU 上已经存在的缓冲区，与计算机着色器用来移动星星位置的缓冲区完全相同。 这里没有CPU读GPU

public class Galaxy1Controller : MonoBehaviour
{
    public Texture2D HueTexture;

    public int NumStars = 10000; // That's right! 10,000 stars!

    public ComputeShader StarCompute;
    public Material StarMaterial;
    private ComputeBuffer _quadPoints;
    private Star[] _stars;
    private ComputeBuffer _starsBuffer;
.
.
.
    private void Start()
    {
        _updateParticlesKernel = StarCompute.FindKernel("UpdateStars");
        _starsBuffer = new ComputeBuffer(NumStars, Constants.StarsStride);

        _stars = new Star[NumStars];
        // Create initial positions for stars here (not shown)
        _starsBuffer.SetData(_stars);

        _quadPoints = new ComputeBuffer(6, QuadStride);
        _quadPoints.SetData(...); // star quad      
    }

    private void Update()
    {
        // bind resources to compute shader
        StarCompute.SetBuffer(_updateParticlesKernel, "stars", _starsBuffer);
        StarCompute.SetFloat("deltaTime", Time.deltaTime*_manager.MasterSpeed);
        StarCompute.SetTexture(_updateParticlesKernel, "hueTexture", HueTexture);

        // dispatch, launch threads on GPU
        var numberOfGroups = Mathf.CeilToInt((float) NumStars/GroupSize);
        StarCompute.Dispatch(_updateParticlesKernel, numberOfGroups, 1, 1);

        // "Look Ma, no reading from the GPU!"
    }

    private void OnRenderObject()
    {
        // bind resources to material
        StarMaterial.SetBuffer("stars", _starsBuffer);
        StarMaterial.SetBuffer("quadPoints", _quadPoints);

        // set the pass
        StarMaterial.SetPass(0);

        // draw
        Graphics.DrawProcedural(MeshTopology.Triangles, 6, NumStars);
    }
}

n-Body 10,000 颗恒星的星系模拟：

我想每个人都会同意 Microsoft 的 GPGPU 文档非常稀疏，因此最好的办法是查看散布在互联网上的示例。我想到的是出色的 “Unity 中的 GPU 光线追踪” 三眼游戏系列。请参阅下面的 link。

另请参阅：

MickyD, "n-Body Galaxy Simulation using Compute Shaders on GPGPU via Unity 3D", 2014
Kuri, D, "GPU Ray Tracing in Unity – Part 1", 2018

Answer 2

ComputeBuffer.GetData 很长。 CPU 从 GPU 复制数据。这将停止主线程。然后你循环所有的变换来改变它们的位置，这肯定比成千上万的 MonoBehaviour 更快，但也很长。有两种方法可以优化您的代码。

CPU

C# Job System + Burst 详细教程：https://github.com/stella3d/job-system-cookbook

GPU

使用在计算着色器中计算的结构化缓冲区，而不将其复制回 CPU。这是有关如何操作的详细教程： https://catlikecoding.com/unity/tutorials/basics/compute-shaders/

Unity中通过CPU和GPU着色器移动物体的速度差异

The difference in the speed of moving objects through the CPU and GPU shader in Unity

c#

shader

hlsl

unity3d

compute-shader

场景 1 - 仅通过 C# 代码更新

场景 2 - 通过计算内核 (GPGPU)

(1) 从 GPU 读取速度慢

(2) 在您的场景中不需要显式 GPU 读取

另请参阅：