对于 C++ Vector3 实用程序 class 实现，数组是否比结构和 class 快？

Question

出于好奇，我以 3 种方式实现了 vector3 实用程序：数组（使用 typedef）、class 和 struct

这是数组实现：

typedef float newVector3[3];

namespace vec3{
    void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void dot(const newVector3& first, const newVector3& second, float& out_result);
    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    }

    // implementations, nothing fancy...really

     void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3)

    {
        out_newVector3[0] = first[0] + second[0];
        out_newVector3[1] = first[1] + second[1];
        out_newVector3[2] = first[2] + second[2];
    }

    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] - second[0];
        out_newVector3[1] = first[1] - second[1];
        out_newVector3[2] = first[2] - second[2];
    }

    void dot(const newVector3& first, const newVector3& second, float& out_result){
        out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2];
    }

    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] * second[0];
        out_newVector3[1] = first[1] * second[1];
        out_newVector3[2] = first[2] * second[2];
    }
}

和 class 实现：

class Vector3{
private:
    float x;
    float y;
    float z;

public:
    // constructors
    Vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    Vector3(const Vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }
}

当然，它包含通常出现在 Vector3 中的其他功能 class。

最后，结构实现：

struct s_vector3{
    float x;
    float y;
    float z;

    // constructors
    s_vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    s_vector3(const s_vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }

同样，我省略了一些其他常见的 Vector3 功能。现在，我让他们三个都创建 9000000 个新对象，并做 9000000 次叉积（我在其中一个完成后写了一大块数据数据缓存，以避免缓存帮助他们）。

这里是测试代码：

const int K_OPERATION_TIME = 9000000;
const size_t bigger_than_cachesize = 20 * 1024 * 1024;

void cleanCache()
{
    // flush the cache
    long *p = new long[bigger_than_cachesize];// 20 MB
    for(int i = 0; i < bigger_than_cachesize; i++)
    {
       p[i] = rand();
    }
}

int main(){

    cleanCache();
    // first, the Vector3 struct
    std::clock_t start;
    double duration;

    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // Third, the Vector3 class implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        Vector3 newVector3Class = Vector3(i,i,i);
        newVector3Class = Vector3::cross(newVector3Class, newVector3Class);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The class implementation of Vector3 takes %f seconds.\n", duration);


    return 0;
}

结果令人震惊。

struct 和 class 实现在 0.23 秒左右完成任务，而 array 实施只需要 0.08 秒！

如果数组确实有这样显着的性能优势，虽然它的语法会很丑陋，但在很多情况下都值得使用。

所以我真的很想确定，这是应该发生的事情吗？谢谢！

Answer 1

简短回答：视情况而定。正如您所观察到的，如果在没有优化的情况下进行编译，则会有所不同。

当我编译（所有内联函数）并优化（-O2 或 -O3）时，没有区别（继续阅读，它并不像看起来那么容易）。

 Optimization    Times (struct vs. array)
    -O0              0.27 vs. 0.12
    -O1              0.14 vs. 0.04
    -O2              0.00 vs. 0.00
    -O3              0.00 vs. 0.00

无法保证您的编译器 can/will 做了什么优化，所以完整的答案是 "it depends on your compiler"。起初我会相信我的编译器会做正确的事情，否则我应该开始编程汇编。只有这部分代码是真正的瓶颈，才值得考虑帮助编译器。

如果使用 -O2 编译，您的代码对于两个版本都需要 0.0 秒，但这是因为优化器发现，这些值根本没有被使用，所以它只是扔掉了完整代码！

让我们确保这不会发生：

#include <ctime>
#include <cstdio>

const int K_OPERATION_TIME = 1000000000;

int main(){
    std::clock_t start;
    double duration;

    start = std::clock();

    double checksum=0.0;
    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
        checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
        checksum+=opResult[0] +opResult[1]+opResult[2];  // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    printf("Checksum: %f\n", checksum);
}

您将看到以下变化：

不涉及缓存（没有缓存未命中），所以我只是删除了负责刷新它的代码。
从性能上看class和struct没有区别（编译后真的没有区别，整个public-private语法糖的区别只是表面的），所以我只看结构。
实际使用了叉积的结果，不能优化掉。
现在有 1e9 次迭代，以获得有意义的时间。

通过此更改，我们看到以下计时（英特尔编译器）：

 Optimization    Times (struct vs. array)
    -O0              33.2 vs. 17.1
    -O1              19.1 vs. 7.8
    -Os              19.2 vs. 7.9
    -O2              0.7 vs. 0.7
    -O3              0.7 vs. 0.7

我有点失望，-Os性能这么差，但除此之外你可以看到，如果优化，结构和数组之间没有区别！

我个人很喜欢-Os，因为它产生的汇编我能理解，所以我们来看看，为什么这么慢。

最明显的事情，无需查看生成的程序集：s_vector3::cross returns 一个 s_vector3-object 但我们将结果分配给一个已经存在的对象，所以如果优化器没有看到，旧对象不再使用，他可能无法进行 RVO。所以让替换

newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z;

与：

s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=r.x +r.y+r.z;

现在有结果：2.14 (struct) vs. 7.9 - 这是一个很大的进步！

我的收获：优化器做得很好，但如果需要，我们可以提供一点帮助。

Answer 2

在这种情况下，没有。就 CPU 而言；类、结构体、数组只是简单的内存布局，本例中的布局完全相同。在非发布版本中，如果您使用内联方法，它们可能被编译成实际函数（主要是为了帮助调试器进入方法），因此可能影响不大。

加法并不是真正好的测试 Vec3 类型性能的方法。点 and/or 叉积通常是更好的测试方法。

如果您真的关心性能，您基本上会希望采用数组结构方法（而不是上面的结构数组）。这往往允许编译器应用自动矢量化。

即而不是这个：

constexpr int N = 100000;
struct Vec3 {
  float x, y, z; 
};
inline float dot(Vec3 a, Vec3 b) { return a.x*b.x + a.y*b.y + a.z*b.z; }
void dotLots(float* dps, const Vec3 a[N], const Vec3 b[N])
{
  for(int i = 0; i < N; ++i)
    dps[i] = dot(a[i], b[i]);
}

你会这样做：

constexpr int N = 100000;
struct Vec3SOA {
  float x[N], y[N], z[N]; 
};
void dotLotsSOA(float* dps, const Vec3SOA& a, const Vec3SOA& b)
{
  for(int i = 0; i < N; ++i)
  {
    dps[i] = a.x[i]*b.x[i] + a.y[i]*b.y[i] + a.z[i]*b.z[i];
  }
}

如果你用-mavx2和-mfma编译，那么后一个版本会优化得很好。

对于 C++ Vector3 实用程序 class 实现，数组是否比结构和 class 快？

For C++ Vector3 utility class implementations, is array faster than struct and class?

c++

optimization

performance

physics

utility