为什么在这种情况下，我的 Java 代码比我的 C++ 代码运行得更快？

Question

我写了一个小的benchmark，其中程序创建了{float, float}的10⁸个二维std::vector结构，然后求和平方他们的长度。

这是 C++ 代码：

#include <iostream>
#include <chrono>
#include <vector>
#include <array>
#include <cmath>
    
using namespace std;
using namespace std::chrono;
    
const int COUNT = pow(10, 8);
    
class Vec {
public:
    float x, y;
    
    Vec() {}
    
    Vec(float x, float y) : x(x), y(y) {}
    
    float len() {
        return x * x + y * y;
    }
};
    
int main() {
    vector <Vec> vecs;
    
    for(int i = 0; i < COUNT; ++i) {
        vecs.emplace_back(i / 3, i / 5);
    }
    
    auto start = high_resolution_clock::now();
    
    // This loop is timed
    float sum = 0;
        for(int i = 0; i < COUNT; ++i) {
        sum += vecs[i].len();
    }
    
    auto stop = high_resolution_clock::now();
    
    cout << "finished in " << duration_cast <milliseconds> (stop - start).count()
         << " milliseconds" << endl;
    cout << "result: " << sum << endl;
    
    return 0;
}

为此我使用了这个 makefile（g++ 版本 7.5.0）：

build:
 g++ -std=c++17 -O3 main.cpp -o program #-ffast-math 
    
run: build
 clear
 ./program

这是我的 Java 代码：

public class MainClass {
    static final int COUNT = (int) Math.pow(10, 8);

    static class Vec {
        float x, y;

        Vec(float x, float y) {
            this.x = x;
            this.y = y;
        }

        float len() {
            return x * x + y * y;
        }
    }

    public static void main(String[] args) throws InterruptedException {

        Vec[] vecs = new Vec[COUNT];

        for (int i = 0; i < COUNT; ++i) {
            vecs[i] = new Vec(i / 3, i / 5);
        }

        long start = System.nanoTime();

        // This loop is timed
        float sum = 0;
        for (int i = 0; i < COUNT; ++i) {
            sum += vecs[i].len();
        }

        long duration = System.nanoTime() - start;
        System.out.println("finished in " + duration / 1000000 + " milliseconds");
        System.out.println("result: " + sum);
    }
}

使用 Java 11.0.4

编译并运行

这是结果（几次运行的平均值，运行 ubuntu 18.04 16 位）：

c++:  262 ms
java: 230 ms

为了使 c++ 代码更快，我尝试了一些方法：

使用std::array代替std::vector
使用普通数组代替 std::vector
在 for 循环中使用迭代器

但是，上述 none 导致任何改进。

我注意到了一些有趣的事情：

当我对整个 main() 函数（分配 + 计算）计时时，C++ 要好得多。然而，这可能是由于 JVM 的预热时间。
对于较少数量的对象，如 10⁷，C++ 稍快（几毫秒）。
开启-ffast-math使C++程序比Java快几倍，但计算结果略有不同。此外，我在一些帖子中读到使用此标志是不安全的。

在这种情况下，我能否以某种方式修改我的 C++ 代码并使其与 Java 一样快或更快？

Answer 1

试试这个：

    float sum = std::transform_reduce(
        std::execution::par_unseq,
        begin(vecs), end(vecs),
        0.f,
        std::plus<>{},
        [](auto&& x){
            return x.len();
        }
    );

这明确地告诉 C++ 编译器您在做什么，您可以使用额外的线程，每个循环迭代不依赖于其他线程，并且您想在 floats.

确实意味着加法可能乱序与您要求的相比较，因此输出值可能不完全相同。

Live example 一侧是原始循环，另一侧是乱序添加权限。

进一步调查：

所以我开始旋转 a godbolt。

在其中，我比较了使用和不使用强制矢量化以及 -ffast-math。强制矢量化和 -ffast-math 导致相同的汇编代码。

问题出在累加器上。一次将一个事物添加到总和中并进行所有 IEEE 舍入得到的值不同于以更高精度的浮点值一次将它们累加 N 个，然后将结果批量存储回浮点数。

如果你这样做 -ffast-math 你将获得 2 倍的速度和不同的积累。如果将 float sum 替换为 double sum，您将获得 same answer 作为 --ffast-math 和矢量化。

基本上，clang 向量化器找不到一种简单的方法来向量化总和的累加而不破坏精确的浮点精度浮点要求。

为什么在这种情况下，我的 Java 代码比我的 C++ 代码运行得更快？

Why in this case, my Java code runs faster than my C++ code?

c++

java

benchmarking

microbenchmark