计算复杂数组的 abs() 值的最快方法

Question

我想用 C 或 C++ 计算复杂数组元素的绝对值。最简单的方法是

for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

但是对于大型向量来说会很慢。有没有办法加快速度（例如，通过使用并行化）？语言可以是 C 或 C++。

Answer 1

鉴于所有循环迭代都是独立的，可以使用以下代码进行并行化：

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

当然，要使用它，您应该在编译代码时启用 OpenMP 支持（通常通过使用 /openmp 标志或设置项目选项）。
您可以在 wiki.

中找到几个 OpenMP 用法示例

Answer 2

或者像这样使用 Concurrency::parallele_for :

Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});

Answer 3

此外，您可以使用 std::future 和 std::async（它们是 C++11 的一部分），也许这是实现您想要做的事情的更清晰的方法：

#include <future>

...

int main()
{
    ...

    // Create async calculations
    std::future<void> *futures = new std::future<void>[N];
    for (int i = 0; i < N; ++i)
    {
        futures[i] = std::async([&a, &b, i]
        {
            b[i] = std::sqrt(a[i]);
        });
    }
    // Wait for calculation of all async procedures
    for (int i = 0; i < N; ++i)
    {
        futures[i].get();
    }

    ...

    return 0;
}

IdeOne live code

我们首先创建异步过程，然后等待所有内容计算完毕。
在这里我使用 sqrt 而不是 cabs 因为我只是不知道 cabs 是什么。我确定没关系。
此外，也许您会发现此 link 有用：cplusplus.com

Answer 4

使用向量运算。

如果您有 glibc 2.22（相当新），您可以使用 OpenMP 4.0 的 SIMD 功能 operate on vectors/arrays。

Libmvec is vector math library added in Glibc 2.22.

Vector math library was added to support SIMD constructs of OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding vector implementations of vector math functions.

Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX for x86_64). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines.

另请参阅 Parallel for vs omp simd: when to use each?

如果您运行在 Solaris 上，您可以显式使用 vhypot() from the math vector library libmvec.so 对复数向量进行运算以获得每个复数的绝对值：

Description

These functions evaluate the function hypot(x, y) for an entire vector of values at once. ...

libmvec 的源代码可以在 http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c 找到。我不记得 Sun Microsystems 是否提供过 libmvec.so 的 Linux 版本。

Answer 5

如果您使用的是现代编译器（例如 GCC 5），则可以使用 Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions 和并行化。

因此，如果您想运行它们并行，您可以这样做：

#include <cilk/cilk.h>

cilk_for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

或者如果您想测试 SIMD：

#pragma simd
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

但是，Cilk 最好的部分是您可以这样做：

b[:] = cabs(a[:])

在这种情况下，编译器和运行time 环境将决定它应该被 SIMD 到哪个级别以及应该并行化什么（最佳方法是在大块上并行应用 SIMD）。由于这是由工作调度程序在运行时间决定的，英特尔声称它能够提供接近最佳的调度，并且应该能够最佳地使用缓存。

Answer 6

使用 #pragma simd（甚至使用 -Ofast）或依赖编译器自动矢量化是为什么盲目期望编译器有效实现 SIMD 是个坏主意的更多示例。为了为此有效地使用 SIMD，您需要使用数组结构数组。例如，对于 SIMD 宽度为 4 的单个浮点数，您可以使用

//struct of arrays of four complex numbers
struct c4 {
    float x[4];  // real values of four complex numbers 
    float y[4];  // imaginary values of four complex numbers
};

这里的代码显示了如何使用 SSE 为 x86 指令集执行此操作。

#include <stdio.h>
#include <x86intrin.h>
#define N 10

struct c4{
    float x[4];
    float y[4];
};

static inline void cabs_soa4(struct c4 *a, float *b) {
    __m128 x4 = _mm_loadu_ps(a->x);
    __m128 y4 = _mm_loadu_ps(a->y);
    __m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
    _mm_storeu_ps(b, b4);
}  

int main(void)
{
    int n4 = ((N+3)&-4)/4;  //choose next multiple of 4 and divide by 4
    printf("%d\n", n4);
    struct c4  a[n4];  //array of struct of arrays
    for(int i=0; i<n4; i++) {
        for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
    }
    float b[4*n4];
    for(int i=0; i<n4; i++) {
        cabs_soa4(&a[i], &b[4*i]);
    }
    for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}

展开几次循环可能会有帮助。在任何情况下，所有这些对于大 N 都没有实际意义，因为该操作受内存带宽限制。对于大 N（意味着当内存使用量远大于最后一级缓存时），尽管 #pragma omp parallel 可能会有所帮助，但最好的解决方案是不要对大 N 执行此操作。而是以适合最低级别的缓存以及其他计算操作。我的意思是这样的

for(int i = 0; i < nchunks; i++) {
    for(int j = 0; j < chunk_size; j++) {
        b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
    }
    foo(&b[i*chunck_size]); // foo is computationally intensive.
}

我没有在这里实现数组结构的数组，但是为此调整代码应该很容易。

计算复杂数组的 abs() 值的最快方法

Fastest way to calculate the abs()-values of a complex array

c

c++

arrays

complex-numbers