std::array 个 AVX 内部函数

Question

我不知道我对 AVX 内在函数如何与 std::array 一起工作的理解是否遗漏了什么，但是当我将两者结合使用时我遇到了一个奇怪的 Clang 问题。

示例代码：

std::array<__m256, 1> gen_data()
{
    std::array<__m256, 1> res;
    res[0] = _mm256_set1_ps(1);
    return res;
}

void main()
{
    auto v = gen_data();
    float a[8];
    _mm256_storeu_ps(a, v[0]);
    for(size_t i = 0; i < 8; ++i)
    {
        std::cout << a[i] << std::endl;
    }
}

Clang 3.5.0 的输出（上面的 4 个浮点数是垃圾数据）：

1
1
1
1
8.82272e-39
0
5.88148e-39
0

GCC 4.8.2/4.9.1 的输出（预期）：

如果我改为将 v 作为输出参数传递给 gen_data，它在两个编译器上都可以正常工作。我愿意接受这可能是 Clang 中的一个错误，但我不知道这是否可能是未定义的行为 (UB)。使用 Clang 3.7*（svn 构建）和 Clang 进行测试现在似乎给出了我的预期结果。如果我切换到 SSE 128 位内部函数 (__m128)，那么所有编译器都会给出相同的预期结果。

所以我的问题是：

这里有UB吗？或者这只是 Clang 3.5.0 中的一个错误？
我的理解 __m256 只是一个 32 字节对齐的内存块是否正确？还是有什么特别之处需要注意？

Answer 1

这看起来像是 clang 错误，现在已修复，我们可以从这个 bug report 中看出这一点，它演示了使用常规数组的非常相似的问题。

假设 std::array 实现与此类似的存储：

T elems[N];

libc++ 和 libstdc++ 似乎都这样做，那么这应该是类似的。其中一条评论说：

However, libc++'s std::array<__m256i, 1> does not work at any optimization level.

错误报告实际上是基于这个 SO 问题：这非常相似，但处理的是常规数组情况。

错误报告包含一种可能的变通方法，OP 声明它就足够了：

In my actual code, num_vectors is calculated based on some C++ template parameters to the simd_pack type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case where num_vectors == 1. It could instead just use a single __m256 member instead of an array of size 1. I'll have to check to see how feasible that is.

std::array 个 AVX 内部函数

std::array of AVX intrinsics

c++

clang

intrinsics

avx