使用联合（封装在结构中）绕过 neon 数据类型的转换

Question

我使用 SSE 进行了我的第一种向量化内在函数方法，其中基本上只有一种数据类型 __m128i。切换到 Neon 我发现数据类型和函数原型更加具体，例如uint8x16_t（16 unsigned char 的向量），uint8x8x2_t（2 个向量，每个 8 unsigned char），uint32x4_t（4 uint32_t 的向量) 等

首先我很热情（更容易找到在所需数据类型上运行的确切函数），然后我看到想要以不同方式处理数据时是多么混乱。使用 specific casting operators would take me forever. The problem is also addressed here。然后我想出了将联合封装到结构中的想法，以及一些强制转换和赋值运算符。

struct uint_128bit_t { union {
        uint8x16_t uint8x16;
        uint16x8_t uint16x8;
        uint32x4_t uint32x4;
        uint8x8x2_t uint8x8x2;
        uint8_t uint8_array[16] __attribute__ ((aligned (16) ));
        uint16_t uint16_array[8] __attribute__ ((aligned (16) ));
        uint32_t uint32_array[4] __attribute__ ((aligned (16) ));
    };

    operator uint8x16_t& () {return uint8x16;}
    operator uint16x8_t& () {return uint16x8;}
    operator uint32x4_t& () {return uint32x4;}
    operator uint8x8x2_t& () {return uint8x8x2;}
    uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;}
    uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;}

};

这种方法对我有用：我可以使用 uint_128bit_t 类型的变量作为参数并输出不同的 Neon 内在函数，例如vshlq_n_u32、vuzp_u8、vget_low_u8（在本例中只是作为输入）。如果需要，我可以用更多数据类型扩展它。注意：数组是为了方便打印变量的内容。

这是正确的处理方式吗？
有没有什么隐藏的漏洞？
我是在重新发明轮子吗？
（是否需要对齐属性？）

Answer 1

根据 C++ 标准，此数据类型几乎无用（对于您想要的目的而言当然如此）。那是因为从联合体的非活动成员读取是未定义的行为。

但是，您的编译器可能会承诺完成这项工作。但是，您还没有询问任何特定的编译器，因此无法进一步评论。

Answer 2

由于最初提出的方法有undefined behaviour in C++，我已经实现了这样的东西：

template <typename T>
struct NeonVectorType {

    private:
    T data;

    public:
    template <typename U>
    operator U () {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
        U u;
        memcpy( &u, &data, sizeof u );
        return u;
    }

    template <typename U>
    NeonVectorType<T>& operator =(const U& in) {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
        memcpy( &data, &in, sizeof data );
        return *this;
    }

};

然后：

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

讨论了memcpy的使用 (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away。

如果您查看编辑历史记录，我已经实现了一个自定义版本，其中包含用于向量向量的组合运算符（例如 uint8x8x2_t）。该问题已在 here. However, since those data types are declared as arrays (see guide，第 12.2.2 节中提及），因此位于连续的内存位置，编译器必然会正确处理 memcpy。

最后，要打印变量的内容可以使用 a function like this。

Answer 3

如果您试图通过各种数据结构黑客来避免以合理的方式进行转换，您最终将洗牌内存/单词，这将破坏您希望从 NEON 获得的任何性能。

您或许可以轻松地将四元寄存器转换为双元寄存器，但其他方式可能行不通。

一切归结于此。在每条指令中，都有一些位用于索引寄存器。如果指令需要 Quad 寄存器，它将像 Q(2*n)、Q(2*n+1) 一样对寄存器进行两两计数，并且仅在编码指令中使用 n，(2*n+1) 将隐式用于核心.如果你的代码中有任何一点你试图将两个双精度转换为一个四精度，那么你可能处于一个不连续的位置，迫使编译器将寄存器洗牌到堆栈并返回以获得连续布局。

我觉得还是同一个答案换句话说

NEON 指令设计为流式传输，您从内存中以大块加载、处理它，然后存储您想要的内容。这应该都是非常简单的机制，否则你会失去它提供的额外性能，这会让人们问你为什么首先要尝试使用 Neon 让自己的生活更难。

将 NEON 视为不可变的值类型和操作。

使用联合（封装在结构中）绕过 neon 数据类型的转换

Using an union (encapsulated in a struct) to bypass conversions for neon data types

c++

gcc

arm

vectorization

neon