GCC 5 及更高版本中的 AVX2 支持
AVX2 support in GCC 5 and later
我写了下面的class "T"来加速操作
"sets of characters" 使用 AVX2。然后我发现它不起作用
gcc 5 及更高版本,当我使用“-O3”时。
任何人都可以帮助我将其追溯到一些编程结构
已知不适用于最新的 compilers/systems?
此代码的工作原理:底层结构(“_bits”)是一个 256 字节的块(为 AVX2 对齐和分配),可以作为 char[256] 或 AVX2 元素访问,具体取决于是否元素被访问或整个事物被用于矢量操作。看起来它应该在 AVX2 平台上完美运行。没有?
这真的很难调试,因为 "valgrind" 说它很干净,
而且我不能使用调试器(由于问题消失时
我删除了“-O3”)。但我不喜欢只使用“|=”
解决方法,因为如果这段代码真的是错误的,那么我可能
在其他地方犯同样的错误,把一切都搞砸
我开发!
有趣的是,“|”运营商有问题,但
“|=”没有。问题可能与从返回结构有关
功能?但我认为返回结构自 1990 年以来一直有效
什么的。
// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp
#include "assert.h"
#include "immintrin.h" // AVX
class T {
public:
__m256i _bits[8];
inline bool& operator[](unsigned char c) {return ((bool*)_bits)[c];}
inline bool operator[](unsigned char c) const {return ((bool*)_bits)[c];}
inline T() {}
inline explicit T(char const*);
inline T operator| (T const& b) const;
inline T & operator|=(T const& b);
inline bool operator! () const;
};
T::T(char const* s)
{
_bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
_bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
char c;
while ((c = *s++))
(*this)[c] = true;
}
T T::operator| (T const& b) const
{
T res;
for (int i = 0; i < 8; i++)
res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
// FIXME why does the above code fail with -O3 in new gcc?
for (int i=0; i<256; i++)
assert(res[i] == ((*this)[i] || b[i]));
// gcc 4.7.0 - PASS
// gcc 4.7.2 - PASS
// gcc 4.8.0 - PASS
// gcc 4.9.2 - PASS
// gcc 5.2.0 - FAIL
// gcc 5.3.0 - FAIL
// gcc 5.3.1 - FAIL
// gcc 6.1.0 - FAIL
return res;
}
T & T::operator|=(T const& b)
{
for (int i = 0; i < 8; i++)
_bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
return *this;
}
bool T::operator! () const
{
for (int i = 0; i < 8; i++)
if (!_mm256_testz_si256(_bits[i], _bits[i]))
return false;
return true;
}
int Main()
{
T sep (" ,\t\n");
T end ("");
return !(sep|end);
}
int main()
{
return Main();
}
您的代码的问题是在您应该使用 unsigned char*
时使用了 bool*
,这允许 GCC 5 继续进行指针别名优化。
由 GCC 4.8.5 和 5.3.1 生成的函数 Main()
的机器代码的两个转储位于本答案末尾的附录中以供参考。
查看代码:
反编译
序幕之后,T sep
的_bits
被初始化为零...
_bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
_bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
40063d: c5 fd 7f 44 24 60 vmovdqa %ymm0,0x60(%rsp)
400643: c5 fd 7f 44 24 40 vmovdqa %ymm0,0x40(%rsp)
400649: c5 fd 7f 44 24 20 vmovdqa %ymm0,0x20(%rsp)
40064f: c5 fd 7f 04 24 vmovdqa %ymm0,(%rsp)
400654: c5 fd 7f 84 24 e0 00 00 00 vmovdqa %ymm0,0xe0(%rsp)
40065d: c5 fd 7f 84 24 c0 00 00 00 vmovdqa %ymm0,0xc0(%rsp)
400666: c5 fd 7f 84 24 a0 00 00 00 vmovdqa %ymm0,0xa0(%rsp)
40066f: c5 fd 7f 84 24 80 00 00 00 vmovdqa %ymm0,0x80(%rsp)
然后根据char* s
.
循环写入
char c;
while ((c = *s++))
(*this)[c] = true;
400680: 48 83 c2 01 add [=11=]x1,%rdx
400684: c6 04 04 01 movb [=11=]x1,(%rsp,%rax,1)
400688: 0f b6 42 ff movzbl -0x1(%rdx),%eax
40068c: 84 c0 test %al,%al
40068e: 75 f0 jne 400680 <_Z4Mainv+0x60>
然后两个编译器都将 T end
初始化为 0:
400690: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400694: 31 c0 xor %eax,%eax
400696: c5 fd 7f 84 24 60 01 00 00 vmovdqa %ymm0,0x160(%rsp)
40069f: c5 fd 7f 84 24 40 01 00 00 vmovdqa %ymm0,0x140(%rsp)
4006a8: c5 fd 7f 84 24 20 01 00 00 vmovdqa %ymm0,0x120(%rsp)
4006b1: c5 fd 7f 84 24 00 01 00 00 vmovdqa %ymm0,0x100(%rsp)
4006ba: c5 fd 7f 84 24 e0 01 00 00 vmovdqa %ymm0,0x1e0(%rsp)
4006c3: c5 fd 7f 84 24 c0 01 00 00 vmovdqa %ymm0,0x1c0(%rsp)
4006cc: c5 fd 7f 84 24 a0 01 00 00 vmovdqa %ymm0,0x1a0(%rsp)
4006d5: c5 fd 7f 84 24 80 01 00 00 vmovdqa %ymm0,0x180(%rsp)
然后两个编译器都优化了 _mm256_or_si256()
操作,因为 T end
已知为 0
。但是,GCC 4.8.5 从 T sep
复制到 T res
(这在计算上是当您将任何内容或运算为零变量时发生的情况),而 GCC 5.3。 1 将 T res
初始化为 0
。它有权这样做,因为在您的 operator []
方法中,您将类型为 __m256i*
的指针转换为 bool*
,并且允许编译器假定指针没有别名。因此在 GCC 4.8.5 中你会看到
4006de: c5 fd 6f 04 24 vmovdqa (%rsp),%ymm0
4006e3: c5 fd 7f 84 24 00 02 00 00 vmovdqa %ymm0,0x200(%rsp)
4006ec: c5 fd 6f 44 24 20 vmovdqa 0x20(%rsp),%ymm0
4006f2: c5 fd 7f 84 24 20 02 00 00 vmovdqa %ymm0,0x220(%rsp)
4006fb: c5 fd 6f 44 24 40 vmovdqa 0x40(%rsp),%ymm0
400701: c5 fd 7f 84 24 40 02 00 00 vmovdqa %ymm0,0x240(%rsp)
40070a: c5 fd 6f 44 24 60 vmovdqa 0x60(%rsp),%ymm0
400710: c5 fd 7f 84 24 60 02 00 00 vmovdqa %ymm0,0x260(%rsp)
400719: c5 fd 6f 84 24 80 00 00 00 vmovdqa 0x80(%rsp),%ymm0
400722: c5 fd 7f 84 24 80 02 00 00 vmovdqa %ymm0,0x280(%rsp)
40072b: c5 fd 6f 84 24 a0 00 00 00 vmovdqa 0xa0(%rsp),%ymm0
400734: c5 fd 7f 84 24 a0 02 00 00 vmovdqa %ymm0,0x2a0(%rsp)
40073d: c5 fd 6f 84 24 c0 00 00 00 vmovdqa 0xc0(%rsp),%ymm0
400746: c5 fd 7f 84 24 c0 02 00 00 vmovdqa %ymm0,0x2c0(%rsp)
40074f: c5 fd 6f 84 24 e0 00 00 00 vmovdqa 0xe0(%rsp),%ymm0
400758: c5 fd 7f 84 24 e0 02 00 00 vmovdqa %ymm0,0x2e0(%rsp)
而在 GCC 5.3.1 中您会看到
4006fa: c5 fd 7f 85 f0 fe ff ff vmovdqa %ymm0,-0x110(%rbp)
400702: c5 fd 7f 85 10 ff ff ff vmovdqa %ymm0,-0xf0(%rbp)
40070a: c5 fd 7f 85 30 ff ff ff vmovdqa %ymm0,-0xd0(%rbp)
400712: c5 fd 7f 85 50 ff ff ff vmovdqa %ymm0,-0xb0(%rbp)
40071a: c5 fd 7f 85 70 ff ff ff vmovdqa %ymm0,-0x90(%rbp)
400722: c5 fd 7f 45 90 vmovdqa %ymm0,-0x70(%rbp)
400727: c5 fd 7f 45 b0 vmovdqa %ymm0,-0x50(%rbp)
40072c: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
然后 assert()
的读取失败。
标准对指针别名的规定:
ISO C++11 指的是以下部分中的别名,它清楚地表明类型 __m256i*
的变量不能使用 bool*
访问,但可以使用 [=37= 访问]:
§ 3.10 Lvalues and rvalues [basic.lval]
[...]
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [52]
- the dynamic type of the object,
- a cv-qualified version of the dynamic type of the object,
- a type similar (as defined in 4.4) to the dynamic type of the object,
- a type that is the signed or unsigned type corresponding to the dynamic type of the object,
- a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
- an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
- a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
- a
char
or unsigned char
type.
52) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
附录
海湾合作委员会 4.8.5:
0000000000400620 <_Z4Mainv>:
400620: 55 push %rbp
400621: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400625: ba e5 08 40 00 mov [=15=]x4008e5,%edx
40062a: b8 20 00 00 00 mov [=15=]x20,%eax
40062f: 48 89 e5 mov %rsp,%rbp
400632: 48 83 e4 e0 and [=15=]xffffffffffffffe0,%rsp
400636: 48 81 ec 00 03 00 00 sub [=15=]x300,%rsp
40063d: c5 fd 7f 44 24 60 vmovdqa %ymm0,0x60(%rsp)
400643: c5 fd 7f 44 24 40 vmovdqa %ymm0,0x40(%rsp)
400649: c5 fd 7f 44 24 20 vmovdqa %ymm0,0x20(%rsp)
40064f: c5 fd 7f 04 24 vmovdqa %ymm0,(%rsp)
400654: c5 fd 7f 84 24 e0 00 00 00 vmovdqa %ymm0,0xe0(%rsp)
40065d: c5 fd 7f 84 24 c0 00 00 00 vmovdqa %ymm0,0xc0(%rsp)
400666: c5 fd 7f 84 24 a0 00 00 00 vmovdqa %ymm0,0xa0(%rsp)
40066f: c5 fd 7f 84 24 80 00 00 00 vmovdqa %ymm0,0x80(%rsp)
400678: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
400680: 48 83 c2 01 add [=15=]x1,%rdx
400684: c6 04 04 01 movb [=15=]x1,(%rsp,%rax,1)
400688: 0f b6 42 ff movzbl -0x1(%rdx),%eax
40068c: 84 c0 test %al,%al
40068e: 75 f0 jne 400680 <_Z4Mainv+0x60>
400690: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400694: 31 c0 xor %eax,%eax
400696: c5 fd 7f 84 24 60 01 00 00 vmovdqa %ymm0,0x160(%rsp)
40069f: c5 fd 7f 84 24 40 01 00 00 vmovdqa %ymm0,0x140(%rsp)
4006a8: c5 fd 7f 84 24 20 01 00 00 vmovdqa %ymm0,0x120(%rsp)
4006b1: c5 fd 7f 84 24 00 01 00 00 vmovdqa %ymm0,0x100(%rsp)
4006ba: c5 fd 7f 84 24 e0 01 00 00 vmovdqa %ymm0,0x1e0(%rsp)
4006c3: c5 fd 7f 84 24 c0 01 00 00 vmovdqa %ymm0,0x1c0(%rsp)
4006cc: c5 fd 7f 84 24 a0 01 00 00 vmovdqa %ymm0,0x1a0(%rsp)
4006d5: c5 fd 7f 84 24 80 01 00 00 vmovdqa %ymm0,0x180(%rsp)
4006de: c5 fd 6f 04 24 vmovdqa (%rsp),%ymm0
4006e3: c5 fd 7f 84 24 00 02 00 00 vmovdqa %ymm0,0x200(%rsp)
4006ec: c5 fd 6f 44 24 20 vmovdqa 0x20(%rsp),%ymm0
4006f2: c5 fd 7f 84 24 20 02 00 00 vmovdqa %ymm0,0x220(%rsp)
4006fb: c5 fd 6f 44 24 40 vmovdqa 0x40(%rsp),%ymm0
400701: c5 fd 7f 84 24 40 02 00 00 vmovdqa %ymm0,0x240(%rsp)
40070a: c5 fd 6f 44 24 60 vmovdqa 0x60(%rsp),%ymm0
400710: c5 fd 7f 84 24 60 02 00 00 vmovdqa %ymm0,0x260(%rsp)
400719: c5 fd 6f 84 24 80 00 00 00 vmovdqa 0x80(%rsp),%ymm0
400722: c5 fd 7f 84 24 80 02 00 00 vmovdqa %ymm0,0x280(%rsp)
40072b: c5 fd 6f 84 24 a0 00 00 00 vmovdqa 0xa0(%rsp),%ymm0
400734: c5 fd 7f 84 24 a0 02 00 00 vmovdqa %ymm0,0x2a0(%rsp)
40073d: c5 fd 6f 84 24 c0 00 00 00 vmovdqa 0xc0(%rsp),%ymm0
400746: c5 fd 7f 84 24 c0 02 00 00 vmovdqa %ymm0,0x2c0(%rsp)
40074f: c5 fd 6f 84 24 e0 00 00 00 vmovdqa 0xe0(%rsp),%ymm0
400758: c5 fd 7f 84 24 e0 02 00 00 vmovdqa %ymm0,0x2e0(%rsp)
400761: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400768: 80 3c 04 00 cmpb [=15=]x0,(%rsp,%rax,1)
40076c: 0f b6 8c 04 00 02 00 00 movzbl 0x200(%rsp,%rax,1),%ecx
400774: ba 01 00 00 00 mov [=15=]x1,%edx
400779: 75 08 jne 400783 <_Z4Mainv+0x163>
40077b: 0f b6 94 04 00 01 00 00 movzbl 0x100(%rsp,%rax,1),%edx
400783: 38 d1 cmp %dl,%cl
400785: 0f 85 b2 00 00 00 jne 40083d <_Z4Mainv+0x21d>
40078b: 48 83 c0 01 add [=15=]x1,%rax
40078f: 48 3d 00 01 00 00 cmp [=15=]x100,%rax
400795: 75 d1 jne 400768 <_Z4Mainv+0x148>
400797: c5 fd 6f 8c 24 00 02 00 00 vmovdqa 0x200(%rsp),%ymm1
4007a0: 31 c0 xor %eax,%eax
4007a2: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007a7: 0f 94 c0 sete %al
4007aa: 0f 85 88 00 00 00 jne 400838 <_Z4Mainv+0x218>
4007b0: c5 fd 6f 8c 24 20 02 00 00 vmovdqa 0x220(%rsp),%ymm1
4007b9: 31 c0 xor %eax,%eax
4007bb: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007c0: 0f 94 c0 sete %al
4007c3: 75 73 jne 400838 <_Z4Mainv+0x218>
4007c5: c5 fd 6f 8c 24 40 02 00 00 vmovdqa 0x240(%rsp),%ymm1
4007ce: 31 c0 xor %eax,%eax
4007d0: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007d5: 0f 94 c0 sete %al
4007d8: 75 5e jne 400838 <_Z4Mainv+0x218>
4007da: c5 fd 6f 8c 24 60 02 00 00 vmovdqa 0x260(%rsp),%ymm1
4007e3: 31 c0 xor %eax,%eax
4007e5: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007ea: 0f 94 c0 sete %al
4007ed: 75 49 jne 400838 <_Z4Mainv+0x218>
4007ef: c5 fd 6f 8c 24 80 02 00 00 vmovdqa 0x280(%rsp),%ymm1
4007f8: 31 c0 xor %eax,%eax
4007fa: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007ff: 0f 94 c0 sete %al
400802: 75 34 jne 400838 <_Z4Mainv+0x218>
400804: c5 fd 6f 8c 24 a0 02 00 00 vmovdqa 0x2a0(%rsp),%ymm1
40080d: 31 c0 xor %eax,%eax
40080f: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
400814: 0f 94 c0 sete %al
400817: 75 1f jne 400838 <_Z4Mainv+0x218>
400819: c5 fd 6f 8c 24 c0 02 00 00 vmovdqa 0x2c0(%rsp),%ymm1
400822: 31 c0 xor %eax,%eax
400824: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
400829: 0f 94 c0 sete %al
40082c: 75 0a jne 400838 <_Z4Mainv+0x218>
40082e: 31 c0 xor %eax,%eax
400830: c4 e2 7d 17 c0 vptest %ymm0,%ymm0
400835: 0f 94 c0 sete %al
400838: c5 f8 77 vzeroupper
40083b: c9 leaveq
40083c: c3 retq
40083d: b9 20 09 40 00 mov [=15=]x400920,%ecx
400842: ba 26 00 00 00 mov [=15=]x26,%edx
400847: be e9 08 40 00 mov [=15=]x4008e9,%esi
40084c: bf f8 08 40 00 mov [=15=]x4008f8,%edi
400851: c5 f8 77 vzeroupper
400854: e8 97 fc ff ff callq 4004f0 <__assert_fail@plt>
400859: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
海湾合作委员会 5:
0000000000400630 <_Z4Mainv>:
400630: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
400635: 48 83 e4 e0 and [=16=]xffffffffffffffe0,%rsp
400639: b8 20 00 00 00 mov [=16=]x20,%eax
40063e: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400642: ba 25 08 40 00 mov [=16=]x400825,%edx
400647: 41 ff 72 f8 pushq -0x8(%r10)
40064b: 55 push %rbp
40064c: 48 89 e5 mov %rsp,%rbp
40064f: 41 52 push %r10
400651: 48 81 ec 08 03 00 00 sub [=16=]x308,%rsp
400658: c5 fd 7f 85 50 fd ff ff vmovdqa %ymm0,-0x2b0(%rbp)
400660: c5 fd 7f 85 30 fd ff ff vmovdqa %ymm0,-0x2d0(%rbp)
400668: c5 fd 7f 85 10 fd ff ff vmovdqa %ymm0,-0x2f0(%rbp)
400670: c5 fd 7f 85 f0 fc ff ff vmovdqa %ymm0,-0x310(%rbp)
400678: c5 fd 7f 85 d0 fd ff ff vmovdqa %ymm0,-0x230(%rbp)
400680: c5 fd 7f 85 b0 fd ff ff vmovdqa %ymm0,-0x250(%rbp)
400688: c5 fd 7f 85 90 fd ff ff vmovdqa %ymm0,-0x270(%rbp)
400690: c5 fd 7f 85 70 fd ff ff vmovdqa %ymm0,-0x290(%rbp)
400698: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
4006a0: 48 83 c2 01 add [=16=]x1,%rdx
4006a4: c6 84 05 f0 fc ff ff 01 movb [=16=]x1,-0x310(%rbp,%rax,1)
4006ac: 0f b6 42 ff movzbl -0x1(%rdx),%eax
4006b0: 84 c0 test %al,%al
4006b2: 75 ec jne 4006a0 <_Z4Mainv+0x70>
4006b4: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
4006b8: 31 c0 xor %eax,%eax
4006ba: c5 fd 7f 85 50 fe ff ff vmovdqa %ymm0,-0x1b0(%rbp)
4006c2: c5 fd 7f 85 30 fe ff ff vmovdqa %ymm0,-0x1d0(%rbp)
4006ca: c5 fd 7f 85 10 fe ff ff vmovdqa %ymm0,-0x1f0(%rbp)
4006d2: c5 fd 7f 85 f0 fd ff ff vmovdqa %ymm0,-0x210(%rbp)
4006da: c5 fd 7f 85 d0 fe ff ff vmovdqa %ymm0,-0x130(%rbp)
4006e2: c5 fd 7f 85 b0 fe ff ff vmovdqa %ymm0,-0x150(%rbp)
4006ea: c5 fd 7f 85 90 fe ff ff vmovdqa %ymm0,-0x170(%rbp)
4006f2: c5 fd 7f 85 70 fe ff ff vmovdqa %ymm0,-0x190(%rbp)
4006fa: c5 fd 7f 85 f0 fe ff ff vmovdqa %ymm0,-0x110(%rbp)
400702: c5 fd 7f 85 10 ff ff ff vmovdqa %ymm0,-0xf0(%rbp)
40070a: c5 fd 7f 85 30 ff ff ff vmovdqa %ymm0,-0xd0(%rbp)
400712: c5 fd 7f 85 50 ff ff ff vmovdqa %ymm0,-0xb0(%rbp)
40071a: c5 fd 7f 85 70 ff ff ff vmovdqa %ymm0,-0x90(%rbp)
400722: c5 fd 7f 45 90 vmovdqa %ymm0,-0x70(%rbp)
400727: c5 fd 7f 45 b0 vmovdqa %ymm0,-0x50(%rbp)
40072c: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
400731: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400738: 0f b6 94 05 f0 fc ff ff movzbl -0x310(%rbp,%rax,1),%edx
400740: 0f b6 8c 05 f0 fe ff ff movzbl -0x110(%rbp,%rax,1),%ecx
400748: 84 d2 test %dl,%dl
40074a: 75 08 jne 400754 <_Z4Mainv+0x124>
40074c: 0f b6 94 05 f0 fd ff ff movzbl -0x210(%rbp,%rax,1),%edx
400754: 38 d1 cmp %dl,%cl
400756: 75 2c jne 400784 <_Z4Mainv+0x154>
400758: 48 83 c0 01 add [=16=]x1,%rax
40075c: 48 3d 00 01 00 00 cmp [=16=]x100,%rax
400762: 75 d4 jne 400738 <_Z4Mainv+0x108>
400764: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400768: 31 c0 xor %eax,%eax
40076a: c4 e2 7d 17 c0 vptest %ymm0,%ymm0
40076f: 0f 94 c0 sete %al
400772: c5 f8 77 vzeroupper
400775: 48 81 c4 08 03 00 00 add [=16=]x308,%rsp
40077c: 41 5a pop %r10
40077e: 5d pop %rbp
40077f: 49 8d 62 f8 lea -0x8(%r10),%rsp
400783: c3 retq
400784: b9 60 08 40 00 mov [=16=]x400860,%ecx
400789: ba 26 00 00 00 mov [=16=]x26,%edx
40078e: be 29 08 40 00 mov [=16=]x400829,%esi
400793: bf 38 08 40 00 mov [=16=]x400838,%edi
400798: c5 f8 77 vzeroupper
40079b: e8 50 fd ff ff callq 4004f0 <__assert_fail@plt>
我写了下面的class "T"来加速操作 "sets of characters" 使用 AVX2。然后我发现它不起作用 gcc 5 及更高版本,当我使用“-O3”时。 任何人都可以帮助我将其追溯到一些编程结构 已知不适用于最新的 compilers/systems?
此代码的工作原理:底层结构(“_bits”)是一个 256 字节的块(为 AVX2 对齐和分配),可以作为 char[256] 或 AVX2 元素访问,具体取决于是否元素被访问或整个事物被用于矢量操作。看起来它应该在 AVX2 平台上完美运行。没有?
这真的很难调试,因为 "valgrind" 说它很干净, 而且我不能使用调试器(由于问题消失时 我删除了“-O3”)。但我不喜欢只使用“|=” 解决方法,因为如果这段代码真的是错误的,那么我可能 在其他地方犯同样的错误,把一切都搞砸 我开发!
有趣的是,“|”运营商有问题,但 “|=”没有。问题可能与从返回结构有关 功能?但我认为返回结构自 1990 年以来一直有效 什么的。
// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp
#include "assert.h"
#include "immintrin.h" // AVX
class T {
public:
__m256i _bits[8];
inline bool& operator[](unsigned char c) {return ((bool*)_bits)[c];}
inline bool operator[](unsigned char c) const {return ((bool*)_bits)[c];}
inline T() {}
inline explicit T(char const*);
inline T operator| (T const& b) const;
inline T & operator|=(T const& b);
inline bool operator! () const;
};
T::T(char const* s)
{
_bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
_bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
char c;
while ((c = *s++))
(*this)[c] = true;
}
T T::operator| (T const& b) const
{
T res;
for (int i = 0; i < 8; i++)
res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
// FIXME why does the above code fail with -O3 in new gcc?
for (int i=0; i<256; i++)
assert(res[i] == ((*this)[i] || b[i]));
// gcc 4.7.0 - PASS
// gcc 4.7.2 - PASS
// gcc 4.8.0 - PASS
// gcc 4.9.2 - PASS
// gcc 5.2.0 - FAIL
// gcc 5.3.0 - FAIL
// gcc 5.3.1 - FAIL
// gcc 6.1.0 - FAIL
return res;
}
T & T::operator|=(T const& b)
{
for (int i = 0; i < 8; i++)
_bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
return *this;
}
bool T::operator! () const
{
for (int i = 0; i < 8; i++)
if (!_mm256_testz_si256(_bits[i], _bits[i]))
return false;
return true;
}
int Main()
{
T sep (" ,\t\n");
T end ("");
return !(sep|end);
}
int main()
{
return Main();
}
您的代码的问题是在您应该使用 unsigned char*
时使用了 bool*
,这允许 GCC 5 继续进行指针别名优化。
由 GCC 4.8.5 和 5.3.1 生成的函数 Main()
的机器代码的两个转储位于本答案末尾的附录中以供参考。
查看代码:
反编译
序幕之后,T sep
的_bits
被初始化为零...
_bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
_bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
40063d: c5 fd 7f 44 24 60 vmovdqa %ymm0,0x60(%rsp)
400643: c5 fd 7f 44 24 40 vmovdqa %ymm0,0x40(%rsp)
400649: c5 fd 7f 44 24 20 vmovdqa %ymm0,0x20(%rsp)
40064f: c5 fd 7f 04 24 vmovdqa %ymm0,(%rsp)
400654: c5 fd 7f 84 24 e0 00 00 00 vmovdqa %ymm0,0xe0(%rsp)
40065d: c5 fd 7f 84 24 c0 00 00 00 vmovdqa %ymm0,0xc0(%rsp)
400666: c5 fd 7f 84 24 a0 00 00 00 vmovdqa %ymm0,0xa0(%rsp)
40066f: c5 fd 7f 84 24 80 00 00 00 vmovdqa %ymm0,0x80(%rsp)
然后根据char* s
.
char c;
while ((c = *s++))
(*this)[c] = true;
400680: 48 83 c2 01 add [=11=]x1,%rdx
400684: c6 04 04 01 movb [=11=]x1,(%rsp,%rax,1)
400688: 0f b6 42 ff movzbl -0x1(%rdx),%eax
40068c: 84 c0 test %al,%al
40068e: 75 f0 jne 400680 <_Z4Mainv+0x60>
然后两个编译器都将 T end
初始化为 0:
400690: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400694: 31 c0 xor %eax,%eax
400696: c5 fd 7f 84 24 60 01 00 00 vmovdqa %ymm0,0x160(%rsp)
40069f: c5 fd 7f 84 24 40 01 00 00 vmovdqa %ymm0,0x140(%rsp)
4006a8: c5 fd 7f 84 24 20 01 00 00 vmovdqa %ymm0,0x120(%rsp)
4006b1: c5 fd 7f 84 24 00 01 00 00 vmovdqa %ymm0,0x100(%rsp)
4006ba: c5 fd 7f 84 24 e0 01 00 00 vmovdqa %ymm0,0x1e0(%rsp)
4006c3: c5 fd 7f 84 24 c0 01 00 00 vmovdqa %ymm0,0x1c0(%rsp)
4006cc: c5 fd 7f 84 24 a0 01 00 00 vmovdqa %ymm0,0x1a0(%rsp)
4006d5: c5 fd 7f 84 24 80 01 00 00 vmovdqa %ymm0,0x180(%rsp)
然后两个编译器都优化了 _mm256_or_si256()
操作,因为 T end
已知为 0
。但是,GCC 4.8.5 从 T sep
复制到 T res
(这在计算上是当您将任何内容或运算为零变量时发生的情况),而 GCC 5.3。 1 将 T res
初始化为 0
。它有权这样做,因为在您的 operator []
方法中,您将类型为 __m256i*
的指针转换为 bool*
,并且允许编译器假定指针没有别名。因此在 GCC 4.8.5 中你会看到
4006de: c5 fd 6f 04 24 vmovdqa (%rsp),%ymm0
4006e3: c5 fd 7f 84 24 00 02 00 00 vmovdqa %ymm0,0x200(%rsp)
4006ec: c5 fd 6f 44 24 20 vmovdqa 0x20(%rsp),%ymm0
4006f2: c5 fd 7f 84 24 20 02 00 00 vmovdqa %ymm0,0x220(%rsp)
4006fb: c5 fd 6f 44 24 40 vmovdqa 0x40(%rsp),%ymm0
400701: c5 fd 7f 84 24 40 02 00 00 vmovdqa %ymm0,0x240(%rsp)
40070a: c5 fd 6f 44 24 60 vmovdqa 0x60(%rsp),%ymm0
400710: c5 fd 7f 84 24 60 02 00 00 vmovdqa %ymm0,0x260(%rsp)
400719: c5 fd 6f 84 24 80 00 00 00 vmovdqa 0x80(%rsp),%ymm0
400722: c5 fd 7f 84 24 80 02 00 00 vmovdqa %ymm0,0x280(%rsp)
40072b: c5 fd 6f 84 24 a0 00 00 00 vmovdqa 0xa0(%rsp),%ymm0
400734: c5 fd 7f 84 24 a0 02 00 00 vmovdqa %ymm0,0x2a0(%rsp)
40073d: c5 fd 6f 84 24 c0 00 00 00 vmovdqa 0xc0(%rsp),%ymm0
400746: c5 fd 7f 84 24 c0 02 00 00 vmovdqa %ymm0,0x2c0(%rsp)
40074f: c5 fd 6f 84 24 e0 00 00 00 vmovdqa 0xe0(%rsp),%ymm0
400758: c5 fd 7f 84 24 e0 02 00 00 vmovdqa %ymm0,0x2e0(%rsp)
而在 GCC 5.3.1 中您会看到
4006fa: c5 fd 7f 85 f0 fe ff ff vmovdqa %ymm0,-0x110(%rbp)
400702: c5 fd 7f 85 10 ff ff ff vmovdqa %ymm0,-0xf0(%rbp)
40070a: c5 fd 7f 85 30 ff ff ff vmovdqa %ymm0,-0xd0(%rbp)
400712: c5 fd 7f 85 50 ff ff ff vmovdqa %ymm0,-0xb0(%rbp)
40071a: c5 fd 7f 85 70 ff ff ff vmovdqa %ymm0,-0x90(%rbp)
400722: c5 fd 7f 45 90 vmovdqa %ymm0,-0x70(%rbp)
400727: c5 fd 7f 45 b0 vmovdqa %ymm0,-0x50(%rbp)
40072c: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
然后 assert()
的读取失败。
标准对指针别名的规定:
ISO C++11 指的是以下部分中的别名,它清楚地表明类型 __m256i*
的变量不能使用 bool*
访问,但可以使用 [=37= 访问]:
§ 3.10 Lvalues and rvalues [basic.lval]
[...]
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [52]
- the dynamic type of the object,
- a cv-qualified version of the dynamic type of the object,
- a type similar (as defined in 4.4) to the dynamic type of the object,
- a type that is the signed or unsigned type corresponding to the dynamic type of the object,
- a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
- an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
- a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
- a
char
orunsigned char
type.
52) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
附录
海湾合作委员会 4.8.5:
0000000000400620 <_Z4Mainv>:
400620: 55 push %rbp
400621: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400625: ba e5 08 40 00 mov [=15=]x4008e5,%edx
40062a: b8 20 00 00 00 mov [=15=]x20,%eax
40062f: 48 89 e5 mov %rsp,%rbp
400632: 48 83 e4 e0 and [=15=]xffffffffffffffe0,%rsp
400636: 48 81 ec 00 03 00 00 sub [=15=]x300,%rsp
40063d: c5 fd 7f 44 24 60 vmovdqa %ymm0,0x60(%rsp)
400643: c5 fd 7f 44 24 40 vmovdqa %ymm0,0x40(%rsp)
400649: c5 fd 7f 44 24 20 vmovdqa %ymm0,0x20(%rsp)
40064f: c5 fd 7f 04 24 vmovdqa %ymm0,(%rsp)
400654: c5 fd 7f 84 24 e0 00 00 00 vmovdqa %ymm0,0xe0(%rsp)
40065d: c5 fd 7f 84 24 c0 00 00 00 vmovdqa %ymm0,0xc0(%rsp)
400666: c5 fd 7f 84 24 a0 00 00 00 vmovdqa %ymm0,0xa0(%rsp)
40066f: c5 fd 7f 84 24 80 00 00 00 vmovdqa %ymm0,0x80(%rsp)
400678: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
400680: 48 83 c2 01 add [=15=]x1,%rdx
400684: c6 04 04 01 movb [=15=]x1,(%rsp,%rax,1)
400688: 0f b6 42 ff movzbl -0x1(%rdx),%eax
40068c: 84 c0 test %al,%al
40068e: 75 f0 jne 400680 <_Z4Mainv+0x60>
400690: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400694: 31 c0 xor %eax,%eax
400696: c5 fd 7f 84 24 60 01 00 00 vmovdqa %ymm0,0x160(%rsp)
40069f: c5 fd 7f 84 24 40 01 00 00 vmovdqa %ymm0,0x140(%rsp)
4006a8: c5 fd 7f 84 24 20 01 00 00 vmovdqa %ymm0,0x120(%rsp)
4006b1: c5 fd 7f 84 24 00 01 00 00 vmovdqa %ymm0,0x100(%rsp)
4006ba: c5 fd 7f 84 24 e0 01 00 00 vmovdqa %ymm0,0x1e0(%rsp)
4006c3: c5 fd 7f 84 24 c0 01 00 00 vmovdqa %ymm0,0x1c0(%rsp)
4006cc: c5 fd 7f 84 24 a0 01 00 00 vmovdqa %ymm0,0x1a0(%rsp)
4006d5: c5 fd 7f 84 24 80 01 00 00 vmovdqa %ymm0,0x180(%rsp)
4006de: c5 fd 6f 04 24 vmovdqa (%rsp),%ymm0
4006e3: c5 fd 7f 84 24 00 02 00 00 vmovdqa %ymm0,0x200(%rsp)
4006ec: c5 fd 6f 44 24 20 vmovdqa 0x20(%rsp),%ymm0
4006f2: c5 fd 7f 84 24 20 02 00 00 vmovdqa %ymm0,0x220(%rsp)
4006fb: c5 fd 6f 44 24 40 vmovdqa 0x40(%rsp),%ymm0
400701: c5 fd 7f 84 24 40 02 00 00 vmovdqa %ymm0,0x240(%rsp)
40070a: c5 fd 6f 44 24 60 vmovdqa 0x60(%rsp),%ymm0
400710: c5 fd 7f 84 24 60 02 00 00 vmovdqa %ymm0,0x260(%rsp)
400719: c5 fd 6f 84 24 80 00 00 00 vmovdqa 0x80(%rsp),%ymm0
400722: c5 fd 7f 84 24 80 02 00 00 vmovdqa %ymm0,0x280(%rsp)
40072b: c5 fd 6f 84 24 a0 00 00 00 vmovdqa 0xa0(%rsp),%ymm0
400734: c5 fd 7f 84 24 a0 02 00 00 vmovdqa %ymm0,0x2a0(%rsp)
40073d: c5 fd 6f 84 24 c0 00 00 00 vmovdqa 0xc0(%rsp),%ymm0
400746: c5 fd 7f 84 24 c0 02 00 00 vmovdqa %ymm0,0x2c0(%rsp)
40074f: c5 fd 6f 84 24 e0 00 00 00 vmovdqa 0xe0(%rsp),%ymm0
400758: c5 fd 7f 84 24 e0 02 00 00 vmovdqa %ymm0,0x2e0(%rsp)
400761: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400768: 80 3c 04 00 cmpb [=15=]x0,(%rsp,%rax,1)
40076c: 0f b6 8c 04 00 02 00 00 movzbl 0x200(%rsp,%rax,1),%ecx
400774: ba 01 00 00 00 mov [=15=]x1,%edx
400779: 75 08 jne 400783 <_Z4Mainv+0x163>
40077b: 0f b6 94 04 00 01 00 00 movzbl 0x100(%rsp,%rax,1),%edx
400783: 38 d1 cmp %dl,%cl
400785: 0f 85 b2 00 00 00 jne 40083d <_Z4Mainv+0x21d>
40078b: 48 83 c0 01 add [=15=]x1,%rax
40078f: 48 3d 00 01 00 00 cmp [=15=]x100,%rax
400795: 75 d1 jne 400768 <_Z4Mainv+0x148>
400797: c5 fd 6f 8c 24 00 02 00 00 vmovdqa 0x200(%rsp),%ymm1
4007a0: 31 c0 xor %eax,%eax
4007a2: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007a7: 0f 94 c0 sete %al
4007aa: 0f 85 88 00 00 00 jne 400838 <_Z4Mainv+0x218>
4007b0: c5 fd 6f 8c 24 20 02 00 00 vmovdqa 0x220(%rsp),%ymm1
4007b9: 31 c0 xor %eax,%eax
4007bb: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007c0: 0f 94 c0 sete %al
4007c3: 75 73 jne 400838 <_Z4Mainv+0x218>
4007c5: c5 fd 6f 8c 24 40 02 00 00 vmovdqa 0x240(%rsp),%ymm1
4007ce: 31 c0 xor %eax,%eax
4007d0: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007d5: 0f 94 c0 sete %al
4007d8: 75 5e jne 400838 <_Z4Mainv+0x218>
4007da: c5 fd 6f 8c 24 60 02 00 00 vmovdqa 0x260(%rsp),%ymm1
4007e3: 31 c0 xor %eax,%eax
4007e5: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007ea: 0f 94 c0 sete %al
4007ed: 75 49 jne 400838 <_Z4Mainv+0x218>
4007ef: c5 fd 6f 8c 24 80 02 00 00 vmovdqa 0x280(%rsp),%ymm1
4007f8: 31 c0 xor %eax,%eax
4007fa: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
4007ff: 0f 94 c0 sete %al
400802: 75 34 jne 400838 <_Z4Mainv+0x218>
400804: c5 fd 6f 8c 24 a0 02 00 00 vmovdqa 0x2a0(%rsp),%ymm1
40080d: 31 c0 xor %eax,%eax
40080f: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
400814: 0f 94 c0 sete %al
400817: 75 1f jne 400838 <_Z4Mainv+0x218>
400819: c5 fd 6f 8c 24 c0 02 00 00 vmovdqa 0x2c0(%rsp),%ymm1
400822: 31 c0 xor %eax,%eax
400824: c4 e2 7d 17 c9 vptest %ymm1,%ymm1
400829: 0f 94 c0 sete %al
40082c: 75 0a jne 400838 <_Z4Mainv+0x218>
40082e: 31 c0 xor %eax,%eax
400830: c4 e2 7d 17 c0 vptest %ymm0,%ymm0
400835: 0f 94 c0 sete %al
400838: c5 f8 77 vzeroupper
40083b: c9 leaveq
40083c: c3 retq
40083d: b9 20 09 40 00 mov [=15=]x400920,%ecx
400842: ba 26 00 00 00 mov [=15=]x26,%edx
400847: be e9 08 40 00 mov [=15=]x4008e9,%esi
40084c: bf f8 08 40 00 mov [=15=]x4008f8,%edi
400851: c5 f8 77 vzeroupper
400854: e8 97 fc ff ff callq 4004f0 <__assert_fail@plt>
400859: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
海湾合作委员会 5:
0000000000400630 <_Z4Mainv>:
400630: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
400635: 48 83 e4 e0 and [=16=]xffffffffffffffe0,%rsp
400639: b8 20 00 00 00 mov [=16=]x20,%eax
40063e: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400642: ba 25 08 40 00 mov [=16=]x400825,%edx
400647: 41 ff 72 f8 pushq -0x8(%r10)
40064b: 55 push %rbp
40064c: 48 89 e5 mov %rsp,%rbp
40064f: 41 52 push %r10
400651: 48 81 ec 08 03 00 00 sub [=16=]x308,%rsp
400658: c5 fd 7f 85 50 fd ff ff vmovdqa %ymm0,-0x2b0(%rbp)
400660: c5 fd 7f 85 30 fd ff ff vmovdqa %ymm0,-0x2d0(%rbp)
400668: c5 fd 7f 85 10 fd ff ff vmovdqa %ymm0,-0x2f0(%rbp)
400670: c5 fd 7f 85 f0 fc ff ff vmovdqa %ymm0,-0x310(%rbp)
400678: c5 fd 7f 85 d0 fd ff ff vmovdqa %ymm0,-0x230(%rbp)
400680: c5 fd 7f 85 b0 fd ff ff vmovdqa %ymm0,-0x250(%rbp)
400688: c5 fd 7f 85 90 fd ff ff vmovdqa %ymm0,-0x270(%rbp)
400690: c5 fd 7f 85 70 fd ff ff vmovdqa %ymm0,-0x290(%rbp)
400698: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
4006a0: 48 83 c2 01 add [=16=]x1,%rdx
4006a4: c6 84 05 f0 fc ff ff 01 movb [=16=]x1,-0x310(%rbp,%rax,1)
4006ac: 0f b6 42 ff movzbl -0x1(%rdx),%eax
4006b0: 84 c0 test %al,%al
4006b2: 75 ec jne 4006a0 <_Z4Mainv+0x70>
4006b4: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
4006b8: 31 c0 xor %eax,%eax
4006ba: c5 fd 7f 85 50 fe ff ff vmovdqa %ymm0,-0x1b0(%rbp)
4006c2: c5 fd 7f 85 30 fe ff ff vmovdqa %ymm0,-0x1d0(%rbp)
4006ca: c5 fd 7f 85 10 fe ff ff vmovdqa %ymm0,-0x1f0(%rbp)
4006d2: c5 fd 7f 85 f0 fd ff ff vmovdqa %ymm0,-0x210(%rbp)
4006da: c5 fd 7f 85 d0 fe ff ff vmovdqa %ymm0,-0x130(%rbp)
4006e2: c5 fd 7f 85 b0 fe ff ff vmovdqa %ymm0,-0x150(%rbp)
4006ea: c5 fd 7f 85 90 fe ff ff vmovdqa %ymm0,-0x170(%rbp)
4006f2: c5 fd 7f 85 70 fe ff ff vmovdqa %ymm0,-0x190(%rbp)
4006fa: c5 fd 7f 85 f0 fe ff ff vmovdqa %ymm0,-0x110(%rbp)
400702: c5 fd 7f 85 10 ff ff ff vmovdqa %ymm0,-0xf0(%rbp)
40070a: c5 fd 7f 85 30 ff ff ff vmovdqa %ymm0,-0xd0(%rbp)
400712: c5 fd 7f 85 50 ff ff ff vmovdqa %ymm0,-0xb0(%rbp)
40071a: c5 fd 7f 85 70 ff ff ff vmovdqa %ymm0,-0x90(%rbp)
400722: c5 fd 7f 45 90 vmovdqa %ymm0,-0x70(%rbp)
400727: c5 fd 7f 45 b0 vmovdqa %ymm0,-0x50(%rbp)
40072c: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
400731: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400738: 0f b6 94 05 f0 fc ff ff movzbl -0x310(%rbp,%rax,1),%edx
400740: 0f b6 8c 05 f0 fe ff ff movzbl -0x110(%rbp,%rax,1),%ecx
400748: 84 d2 test %dl,%dl
40074a: 75 08 jne 400754 <_Z4Mainv+0x124>
40074c: 0f b6 94 05 f0 fd ff ff movzbl -0x210(%rbp,%rax,1),%edx
400754: 38 d1 cmp %dl,%cl
400756: 75 2c jne 400784 <_Z4Mainv+0x154>
400758: 48 83 c0 01 add [=16=]x1,%rax
40075c: 48 3d 00 01 00 00 cmp [=16=]x100,%rax
400762: 75 d4 jne 400738 <_Z4Mainv+0x108>
400764: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
400768: 31 c0 xor %eax,%eax
40076a: c4 e2 7d 17 c0 vptest %ymm0,%ymm0
40076f: 0f 94 c0 sete %al
400772: c5 f8 77 vzeroupper
400775: 48 81 c4 08 03 00 00 add [=16=]x308,%rsp
40077c: 41 5a pop %r10
40077e: 5d pop %rbp
40077f: 49 8d 62 f8 lea -0x8(%r10),%rsp
400783: c3 retq
400784: b9 60 08 40 00 mov [=16=]x400860,%ecx
400789: ba 26 00 00 00 mov [=16=]x26,%edx
40078e: be 29 08 40 00 mov [=16=]x400829,%esi
400793: bf 38 08 40 00 mov [=16=]x400838,%edi
400798: c5 f8 77 vzeroupper
40079b: e8 50 fd ff ff callq 4004f0 <__assert_fail@plt>