普通循环未使用 gcc 4.8.5 自动矢量化
Trivial Loop not auto-vectorized with gcc 4.8.5
我正在尝试了解有关 gcc 中的自动矢量化的更多信息。
在我的项目中,我必须使用 gcc 4.8.5,并且我看到一些循环没有矢量化。
因此,我创建了一个小示例来播放并了解为什么它们不是。
我感兴趣的是 gcc 不对循环进行矢量化,并想了解如何对其进行矢量化。不幸的是我不是很熟悉 GCC 的输出信息。
a) 我希望这个循环会被矢量化为一个简单的案例
b) 有什么我遗漏的小事吗?
非常感谢大家...
小例子是:
#include <iostream>
#include <vector>
using namespace std;
class test
{
public:
test();
~test();
void calc_test();
};
test::test()
{
}
test::~test()
{
}
void
test::calc_test(void)
{
vector<int> ffs_psd(10000,5.0);
vector<int> G_qh_sp(10000,1.0);
vector<int> G_qv_sp(10000,3.0);
vector<int> B_erm_qh(10000,50.0);
vector<int> B_erm_qv(10000,2.0);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
}
int main(int argc, char * argv[])
{
test m_test;
m_test.calc_test();
}
我用gcc 4.8.5编译:
c++ -O3 -ftree-vectorize -fopt-info-vec-missed -ftree-vectorizer-verbose=5 -std=c++11 test.cpp
我从编译器得到的输出是:
test.cpp:34: note: ===vect_slp_analyze_bb===
test.cpp:34: note: === vect_analyze_data_refs ===
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: === vect_pattern_recog ===
test.cpp:34: note: vect_is_simple_use: operand _27
test.cpp:34: note: def_stmt: _27 = (long unsigned int) ang_212;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand ang_212
test.cpp:34: note: def_stmt: ang_212 = PHI <ang_43(78), 0(76)>
test.cpp:34: note: type of def: 2.
test.cpp:34: note: vect_is_simple_use: operand 4
test.cpp:34: note: vect_recog_widen_mult_pattern: detected:
test.cpp:34: note: get vectype with 4 units of type uint
test.cpp:34: note: vectype: vector(4) unsigned int
test.cpp:34: note: get vectype with 2 units of type long unsigned int
test.cpp:34: note: vectype: vector(2) long unsigned int
test.cpp:34: note: patt_2 = ang_212 w* 4;
test.cpp:34: note: pattern recognized: patt_2 = ang_212 w* 4;
test.cpp:34: note: vect_is_simple_use: operand _29
test.cpp:34: note: def_stmt: _29 = *_67;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand _34
test.cpp:34: note: def_stmt: _34 = *_69;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: === vect_analyze_dependences ===
test.cpp:34: note: can't determine dependence between *_67 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_68 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_69 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_70 and MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_refs_alignment ===
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_125
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_153
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_139
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_167
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: can't force alignment of ref: MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_ref_accesses ===
test.cpp:34: note: not consecutive access MEM[(value_type &)__first_111] = _41;
test.cpp:34: note: === vect_analyze_slp ===
test.cpp:34: note: Failed to SLP the basic block.
test.cpp:34: note: not vectorized: failed to find SLP opportunities in basic block.
编辑:
在 Matts 回答后:
@马特:
非常感谢您的回答。
我不知道向量没有对齐。此信息非常有用,因为许多人认为循环将被矢量化是理所当然的,即使他们使用矢量作为容器也是如此。
不幸的是,即使您进行了更改,来自 gcc 的报告仍然没有矢量化(这次有不同的消息):
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
汇编输出是(希望我复制粘贴正确的部分因为我的汇编知识不是很好):
.L16
vmovdqa 40000(%rsp,%rax), %ymm1
vmovdqa 80000(%rsp,%rax), %ymm0
vpmulld 120000(%rsp,%rax), %ymm1, %ymm1
vpmulld 160000(%rsp,%rax), %ymm0, %ymm0
vpaddd %ymm0, %ymm1, %ymm0
vpaddd (%rsp,%rax), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp,%rax)
addq , %rax
cmpq 232, %rax
jne .L16
为了使用矢量化指令,操作数需要沿着正确的边界对齐。例如 __attribute__((aligned(32)))
或 __attribute__((aligned(16)))
等。即使 class 对齐,std::vector
的标准分配器也不保证对齐。例如 std::vector<__m64> A
创建一个 SSE 数据类型的向量,但它们可能不会对齐,因为 std::allocator
不会对齐所有内容。在我看来,最简单的更改是使用 std::array
和 __attribute__((aligned(32)))
#include <iostream>
#include <array>
using namespace std;
int main()
{
array<int, 10000> ffs_psd __attribute__((aligned(32)));
ffs_psd.fill(5);
array<int, 10000> G_qh_sp __attribute__((aligned(32)));
G_qh_sp.fill(1);
array<int, 10000> G_qv_sp __attribute__((aligned(32)));
G_qv_sp.fill(3);
array<int, 10000> B_erm_qh __attribute__((aligned(32)));
B_erm_qh.fill(50);
array<int, 10000> B_erm_qv __attribute__((aligned(32)));
B_erm_qv.fill(2);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
cout << ffs_psd[0] << endl;
}
循环产生这个:
vmovdqa ymm2, YMMWORD PTR [rsp+40000+rax]
vmovdqa ymm1, YMMWORD PTR [rsp+80000+rax]
vpmulld ymm2, ymm2, YMMWORD PTR [rsp+120000+rax]
vpmulld ymm1, ymm1, YMMWORD PTR [rsp+160000+rax]
add rax, 32
vpaddd ymm1, ymm2, ymm1
cmp rax, 27232
vpaddd ymm0, ymm0, ymm1
jne .L13
vmovdqa xmm1, xmm0
与 GCC 4.8.3 -std=c++11 -Wall -Wextra -pedantic-errors -O2 -ftree-vectorize -march=native
一起在 Godbolt 上
另一种选择是将 boost::alignment::aligned_allocator
与您的向量一起使用。
最后,您可以编写自己的 allocator
,vector
可以使用它来正确对齐事物。这是一篇文章,解释了关于相同基本事物的 allocator. Also here is a SO question 的要求。
我正在尝试了解有关 gcc 中的自动矢量化的更多信息。 在我的项目中,我必须使用 gcc 4.8.5,并且我看到一些循环没有矢量化。 因此,我创建了一个小示例来播放并了解为什么它们不是。
我感兴趣的是 gcc 不对循环进行矢量化,并想了解如何对其进行矢量化。不幸的是我不是很熟悉 GCC 的输出信息。
a) 我希望这个循环会被矢量化为一个简单的案例
b) 有什么我遗漏的小事吗?
非常感谢大家...
小例子是:
#include <iostream>
#include <vector>
using namespace std;
class test
{
public:
test();
~test();
void calc_test();
};
test::test()
{
}
test::~test()
{
}
void
test::calc_test(void)
{
vector<int> ffs_psd(10000,5.0);
vector<int> G_qh_sp(10000,1.0);
vector<int> G_qv_sp(10000,3.0);
vector<int> B_erm_qh(10000,50.0);
vector<int> B_erm_qv(10000,2.0);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
}
int main(int argc, char * argv[])
{
test m_test;
m_test.calc_test();
}
我用gcc 4.8.5编译:
c++ -O3 -ftree-vectorize -fopt-info-vec-missed -ftree-vectorizer-verbose=5 -std=c++11 test.cpp
我从编译器得到的输出是:
test.cpp:34: note: ===vect_slp_analyze_bb===
test.cpp:34: note: === vect_analyze_data_refs ===
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: === vect_pattern_recog ===
test.cpp:34: note: vect_is_simple_use: operand _27
test.cpp:34: note: def_stmt: _27 = (long unsigned int) ang_212;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand ang_212
test.cpp:34: note: def_stmt: ang_212 = PHI <ang_43(78), 0(76)>
test.cpp:34: note: type of def: 2.
test.cpp:34: note: vect_is_simple_use: operand 4
test.cpp:34: note: vect_recog_widen_mult_pattern: detected:
test.cpp:34: note: get vectype with 4 units of type uint
test.cpp:34: note: vectype: vector(4) unsigned int
test.cpp:34: note: get vectype with 2 units of type long unsigned int
test.cpp:34: note: vectype: vector(2) long unsigned int
test.cpp:34: note: patt_2 = ang_212 w* 4;
test.cpp:34: note: pattern recognized: patt_2 = ang_212 w* 4;
test.cpp:34: note: vect_is_simple_use: operand _29
test.cpp:34: note: def_stmt: _29 = *_67;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand _34
test.cpp:34: note: def_stmt: _34 = *_69;
test.cpp:34: note: type of def: 3.
test.cpp:34: note: === vect_analyze_dependences ===
test.cpp:34: note: can't determine dependence between *_67 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_68 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_69 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_70 and MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_refs_alignment ===
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_125
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_153
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_139
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_167
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: can't force alignment of ref: MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_ref_accesses ===
test.cpp:34: note: not consecutive access MEM[(value_type &)__first_111] = _41;
test.cpp:34: note: === vect_analyze_slp ===
test.cpp:34: note: Failed to SLP the basic block.
test.cpp:34: note: not vectorized: failed to find SLP opportunities in basic block.
编辑: 在 Matts 回答后:
@马特:
非常感谢您的回答。 我不知道向量没有对齐。此信息非常有用,因为许多人认为循环将被矢量化是理所当然的,即使他们使用矢量作为容器也是如此。
不幸的是,即使您进行了更改,来自 gcc 的报告仍然没有矢量化(这次有不同的消息):
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];
test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.
汇编输出是(希望我复制粘贴正确的部分因为我的汇编知识不是很好):
.L16
vmovdqa 40000(%rsp,%rax), %ymm1
vmovdqa 80000(%rsp,%rax), %ymm0
vpmulld 120000(%rsp,%rax), %ymm1, %ymm1
vpmulld 160000(%rsp,%rax), %ymm0, %ymm0
vpaddd %ymm0, %ymm1, %ymm0
vpaddd (%rsp,%rax), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp,%rax)
addq , %rax
cmpq 232, %rax
jne .L16
为了使用矢量化指令,操作数需要沿着正确的边界对齐。例如 __attribute__((aligned(32)))
或 __attribute__((aligned(16)))
等。即使 class 对齐,std::vector
的标准分配器也不保证对齐。例如 std::vector<__m64> A
创建一个 SSE 数据类型的向量,但它们可能不会对齐,因为 std::allocator
不会对齐所有内容。在我看来,最简单的更改是使用 std::array
和 __attribute__((aligned(32)))
#include <iostream>
#include <array>
using namespace std;
int main()
{
array<int, 10000> ffs_psd __attribute__((aligned(32)));
ffs_psd.fill(5);
array<int, 10000> G_qh_sp __attribute__((aligned(32)));
G_qh_sp.fill(1);
array<int, 10000> G_qv_sp __attribute__((aligned(32)));
G_qv_sp.fill(3);
array<int, 10000> B_erm_qh __attribute__((aligned(32)));
B_erm_qh.fill(50);
array<int, 10000> B_erm_qv __attribute__((aligned(32)));
B_erm_qv.fill(2);
for ( uint ang=0; ang < 6808; ang++)
{
ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang]) + (G_qv_sp[ang] * B_erm_qv[ang]);
}
cout << ffs_psd[0] << endl;
}
循环产生这个:
vmovdqa ymm2, YMMWORD PTR [rsp+40000+rax]
vmovdqa ymm1, YMMWORD PTR [rsp+80000+rax]
vpmulld ymm2, ymm2, YMMWORD PTR [rsp+120000+rax]
vpmulld ymm1, ymm1, YMMWORD PTR [rsp+160000+rax]
add rax, 32
vpaddd ymm1, ymm2, ymm1
cmp rax, 27232
vpaddd ymm0, ymm0, ymm1
jne .L13
vmovdqa xmm1, xmm0
与 GCC 4.8.3 -std=c++11 -Wall -Wextra -pedantic-errors -O2 -ftree-vectorize -march=native
另一种选择是将 boost::alignment::aligned_allocator
与您的向量一起使用。
最后,您可以编写自己的 allocator
,vector
可以使用它来正确对齐事物。这是一篇文章,解释了关于相同基本事物的 allocator. Also here is a SO question 的要求。