ARM下编译NEON代码出错
Error compiling NEON code under ARM
我正在尝试将 SSE4 优化代码移植到使用以下 header 优化的 NEON:
https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h
在 ODROID-xu4 上编译这段代码时出现编译错误:
https://github.com/k06a/creepMiner/tree/feature/neon
[ 2%] Building CXX object CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o
In file included from /root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:123:0,
from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h: In function '__m128i _mm_set1_epi32(int)':
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h:6733:1: error: inlining failed in call to always_inline 'int32x4_t vdupq_n_s32(int32_t)': target specific option mismatch
vdupq_n_s32 (int32_t __a)
^~~~~~~~~~~
In file included from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:0:
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:230:7: note: called from here
(x)
^
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:383:12: note: in expansion of macro 'vreinterpretq_m128i_s32'
return vreinterpretq_m128i_s32(vdupq_n_s32(_i));
^~~~~~~~~~~~~~~~~~~~~~~
CMakeFiles/creepMiner.dir/build.make:878: recipe for target 'CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o' failed
make[2]: *** [CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/creepMiner.dir/all' failed
make[1]: *** [CMakeFiles/creepMiner.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2
源文件有以下特定选项:
-marm -march=armv7-a+simd -mtune=cortex-a15.cortex-a7
CMakeLists.txt:
if (USE_NEON AND NOT MINIMAL_BUILD)
add_definitions(-DUSE_NEON)
set(SOURCE_FILES ${SOURCE_FILES} src/shabal/mshabal/mshabal_neon.cpp)
if (UNIX OR APPLE)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -marm)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -march=armv7-a+simd)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -mtune=cortex-a15.cortex-a7)
elseif (MSVC)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS /arch:ARMv7)
endif ()
endif ()
看起来当前架构不支持 vdupq_n_s32
,但应该支持,因为 armv7
支持。
处理器信息:
$ cat /proc/cpuinfo
给出以下内容:
processor : 0
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 1
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 2
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 3
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 4
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 5
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 6
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 7
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
Hardware : ODROID-XU4
Revision : 0100
Serial : 0000000000000000
获取原生架构:
gcc -march=native -v
提供以下内容:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/7/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.3.0-16ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --with-as=/usr/bin/arm-linux-gnueabihf-as --with-ld=/usr/bin/arm-linux-gnueabihf-ld --program-suffix=-7 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)
也许这是个问题?我只看到 --with-arch=armv7-a --with-fpu=vfpv3-d16
支持,但应该是 vfpv4
支持。是吗?我应该重新配置 GCC 吗?这会有帮助吗?
-mfpu=neon
应该可以解决问题。
顺便说一句,您真的希望只包含头文件就可以解决问题吗?
NEON 有大量英特尔机器上不可用的指令,尤其是在排列方面。
你会得到很多 vtbl
指令,这些指令到处都是令人讨厌的延迟,疯狂地消耗周期。
一味依赖别人的通用解决方案不能称为优化IMO。
我正在尝试将 SSE4 优化代码移植到使用以下 header 优化的 NEON: https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h
在 ODROID-xu4 上编译这段代码时出现编译错误: https://github.com/k06a/creepMiner/tree/feature/neon
[ 2%] Building CXX object CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o
In file included from /root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:123:0,
from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h: In function '__m128i _mm_set1_epi32(int)':
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h:6733:1: error: inlining failed in call to always_inline 'int32x4_t vdupq_n_s32(int32_t)': target specific option mismatch
vdupq_n_s32 (int32_t __a)
^~~~~~~~~~~
In file included from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:0:
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:230:7: note: called from here
(x)
^
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:383:12: note: in expansion of macro 'vreinterpretq_m128i_s32'
return vreinterpretq_m128i_s32(vdupq_n_s32(_i));
^~~~~~~~~~~~~~~~~~~~~~~
CMakeFiles/creepMiner.dir/build.make:878: recipe for target 'CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o' failed
make[2]: *** [CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/creepMiner.dir/all' failed
make[1]: *** [CMakeFiles/creepMiner.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2
源文件有以下特定选项:
-marm -march=armv7-a+simd -mtune=cortex-a15.cortex-a7
CMakeLists.txt:
if (USE_NEON AND NOT MINIMAL_BUILD)
add_definitions(-DUSE_NEON)
set(SOURCE_FILES ${SOURCE_FILES} src/shabal/mshabal/mshabal_neon.cpp)
if (UNIX OR APPLE)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -marm)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -march=armv7-a+simd)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -mtune=cortex-a15.cortex-a7)
elseif (MSVC)
set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS /arch:ARMv7)
endif ()
endif ()
看起来当前架构不支持 vdupq_n_s32
,但应该支持,因为 armv7
支持。
处理器信息:
$ cat /proc/cpuinfo
给出以下内容:
processor : 0
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 1
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 2
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 3
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 90.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 4
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 5
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 6
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
processor : 7
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 120.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
Hardware : ODROID-XU4
Revision : 0100
Serial : 0000000000000000
获取原生架构:
gcc -march=native -v
提供以下内容:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/7/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.3.0-16ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --with-as=/usr/bin/arm-linux-gnueabihf-as --with-ld=/usr/bin/arm-linux-gnueabihf-ld --program-suffix=-7 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)
也许这是个问题?我只看到 --with-arch=armv7-a --with-fpu=vfpv3-d16
支持,但应该是 vfpv4
支持。是吗?我应该重新配置 GCC 吗?这会有帮助吗?
-mfpu=neon
应该可以解决问题。
顺便说一句,您真的希望只包含头文件就可以解决问题吗?
NEON 有大量英特尔机器上不可用的指令,尤其是在排列方面。
你会得到很多 vtbl
指令,这些指令到处都是令人讨厌的延迟,疯狂地消耗周期。
一味依赖别人的通用解决方案不能称为优化IMO。