最终的 ARM Linux NEON 复制内存碎片但不是 memcpy
Eventual ARM Linux Memory Fragmentation with NEON Copy but not memcpy
我在 BeagleBone X-15 (ARM Cortex-A15) 板上 运行ning Linux 4.4。我的应用程序映射 SGX GPU 的输出,需要复制 DRM 后备存储。
memcpy 和我自定义的 NEON 复制代码都可以工作...但是 NEON 代码要快得多(~11ms 与 ~35ms)。
我注意到,在 12500 秒后,当我使用副本的 NEON 版本时,Linux 会以内存不足 (OOM) 为由终止应用程序。当我 运行 应用程序并将一行从 NEON 副本更改为标准 memcpy 时,它 运行 无限期地(到目前为止 12 小时...)。但是复制比较慢。
我在下面粘贴了 mmap、复制和 NEON 复制代码。我的 NEON 副本真的有问题吗?谢谢。
NEON 复制:
/**
* CompOpenGL neonCopyRGBAtoRGBA()
* Purpose: neonCopyRGBAtoRGBA - Software NEON copy
*
* @param src - Source buffer
* @param dst - Destination buffer
* @param numpix - Number of pixels to convert
*/
__attribute__((noinline)) void CompOpenGL::neonCopyRGBAtoRGBA(unsigned char* src, unsigned char* dst, int numPix)
{
(void)src;
(void)dst;
(void)numPix;
// This case takes RGBA -> BGRA
__asm__ volatile(
"mov r3, r3, lsr #3\n" /* Divide number of pixels by 8 because we process them 8 at a time */
"loopRGBACopy:\n"
"vld4.8 {d0-d3}, [r1]!\n" /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
"subs r3, r3, #1\n" /* Decrement the loop counter */
"vst4.8 {d0-d3}, [r2]!\n" /* Store the RGBA into destination 8 pixels at a time */
"bgt loopRGBACopy\n"
"bx lr\n"
);
}
映射并在此处复制代码:
union gbm_bo_handle handleUnion = gbm_bo_get_handle(m_Fb->bo);
struct drm_omap_gem_info gemInfo;
char *gpuMmapFrame = NULL;
gemInfo.handle = handleUnion.s32;
int ret = drmCommandWriteRead(m_DRMController->m_Fd, DRM_OMAP_GEM_INFO,&gemInfo, sizeof(gemInfo));
if (ret) {
qDebug() << "Cannot set write/read";
}
else {
// Mmap the frame
gpuMmapFrame = (char *)mmap(0, gemInfo.size, PROT_READ | PROT_WRITE, MAP_SHARED,m_DRMController->m_Fd, gemInfo.offset);
if ( gpuMmapFrame != MAP_FAILED ) {
QElapsedTimer timer;
timer.restart();
//m_OGLController->neonCopyRGBAtoRGBA((uchar*)gpuMmapFrame, (uchar*)m_cpyFrame,dmaBuf.width * dmaBuf.height);
memcpy(m_cpyFrame,gpuMmapFrame,dmaBuf.height * dmaBuf.width * 4);
qDebug() << "Copy Performance: " << timer.elapsed();
好消息是,如果将 vld4/vst4
替换为 vld1/vst1
,您的函数 运行 会快得多。
坏消息是您必须报告您使用和修改了哪些寄存器,包括 CPSR
和内存,并且您不应该 return 来自内联汇编。 (bx lr
).
__asm__ volatile(
"mov r3, r3, lsr #3\n" /* Divide number of pixels by 8 because we process them 8 at a time */
"loopRGBACopy:\n"
"vld1.8 {d0-d3}, [r1]!\n" /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
"subs r3, r3, #1\n" /* Decrement the loop counter */
"vst1.8 {d0-d3}, [r2]!\n" /* Store the RGBA into destination 8 pixels at a time */
"bgt loopRGBACopy\n"
::: "r1", "r2", "r3", "d0", "d1", "d2", "d3", "cc", "memory"
);
我在 BeagleBone X-15 (ARM Cortex-A15) 板上 运行ning Linux 4.4。我的应用程序映射 SGX GPU 的输出,需要复制 DRM 后备存储。
memcpy 和我自定义的 NEON 复制代码都可以工作...但是 NEON 代码要快得多(~11ms 与 ~35ms)。
我注意到,在 12500 秒后,当我使用副本的 NEON 版本时,Linux 会以内存不足 (OOM) 为由终止应用程序。当我 运行 应用程序并将一行从 NEON 副本更改为标准 memcpy 时,它 运行 无限期地(到目前为止 12 小时...)。但是复制比较慢。
我在下面粘贴了 mmap、复制和 NEON 复制代码。我的 NEON 副本真的有问题吗?谢谢。
NEON 复制:
/**
* CompOpenGL neonCopyRGBAtoRGBA()
* Purpose: neonCopyRGBAtoRGBA - Software NEON copy
*
* @param src - Source buffer
* @param dst - Destination buffer
* @param numpix - Number of pixels to convert
*/
__attribute__((noinline)) void CompOpenGL::neonCopyRGBAtoRGBA(unsigned char* src, unsigned char* dst, int numPix)
{
(void)src;
(void)dst;
(void)numPix;
// This case takes RGBA -> BGRA
__asm__ volatile(
"mov r3, r3, lsr #3\n" /* Divide number of pixels by 8 because we process them 8 at a time */
"loopRGBACopy:\n"
"vld4.8 {d0-d3}, [r1]!\n" /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
"subs r3, r3, #1\n" /* Decrement the loop counter */
"vst4.8 {d0-d3}, [r2]!\n" /* Store the RGBA into destination 8 pixels at a time */
"bgt loopRGBACopy\n"
"bx lr\n"
);
}
映射并在此处复制代码:
union gbm_bo_handle handleUnion = gbm_bo_get_handle(m_Fb->bo);
struct drm_omap_gem_info gemInfo;
char *gpuMmapFrame = NULL;
gemInfo.handle = handleUnion.s32;
int ret = drmCommandWriteRead(m_DRMController->m_Fd, DRM_OMAP_GEM_INFO,&gemInfo, sizeof(gemInfo));
if (ret) {
qDebug() << "Cannot set write/read";
}
else {
// Mmap the frame
gpuMmapFrame = (char *)mmap(0, gemInfo.size, PROT_READ | PROT_WRITE, MAP_SHARED,m_DRMController->m_Fd, gemInfo.offset);
if ( gpuMmapFrame != MAP_FAILED ) {
QElapsedTimer timer;
timer.restart();
//m_OGLController->neonCopyRGBAtoRGBA((uchar*)gpuMmapFrame, (uchar*)m_cpyFrame,dmaBuf.width * dmaBuf.height);
memcpy(m_cpyFrame,gpuMmapFrame,dmaBuf.height * dmaBuf.width * 4);
qDebug() << "Copy Performance: " << timer.elapsed();
好消息是,如果将 vld4/vst4
替换为 vld1/vst1
,您的函数 运行 会快得多。
坏消息是您必须报告您使用和修改了哪些寄存器,包括 CPSR
和内存,并且您不应该 return 来自内联汇编。 (bx lr
).
__asm__ volatile(
"mov r3, r3, lsr #3\n" /* Divide number of pixels by 8 because we process them 8 at a time */
"loopRGBACopy:\n"
"vld1.8 {d0-d3}, [r1]!\n" /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
"subs r3, r3, #1\n" /* Decrement the loop counter */
"vst1.8 {d0-d3}, [r2]!\n" /* Store the RGBA into destination 8 pixels at a time */
"bgt loopRGBACopy\n"
::: "r1", "r2", "r3", "d0", "d1", "d2", "d3", "cc", "memory"
);