对于 Cortex-M3,如何优化块复制和右移 + 饱和到 max=5
How do I optimize a block copy and right shift + saturate to max=5, for Cortex-M3
基本上,我需要通过减小整个代码的大小以减少内存大小或使其运行方式更高效来提高这段代码的效率。我正在使用 Thumb 2 和 Cortex-M3。
我已经尝试减少使用的 MOV 函数的数量,但虽然这确实减少了整体代码大小,但由于代码的工作方式,它需要每个单独的部分来获取结果并将结果存储在寄存器中,所以我我对如何改进它感到困惑。代码目前处于默认状态。
THUMB
AREA RESET, CODE, READONLY
EXPORT __Vectors
EXPORT Reset_Handler
__Vectors
DCD 0x00180000 ; top of the stack
DCD Reset_Handler ; reset vector - where the program starts
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
num_words EQU (end_source-source)/4 ; number of words to copy
start
LDR r0,=source ; point to the start of the area of memory to copy from
LDR r1,=dest ; point to the start of the area of memory to copy to
MOV r2,#num_words ; get the number of words to copy
; find out how many blocks of 8 words need to be copied - it is assumed
; that it is faster to load 8 data items at a time, rather than load
; individually
block
MOVS r3,r2,LSR #3 ; find the number of blocks of 8 words
BEQ individ ; if no blocks to copy, just copy individual words
; copy and process blocks of 8 words
block_loop
LDMIA r0!,{r5-r12} ; get 8 words to copy as a block
MOV r4,r5 ; get first item
BL data_processing ; process first item
MOV r5,r4 ; keep first item
MOV r4,r6 ; get second item
BL data_processing ; process second item
MOV r6,r4 ; keep second item
MOV r4,r7 ; get third item
BL data_processing ; process third item
MOV r7,r4 ; keep third item
MOV r4,r8 ; get fourth item
BL data_processing ; process fourth item
MOV r8,r4 ; keep fourth item
MOV r4,r9 ; get fifth item
BL data_processing ; process fifth item
MOV r9,r4 ; keep fifth item
MOV r4,r10 ; get sixth item
BL data_processing ; process sixth item
MOV r10,r4 ; keep sixth item
MOV r4,r11 ; get seventh item
BL data_processing ; process seventh item
MOV r11,r4 ; keep seventh item
MOV r4,r12 ; get eighth item
BL data_processing ; process eighth item
MOV r12,r4 ; keep eighth item
STMIA r1!,{r5-r12} ; copy the 8 words
SUBS r3,r3,#1 ; move on to the next block
BNE block_loop ; continue until last block reached
; there may now be some data items available (fewer than 8)
; find out how many of these individual words need to be copied
individ
ANDS r3,r2,#7 ; find the number of words that remain to copy individually
BEQ exit ; skip individual copying if none remains
; copy the excess of words
individ_loop
LDR r4,[r0],#4 ; get next word to copy
BL data_processing ; process the item read
STR r4,[r1],#4 ; copy the word
SUBS r3,r3,#1 ; move on to the next word
BNE individ_loop ; continue until the last word reached
; languish in an endless loop once all is done
exit
B exit
; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5
data_processing
CMP r4,#10 ; check whether saturation is needed
BLT divide_by_two ; if not, just divide by 2
MOV r4,#5 ; saturate to 5
BX lr
divide_by_two
MOV r4,r4,LSR #1 ; perform scaling
BX lr
AREA 2a_ROData, DATA, READONLY
source ; some data to copy
DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10
end_source
AREA 2a_RWData, DATA, READWRITE
dest ; copy to this area of memory
SPACE end_source-source
end_dest
END
基本上我需要代码将结果存储在每个寄存器中,同时还要减小大小或加快执行速度。感谢您的帮助。
这是主循环的略微优化版本。鉴于您正在为 Cortex M3 编程,超标量或 SIMD 处理没有真正的可能性,因为您的 CPU 不支持它。这与您的代码之间的主要区别是:
- 所有相关函数都是内联的
- 逻辑优化了一点
- 已省略无用的移动指令
此代码在每个 table 条目中运行 10 个周期,加上一些用于初始分支的指令以及最终的分支预测错误。
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
block: cbz r2, .Lbxlr @ 2 return if nothing to copy
.Loop: ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
通过展开一次循环,您可以将两个条目的循环减少到 16 个(每个条目 8 个循环)。请注意,这几乎是代码长度的三倍,但性能提升很小。
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
@ first check if the number of elements is even or odd
@ leave this out if it's know to be even
block: tst r2, #1 @ 1 odd number of entries to copy?
beq .Leven @ 2 if not, proceed with eveness check
ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
@ check if any elements are left
@ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr @ 2 return if no entries left.
.Loop: ldmia r0!, {r3, r4} @ 3 load two items from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
cmp r4, #10 @ 1 need to scale?
ite lt @ 1 if r5 < 10:
lsrlt r4, r4, #1 @ 1 then r4 >>= 1
movge r4, #5 @ 1 else r4 = 5
stmia r1!, {r3, r4} @ 3 store to destination
subs r2, r2, #2 @ 1 decrement #words twice
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
将循环展开四次可以实现每个元素 7 个循环,但我认为这太多了。
请注意,此代码在 GNU 中作为语法。为您的汇编程序修改它应该是微不足道的。
基本上,我需要通过减小整个代码的大小以减少内存大小或使其运行方式更高效来提高这段代码的效率。我正在使用 Thumb 2 和 Cortex-M3。
我已经尝试减少使用的 MOV 函数的数量,但虽然这确实减少了整体代码大小,但由于代码的工作方式,它需要每个单独的部分来获取结果并将结果存储在寄存器中,所以我我对如何改进它感到困惑。代码目前处于默认状态。
THUMB
AREA RESET, CODE, READONLY
EXPORT __Vectors
EXPORT Reset_Handler
__Vectors
DCD 0x00180000 ; top of the stack
DCD Reset_Handler ; reset vector - where the program starts
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
num_words EQU (end_source-source)/4 ; number of words to copy
start
LDR r0,=source ; point to the start of the area of memory to copy from
LDR r1,=dest ; point to the start of the area of memory to copy to
MOV r2,#num_words ; get the number of words to copy
; find out how many blocks of 8 words need to be copied - it is assumed
; that it is faster to load 8 data items at a time, rather than load
; individually
block
MOVS r3,r2,LSR #3 ; find the number of blocks of 8 words
BEQ individ ; if no blocks to copy, just copy individual words
; copy and process blocks of 8 words
block_loop
LDMIA r0!,{r5-r12} ; get 8 words to copy as a block
MOV r4,r5 ; get first item
BL data_processing ; process first item
MOV r5,r4 ; keep first item
MOV r4,r6 ; get second item
BL data_processing ; process second item
MOV r6,r4 ; keep second item
MOV r4,r7 ; get third item
BL data_processing ; process third item
MOV r7,r4 ; keep third item
MOV r4,r8 ; get fourth item
BL data_processing ; process fourth item
MOV r8,r4 ; keep fourth item
MOV r4,r9 ; get fifth item
BL data_processing ; process fifth item
MOV r9,r4 ; keep fifth item
MOV r4,r10 ; get sixth item
BL data_processing ; process sixth item
MOV r10,r4 ; keep sixth item
MOV r4,r11 ; get seventh item
BL data_processing ; process seventh item
MOV r11,r4 ; keep seventh item
MOV r4,r12 ; get eighth item
BL data_processing ; process eighth item
MOV r12,r4 ; keep eighth item
STMIA r1!,{r5-r12} ; copy the 8 words
SUBS r3,r3,#1 ; move on to the next block
BNE block_loop ; continue until last block reached
; there may now be some data items available (fewer than 8)
; find out how many of these individual words need to be copied
individ
ANDS r3,r2,#7 ; find the number of words that remain to copy individually
BEQ exit ; skip individual copying if none remains
; copy the excess of words
individ_loop
LDR r4,[r0],#4 ; get next word to copy
BL data_processing ; process the item read
STR r4,[r1],#4 ; copy the word
SUBS r3,r3,#1 ; move on to the next word
BNE individ_loop ; continue until the last word reached
; languish in an endless loop once all is done
exit
B exit
; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5
data_processing
CMP r4,#10 ; check whether saturation is needed
BLT divide_by_two ; if not, just divide by 2
MOV r4,#5 ; saturate to 5
BX lr
divide_by_two
MOV r4,r4,LSR #1 ; perform scaling
BX lr
AREA 2a_ROData, DATA, READONLY
source ; some data to copy
DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10
end_source
AREA 2a_RWData, DATA, READWRITE
dest ; copy to this area of memory
SPACE end_source-source
end_dest
END
基本上我需要代码将结果存储在每个寄存器中,同时还要减小大小或加快执行速度。感谢您的帮助。
这是主循环的略微优化版本。鉴于您正在为 Cortex M3 编程,超标量或 SIMD 处理没有真正的可能性,因为您的 CPU 不支持它。这与您的代码之间的主要区别是:
- 所有相关函数都是内联的
- 逻辑优化了一点
- 已省略无用的移动指令
此代码在每个 table 条目中运行 10 个周期,加上一些用于初始分支的指令以及最终的分支预测错误。
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
block: cbz r2, .Lbxlr @ 2 return if nothing to copy
.Loop: ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
通过展开一次循环,您可以将两个条目的循环减少到 16 个(每个条目 8 个循环)。请注意,这几乎是代码长度的三倍,但性能提升很小。
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
@ first check if the number of elements is even or odd
@ leave this out if it's know to be even
block: tst r2, #1 @ 1 odd number of entries to copy?
beq .Leven @ 2 if not, proceed with eveness check
ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
@ check if any elements are left
@ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr @ 2 return if no entries left.
.Loop: ldmia r0!, {r3, r4} @ 3 load two items from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
cmp r4, #10 @ 1 need to scale?
ite lt @ 1 if r5 < 10:
lsrlt r4, r4, #1 @ 1 then r4 >>= 1
movge r4, #5 @ 1 else r4 = 5
stmia r1!, {r3, r4} @ 3 store to destination
subs r2, r2, #2 @ 1 decrement #words twice
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
将循环展开四次可以实现每个元素 7 个循环,但我认为这太多了。
请注意,此代码在 GNU 中作为语法。为您的汇编程序修改它应该是微不足道的。