需要未移位的寄存器 - 汇编器在 TST 指令上抛出错误

Question

我目前正在将算法从 C 重写到 arm 汇编 (ARM Cortex M4 CPU)。

我的代码有什么作用？

该算法以一个 8 位数字作为输入，从右边开始告诉我们第一个为 0 的位是多少。这里有几个例子：

输入：B01111111 Output:7

输入：B01110111 Output:3

输入：B11111110 Output:0

这是完成此操作的原始 C 代码：

uint8_t find_empty(uint32_t input_word)
{
  for (uint8_t searches=7; searches>=0; searches--)
  {
    if ((input_word&1)==0)
    {
      return 7-searches;
    }
    
    input_word=input_word>>1;
  }
  return 255;
}

这是我在 ARM (Cortex M4) 程序集中重写它的初学者尝试。

.global findEmpty
findEmpty:
    mov r1, r0 //Move input_word to r1
    
    //Config
    mov r0, #7 //search through 8 (7+1) bits. <-searches

    FindLoop:
      tst r1, #1 //ANDs input_word with 1, sets the Z flag accordingly.
      
      beq NotFoundYet //didn't get a 0, jump forward
        rsb r0, r0, #7 //searches=7-searches <- which bit is 0? 
        bx lr //Return found bit number
      
      NotFoundYet:
      lsr r1, r1, #1 //input_word=input_word>>1

      sub r0, r0, #1 //Decrement searches
      cmp r0, #0
      bpl FindLoop //If searches>=0, do the loop again. 
    mov r0, #255 //We didn't find anything. Return 255 to signal that
    bx lr

快速说明： 我在这里使用 r1 作为变量，我听说你不应该这样做，因为编译器（我正在使用 gcc 将我的程序集“.S”文件链接到 C 文件）使用 r0-r3 来传递数据和接收数据职能。但是，正因为如此，它不会将这些寄存器用于重要的事情，所以我不必处理将内容推入堆栈的问题，从而节省了周期。

有什么问题吗？

当我尝试编译我的项目时，gcc 在 TST 行上给我一个汇编程序错误：

汇编程序消息： 错误：需要未移位的寄存器 -- `tst r1, #1'

这让我很困惑，因为我查看了 TST instruction and the LSR instruction which I am using later to shift r1 by 1. Yet none of them say anything about not being able to work together. I’ve looked online for other discussions on this topic. I came across this discussion 的 keil 站点，人们说告诉编译器在 ARM 模式下编译，但我的代码已经是运行在 ARM 模式下，而不是 Thumb。我通过制作另一个 .global 子例程并尝试将超过 7 的立即数添加到数字中来确认这一点，并且确定它不起作用，就像 CPU 处于 ARM 模式时它不应该一样。

.global illegal_add
illegal_add:
    add r0, r0, #20
    bx lr

我知之甚少，不知道如何尝试解决这个问题。如果有人对要尝试的事情有任何想法，请告诉我。谢谢你的帮助。

Answer 1

我不是 100% 清楚问题出在哪里。您很可能忘记正确设置程序集。要解决此问题，请在文件开头发出这些指令：

.syntax unified
.cpu cortex-m4
.thumb

如果我把这些放在你的代码前面，它在我的机器上组装得很好。

一些一般提示：

阅读哪些指令是 16 位可编码的，并尝试从中挑选指令。 16 位指令执行速度更快，消耗的内存更少。例如，您可以使用 lsrs r1, r1, #1 而不是 lsr r1, r1, #1 来获得 16 位指令。
更聪明地操纵旗帜。许多指令已经为您设置了标志，如果您聪明的话，您可能可以避免所有 tst 和 cmp 指令。例如，如果你使用 subs r0, r0, #1 而不是 sub r0, r0, #1 你保存一个字节（16 位指令）并且已经根据 r0 设置了 Z 标志，为你保存后续的 cmp说明。

Answer 2

tst reg,imm 有一个 32 位 Thumb2 形式，如果您使用正确的汇编程序选项和指令，它就可以工作。但是对于找到最低0位或1位的位置没有用。

您只需要三个 T32 指令，每个指令只执行一次（无 iteration/no CPSR）：

mvn     r0, r0 // binary negate the value
rbit    r0, r0 // reverse bit
clz     r0, r0 // count leading zeroes

如果没有找到零，它将 return 8 而不是 255。你最好写一个内联 inline-assembly 函数，因为这个函数又小又快，以至于函数调用的开销会比函数本身大。

正如有人提到的，您只需要聪明一点，而这个人似乎并不聪明。

或者您可以使用内置函数并在 C:

中编写整个函数

static inline uint32_t find_empty(uint32_t input_word)
{
    input_word = ~input_word;
    input_word = __rbit(input_word);
    input_word = __clz(input_word);
    return input_word;
}

您的工具链很可能同时支持 rbit 和 clz，您只需要找到正确的语法即可。

您甚至可以轻松地将其转换为 x86（甚至更好，因为 x86 直接“计算尾随零”，尽管 AMD CPU 将其解码为 2 微指令，可能在内部 bit-reversing像我们必须在 ARM 上手动那样喂 lzcnt):

#include <immintrin.h>
static inline uint32_t find_empty(uint32_t input)
{
    input = ~input;
    return _tzcnt_u32(input);
}

如果您的编译器支持 GNU extensions，__builtin_ctz(~input) 可移植地计算尾随零，在 ARM 上使用 rbit/clz，在 x86 上使用 tzcnt 或 bsf。（但请注意，如果 ~input 为 all-zero，则结果是未定义的 int 值，因为可能使用 x86 bsf。解决此问题的一种方法是 (1<<8)|(~input) 所以它肯定有一点可以找到。）

无论如何，这将带来巨大的性能提升：

没有 CPSR dependency/corruption
无外部依赖
无需额外注册
没有分支

由于上述原因，编译器将能够出色地安排那些 two-three 指令。

需要未移位的寄存器 - 汇编器在 TST 指令上抛出错误

unshifted register required - Assembler throws error on the TST instruction

c

assembly

arm

thumb

cortex-m