Javascript 性能:递减循环中负数的模运算使代码减慢了 100% 以上

Javascript Performance: Modulus operation of negative Number within decrementing loop slowing the code by more than 100%

我正在经历 Eloquent JavaScript (again) and came across exercise "Chess Board" of Chapter 2. I had my one decent version of solution written back in the day when I was first time reading it, and another version of solution provided at the Elequent Javascript website。我是 新手 中的一员,他们想成为超级高效的程序员,脑子里只有一个问题:"Can I make it work a tad faster or smaller in anyway?"

所以,几个月前我在网上搜索时,我在 Stack Overflow 上遇到了 a question,关于 for 循环与 while 循环的性能。因为在那个线程中提到 for 循环比 while 慢,而递减迭代器的循环更快所以我重写了代码以获得更好的性能。

这是新版本,其中 for 替换为 while 并针对递减编辑了条件:

console.time("looping");
var gridSize = 5000, str = '', i = gridSize, j;
while (i--) {
  j = gridSize;
  while (j--) {
    if ((i - j) % 2 === 0)
      str += " ";
    else
      str += "#";
  }
  str += "\n";
}

//console.log(str);
console.timeEnd("looping");

但令我惊讶的是这段代码的性能更差。然后,过了一会儿我发现 if ((i - j) % 2 === 0) 是罪魁祸首,用加号代替了减号将总执行时间减少到 ~ 750ms

//Checked on NODE.js = v6.11.2
Book version of code         -->    893.76 ms
while loop with subtraction  -->    1562.43 ms
while loop with addition     -->    749.62 ms


//firefox Benchmark v54.0 (64-bit) (OS Ubuntu 16.04)
Book version of code         -->    725.10 ms
while loop with subtraction  -->    1565.29 ms
while loop with addition     -->    601.12 ms

为什么减法对总执行时间有如此巨大的影响?

编辑 01 6:30 8 月 8 日上午(格林威治标准时间)

在查看了@jaromandaX 的回答后,我很确定减慢循环速度的不是减法,而是负数的模数。 我又想知道是什么让负数的模变慢了

而不是在

中使用昂贵的模块化操作
((i - j) % 2 === 0)

你可以使用位运算

(((i-j)&1) === 0)

正如 SBS 在评论中所建议的,您也应该尝试

(((i^j)&1) === 0)

我在 nodejs show 中的测试(平均每个测试 5 运行s)

(i - j) % 2 // 1170ms
(i + j) % 2 //  720ms
Math.abs(i - j) % 2 // 720ms
Math.abs(i + j) % 2 // 720ms
(gridSize + i + j) % 2 // 715ms
(gridSize + i - j) % 2 // 710ms
(-i - j) % 2 // 1500ms

有些奇怪,最大的惊喜是调用 Math.absi + j 的影响几乎为零,但更令人惊讶的是添加 gridSize 使得 gridSize + i - j案例最快!!

但我可以从中得出的结论是,主要问题在于

(i - j) % 2

很多(i - j)< 0(一半?)

(-i - j) ALL 值是 < 0

Conclusion: When faced with a modulo operation on a negative number, performance decreases significantly

请注意,您应该可以使用

console.time("looping");
...
console.timeEnd("looping");

也在您的浏览器中,因此您可以 运行 相同的代码而无需在浏览器中使用 performance.now()

不确定这个 "benchmark" 有多浓,但是

console.time("positive");
(function() {
    var size = 100000;
    var v = 0;
    while (size--) {
        v+=(+size)%2
    }
})();
console.timeEnd("positive");

console.time("negative");
(function() {
    var size = 100000;
    var v = 0;
    while (size--) {
        v+=(-size)%2
    }
})();
console.timeEnd("negative");

这远非完整的答案,需要进一步调查(或了解 V8 实施细节的人的见解)。不过,这是我的发现:

Sidenode:使用以下命令行收集结果 运行 Node.JS:

node --expose-gc --print-code --code-comments --print-opt-code --trace-hydrogen --redirect-code-traces --redirect-code-traces-to=code.asm --trace_representation --trace-deopt --trace-opt 1.js

并查看 V8 源代码。

1. 性能差异来自于在这些情况下生成不同的机器代码这一事实。对于 + % 的代码是

                  ;;; <@134,#123> add-i
00000151A32DD74B   395  03c2           addl rax,rdx
00000151A32DD74D   397  0f807a030000   jo 1293  (00000151A32DDACD)
                  ;;; <@136,#126> mod-by-power-of-2-i
00000151A32DD753   403  85c0           testl rax,rax
00000151A32DD755   405  790f           jns 422  (00000151A32DD766)
00000151A32DD757   407  f7d8           negl rax
00000151A32DD759   409  83e001         andl rax,0x1
00000151A32DD75C   412  f7d8           negl rax
00000151A32DD75E   414  0f846e030000   jz 1298  (00000151A32DDAD2)
00000151A32DD764   420  eb03           jmp 425  (00000151A32DD769)
00000151A32DD766   422  83e001         andl rax,0x1
                  ;;; <@138,#200> smi-tag
00000151A32DD769   425  8bd8           movl rbx,rax
00000151A32DD76B   427  48c1e320       REX.W shlq rbx, 32
                  ;;; <@140,#130> gap
00000151A32DD76F   431  488bc2         REX.W movq rax,rdx

- 的代码要复杂得多:

                  ;;; <@136,#123> sub-i
00000151A32E57E1   417  412bc3         subl rax,r11
00000151A32E57E4   420  0f8039040000   jo 1507  (00000151A32E5C23)
                  ;;; <@138,#200> int32-to-double
00000151A32E57EA   426  c5f957c0       vxorpd xmm0,xmm0,xmm0
00000151A32E57EE   430  c5fb2ac0       vcvtlsi2sd xmm0,xmm0,rax
                  ;;; <@139,#200> gap
00000151A32E57F2   434  c5f928ca       vmovapd xmm1,xmm2
                  ;;; <@140,#126> mod-d
00000151A32E57F6   438  4989e2         REX.W movq r10,rsp
00000151A32E57F9   441  4883ec28       REX.W subq rsp,0x28
00000151A32E57FD   445  4883e4f0       REX.W andq rsp,0xf0
00000151A32E5801   449  4c89542420     REX.W movq [rsp+0x20],r10
00000151A32E5806   454  48b830d4124001000000 REX.W movq rax,000000014012D430
00000151A32E5810   464  ffd0           call rax
00000151A32E5812   466  488b642420     REX.W movq rsp,[rsp+0x20]
                  ;;; <@142,#126> lazy-bailout
                  ;;; <@144,#202> number-tag-d
00000151A32E5817   471  498b9dc06f0400 REX.W movq rbx,[r13+0x46fc0]
00000151A32E581E   478  488bc3         REX.W movq rax,rbx
00000151A32E5821   481  4883c010       REX.W addq rax,0x10
00000151A32E5825   485  493b85c86f0400 REX.W cmpq rax,[r13+0x46fc8]
00000151A32E582C   492  0f878f030000   ja 1409  (00000151A32E5BC1)
00000151A32E5832   498  498985c06f0400 REX.W movq [r13+0x46fc0],rax
00000151A32E5839   505  48ffc3         REX.W incq rbx
00000151A32E583C   508  4d8b5560       REX.W movq r10,[r13+0x60]
00000151A32E5840   512  4c8953ff       REX.W movq [rbx-0x1],r10
00000151A32E5844   516  c5fb114307     vmovsd [rbx+0x7],xmm0
                  ;;; <@146,#130> gap
00000151A32E5849   521  488b45a0       REX.W movq rax,[rbp-0x60]
00000151A32E584D   525  488b7db8       REX.W movq rdi,[rbp-0x48]
00000151A32E5851   529  488b75c0       REX.W movq rsi,[rbp-0x40]
00000151A32E5855   533  488b4dc8       REX.W movq rcx,[rbp-0x38]
00000151A32E5859   537  488b55b0       REX.W movq rdx,[rbp-0x50]
00000151A32E585D   541  4c8b4da8       REX.W movq r9,[rbp-0x58]
00000151A32E5861   545  4c8b4598       REX.W movq r8,[rbp-0x68]
00000151A32E5865   549  c5fb109578ffffff vmovsd xmm2,[rbp-0x88]

简而言之,区别在于 "plus" 案例 Mod (%) 操作是使用高度专业化的 mod-by-power-of-2-i 机器代码执行的,但 "minus" 情况下,它是使用 mod-d 完成的,这是基于浮点数的算术实现。

另请注意,mod-by-power-of-2-i 机器代码确实支持负值。大致可以改写成这样:

if (rax < 0) {
    rax = -rax;
    rax = rax & 1;
    rax = -rax;
}
else {
    rax = rax & 1;
}

所以这不是只针对正值优化机器代码的情况。

2. 生成代码的差异似乎是因为类型推断的工作方式不同。 --trace_representation 生成的日志显示简化程序的 "plus" 和 "minus" 案例之间的以下差异:

var f_minus = function(log) {
    var str = '', i = gridSize, j;
    while (i--) {
      j = gridSize;
      while (j--) {
        var ttt = (i - j) % 2
      }
    }

  if(log) {
     if(ttt == -1)
        console.log(t);
   }
}


var f_plus = function(log) {
    var str = '', i = gridSize, j;
    while (i--) {
      j = gridSize;
      while (j--) {
        var ttt = (i + j) % 2
      }
    }

    if(log){
     if(ttt == -1)
        console.log(t);
    }
}

比较

[marking 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Sub as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #101 Mod representation i -> d based on output
Changing #98 Sub representation v -> s based on inputs
Changing #98 Sub representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...

至此

[marking 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Add as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #98 Add representation v -> s based on inputs
Changing #98 Add representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...

有趣的区别在于线

Changing #101 Mod representation i -> d based on output

只存在于 f_minus 而不是 f_plus 的情况。出于某种原因,编译器认为在 f_minus 情况下,类型应该是 Double 而不是 Integer,基于输出值的猜测。有趣的是,如果我更改行

        var ttt = (i - j) % 2

        var ttt = (i - j + gridSize) % 2; 

它再次开始生成快速 mod-by-power-of-2-i 代码。所以是的,输出值很可能会影响优化编译器。但目前尚不清楚为什么在这种特殊情况下会发生这种情况。

乍一看,这种行为看起来像是优化编译器中的一个错误,它没有注意到 "minus" 情况也可以由 mod-by-power-of-2-i 处理。我无法追踪为什么会发生这种情况,只是浏览了源代码,所以欢迎进一步输入。

代码段中最耗时的操作是字符串连接。所以我决定删除它并改用实际的数组。以下是 2 个实现,一个来自书中,另一个来自您的代码。书籍版本在我的机器上快两倍。此外,进行此更改可使代码加速约 6 倍。也许数组调整大小在这里也有影响,我们也应该测试一个没有数组的版本。

我的结论:问题实际上可能不是负模数,但其他一些部分可能会影响基准。

console.time("looping");

var board = [];

var size = 5000;

for (var y = 0; y < size; y++) {
  board[y]=[];
  for (var x = 0; x < size; x++) {
    if ((x + y) % 2 === 0)
       board[y][x] = " ";
    else
       board[y][x] = "#";
   }
    
}

 
console.timeEnd("looping");
   

    console.time("looping2");
    var gridSize = 5000, str = [], i = gridSize, j;
    while (i--) {
      str[i]=[];
      j = gridSize;
      while (j--) {
        if ((i - j) % 2 === 0)
          str[i][j] = " ";
        else
          str[i][j] = "#";
      }
    }

     console.timeEnd("looping2");

您的主要问题是嵌套循环。尽量避免它们:

console.time("looping");
var gridSize = 5000;
var board = "";
var firstLine = "";
var secondLine = "";

var counter = 0;
var isBlack = true;

while (counter < gridSize) {
    firstLine += isBlack ? "#" : " ";
    secondLine += isBlack ? " " : "#";
    isBlack = !isBlack ;
    counter++ ;
}

var counter = 0;
var isBlack = false;
while (counter < gridSize) {
    board =  board + (isBlack ? firstLine : secondLine) + "\n";
    isBlack = !isBlack;
    counter++;
}
// console.log(board);
console.timeEnd("looping");

存在更有效的算法来用短的可重复模式填充大字符串。但是,代码会太复杂。我希望,这个算法被用在 padStart() 实现中。

console.time("looping");
var gridSize = 5000;
var firstLine = "".padStart(gridSize, " #");
var secondLine = "".padStart(gridSize, "# ");
var board = "".padStart(gridSize * (gridSize + 1), firstLine + "\n" + secondLine + "\n");
// console.log(board);
console.timeEnd("looping");

我机器上的结果(Windows 7,Firefox 54.0 64 位):

+------------------------+---------+
|         Method         |  Time   |
+------------------------+---------+
| nested loops, (-i-j)%2 | 1500 ms |
| nested loops,  (i+j)%2 |  500 ms |
| nested loops,  (i^j)&1 |  500 ms |
| manual filling         |    2 ms |
| padStart()             |    1 ms |
+------------------------+---------+

顺便说一下,国际棋联规则 says:

The chessboard is placed between the players in such a way that the near corner square to the right of the player is white.