为什么 F# 中的函数组合比管道慢 60%?
Why is function composition in F# so much slower, by 60%, than piping?
诚然,我不确定我在这里比较苹果与苹果或苹果与梨是否正确。但我对差异之大感到特别惊讶,如果有的话,差异会很小。
管道 can often be expressed as function composition and vice versa,我假设编译器也知道这一点,所以我尝试了一个小实验:
// simplified example of some SB helpers:
let inline bcreate() = new StringBuilder(64)
let inline bget (sb: StringBuilder) = sb.ToString()
let inline appendf fmt (sb: StringBuilder) = Printf.kbprintf (fun () -> sb) sb fmt
let inline appends (s: string) (sb: StringBuilder) = sb.Append s
let inline appendi (i: int) (sb: StringBuilder) = sb.Append i
let inline appendb (b: bool) (sb: StringBuilder) = sb.Append b
// test function for composition, putting some garbage data in SB
let compose a =
(appends "START"
>> appendb true
>> appendi 10
>> appendi a
>> appends "0x"
>> appendi 65535
>> appendi 10
>> appends "test"
>> appends "END") (bcreate())
// test function for piping, putting the same garbage data in SB
let pipe a =
bcreate()
|> appends "START"
|> appendb true
|> appendi 10
|> appendi a
|> appends "0x"
|> appendi 65535
|> appendi 10
|> appends "test"
|> appends "END"
在 FSI 中对此进行测试(启用 64 位,--optimize
标志打开)给出:
> for i in 1 .. 500000 do compose 123 |> ignore;;
Real: 00:00:00.390, CPU: 00:00:00.390, GC gen0: 62, gen1: 1, gen2: 0
val it : unit = ()
> for i in 1 .. 500000 do pipe 123 |> ignore;;
Real: 00:00:00.249, CPU: 00:00:00.249, GC gen0: 27, gen1: 0, gen2: 0
val it : unit = ()
小的差异是可以理解的,但这是 1.6 (60%) 的性能下降因素。
我实际上希望大部分工作发生在 StringBuilder
中,但显然合成的开销有相当大的影响。
我知道在大多数实际情况下,这种差异可以忽略不计,但如果您像本例一样编写大格式文本文件(如日志文件),它就会产生影响。
我使用的是最新版本的 F#。
在不深入了解 F# 内部的情况下,我可以从生成的 IL 中得知 compose
将生成 lambda(如果关闭优化,还会生成很多),而在 pipe
所有对 append*
的调用都将被内联。
为 pipe
函数生成的 IL:
Main.pipe:
IL_0000: nop
IL_0001: ldc.i4.s 40
IL_0003: newobj System.Text.StringBuilder..ctor
IL_0008: ldstr "START"
IL_000D: callvirt System.Text.StringBuilder.Append
IL_0012: ldc.i4.1
IL_0013: callvirt System.Text.StringBuilder.Append
IL_0018: ldc.i4.s 0A
IL_001A: callvirt System.Text.StringBuilder.Append
IL_001F: ldarg.0
IL_0020: callvirt System.Text.StringBuilder.Append
IL_0025: ldstr "0x"
IL_002A: callvirt System.Text.StringBuilder.Append
IL_002F: ldc.i4 FF FF 00 00
IL_0034: callvirt System.Text.StringBuilder.Append
IL_0039: ldc.i4.s 0A
IL_003B: callvirt System.Text.StringBuilder.Append
IL_0040: ldstr "test"
IL_0045: callvirt System.Text.StringBuilder.Append
IL_004A: ldstr "END"
IL_004F: callvirt System.Text.StringBuilder.Append
IL_0054: ret
为 compose
函数生成的 IL:
Main.compose:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: newobj Main+compose@10..ctor
IL_0007: stloc.1
IL_0008: ldloc.1
IL_0009: newobj Main+compose@10-1..ctor
IL_000E: stloc.0
IL_000F: ldc.i4.s 40
IL_0011: newobj System.Text.StringBuilder..ctor
IL_0016: stloc.2
IL_0017: ldloc.0
IL_0018: ldloc.2
IL_0019: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_001E: ldstr "END"
IL_0023: callvirt System.Text.StringBuilder.Append
IL_0028: ret
compose@10.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose@10.a
IL_0007: ldarg.1
IL_0008: call Main.f@1
IL_000D: ldc.i4.s 0A
IL_000F: callvirt System.Text.StringBuilder.Append
IL_0014: ret
compose@10..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose@10.a
IL_000D: ret
compose@10-1.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose@10-1.f
IL_0007: ldarg.1
IL_0008: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_000D: ldstr "test"
IL_0012: callvirt System.Text.StringBuilder.Append
IL_0017: ret
compose@10-1..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose@10-1.f
IL_000D: ret
我用 FSI 测试了你的例子,没有发现明显的区别:
> #time
for i in 1 .. 500000 do compose 123 |> ignore
--> Timing now on
Real: 00:00:00.229, CPU: 00:00:00.234, GC gen0: 32, gen1: 32, gen2: 0
val it : unit = ()
> #time;;
--> Timing now off
> #time
for i in 1 .. 500000 do pipe 123 |> ignore;;;;
--> Timing now on
Real: 00:00:00.214, CPU: 00:00:00.218, GC gen0: 30, gen1: 30, gen2: 0
val it : unit = ()
在BenchmarkDotNet中测量(第一个table只是一个compose/pipe运行,第二个table做了500000次),我发现了类似的东西:
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |----------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 319.7963 ns | 5.0299 ns | 2,848.50 | - | - | 182.54 |
pipe | X64 | RyuJit | 308.5887 ns | 11.3793 ns | 2,453.82 | - | - | 155.88 |
compose | X86 | LegacyJit | 428.0141 ns | 3.6112 ns | 1,970.00 | - | - | 126.85 |
pipe | X86 | LegacyJit | 416.3469 ns | 8.0869 ns | 1,886.00 | - | - | 121.86 |
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |---------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 160.8059 ms | 4.6699 ms | 3,514.75 | - | - | 56,224,980.75 |
pipe | X64 | RyuJit | 163.1026 ms | 4.9829 ms | 3,120.00 | - | - | 50,025,686.21 |
compose | X86 | LegacyJit | 215.8562 ms | 4.2769 ms | 2,292.00 | - | - | 36,820,936.68 |
pipe | X86 | LegacyJit | 209.9219 ms | 2.5605 ms | 2,220.00 | - | - | 35,554,575.32 |
您测量的差异可能与GC有关。尝试强制 GC 收集 before/after 你的计时。
也就是说,查看管道运算符的 source code:
let inline (|>) x f = f x
并与组合运算符进行比较:
let inline (>>) f g x = g(f x)
似乎明确表示组合运算符将创建 lambda 函数,这应该会导致更多分配。这也可以在 BenchmarkDotNet 运行s 中看到。这也可能是您看到的性能差异的原因。
诚然,我不确定我在这里比较苹果与苹果或苹果与梨是否正确。但我对差异之大感到特别惊讶,如果有的话,差异会很小。
管道 can often be expressed as function composition and vice versa,我假设编译器也知道这一点,所以我尝试了一个小实验:
// simplified example of some SB helpers:
let inline bcreate() = new StringBuilder(64)
let inline bget (sb: StringBuilder) = sb.ToString()
let inline appendf fmt (sb: StringBuilder) = Printf.kbprintf (fun () -> sb) sb fmt
let inline appends (s: string) (sb: StringBuilder) = sb.Append s
let inline appendi (i: int) (sb: StringBuilder) = sb.Append i
let inline appendb (b: bool) (sb: StringBuilder) = sb.Append b
// test function for composition, putting some garbage data in SB
let compose a =
(appends "START"
>> appendb true
>> appendi 10
>> appendi a
>> appends "0x"
>> appendi 65535
>> appendi 10
>> appends "test"
>> appends "END") (bcreate())
// test function for piping, putting the same garbage data in SB
let pipe a =
bcreate()
|> appends "START"
|> appendb true
|> appendi 10
|> appendi a
|> appends "0x"
|> appendi 65535
|> appendi 10
|> appends "test"
|> appends "END"
在 FSI 中对此进行测试(启用 64 位,--optimize
标志打开)给出:
> for i in 1 .. 500000 do compose 123 |> ignore;;
Real: 00:00:00.390, CPU: 00:00:00.390, GC gen0: 62, gen1: 1, gen2: 0
val it : unit = ()
> for i in 1 .. 500000 do pipe 123 |> ignore;;
Real: 00:00:00.249, CPU: 00:00:00.249, GC gen0: 27, gen1: 0, gen2: 0
val it : unit = ()
小的差异是可以理解的,但这是 1.6 (60%) 的性能下降因素。
我实际上希望大部分工作发生在 StringBuilder
中,但显然合成的开销有相当大的影响。
我知道在大多数实际情况下,这种差异可以忽略不计,但如果您像本例一样编写大格式文本文件(如日志文件),它就会产生影响。
我使用的是最新版本的 F#。
在不深入了解 F# 内部的情况下,我可以从生成的 IL 中得知 compose
将生成 lambda(如果关闭优化,还会生成很多),而在 pipe
所有对 append*
的调用都将被内联。
为 pipe
函数生成的 IL:
Main.pipe:
IL_0000: nop
IL_0001: ldc.i4.s 40
IL_0003: newobj System.Text.StringBuilder..ctor
IL_0008: ldstr "START"
IL_000D: callvirt System.Text.StringBuilder.Append
IL_0012: ldc.i4.1
IL_0013: callvirt System.Text.StringBuilder.Append
IL_0018: ldc.i4.s 0A
IL_001A: callvirt System.Text.StringBuilder.Append
IL_001F: ldarg.0
IL_0020: callvirt System.Text.StringBuilder.Append
IL_0025: ldstr "0x"
IL_002A: callvirt System.Text.StringBuilder.Append
IL_002F: ldc.i4 FF FF 00 00
IL_0034: callvirt System.Text.StringBuilder.Append
IL_0039: ldc.i4.s 0A
IL_003B: callvirt System.Text.StringBuilder.Append
IL_0040: ldstr "test"
IL_0045: callvirt System.Text.StringBuilder.Append
IL_004A: ldstr "END"
IL_004F: callvirt System.Text.StringBuilder.Append
IL_0054: ret
为 compose
函数生成的 IL:
Main.compose:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: newobj Main+compose@10..ctor
IL_0007: stloc.1
IL_0008: ldloc.1
IL_0009: newobj Main+compose@10-1..ctor
IL_000E: stloc.0
IL_000F: ldc.i4.s 40
IL_0011: newobj System.Text.StringBuilder..ctor
IL_0016: stloc.2
IL_0017: ldloc.0
IL_0018: ldloc.2
IL_0019: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_001E: ldstr "END"
IL_0023: callvirt System.Text.StringBuilder.Append
IL_0028: ret
compose@10.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose@10.a
IL_0007: ldarg.1
IL_0008: call Main.f@1
IL_000D: ldc.i4.s 0A
IL_000F: callvirt System.Text.StringBuilder.Append
IL_0014: ret
compose@10..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose@10.a
IL_000D: ret
compose@10-1.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose@10-1.f
IL_0007: ldarg.1
IL_0008: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_000D: ldstr "test"
IL_0012: callvirt System.Text.StringBuilder.Append
IL_0017: ret
compose@10-1..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose@10-1.f
IL_000D: ret
我用 FSI 测试了你的例子,没有发现明显的区别:
> #time
for i in 1 .. 500000 do compose 123 |> ignore
--> Timing now on
Real: 00:00:00.229, CPU: 00:00:00.234, GC gen0: 32, gen1: 32, gen2: 0
val it : unit = ()
> #time;;
--> Timing now off
> #time
for i in 1 .. 500000 do pipe 123 |> ignore;;;;
--> Timing now on
Real: 00:00:00.214, CPU: 00:00:00.218, GC gen0: 30, gen1: 30, gen2: 0
val it : unit = ()
在BenchmarkDotNet中测量(第一个table只是一个compose/pipe运行,第二个table做了500000次),我发现了类似的东西:
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |----------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 319.7963 ns | 5.0299 ns | 2,848.50 | - | - | 182.54 |
pipe | X64 | RyuJit | 308.5887 ns | 11.3793 ns | 2,453.82 | - | - | 155.88 |
compose | X86 | LegacyJit | 428.0141 ns | 3.6112 ns | 1,970.00 | - | - | 126.85 |
pipe | X86 | LegacyJit | 416.3469 ns | 8.0869 ns | 1,886.00 | - | - | 121.86 |
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |---------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 160.8059 ms | 4.6699 ms | 3,514.75 | - | - | 56,224,980.75 |
pipe | X64 | RyuJit | 163.1026 ms | 4.9829 ms | 3,120.00 | - | - | 50,025,686.21 |
compose | X86 | LegacyJit | 215.8562 ms | 4.2769 ms | 2,292.00 | - | - | 36,820,936.68 |
pipe | X86 | LegacyJit | 209.9219 ms | 2.5605 ms | 2,220.00 | - | - | 35,554,575.32 |
您测量的差异可能与GC有关。尝试强制 GC 收集 before/after 你的计时。
也就是说,查看管道运算符的 source code:
let inline (|>) x f = f x
并与组合运算符进行比较:
let inline (>>) f g x = g(f x)
似乎明确表示组合运算符将创建 lambda 函数,这应该会导致更多分配。这也可以在 BenchmarkDotNet 运行s 中看到。这也可能是您看到的性能差异的原因。