如何正确衡量正则表达式的性能?

How to measure regex re performance properly?

正在尝试一些正则表达式性能测试(听到一些谣言说 erlang 很慢)

>Fun = fun F(X) -> case X > 1000000 of true -> ok; false -> Y = X + 1, re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, "^[a-zA-Z0-9_]+$"), F(Y) end end.
#Fun<erl_eval.30.128620087>
> timer:tc(Fun, [0]).                                                         
{17233982,ok}                                                                   
> timer:tc(Fun, [0]).   
{17155982,ok}

和编译正则表达式后的一些测试

{ok, MP} = re:compile("^[a-zA-Z0-9_]+$").                                   
{ok,{re_pattern,0,0,0,                                                          
            <<69,82,67,80,107,0,0,0,16,0,0,0,1,0,0,0,255,255,255,
              255,255,255,...>>}}
> Fun = fun F(X) -> case X > 1000000 of true -> ok; false -> Y = X + 1, re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, MP), F(Y) end end.               
#Fun<erl_eval.30.128620087>
> timer:tc(Fun, [0]).                                                         
{15796985,ok}                                                                   
>        
> timer:tc(Fun, [0]).
{15921984,ok}

http://erlang.org/doc/man/timer.html :

Unless otherwise stated, time is always measured in milliseconds.

http://erlang.org/doc/man/re.html#compile-1 :

Compiling the regular expression before matching is useful if the same expression is to be used in matching against multiple subjects during the lifetime of the program. Compiling once and executing many times is far more efficient than compiling each time one wants to match.

问题

  1. 为什么返回微秒给我?(应该是毫秒?)
  2. 编译正则表达式没有太大区别,为什么?
  3. 我应该编译它吗?
  1. 在模块 timer 中,函数 tc/2 returns 微秒
tc(Fun) -> {Time, Value}
tc(Fun, Arguments) -> {Time, Value}
tc(Module, Function, Arguments) -> {Time, Value}
    Types
    Module = module()
    Function = atom()
    Arguments = [term()]
    Time = integer()
      In microseconds
    Value = term()
  1. 因为函数 Fun 需要编译字符串 "^[a-zA-Z0-9_]+$" 在情况 1 中每次递归(100 万次)。相比之下,在情况 2 中你先编译。之后你将结果带入递归,所以这就是性能低于情况 1 的原因。

run(Subject, RE) -> {match, Captured} | nomatch

Subject = iodata() | unicode:charlist()

RE = mp() | iodata()

The regular expression can be specified either as iodata() in which case it is automatically compiled (as by compile/2) and executed, or as a precompiled mp() in which case it is executed against the subject directly.

  1. 是的,要注意先编译再递归

是的,您应该在尝试衡量性能之前先编译代码。当您将代码键入 shell 时,代码将被解释,而不是编译成字节代码。将代码放入模块时,我看到了很大的改进:

7> timer:tc(Fun, [0]).
{6253194,ok}
8> timer:tc(fun foo:run/1, [0]).
{1768831,ok}

(两者都是编译后的正则表达式。)

-module(foo).

-compile(export_all).

run(X) ->
    {ok, MP} = re:compile("^[a-zA-Z0-9_]+$"),
    run(X, MP).

run(X, _MP) when X > 1000000 ->
    ok;
run(X, MP) ->
    Y = X + 1,
    re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, MP),
    run(Y).