为什么这个表达式解析器在(一些?)规则的数量上如此糟糕?

Why does this expression parser scale so bad in the number of (some?) rules?

我正在尝试使用 Idris 2 的 Text.Parser 库来解析预标记字节流。我以 Parsec's expression parser 的风格写了以下效用函数:

module Text.Parser.Expression

import Text.Parser

public export
data Assoc
   = AssocNone
   | AssocLeft
   | AssocRight

public export
data Op state k a
  = Prefix (Grammar state k True (a -> a))
  | Infix (Grammar state k True (a -> a -> a)) Assoc

public export
OpTable : Type -> Type -> Type -> Type
OpTable state k a = List (List (Op state k a))

public export
expressionParser :
  OpTable state k a ->
  Grammar state k True a ->
  Grammar state k True a
expressionParser table term = foldl level term table
  where
    level : Grammar state k True a -> List (Op state k a) -> Grammar state k True a
    level factor ops = choiceMap toP ops <|> factor
      where
        toP : Op state k a -> Grammar state k True a
        toP (Infix op AssocNone) = do
          x <- factor
          f <- op
          y <- factor
          pure $ f x y
        toP (Infix op AssocLeft) = do
          x <- factor
          fs <- some (flip <$> op <*> factor)
          pure $ foldl (flip ($)) x fs
        toP (Infix op AssocRight) = do
          fs <- some (factor >>= \x => op <*> pure x)
          y <- factor
          pure $ foldr ($) y fs
        toP (Prefix op) = op <*> factor

对于某些输入,这似乎与运算符定义的数量成正比。这是一个缩减示例:

public export
Number : Type
Number = Double

public export
data Fun
  = IntFun
  | Rnd

public export
data BinOp
   = Eq
   | NEq
   | LT
   | LE
   | GT
   | GE
   | Plus
   | Minus
   | Mul
   | And
   | Or

public export
data Expr
  = NumLitE Number
  | Bin BinOp Expr Expr
  | FunE Fun (List1 Expr)

public export
Show Fun where
  show IntFun = "INT"
  show Rnd = "RND"

public export
Show BinOp where
  show Eq = "="
  show NEq = "<>"
  show LT = "<"
  show LE = "<="
  show GT = ">"
  show GE = ">="
  show Plus = "+"
  show Minus = "-"
  show Mul = "*"
  show And = "AND"
  show Or = "OR"

public export
Show Expr where
  show (NumLitE n) = show n
  show (Bin op x y) = unwords [show x, show op, show y]
  show (FunE f args) = show f ++ "(" ++ show args ++ ")"

bits8 : Bits8 -> Grammar state Bits8 True ()
bits8 x = terminal ("Byte " ++ show x) $ \x' => if x == x' then Just () else Nothing

lexeme : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 c a
lexeme p = afterMany (bits8 0x20) p

comma : Grammar state Bits8 True ()
comma = lexeme $ bits8 0x2c

parens : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 True a
parens = between (lexeme $ bits8 0x28) (lexeme $ bits8 0x29)

digit : Grammar state Bits8 True Bits8
digit = terminal "digit" $ \x =>
  toMaybe (0x30 <= x && x <= 0x39) x

digitLit : (Num a) => Grammar state Bits8 True a
digitLit = fromInteger . cast . (\x => x - 0x30) <$> digit

numLit : (Num a, Neg a) => Grammar state Bits8 True a
numLit {a} = fromDigits <$> lexeme sign <*> lexeme (some digitLit)
  where
    fromDigits : Bool -> List1 a -> a
    fromDigits neg =
      (if neg then negate else id) .
      foldl (\x => \y => x * 10 + y) (the a 0)

    sign : Grammar state Bits8 False Bool
    sign = option False $ True <$ bits8 0xab

expr : Grammar state Bits8 True Expr
expr = expressionParser table term <|> fail "expression"
  where
    table : List (List (Op state Bits8 Expr))
    table =
      [ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
        ]
      , [ Infix (lexeme $ Bin Plus  <$ bits8 0xaa) AssocLeft
        , Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
        ]
      , -- This next group is the one I will keep shrinking
        [ Infix (lexeme $ Bin Eq  <$ bits8 0xb2) AssocNone
        , Infix (lexeme $ Bin NEq <$ (bits8 0xb3 *> bits8 0xb1)) AssocNone
        , Infix (lexeme $ Bin GE  <$ (bits8 0xb1 *> bits8 0xb2)) AssocNone
        , Infix (lexeme $ Bin GT  <$ bits8 0xb1) AssocNone
        , Infix (lexeme $ Bin LE  <$ (bits8 0xb3 *> bits8 0xb2)) AssocNone
        , Infix (lexeme $ Bin LT  <$ bits8 0xb3) AssocNone
        ]
      , [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
        , Infix (lexeme $ Bin Or  <$ bits8 0xb0) AssocLeft
        ]
      ]

    fun : Grammar state Bits8 True Fun
    fun = lexeme $ choice
      [ IntFun  <$ bits8 0xb5
      , Rnd     <$ bits8 0xbb
      ]

    term : Grammar state Bits8 True Expr
    term =
          NumLitE <$> numLit
      <|> FunE <$> fun <*> parens (sepBy1 comma expr)
      <|> parens expr

为了测量,我尝试解析 [181,40,40,187,40,49,41,172,51,41,170,49,41],同时删除 EqNEq、...、Lt 的解析规则。下面是解析上述字节列表的用户时间,以及该解析规则组中未注释掉的规则数:

n usr (seconds)
1 0.41
2 1.56
3 4.67
4 13.92
5 25.71
6 45.92

这是怎么回事?

不是答案,但一个可行的解决方法是将所有比较运算符合并到一个解析器中:

  cmp : Grammar state Bits8 True BinOp
  cmp = choice
    [ Eq  <$ bits8 0xb2
    , NEq <$ bits8 0xb3 <* bits8 0xb1
    , GE  <$ bits8 0xb1 <* bits8 0xb2
    , GT  <$ bits8 0xb1
    , LE  <$ bits8 0xb3 <* bits8 0xb2
    , LT  <$ bits8 0xb3
    ]

  table : List (List (Op state Bits8 Expr))
  table =
    [ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
      ]
    , [ Infix (lexeme $ Bin Plus <$ bits8 0xaa) AssocLeft
      , Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
      ]
    , [ Infix (Bin <$> lexeme cmp) AssocNone ]
    , [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
      , Infix (lexeme $ Bin Or <$ bits8 0xb0) AssocLeft
      ]
    ]

这表明了一个更有原则的改进,即在 expressionParser 本身中收集每个优先级别的所有非关联运算符并进行此转换。

我通过 copying more of Parsec's design 解决了这个问题。从那个 link 可以看出,重要的思想是只解析一次前导项,然后解析后面的一系列关联运算符和操作数。这避免了重复重新解析更高优先级的术语,这是导致问题中代码变慢的原因。