为什么这个表达式解析器在(一些?)规则的数量上如此糟糕?
Why does this expression parser scale so bad in the number of (some?) rules?
我正在尝试使用 Idris 2 的 Text.Parser
库来解析预标记字节流。我以 Parsec's expression parser 的风格写了以下效用函数:
module Text.Parser.Expression
import Text.Parser
public export
data Assoc
= AssocNone
| AssocLeft
| AssocRight
public export
data Op state k a
= Prefix (Grammar state k True (a -> a))
| Infix (Grammar state k True (a -> a -> a)) Assoc
public export
OpTable : Type -> Type -> Type -> Type
OpTable state k a = List (List (Op state k a))
public export
expressionParser :
OpTable state k a ->
Grammar state k True a ->
Grammar state k True a
expressionParser table term = foldl level term table
where
level : Grammar state k True a -> List (Op state k a) -> Grammar state k True a
level factor ops = choiceMap toP ops <|> factor
where
toP : Op state k a -> Grammar state k True a
toP (Infix op AssocNone) = do
x <- factor
f <- op
y <- factor
pure $ f x y
toP (Infix op AssocLeft) = do
x <- factor
fs <- some (flip <$> op <*> factor)
pure $ foldl (flip ($)) x fs
toP (Infix op AssocRight) = do
fs <- some (factor >>= \x => op <*> pure x)
y <- factor
pure $ foldr ($) y fs
toP (Prefix op) = op <*> factor
对于某些输入,这似乎与运算符定义的数量成正比。这是一个缩减示例:
public export
Number : Type
Number = Double
public export
data Fun
= IntFun
| Rnd
public export
data BinOp
= Eq
| NEq
| LT
| LE
| GT
| GE
| Plus
| Minus
| Mul
| And
| Or
public export
data Expr
= NumLitE Number
| Bin BinOp Expr Expr
| FunE Fun (List1 Expr)
public export
Show Fun where
show IntFun = "INT"
show Rnd = "RND"
public export
Show BinOp where
show Eq = "="
show NEq = "<>"
show LT = "<"
show LE = "<="
show GT = ">"
show GE = ">="
show Plus = "+"
show Minus = "-"
show Mul = "*"
show And = "AND"
show Or = "OR"
public export
Show Expr where
show (NumLitE n) = show n
show (Bin op x y) = unwords [show x, show op, show y]
show (FunE f args) = show f ++ "(" ++ show args ++ ")"
bits8 : Bits8 -> Grammar state Bits8 True ()
bits8 x = terminal ("Byte " ++ show x) $ \x' => if x == x' then Just () else Nothing
lexeme : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 c a
lexeme p = afterMany (bits8 0x20) p
comma : Grammar state Bits8 True ()
comma = lexeme $ bits8 0x2c
parens : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 True a
parens = between (lexeme $ bits8 0x28) (lexeme $ bits8 0x29)
digit : Grammar state Bits8 True Bits8
digit = terminal "digit" $ \x =>
toMaybe (0x30 <= x && x <= 0x39) x
digitLit : (Num a) => Grammar state Bits8 True a
digitLit = fromInteger . cast . (\x => x - 0x30) <$> digit
numLit : (Num a, Neg a) => Grammar state Bits8 True a
numLit {a} = fromDigits <$> lexeme sign <*> lexeme (some digitLit)
where
fromDigits : Bool -> List1 a -> a
fromDigits neg =
(if neg then negate else id) .
foldl (\x => \y => x * 10 + y) (the a 0)
sign : Grammar state Bits8 False Bool
sign = option False $ True <$ bits8 0xab
expr : Grammar state Bits8 True Expr
expr = expressionParser table term <|> fail "expression"
where
table : List (List (Op state Bits8 Expr))
table =
[ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
]
, [ Infix (lexeme $ Bin Plus <$ bits8 0xaa) AssocLeft
, Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
]
, -- This next group is the one I will keep shrinking
[ Infix (lexeme $ Bin Eq <$ bits8 0xb2) AssocNone
, Infix (lexeme $ Bin NEq <$ (bits8 0xb3 *> bits8 0xb1)) AssocNone
, Infix (lexeme $ Bin GE <$ (bits8 0xb1 *> bits8 0xb2)) AssocNone
, Infix (lexeme $ Bin GT <$ bits8 0xb1) AssocNone
, Infix (lexeme $ Bin LE <$ (bits8 0xb3 *> bits8 0xb2)) AssocNone
, Infix (lexeme $ Bin LT <$ bits8 0xb3) AssocNone
]
, [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
, Infix (lexeme $ Bin Or <$ bits8 0xb0) AssocLeft
]
]
fun : Grammar state Bits8 True Fun
fun = lexeme $ choice
[ IntFun <$ bits8 0xb5
, Rnd <$ bits8 0xbb
]
term : Grammar state Bits8 True Expr
term =
NumLitE <$> numLit
<|> FunE <$> fun <*> parens (sepBy1 comma expr)
<|> parens expr
为了测量,我尝试解析 [181,40,40,187,40,49,41,172,51,41,170,49,41]
,同时删除 Eq
、NEq
、...、Lt
的解析规则。下面是解析上述字节列表的用户时间,以及该解析规则组中未注释掉的规则数:
n
usr (seconds)
1
0.41
2
1.56
3
4.67
4
13.92
5
25.71
6
45.92
这是怎么回事?
不是答案,但一个可行的解决方法是将所有比较运算符合并到一个解析器中:
cmp : Grammar state Bits8 True BinOp
cmp = choice
[ Eq <$ bits8 0xb2
, NEq <$ bits8 0xb3 <* bits8 0xb1
, GE <$ bits8 0xb1 <* bits8 0xb2
, GT <$ bits8 0xb1
, LE <$ bits8 0xb3 <* bits8 0xb2
, LT <$ bits8 0xb3
]
table : List (List (Op state Bits8 Expr))
table =
[ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
]
, [ Infix (lexeme $ Bin Plus <$ bits8 0xaa) AssocLeft
, Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
]
, [ Infix (Bin <$> lexeme cmp) AssocNone ]
, [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
, Infix (lexeme $ Bin Or <$ bits8 0xb0) AssocLeft
]
]
这表明了一个更有原则的改进,即在 expressionParser
本身中收集每个优先级别的所有非关联运算符并进行此转换。
我通过 copying more of Parsec's design 解决了这个问题。从那个 link 可以看出,重要的思想是只解析一次前导项,然后解析后面的一系列关联运算符和操作数。这避免了重复重新解析更高优先级的术语,这是导致问题中代码变慢的原因。
我正在尝试使用 Idris 2 的 Text.Parser
库来解析预标记字节流。我以 Parsec's expression parser 的风格写了以下效用函数:
module Text.Parser.Expression
import Text.Parser
public export
data Assoc
= AssocNone
| AssocLeft
| AssocRight
public export
data Op state k a
= Prefix (Grammar state k True (a -> a))
| Infix (Grammar state k True (a -> a -> a)) Assoc
public export
OpTable : Type -> Type -> Type -> Type
OpTable state k a = List (List (Op state k a))
public export
expressionParser :
OpTable state k a ->
Grammar state k True a ->
Grammar state k True a
expressionParser table term = foldl level term table
where
level : Grammar state k True a -> List (Op state k a) -> Grammar state k True a
level factor ops = choiceMap toP ops <|> factor
where
toP : Op state k a -> Grammar state k True a
toP (Infix op AssocNone) = do
x <- factor
f <- op
y <- factor
pure $ f x y
toP (Infix op AssocLeft) = do
x <- factor
fs <- some (flip <$> op <*> factor)
pure $ foldl (flip ($)) x fs
toP (Infix op AssocRight) = do
fs <- some (factor >>= \x => op <*> pure x)
y <- factor
pure $ foldr ($) y fs
toP (Prefix op) = op <*> factor
对于某些输入,这似乎与运算符定义的数量成正比。这是一个缩减示例:
public export
Number : Type
Number = Double
public export
data Fun
= IntFun
| Rnd
public export
data BinOp
= Eq
| NEq
| LT
| LE
| GT
| GE
| Plus
| Minus
| Mul
| And
| Or
public export
data Expr
= NumLitE Number
| Bin BinOp Expr Expr
| FunE Fun (List1 Expr)
public export
Show Fun where
show IntFun = "INT"
show Rnd = "RND"
public export
Show BinOp where
show Eq = "="
show NEq = "<>"
show LT = "<"
show LE = "<="
show GT = ">"
show GE = ">="
show Plus = "+"
show Minus = "-"
show Mul = "*"
show And = "AND"
show Or = "OR"
public export
Show Expr where
show (NumLitE n) = show n
show (Bin op x y) = unwords [show x, show op, show y]
show (FunE f args) = show f ++ "(" ++ show args ++ ")"
bits8 : Bits8 -> Grammar state Bits8 True ()
bits8 x = terminal ("Byte " ++ show x) $ \x' => if x == x' then Just () else Nothing
lexeme : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 c a
lexeme p = afterMany (bits8 0x20) p
comma : Grammar state Bits8 True ()
comma = lexeme $ bits8 0x2c
parens : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 True a
parens = between (lexeme $ bits8 0x28) (lexeme $ bits8 0x29)
digit : Grammar state Bits8 True Bits8
digit = terminal "digit" $ \x =>
toMaybe (0x30 <= x && x <= 0x39) x
digitLit : (Num a) => Grammar state Bits8 True a
digitLit = fromInteger . cast . (\x => x - 0x30) <$> digit
numLit : (Num a, Neg a) => Grammar state Bits8 True a
numLit {a} = fromDigits <$> lexeme sign <*> lexeme (some digitLit)
where
fromDigits : Bool -> List1 a -> a
fromDigits neg =
(if neg then negate else id) .
foldl (\x => \y => x * 10 + y) (the a 0)
sign : Grammar state Bits8 False Bool
sign = option False $ True <$ bits8 0xab
expr : Grammar state Bits8 True Expr
expr = expressionParser table term <|> fail "expression"
where
table : List (List (Op state Bits8 Expr))
table =
[ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
]
, [ Infix (lexeme $ Bin Plus <$ bits8 0xaa) AssocLeft
, Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
]
, -- This next group is the one I will keep shrinking
[ Infix (lexeme $ Bin Eq <$ bits8 0xb2) AssocNone
, Infix (lexeme $ Bin NEq <$ (bits8 0xb3 *> bits8 0xb1)) AssocNone
, Infix (lexeme $ Bin GE <$ (bits8 0xb1 *> bits8 0xb2)) AssocNone
, Infix (lexeme $ Bin GT <$ bits8 0xb1) AssocNone
, Infix (lexeme $ Bin LE <$ (bits8 0xb3 *> bits8 0xb2)) AssocNone
, Infix (lexeme $ Bin LT <$ bits8 0xb3) AssocNone
]
, [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
, Infix (lexeme $ Bin Or <$ bits8 0xb0) AssocLeft
]
]
fun : Grammar state Bits8 True Fun
fun = lexeme $ choice
[ IntFun <$ bits8 0xb5
, Rnd <$ bits8 0xbb
]
term : Grammar state Bits8 True Expr
term =
NumLitE <$> numLit
<|> FunE <$> fun <*> parens (sepBy1 comma expr)
<|> parens expr
为了测量,我尝试解析 [181,40,40,187,40,49,41,172,51,41,170,49,41]
,同时删除 Eq
、NEq
、...、Lt
的解析规则。下面是解析上述字节列表的用户时间,以及该解析规则组中未注释掉的规则数:
n | usr (seconds) |
---|---|
1 | 0.41 |
2 | 1.56 |
3 | 4.67 |
4 | 13.92 |
5 | 25.71 |
6 | 45.92 |
这是怎么回事?
不是答案,但一个可行的解决方法是将所有比较运算符合并到一个解析器中:
cmp : Grammar state Bits8 True BinOp
cmp = choice
[ Eq <$ bits8 0xb2
, NEq <$ bits8 0xb3 <* bits8 0xb1
, GE <$ bits8 0xb1 <* bits8 0xb2
, GT <$ bits8 0xb1
, LE <$ bits8 0xb3 <* bits8 0xb2
, LT <$ bits8 0xb3
]
table : List (List (Op state Bits8 Expr))
table =
[ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
]
, [ Infix (lexeme $ Bin Plus <$ bits8 0xaa) AssocLeft
, Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
]
, [ Infix (Bin <$> lexeme cmp) AssocNone ]
, [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
, Infix (lexeme $ Bin Or <$ bits8 0xb0) AssocLeft
]
]
这表明了一个更有原则的改进,即在 expressionParser
本身中收集每个优先级别的所有非关联运算符并进行此转换。
我通过 copying more of Parsec's design 解决了这个问题。从那个 link 可以看出,重要的思想是只解析一次前导项,然后解析后面的一系列关联运算符和操作数。这避免了重复重新解析更高优先级的术语,这是导致问题中代码变慢的原因。