John Hughes 的 Deterministic LL(1) parsing with Arrow and errors

John Hughes' Deterministic LL(1) parsing with Arrow and errors

我想根据 John Hughes 的论文 Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers 使用 Arrows 编写一个解析器。他描述的解析器类型如下所示:

data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)

组合运算符看起来像这样:

(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
  P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
  P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
    (DP $ f2 . f1)

问题是解析器的组合 q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' 期望我们所有的 DynamicParser 是完整的,这样一个符号解析器的类型签名将是 symbol :: ch -> Parser ch a (Maybe ch) 而不是 symbol :: ch -> Parser ch a ch。这看起来仍然很尴尬,因为我们必须复制信息,将起始符号信息放在 StaticParserDynamicParser 中。另一个问题是几乎所有解析器都有可能抛出异常,这意味着我们将不得不在 MaybeEither 中花费大量时间来创建本质上是“monads do not compose problem”的问题。这可以通过重写 DynamicParser 本身来处理失败或作为 Arrow transformer, 来补救,但这与论文有很大的偏差。 None 这些问题在本文中得到了解决,而且解析器的呈现就好像它显然可以工作一样,所以我觉得我必须遗漏一些基本的东西。如果有人能抓住我错过的东西,那将非常有帮助。

我认为 Swierstra 和 Duponcheel 描述的确定性解析器与传统解析器有点不同:它们根本不处理失败,只处理选择。

另见S&D论文中的invokeDet函数:

invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a

此函数明确假定它始终能够找到有效的解析。

使用 Hughes 描述的箭头版本的解析器,您可以编写如下示例:

main = do
  let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
  print $ invokeDet p "ab"
  print $ invokeDet p "ac"

这将打印预期的:

'b'
'c'

但是,如果您编写了一个“失败的”解析:

main = do
  let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
  print $ invokeDet p "ad"

它仍然会打印:

'c'

为了使这种行为更加明智,Swierstra 和 Duponcheel 还引入了纠错功能。如果我们假设错误字符 d 已在输入中更正为 c,则输出 'c' 是预期的。这需要一个额外的机制,该机制可能太复杂而无法包含在 Hughes 的论文中。

我已经上传了我用来获得这些结果的实现:https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35

有关相同样式的更实用的解析器的更多信息(但不再是确定性的并且不再限于 LL(1)),我真的很喜欢 Swierstra 的 "Combinator Parsing: A Short Tutorial"。 9.3 节的有趣摘录:

A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.

Swierstra 的完整解析器实现可以在 uu-parsinglib 包中找到,尽管我不知道那里实现了多少扩展。