此解析器中的 <|> 顺序真的很重要吗?

Does order really matter with <|> here in this parser?

星期一早上 Haskell post Parsing Part 2: Applicative Parsingregex-applicative:

的交替

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

参考this function from their Git repository:

numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
  (negativeParser <|> decimalParser <|> integerParser)
  where
    integerParser = some (psym isNumber)
    decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
    negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)

    combineDecimal :: String -> Char -> String -> String
    combineDecimal base point decimal = base ++ (point : decimal)

但是,我不明白为什么会这样。当我将 decimalParser <|> integerParser 更改为 integerParser <|> decimalParser 时,它似乎总是解析正确的东西(特别是,我这样做了 运行 stack test,他们的测试仍然通过).小数解析器必须有小数点,整数解析器不能有,所以它会在那里停止解析,导致小数点导致下一段解析失败,回溯到小数解析器。似乎唯一不会发生这种情况的情况是,如果整个解析器的下一部分可以接受小数点(使其成为歧义语法),但您仍然不会 "lose all the information after the decimal, or worse, have a parse failure"。我的推理是否正确,这是那篇文章中的一个错误,还是我没有看到他们的结果之一可能发生的情况?

解析如下内容时出现问题:

12.34

如果你先尝试整数解析器然后它会找到 12,断定它找到了一个整数,然后尝试将 .34 解析为下一个标记。

[...] the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser.

可以,只要您的解析器正在执行回溯。出于性能原因,大多数生产解析器(例如 megaparsec)不会这样做,除非他们被特别告知(通常使用 try 组合器)。博客post中使用的RE解析器很可能是个例外,但我找不到它的解释器来检查。

There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.

When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).

The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).

Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):

badex :: Spec
badex = describe "Bad example" $ do
  it "Should fail" $
    shouldMatch
      exampleLineParser
      "| 3.4 |\n"
      [ ValueNumber 3.4 ]

If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.

To give more context, we have to see where numberParser is used:

  1. numberParser is one of the alternatives of valueParser...
  2. which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
  3. readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).

So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.

The explanation from the blogpost you quote is only partially correct:

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

(emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.

As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.

I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.