解析不以 attoparsec 中的某些字符结尾的标识符
Parse identifiers that don't end with certain characters in attoparsec
我一直在编写一个 attoparsec 解析器来解析 Uniform Code for Units of Measure 调用的 <ATOM-SYMBOL>
。它被定义为某个 class(class 包括所有数字 0-9)中不以数字结尾的最长字符序列。
所以给定输入 foo27
我想消费 return foo
,对于 237bar26
我想消费 return 237bar
, 因为 19
我想不消耗任何东西就失败。
我不知道如何用 takeWhile1
或 takeTill
或 scan
构建它,但我可能遗漏了一些明显的东西。
更新:
到目前为止,我最好的尝试是我设法排除了完全是数字的序列
atomSymbol :: Parser Text
atomSymbol = do
r <- core
if (P.all (inClass "0-9") . T.unpack $ r)
then fail "Expected an atom symbol but all characters were digits."
else return r
where
core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\^-z|~"
我尝试更改它以测试最后一个字符是否是数字而不是所有字符都是数字,但它似乎不会一次回溯一个字符。
更新 2:
整个文件位于https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes
branch from https://github.com/dmcclean/dimensional。
您应该重新表述问题并分别处理数字跨度 (0-9
) 和非数字字符跨度 (!#-'*,:-<>-Z\^-z|~
)。然后可以将感兴趣的句法元素描述为
- 一个可选的数字范围,后跟
- 一个非数字范围,后跟
- 零个或多个{数字范围后跟非数字范围}。
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative ((<|>), many)
import Data.Char (isDigit)
import Data.Attoparsec.Combinator (option)
import Data.Attoparsec.Text (Parser)
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
atomSymbol :: Parser Text
atomSymbol = f <$> (option "" digitSpan)
<*> (nonDigitSpan <|> fail errorMsg)
<*> many (g <$> digitSpan <*> nonDigitSpan)
where
nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\^-z|~"
digitSpan = A.takeWhile1 isDigit
f x y xss = T.concat $ x : y : concat xss
g x y = [x,y]
errorMsg = "Expected an atom symbol but all characters (if any) were digits."
测试
[...] given the input foo27
I want to consume and return foo, for 237bar26
I want to consume and return 237bar
, for 19
I want to fail without consuming anything.
λ> A.parseOnly atomSymbol "foo26"
Right "foo"
λ> A.parseOnly atomSymbol "237bar26"
Right "237bar"
λ> A.parseOnly atomSymbol "19"
Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."
我一直在编写一个 attoparsec 解析器来解析 Uniform Code for Units of Measure 调用的 <ATOM-SYMBOL>
。它被定义为某个 class(class 包括所有数字 0-9)中不以数字结尾的最长字符序列。
所以给定输入 foo27
我想消费 return foo
,对于 237bar26
我想消费 return 237bar
, 因为 19
我想不消耗任何东西就失败。
我不知道如何用 takeWhile1
或 takeTill
或 scan
构建它,但我可能遗漏了一些明显的东西。
更新: 到目前为止,我最好的尝试是我设法排除了完全是数字的序列
atomSymbol :: Parser Text
atomSymbol = do
r <- core
if (P.all (inClass "0-9") . T.unpack $ r)
then fail "Expected an atom symbol but all characters were digits."
else return r
where
core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\^-z|~"
我尝试更改它以测试最后一个字符是否是数字而不是所有字符都是数字,但它似乎不会一次回溯一个字符。
更新 2:
整个文件位于https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes
branch from https://github.com/dmcclean/dimensional。
您应该重新表述问题并分别处理数字跨度 (0-9
) 和非数字字符跨度 (!#-'*,:-<>-Z\^-z|~
)。然后可以将感兴趣的句法元素描述为
- 一个可选的数字范围,后跟
- 一个非数字范围,后跟
- 零个或多个{数字范围后跟非数字范围}。
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative ((<|>), many)
import Data.Char (isDigit)
import Data.Attoparsec.Combinator (option)
import Data.Attoparsec.Text (Parser)
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
atomSymbol :: Parser Text
atomSymbol = f <$> (option "" digitSpan)
<*> (nonDigitSpan <|> fail errorMsg)
<*> many (g <$> digitSpan <*> nonDigitSpan)
where
nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\^-z|~"
digitSpan = A.takeWhile1 isDigit
f x y xss = T.concat $ x : y : concat xss
g x y = [x,y]
errorMsg = "Expected an atom symbol but all characters (if any) were digits."
测试
[...] given the input
foo27
I want to consume and return foo, for237bar26
I want to consume and return237bar
, for19
I want to fail without consuming anything.
λ> A.parseOnly atomSymbol "foo26"
Right "foo"
λ> A.parseOnly atomSymbol "237bar26"
Right "237bar"
λ> A.parseOnly atomSymbol "19"
Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."