如何在 Parsec 中使用开始和结束标记解析多行
How to parse multiple lines with start and end tokens in Parsec
我是 Parsec
的新人。将感谢此处问题的指示。比如说,我有一个固定数量 headers 的 csv 文件。我不想单独解析每一行,而是想在行的开头寻找一个标记,并获取所有行,直到下一行带有 non-empty 标记。示例如下:
token,flag,values
a,1,
,,a
,,f
b,2,
有效输入的规则是:如果令牌字段已填写,则获取所有行直到下一个 non-empty 令牌字段。所以,我希望 Parsec
得到下面的多行作为第一个输入(这些多行然后可以被另一个规则解析):
a,1,
,,a
,,f
然后,该过程在带有 non-empty 令牌字段的下一行再次开始(此处示例中的最后一行)。我想弄清楚的是,是否有一种简单的方法来指定 Parsec
中的规则 - 获取满足特定规则的所有行。然后可以将它们移交给另一个解析器。基本上,它看起来像某种 lookahead
规则来指定什么是有效的 multi-line 输入。我做对了吗?
我们暂时可以忽略上面的逗号分隔符,就说输入是在一行的开头找到一个字符,在一行的开头找到一个字符就结束了。
我在@user2407038 的帮助下解决了这个问题,他在评论中提出了基本大纲。下面的解决方案和解释(请参阅函数后的注释 - 它们显示函数如何处理输入):
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Text.Parsec
import Control.Applicative hiding ((<|>), many)
-- | this one accepts everything until newline, and discards the newline
-- | This one is used as building block in the functions below
restOfLine :: Stream s m Char => ParsecT s u m [Char]
restOfLine = many1 (satisfy (\x -> not $ x == '\n')) <* char '\n'
-- | a line with token is "many alphanumeric characters" followed by
-- | any characters until newline
tokenLine :: Stream s m Char => ParsecT s u m [Char]
tokenLine = (++) <$> many1 alphaNum <*> restOfLine
-- | ghci test:
-- | *Main Text.Parsec> parseTest tokenLine "a,1,,\n"
-- | "a,1,,"
-- | *Main Text.Parsec> parseTest tokenLine ",1,,\n"
-- | parse error at (line 1, column 1):
-- | unexpected ","
-- |expecting letter or digit
-- | a non-token line is a line that has any number of spaces followed
-- | by ",", then any characters until newline
nonTokenLine :: Stream s m Char => ParsecT s u m [Char]
nonTokenLine = (++) <$> (many space) <*> ((:) <$> char ',' <*> restOfLine)
-- | ghci test:
-- | *Main Text.Parsec> parseTest nonTokenLine ",1,,\n"
-- | ",1,,"
-- | *Main Text.Parsec> parseTest nonTokenLine "a,1,,\n"
-- | parse error at (line 1, column 1):
-- | unexpected "a"
-- | expecting space or ","
-- | One entry is tokenLine followed by any number of nonTokenLine
oneEntry :: Stream s m Char => ParsecT s u m [[Char]]
oneEntry = (:) <$> tokenLine <*> (many nonTokenLine)
-- | ghci test - please note that it drops last line as expected
-- | *Main Text.Parsec> parseTest oneEntry "a,1,,\n,,a\n,,f\nb,2,,\n"
-- | ["a,1,,",",,a",",,f"]
-- | We add 'many' to oneEntry to parse the entire file, and get multiple match entries
multiEntries :: Stream s m Char => ParsecT s u m [[String]]
multiEntries = many oneEntry
-- | ghci test - please note that it gets two entries as expected
-- | *Main Text.Parsec> parseTest multiEntries "a,1,,\n,,a\n,,f\nb,2,,\n"
-- | [["a,1,,",",,a",",,f"],["b,2,,"]]
注释中出现的解析器错误是针对无效输入的预期结果。这很容易处理。上面的代码只是入门的基本构建块。
我是 Parsec
的新人。将感谢此处问题的指示。比如说,我有一个固定数量 headers 的 csv 文件。我不想单独解析每一行,而是想在行的开头寻找一个标记,并获取所有行,直到下一行带有 non-empty 标记。示例如下:
token,flag,values
a,1,
,,a
,,f
b,2,
有效输入的规则是:如果令牌字段已填写,则获取所有行直到下一个 non-empty 令牌字段。所以,我希望 Parsec
得到下面的多行作为第一个输入(这些多行然后可以被另一个规则解析):
a,1,
,,a
,,f
然后,该过程在带有 non-empty 令牌字段的下一行再次开始(此处示例中的最后一行)。我想弄清楚的是,是否有一种简单的方法来指定 Parsec
中的规则 - 获取满足特定规则的所有行。然后可以将它们移交给另一个解析器。基本上,它看起来像某种 lookahead
规则来指定什么是有效的 multi-line 输入。我做对了吗?
我们暂时可以忽略上面的逗号分隔符,就说输入是在一行的开头找到一个字符,在一行的开头找到一个字符就结束了。
我在@user2407038 的帮助下解决了这个问题,他在评论中提出了基本大纲。下面的解决方案和解释(请参阅函数后的注释 - 它们显示函数如何处理输入):
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Text.Parsec
import Control.Applicative hiding ((<|>), many)
-- | this one accepts everything until newline, and discards the newline
-- | This one is used as building block in the functions below
restOfLine :: Stream s m Char => ParsecT s u m [Char]
restOfLine = many1 (satisfy (\x -> not $ x == '\n')) <* char '\n'
-- | a line with token is "many alphanumeric characters" followed by
-- | any characters until newline
tokenLine :: Stream s m Char => ParsecT s u m [Char]
tokenLine = (++) <$> many1 alphaNum <*> restOfLine
-- | ghci test:
-- | *Main Text.Parsec> parseTest tokenLine "a,1,,\n"
-- | "a,1,,"
-- | *Main Text.Parsec> parseTest tokenLine ",1,,\n"
-- | parse error at (line 1, column 1):
-- | unexpected ","
-- |expecting letter or digit
-- | a non-token line is a line that has any number of spaces followed
-- | by ",", then any characters until newline
nonTokenLine :: Stream s m Char => ParsecT s u m [Char]
nonTokenLine = (++) <$> (many space) <*> ((:) <$> char ',' <*> restOfLine)
-- | ghci test:
-- | *Main Text.Parsec> parseTest nonTokenLine ",1,,\n"
-- | ",1,,"
-- | *Main Text.Parsec> parseTest nonTokenLine "a,1,,\n"
-- | parse error at (line 1, column 1):
-- | unexpected "a"
-- | expecting space or ","
-- | One entry is tokenLine followed by any number of nonTokenLine
oneEntry :: Stream s m Char => ParsecT s u m [[Char]]
oneEntry = (:) <$> tokenLine <*> (many nonTokenLine)
-- | ghci test - please note that it drops last line as expected
-- | *Main Text.Parsec> parseTest oneEntry "a,1,,\n,,a\n,,f\nb,2,,\n"
-- | ["a,1,,",",,a",",,f"]
-- | We add 'many' to oneEntry to parse the entire file, and get multiple match entries
multiEntries :: Stream s m Char => ParsecT s u m [[String]]
multiEntries = many oneEntry
-- | ghci test - please note that it gets two entries as expected
-- | *Main Text.Parsec> parseTest multiEntries "a,1,,\n,,a\n,,f\nb,2,,\n"
-- | [["a,1,,",",,a",",,f"],["b,2,,"]]
注释中出现的解析器错误是针对无效输入的预期结果。这很容易处理。上面的代码只是入门的基本构建块。