解析空行之间的数字数组

Question

我试图让解析器扫描文本文件中由空行分隔的数字数组。

1   235 623 684
2   871 699 557
3   918 686 49
4   53  564 906


1   154
2   321
3   519

1   235 623 684
2   871 699 557
3   918 686 49

这里是完整的text file

我用 parsec 编写了以下解析器：

import Text.ParserCombinators.Parsec

emptyLine = do
  spaces
  newline

emptyLines = many1 emptyLine

data1 = do
  dat <- many1 digit
  return (dat)

datan = do
  many1 (oneOf " \t")
  dat <- many1 digit
  return (dat)

dataline  = do
  dat1 <- data1
  dat2 <- many datan
  many (oneOf " \t")
  newline
  return (dat1:dat2)

parseSeries = do 
    dat <- many1 dataline  
    return dat

parseParag =  try parseSeries

parseListing = do 
    --cont <- parseSeries `sepBy` emptyLines
    cont <- between emptyLines emptyLines parseSeries
    eof
    return cont

main = do
    fichier <- readFile ("test_listtst.txt")
    case parse parseListing "(test)" fichier of
            Left error -> do putStrLn "!!! Error !!!"
                             print error
            Right serie -> do  
                                mapM_ print serie

但失败并出现以下错误：

!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line

我不明白为什么。

你知道我的解析器出了什么问题吗？

你有关于如何解析由空行分隔的结构化数据的示例吗？

Answer 1

我不知道确切的问题，但我使用 parsec 解析 "line oriented" 文件的经验是：不要使用 parsec（或者至少不要这样）。

我的意思是问题是你想以某种方式去除数字（在同一行）之间的空白（空格和换行符），但在需要时仍然知道它们。一步完成真的很难（这就是你想要做的）。添加前瞻性可能是可行的，但它真的很混乱（老实说，我从来没有设法让它发挥作用）。

最简单的方法是在第一步中解析行（这允许您检测空行），然后分别解析每一行。

为此，您根本不需要秒差距，只需使用 lines 和 words 即可。但是，这样做，您将失去回溯的能力。

可能有一种方法 "mulitple steps" 使用 parsec 及其分词器进行解析（但我还没有找到任何关于如何使用 parsec 分词器的有用文档）。

Answer 2

emptyLine中的spaces正在消耗'\n'，然后newline没有'\n'解析。你可以写成：

emptyLine = do
  skipMany $ satisfy (\c -> isSpace c && c /= '\n')
  newline

您应该将 parseListing 更改为：

parseListing = do 
    cont <- parseSeries `sepEndBy` emptyLines
    eof
    return cont

我认为 sepEndBy 比 sepBy 好，因为它会跳过文件末尾的任何新行。

Answer 3

几件事：

spaces 包含新行，因此 spaces >> newline 总是失败，这意味着 emptyLine 解析器总是会失败。

我对 parseSeries 和 parseListing 的这些定义很幸运：

parseSeries = do
  s <- many1 dataline
  spaces                  -- eat trailing whitespace
  return s

parseListing = do
  spaces                  -- ignore leading whitespace
  ss <- many parseSeries  -- begin parseSeries at non-whitespace
  eof
  return ss

这个想法是解析器总是吃掉它后面的空格。这种方法还可以处理空文件。

Answer 4

Do you have any idea of what's wrong with my parser ?

几件事：

正如其他回答者已经指出的那样，spaces 解析器旨在使用满足 Data.Char.isSpace 的字符序列；换行符 ('\n') 就是这样一个字符。因此，您的 emptyLine 解析器总是失败，因为 newline 需要一个已经被消耗的换行符。
你可能不应该在你的 "line" 解析器中使用 newline 解析器，因为如果后者不这样做，这些解析器将在文件的最后一行失败' t 以换行符结尾。
为什么不使用秒差距 3 (Text.Parsec.*) 而不是秒差距 2 (Text.ParserCombinators.*)？
为什么不将数字解析为 Integers 或 Ints，而不是将它们保留为 Strings？
个人喜好，但您过于依赖 do 符号来满足我的口味，不利于可读性。例如，
```
data1 = do
  dat <- many1 digit
  return (dat)
```
可以简化为
```
data1 = many1 digit
```
最好为所有顶级绑定添加类型签名。
在命名解析器时保持一致：为什么 "parseListing" 而不是简单的 "listing"？
您是否考虑过使用不同类型的输入流（例如 Text）以获得更好的性能？

Do you have an example on how to parse a structured bunch of data separated by empty lines ?

下面是您需要的那种解析器的简化版本。请注意，输入不应该以空行开头（但可以以空行结尾），并且 "data lines" 不应该包含前导空格，但可以包含尾随 spaces（在 spaces 解析器的意义上）。

module Main where

import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )

eolChar :: Char
eolChar = '\n'

eol :: Parser Char
eol = char eolChar

whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar

emptyLine :: Parser String
emptyLine = whitespace

emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol

cell :: Parser Integer
cell = read <$> many1 digit

dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
--             ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"

dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol

listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines

main :: IO ()
main = do
    fichier <- readFile ("test_listtst.txt")
    case parse listing "(test)" fichier of
        Left error  -> putStrLn "!!! Error !!!"
        Right serie -> mapM_ print serie

测试：

λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]

Answer 5

这是另一种方法，它允许您流入数据并在识别时处理每个块：

import Data.Char
import Control.Monad

-- toBlocks - convert a list of lines into a list of blocks
toBlocks :: [String] -> [[[String]]]
toBlocks []  = []
toBlocks theLines =
  let (block,rest) = break isBlank theLines
      next = dropWhile isBlank rest
  in  if null block
        then toBlocks next
        else [ words x | x <- block ] : toBlocks next
  where isBlank str = all isSpace str

main' path = do
  content <- readFile path
  forM_ (toBlocks (lines content)) $ print

Parsec 在为您提供块列表之前必须读入整个文件，如果您的输入文件很大，这可能是个问题。

解析空行之间的数字数组

Parse array of numbers between emptylines

parsing

haskell

parsec