使用 attoparsec 递归 return 来自 .txt 文件的所有单词

Question

我是 Haskell 的新手，我才刚刚开始学习如何使用 attoparsec 从 .txt 文件中解析大量英文文本。我知道如何在不使用 attoparsec 的情况下获取 .txt 文件中的字数，但我有点坚持使用 attoparsec。当我运行下面的代码时，让我们说

"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"

我只回来了：

World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word = "Hello"})

这是我当前的代码：

{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

{-
This is how I would usually get the length of the list of words in a .txt file normally.

countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)

-}

data Prose = Prose {
  word :: String
} deriving Show

prose :: Parser Prose
prose = do
  word <- many' $ letter
  return $ Prose word

main :: IO()
main = do
  input <- Txt.readFile "small.txt"
  print $ parse prose input

另外，我怎样才能得到单词的整数计数，稍后呢？此外，关于如何开始使用 attoparsec 有什么建议吗？

Answer 1

你已经有了一个很好的开始 - 你可以解析一个词。
接下来你需要的是 Parser [Prose]，它可以通过将你的 prose 解析器与另一个使用 "not prose" 部分的解析器组合来表示，使用 sepBy 或 sepBy1 ，您可以在 Data.Attoparsec.Text 文档中查找。

从那里开始，获取字数的最简单方法就是简单地获取您获得的 [Prose] 的长度。

编辑：

这是一个最小的工作示例。 Parser runner 已被替换为 parseOnly 以允许忽略剩余输入，这意味着尾随的非单词不会使解析器变得 cray-cray。

{-# LANGUAGE OverloadedStrings #-}

module Atto where

--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)

import qualified Data.Text as T

data Prose = Prose {
  word :: String
} deriving Show

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter

separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()

wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
        wp = optional separator *> prose `sepBy1` separator

main :: IO ()
main = do
  let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
  let words = wordParser input
  print words
  print $ length words

提供的解析器不会给出与 concatMap words . lines 完全相同的结果，因为它也会在 .,' 上中断单词。修改此行为留作简单练习。

希望对您有所帮助！ :)

Answer 2

你走对了！您已经编写了一个解析器 (prose)，它读取一个单词：many' letter 识别一个字母序列。

现在您已经弄清楚了如何解析单个单词，您的工作就是将其扩展以解析一系列由 space 分隔的单词。这就是 sepBy 所做的：p `sepBy` q 重复运行 p 解析器，中间穿插 q 解析器。

所以一个单词序列的解析器看起来像这样（我冒昧地将你的 prose 重命名为 word）：

word = many letter
phrase = word `sepBy` some space  -- "some" runs a parser one-or-more times

ghci> parseOnly phrase "wibble wobble wubble"  -- with -XOverloadedStrings
Right ["wibble","wobble","wubble"]

现在，由 letter 和 space 组成的 phrase 将死于非字母非 space 字符，例如 '和 .。我会留给你解决这个问题的方法。（作为提示，您可能需要将 many letter 更改为 many (letter <|> ...)，具体取决于您希望它在各种标点符号上的表现。）

使用 attoparsec 递归 return 来自 .txt 文件的所有单词

Recursively return all words from .txt file using attoparsec

parsing

haskell

attoparsec