解析前面没有出现白色 space 的单词的第一次出现

Parsing the first occurrence of a word that is not precded by white space

设置

我需要在某些 .txt 文件中找到第一个没有以白色开头的单词 space。以下是可能的情况:

-- * should succed
t1 = "hello\t999\nworld\t0"
t2 = "world\t0\nhello\t999\n"
t3 = "world world\t0\nhello\t999\n"

-- * should fail
t4 = "world\t0\nhello world\t999\n"
t5 = "hello world\t999\nworld\t0"
t6 = "world hello\t999\nworld\t0"

现在 t6 成功了,尽管它应该会失败,因为我的解析器会消耗任何字符,直到它到达 hello。这是我的解析器:

我的解决方案

import Control.Applicative

import Data.Attoparsec.Text.Lazy
import Data.Attoparsec.Combinator
import Data.Text hiding (foldr)
import qualified Data.Text.Lazy as L (Text, pack)



-- * should succed
t1 = L.pack "hello\t999\nworld\t0"
t2 = L.pack "world\t0\nhello\t999\n"

-- * should fail
t3 = L.pack "world\t0\nhello world\t999\n"
t4 = L.pack "hello world\t999\nworld\t0"
t5 = L.pack "world hello\t999\nworld\t0"

p = occur "hello"    

---- * discard all text until word `w` occurs, and find its only field `n`
occur :: String -> Parser (String, Int)
occur w = do
    pUntil w
    string . pack $ w
    string "\t"
    n <- natural 
    string "\n"
    return (w, read n)


-- * Parse a natural number
natural :: Parser String
natural = many1' digit

-- * skip over all words in Text stream until the word we want
pUntil :: String -> Parser String 
pUntil = manyTill anyChar . lookAhead . string . pack 

这里有一个可供考虑的方法:

{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative

import Data.Attoparsec.Text.Lazy
import Data.Attoparsec.Combinator
import Data.Text hiding (foldr)
import qualified Data.Text.Lazy as L (Text, pack)
import Data.Monoid

natural = many1' digit

-- manyTill anyChar (try $ char c <* eof)

pair0 w = do
  string (w <> "\t")
  n <- natural
  string "\n"
  return n

pair1 w = do
  manyTill anyChar (try $ string ("\n" <> w <> "\t"))
  n <- natural
  string "\n"
  return n

pair w = pair0 w <|> pair1 w

t1 = "hello\t999\nworld\t0"
t2 = "world\t0\nhello\t999\n"
t3 = "world world\t0\nhello\t999\n"

-- * should fail
t4 = "world\t0\nhello world\t999\n"
t5 = "hello world\t999\nworld\t0"
t6 = "world hello\t999\nworld\t0"

test t = parseTest (pair "hello") (L.pack t)

main = do
  test t1; test t2; test t3
  test t4; test t5; test t6

想法是 pair0 匹配输入开头给定键的一对,pair1 匹配换行符后的一对。

关键是使用manyTill anyChar (try p)会一直跳 字符,直到解析器 p 成功。

(顺便说一句 - 我通过阅读@Cactus 写的答案了解了 manyTilltry 的用法。)