多行 *非* 与 attoparsec 匹配

Multi-line *non* match with attoparsec

我正在尝试解析 (PostgreSQL) 日志,这些日志可以包含多行条目。

2016-01-01 01:01:01 entry1
2016-01-01 01:01:02 entry2a
    entry2b
2016-01-01 01:01:03 entry3

所以 - 使用 Perl 或 Python 脚本,我只是抓住下一行,如果它不是以时间戳开头,则将其附加到上一个日志条目。将 attoparsec 连接到 io-streams 来解决这个问题的明智方法是什么?我显然想用 lookAhead 做点什么,但未能匹配时间戳,但我的大脑只是遗漏了一些东西。


不 - 仍然看不到。我已经剥夺了我所拥有的。解析一行很容易。我不知道如何解析 "up to" 另一种解析模式 - 我可以看到我可以使用的 lookAhead 函数,但我看不出它如何适合应用 "not" 条件。

我也看不出怎么匹配。完全有可能是我脑子坏了。

{-# LANGUAGE OverloadedStrings #-}

module DummyParser (
    LogStatement (..), parseLogLine
    -- and, so we can test it...
    , LogTimestamp , parseTimestamp
    , parseSqlStmt
    , newLineAndTimestamp
) where

{-  we want to parse...
TIME001 statement: SELECT true;
TIME002 statement: SELECT 'b',
  'c';
TIME003 statement: SELECT 3;
-}

import           Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString.Char8            as B

type LogTimestamp = Int

data LogStatement = LogStatement {
     l_ts  :: LogTimestamp
    ,l_sql :: String
} deriving (Eq, Show)


restOfLine :: Parser B.ByteString
restOfLine = do
    rest <- takeTill (== '\n')
    isEOF <- atEnd
    if isEOF then
        return rest
    else
        (char '\n') >> return rest


-- e.g. TIME001
parseTimestamp :: Parser LogTimestamp
parseTimestamp  = do
  string "TIME"
  digits  <- count 3 digit
  return (read digits)


-- e.g. statement: SELECT 1
parseSqlStmt :: Parser String
parseSqlStmt = do
    string "statement: "
    -- How can I match until the next timestamp?
    sql <- restOfLine
    return (B.unpack sql)


newLineAndTimestamp :: Parser LogTimestamp
newLineAndTimestamp = (char '\n') *> parseTimestamp


spaces :: Parser ()
spaces = do
    skipWhile (== ' ')


-- e.g. TIME001 statement: SELECT * FROM schema.table;
parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    let ls = LogStatement log_ts log_sql
    return ls

编辑:所以,这就是我最终得到的结果,感谢 arrowd 的帮助

isTimestampNext = lookAhead parseTimestamp *> pure()

parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    extraLines <- manyTill restOfLine (endOfInput <|> isTimestampNext)
    let ls = LogStatement log_ts (log_sql ++ (B.unpack $ B.concat extraLines))
    return ls

我在许多 attoparsec 问题上分享的组合器:

notFollowedBy p = p >> fail "not followed by"

您的解决方案类似于

parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    newlineLeftover <- ((notFollowedBy parseTimestamp) *> parseSqlStmt) <|> pure ""
    let ls = LogStatement log_ts (log_sql ++ newlineLeftover
    return ls

*> 的右手 newlineLeftOver 表达式需要更多的工作,我想,但总体思路是这样的。