为什么 runConduit 不发送所有数据？

Question

这是我正在解析的一些 xml:

<?xml version="1.0" encoding="utf-8"?>
<data>
<row ows_Document='Weekly Report 10.21.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Daily Update 10.20.2020'
     ows_Category='Daily Update'/>
<row ows_Document='Weekly Report 10.14.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Weekly Report 10.07.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Spanish: Reporte Semanal 07.10.2020' 
     ows_Category='Weekly Report'/>
</data>

我一直在想办法让管道解析器拒绝记录，除非 ows_Category 是 Weekly Report 并且 ows_Document 不包含 Spanish。起初，我使用了一个虚拟值（在下面的 parseDoc' 中）在解析后将它们过滤掉，但后来我意识到我应该能够使用 Maybe （在下面相同的 parseDoc 中), 连同 join 将我的 Maybe 层与 tag' 事件解析器使用的基于名称或属性匹配失败的层折叠起来。它可以编译，但行为很奇怪，显然甚至没有尝试将某些元素发送到解析器！怎么会这样？

{-# LANGUAGE OverloadedStrings #-}

import           Conduit
import           Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import           Data.Foldable
import           Data.String
import qualified Data.Text                  as T
import           Data.XML.Types
import           Text.XML.Stream.Parse

newtype Doc = Doc
  { name :: String
  } deriving (Show)

main :: IO ()
main = do
  r <- L8.readFile "oha.xml"

  let doc = Doc . T.unpack
      check (x,y) a b = if y == "Weekly Report" && not (T.isInfixOf "Spanish" x) then a else b

      t :: (MonadThrow m, MonadIO m) => ((T.Text, T.Text) -> ConduitT Event o m c)
                                     -> ConduitT Event o m (Maybe c)
      t f = tag' "row" ((,) <$> requireAttr "ows_Document" <*> requireAttr "ows_Category") $ \x -> do
        liftIO $ print x
        f x

      parseDoc, parseDoc' :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
      parseDoc  = (join <$>) . t $ \z@(x,_) -> return $       check z (Just $ doc x)  Nothing -- this version doesn't get sent all of the data! why!?!?
      parseDoc' =              t $ \z@(x,_) -> return $ doc $ check z             x $ T.pack bad -- dummy value

      parseDocs :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
                                             -> ConduitT Event o m [Doc]
      parseDocs = f tagNoAttr "data" . many'
      f g n = force (n <> " required") . g (fromString n)

      go p = runConduit $ parseLBS def r .| parseDocs p
      bad = "no good"

  traverse_ print =<<                              go parseDoc
  putStrLn ""
  traverse_ print =<< filter ((/= bad) . name) <$> go parseDoc'

输出——注意 parseDoc 甚至没有发送其中一条记录（应该成功的记录，从 10.14 开始），而 parseDoc' 的行为符合预期：

("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.07.2020"}

("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.14.2020","Weekly Report")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.14.2020"}
Doc {name = "Weekly Report 10.07.2020"}

当我尝试通过删除与 ows_Category 有关的所有内容来进一步简化时，突然 parseDoc 工作正常，确定了想法的合理性？当我删除与 ows_Document 相关的所有内容时，问题仍然存在。

我怀疑我应该用 requireAttrRaw 来做这件事，但我无法理解它并且找不到 doc/examples.

这和Applicative有关系吗——现在想想，应该不会因为检查值而失败吧？

更新

我从作者那里找到了这个 answer 以前版本的库，其中包括类似情况下有趣的 force "fail msg" $ return Nothing，但是放弃了所有解析而不是仅仅使当前解析失败.

this comment suggests i need to throw an exception, and in the source，他们使用类似 lift $ throwM $ XmlException "failed check" $ Just event 的东西，但像 force ... return Nothing，这会杀死所有解析，而不仅仅是当前的解析器。我也不知道如何得到 event.

这里有一个合并的 pull request 声称已经解决了这个问题，但它没有讨论如何使用它，只是说它“微不足道”:)

回答

明确回答：

  parseAttributes :: AttrParser (T.Text, T.Text)
  parseAttributes = do
    d <- requireAttr "ows_Document"
    c <- requireAttr "ows_Category"
    ignoreAttrs
    guard $ not (T.isInfixOf "Spanish" d) && c == "Weekly Report"
    return d

  parseDoc :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
  parseDoc = tag' "row" parseAttributes $ return . doc

或者，因为在这种情况下可以独立检查属性值：

  parseAttributes = requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
                 <* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
                 <* ignoreAttrs
    where requireAttrRaw' n f = requireAttrRaw ("required attr value failed condition: " <> n) $ \(n',as) ->
            asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as

但后者留下了关于 requireAttrRaw:

的这些问题

如果我们负责验证 Name，我们不需要知道命名空间吗？
为什么 requireAttrRaw 发送给我们 [Content] 而不是两个 Maybe Content，ContentText 和 ContentEntity 各发送一个？
我们应该如何处理 ContentEntity“传递解析”？

Answer 1

tl;dr 在tag' "row" parseAttributes parseContent中，check函数属于parseAttributes，不属于parseContent。

为什么它没有按预期运行

xml-conduit 是（值得注意的）围绕以下不变量设计的：

当解析器的类型为 ConduitT Event o m (Maybe a) 时，Maybe 层编码是否 Event 已被消耗

parseName

parseAttributes

tag' parseName parseAttributes parseContent 消耗 Events
tag' parseName parseAttributes parseContent 运行s parseContent 当且仅当 parseName 和 parseAttributes 都成功时

在parseDoc中：

在parseContent部分调用了check函数；在这个阶段，tag' 已经承诺使用 Events，根据不变量 2
2 Maybe 层的堆叠 join 在一起：
- check函数的输出，编码当前<row/>元素是否相关
- 来自 tag' 签名的“标准”Maybe 层，根据不变量 1

这基本上打破了不变量 1：当 check returns Nothing、parseDoc returns Nothing 尽管消耗了 Event整个 <row/> 元素的 s。这导致 xml-conduit 的所有组合子的未定义行为，特别是 many'（在下面分析。）

为什么它的行为如此

many' 组合子依赖不变量 1 来完成它的工作。定义为many' consumer = manyIgnore consumer ignoreAnyTreeContent，即：

尝试consumer
if consumer returns Nothing，然后使用 ignoreAnyTreeContent 跳过元素或内容，假设它还没有被 consumer，递归回到步骤(1)

在您的情况下，consumer returns Nothing 用于 Daily Update 10.20.2020 项目，即使完整的 <row/> 元素已被消耗。因此，ignoreAnyTreeContent 是运行作为跳过特定 <row/> 的一种方式，但实际上最终会跳过下一个（Weekly Report 10.14.2020）。

如何实现预期的行为

将check逻辑移动到parseAttributes部分，这样Event消费就与check是否通过相关联。

为什么 runConduit 不发送所有数据？

Why doesn't runConduit send all the data?

haskell

xml-parsing

applicative

xml-conduit

alternative-functor

为什么它没有按预期运行

为什么它的行为如此

如何实现预期的行为