如何在资源有限的情况下解析 Haskell 中的大型 XML 文件?

How to parse a large XML file in Haskell with limited amount of resources?

我想从 Haskell 中的一个大 XML 文件(大约 20G)中提取信息。由于它是一个大文件,我使用了 Hexpath.

中的 SAX 解析函数

这是我测试过的简单代码:

import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax

parse :: FilePath -> IO ()
parse path = do
    inputText <- L.readFile path
    let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
    let txt = foldl' processEvent "" saxEvents
    putStrLn txt

在 Cabal 中激活分析后,它说 parse.saxEvents 占用了分配内存的 85%。我也用了foldr,结果是一样的

如果 processEvent 变得足够复杂,程序会崩溃并出现 stack space overflow 错误。

我做错了什么?

你不说processEvent是什么样的。原则上,使用 lazy ByteString 对延迟生成的输入进行严格的左折叠应该没有问题,所以我不确定你的情况出了什么问题。但是在处理巨大的文件时应该使用适合流式处理的类型!

事实上,hexpat确实有'streaming'接口(就像xml-conduit一样)。它使用 List 包中不太知名的 List library and the rather ugly List class it defines. In principle the ListT type 应该可以很好地工作。由于缺少组合器,我很快就放弃了,并为 Pipes.ListT 的包装版本编写了一个丑陋的 List class 的适当实例,然后我用它来导出普通的 Pipes.Producer 功能类似于 parseProduce。为此所需的琐碎操作在下面附加为 PipesSax.hs

一旦我们有了 parseProducer,我们就可以将 ByteString 或 Text Producer 转换为带有 Text 或 ByteString 组件的 SaxEvents 的 Producer。下面是一些简单的操作。我用的是238M "input.xml";这些程序永远不需要超过 6 MB 的内存,从 top 来判断。

-- Sax.hs 大多数 IO 操作使用定义在底部的 registerIds 管道,它是为 xml 的一个巨大位定制的,这是一个有效的 1000 片段http://sprunge.us/WaQK

{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX 
import Pipes  -- cabal install pipes pipes-bytestring 
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8

sax :: MonadIO m => Producer ByteString m () 
                 -> Producer (SAXEvent ByteString ByteString) m ()
sax =  parseProducer defaultParseOptions

-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 =  runEffect $ sax stdin >-> P.print

-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout

-- write all IDs to a file
main2 =  
 IO.withFile "input.xml" IO.ReadMode $ \inp -> 
 IO.withFile "output.txt" IO.WriteMode $ \out -> 
   runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out 

-- folds:
-- print number of IDs
main3 =  IO.withFile "input.xml" IO.ReadMode $ \inp -> 
           do n <- P.length $ sax (fromHandle inp) >-> registryIds
              print n

-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 =  IO.withFile "input.xml" IO.ReadMode $ \inp ->
         do let pipeline =  sax (fromHandle inp) >-> registryIds >-> P.map readIntId
            n <- P.fold (+) 0 id pipeline
            print n
  where
   readIntId :: ByteString -> Integer
   readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2

-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")] 
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do 
  e <- await  -- we look for a 'SAXEvent'
  case e of -- if it matches, we yield, else we go to the next event
    StartElement "FacilitySite" [("registryId",a)] -> do yield a
                                                         yield "\n"
                                                         registryIds
    _ -> registryIds  

-- 'library': PipesSax.hs

这只是 newtypes Pipes.ListT 以获得适当的实例。我们不导出与 ListListT 相关的任何内容,而只是使用标准的 Pipes.Producer 概念。

{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where 
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes  
import qualified Pipes.Internal as I

parseProducer
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text) m ()
parseProducer opt  = enumerate . enumerate_ 
                     . parseG opt 
                     . Select_ . Select

parseProducerLocations
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt = 
  enumerate . enumerate_ . parseLocationsG opt . Select_ . Select  

newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
    deriving (Functor, Monad, MonadPlus, MonadIO
             , Applicative, Alternative, Monoid, MonadTrans)

instance Monad m => List (ListT_ m) where
 type ItemM (ListT_ m) = m
 joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_) 
 runList   = liftM emend  . next  . enumerate . enumerate_
   where 
     emend (Right (a,q)) = Cons a (Select_ (Select q))
     emend _ = Nil