如何在资源有限的情况下解析 Haskell 中的大型 XML 文件?
How to parse a large XML file in Haskell with limited amount of resources?
我想从 Haskell 中的一个大 XML 文件(大约 20G)中提取信息。由于它是一个大文件,我使用了 Hexpath.
中的 SAX 解析函数
这是我测试过的简单代码:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
在 Cabal 中激活分析后,它说 parse.saxEvents
占用了分配内存的 85%。我也用了foldr
,结果是一样的
如果 processEvent
变得足够复杂,程序会崩溃并出现 stack space overflow
错误。
我做错了什么?
你不说processEvent
是什么样的。原则上,使用 lazy ByteString
对延迟生成的输入进行严格的左折叠应该没有问题,所以我不确定你的情况出了什么问题。但是在处理巨大的文件时应该使用适合流式处理的类型!
事实上,hexpat
确实有'streaming'接口(就像xml-conduit
一样)。它使用 List 包中不太知名的 List
library and the rather ugly List
class it defines. In principle the ListT
type 应该可以很好地工作。由于缺少组合器,我很快就放弃了,并为 Pipes.ListT
的包装版本编写了一个丑陋的 List
class 的适当实例,然后我用它来导出普通的 Pipes.Producer
功能类似于 parseProduce
。为此所需的琐碎操作在下面附加为 PipesSax.hs
一旦我们有了 parseProducer
,我们就可以将 ByteString 或 Text Producer 转换为带有 Text 或 ByteString 组件的 SaxEvents
的 Producer。下面是一些简单的操作。我用的是238M "input.xml";这些程序永远不需要超过 6 MB 的内存,从 top
来判断。
-- Sax.hs
大多数 IO 操作使用定义在底部的 registerIds
管道,它是为 xml 的一个巨大位定制的,这是一个有效的 1000 片段http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
这只是 newtypes Pipes.ListT 以获得适当的实例。我们不导出与 List
或 ListT
相关的任何内容,而只是使用标准的 Pipes.Producer 概念。
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil
我想从 Haskell 中的一个大 XML 文件(大约 20G)中提取信息。由于它是一个大文件,我使用了 Hexpath.
中的 SAX 解析函数这是我测试过的简单代码:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
在 Cabal 中激活分析后,它说 parse.saxEvents
占用了分配内存的 85%。我也用了foldr
,结果是一样的
如果 processEvent
变得足够复杂,程序会崩溃并出现 stack space overflow
错误。
我做错了什么?
你不说processEvent
是什么样的。原则上,使用 lazy ByteString
对延迟生成的输入进行严格的左折叠应该没有问题,所以我不确定你的情况出了什么问题。但是在处理巨大的文件时应该使用适合流式处理的类型!
事实上,hexpat
确实有'streaming'接口(就像xml-conduit
一样)。它使用 List 包中不太知名的 List
library and the rather ugly List
class it defines. In principle the ListT
type 应该可以很好地工作。由于缺少组合器,我很快就放弃了,并为 Pipes.ListT
的包装版本编写了一个丑陋的 List
class 的适当实例,然后我用它来导出普通的 Pipes.Producer
功能类似于 parseProduce
。为此所需的琐碎操作在下面附加为 PipesSax.hs
一旦我们有了 parseProducer
,我们就可以将 ByteString 或 Text Producer 转换为带有 Text 或 ByteString 组件的 SaxEvents
的 Producer。下面是一些简单的操作。我用的是238M "input.xml";这些程序永远不需要超过 6 MB 的内存,从 top
来判断。
-- Sax.hs
大多数 IO 操作使用定义在底部的 registerIds
管道,它是为 xml 的一个巨大位定制的,这是一个有效的 1000 片段http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
这只是 newtypes Pipes.ListT 以获得适当的实例。我们不导出与 List
或 ListT
相关的任何内容,而只是使用标准的 Pipes.Producer 概念。
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil