如何使用 xml-conduit Cursor Interface 从大型 XML 文件(大约 30G)中提取信息
How to use the xml-conduit Cursor Interface for information extraction from a large XML file (around 30G)
下面的问题是基于这个question. The author of the accepted answer said that the streaming helper API in xml-conduit
was not updated for years (source: accepted answer of SO question)的公认答案,他推荐Cursor
接口
根据第一个问题的解决方案,我写了下面的haskell代码,它使用了xml-conduit
包的Cursor
接口。
import Text.XML as XML (readFile, def)
import Text.XML.Cursor (Cursor, ($/), (&/), ($//), (>=>),
fromDocument, element, content)
import Data.Monoid (mconcat)
import Filesystem.Path (FilePath)
import Filesystem.Path.CurrentOS (fromText)
data Page = Page
{ title :: Text
} deriving (Show)
parse :: FilePath -> IO ()
parse path = do
doc <- XML.readFile def path
let cursor = fromDocument doc
let pages = cursor $// element "page" >=> parseTitle
writeFile "output.txt" ""
mapM_ ((appendFile "output.txt") . (\x -> x ++ "\n") . show) pages
parseTitle :: Cursor -> [Page]
parseTitle c = do
let titleText = c $/ element "title" &/ content
[Page (mconcat titleText)]
main :: IO ()
main = parse (fromText "input.xml")
此代码适用于小型 XML 文件。但是,当代码在 30G XML 文件上 运行 时,执行会被 OS 杀死。
如何使此代码在非常大的 XML 文件上运行?
Cursor
模块要求所有内容都在内存中,这在这种情况下似乎是不可能的。如果要处理那么大的文件,则需要使用流媒体接口。
下面的问题是基于这个question. The author of the accepted answer said that the streaming helper API in xml-conduit
was not updated for years (source: accepted answer of SO question)的公认答案,他推荐Cursor
接口
根据第一个问题的解决方案,我写了下面的haskell代码,它使用了xml-conduit
包的Cursor
接口。
import Text.XML as XML (readFile, def)
import Text.XML.Cursor (Cursor, ($/), (&/), ($//), (>=>),
fromDocument, element, content)
import Data.Monoid (mconcat)
import Filesystem.Path (FilePath)
import Filesystem.Path.CurrentOS (fromText)
data Page = Page
{ title :: Text
} deriving (Show)
parse :: FilePath -> IO ()
parse path = do
doc <- XML.readFile def path
let cursor = fromDocument doc
let pages = cursor $// element "page" >=> parseTitle
writeFile "output.txt" ""
mapM_ ((appendFile "output.txt") . (\x -> x ++ "\n") . show) pages
parseTitle :: Cursor -> [Page]
parseTitle c = do
let titleText = c $/ element "title" &/ content
[Page (mconcat titleText)]
main :: IO ()
main = parse (fromText "input.xml")
此代码适用于小型 XML 文件。但是,当代码在 30G XML 文件上 运行 时,执行会被 OS 杀死。
如何使此代码在非常大的 XML 文件上运行?
Cursor
模块要求所有内容都在内存中,这在这种情况下似乎是不可能的。如果要处理那么大的文件,则需要使用流媒体接口。