使用 python 增量解析大型维基百科转储 XML 文件

Question

目标是阅读维基百科 DUMP（70Gb 文件）中的所有……内容。这不可能加载到内存中，因此我尝试逐步解析文件并从中获取一些值。然而，我刚写的脚本没有打印任何东西，并立即占据了我所有的记忆。

代码如下：

from lxml import etree

def fast_iter(context, func, *args, **kwargs):

    for event, elem in context:
        func(elem, *args, **kwargs)

        elem.clear()

        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    #print(elem)
    print (elem.xpath( './revision/text/text( )' ))

context = etree.iterparse( 'enwiki-latest-pages-articles-multistream.xml', tag='page' )
fast_iter(context,process_element)

当此脚本应用于小型 xml 文件时，它会打印请求的 xpath 中的值。

然而，当应用于整个文件时，没有任何反应。

下面是来自维基百科转储的相同行

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.33.0-wmf.19</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="118" case="first-letter">Draft</namespace>
      <namespace key="119" case="first-letter">Draft talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>
  <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    <revision>
      <id>885648527</id>
      <parentid>885645378</parentid>
      <timestamp>2019-03-01T11:16:23Z</timestamp>
      <contributor>
        <username>Jarnsax</username>
        <id>33627956</id>
      </contributor>
      <comment>improve citation metadata</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}
{{pp-move-indef}}
{{short description|Political philosophy that advocates self-governed societies}}
{{Use dmy dates|date=July 2018}}
{{use British English|date=January 2014}}
{{Anarchism sidebar}}
{{Basic forms of government}}
'''Anarchism''' is an [[anti-authoritarian]] [[political philosophy]]{{sfn|McLaughlin|2007|p=59}}{{sfn|Flint|2009|p=27}} that advocates [[Self-governance|self-governed]] societies based on voluntary, [[cooperative]] institutions and the rejection of coercive [[Hierarchy|hierarchies]] those societies view as unjust. These institutions are often described as [[Stateless society|stateless societies]],{{r|group=note|Note01}}{{sfn|Sheehan|2003|p=85}} although several authors have defined them more specifically as distinct institutions based on non-hierarchical or [[Free association (communism and anarchism)|free associations]].{{r|group=note|Note02}} Anarchism holds the [[State (polity)|state]] to be undesirable, unnecessary, and harmful.{{r|group=note|Note03}}&lt;ref name=definition /&gt; Any philosophy consistent with statelessness, that is, principled opposition to the State, is anarchist, thus anarchist schools of thought range from [[anarcho-communism]] to [[anarcho-capitalism]].{{sfn|Fiala|2018}}

While [[Anti-statism|opposition to the state]] is central,{{r|group=note|Note04}} many forms of anarchism specifically entail opposing authority or hierarchical organisation based on authority in the conduct of all human relations.{{r|group=note|Note05}} Anarchism is often considered a [[Far-left politics|far-left]] ideology,{{r|group=note|Note06}}{{sfn|Kahn|2000}}{{sfn|Moyihan|2007}} and much of [[anarchist economics]] and [[Anarchist law|anarchist legal philosophy]] reflect [[Libertarian socialism|anti-authoritarian interpretations]] of [[Anarcho-communism|communism]], [[Collectivist anarchism|collectivism]], [[Anarcho-syndicalism|syndicalism]], [[Mutualism (economic theory)|mutualism]], or [[participatory economics]].{{r|group=note|Note07}}

Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy.{{sfn|Marshall|2010|p=16}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.{{sfn|Sylvan|2007|p=262}} [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].{{sfn|McLean|McMillan|2003|loc= Anarchism}} Strains of anarchism have often been divided into the categories of [[Social anarchism|social]] and [[individualist anarchism]] or similar dual classifications.{{sfn|Ostergaard|p=14|loc=Anarchism}}{{sfn|Kropotkin|2002|p=5}}{{sfn|Fowler|1972}}
   </text>
   </revision>
   </page>
</mediawiki>

以前有人这样做过吗？知道如何有效地解析这个巨大的转储吗？有没有package/lib做过的？我不想重新发明轮子。

Answer 1

使用 SAX。请参阅下面的示例 (https://www.tutorialspoint.com/python3/python_xml_processing.htm)。

Simple API for XML (SAX) − Here, you register callbacks for events of interest and then let the parser proceed through the document. This is useful when your documents are large or you have memory limitations, it parses the file as it reads it from the disk and the entire file is never stored in the memory.

SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX generally requires you to create your own ContentHandler by subclassing xml.sax.ContentHandler.

进口xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print ("*****Movie*****")
         title = attributes["title"]
         print ("Title:", title)

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print ("Type:", self.type)
      elif self.CurrentData == "format":
         print ("Format:", self.format)
      elif self.CurrentData == "year":
         print ("Year:", self.year)
      elif self.CurrentData == "rating":
         print ("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print ("Stars:", self.stars)
      elif self.CurrentData == "description":
         print ("Description:", self.description)
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

if ( __name__ == "__main__"):

   # create an XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )

   parser.parse("c:\temp\movies.xml")

movies.xml

<collection shelf = "New Arrivals">
<movie title = "Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title = "Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title = "Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title = "Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Answer 2

Question: Parsing incrementally a large wikipedia dump XML file
When this (the Questions) script is applied in a small xml file, it prints the values from the requested xpath.
However when applied on the full file, nothing happens.

我想知道，您从 小文件 中得到了什么，因为您没有使用 namespace 参数。
Wikipedia xml 文件使用以下默认 namespace:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"

此示例使用 lxml:

from lxml import etree

class Wikipedia:
    def __init__(self, fh, tag):
        """
        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

        :param fh: File Handle from the XML File to parse
        :param tag: The tag to process
        """
        # Prepend the default Namespace {*} to get anything.
        self.context = etree.iterparse(fh, events=("end",), tag=['{*}' + tag])

    def _parse(self):
        """
        Parse the XML File for all '<tag>...</tag>' Elements
        Clear/Delete the Element Tree after processing

        :return: Yield the current 'Event, Element Tree'
        """
        for event, elem in self.context:
            yield event, elem

            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

    def __iter__(self):
        """
        Iterate all '<tag>...</tag>' Element Trees yielded from self._parse()

        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
        """
        for event, elem in self._parse():
            entity = {}

            # Assign the 'elem.namespace' to the 'xpath'
            entity['revision'] = elem.xpath('./xmlns:revision/xmlns:text/text( )', 
                                   namespaces={'xmlns':etree.QName(elem).namespace})

            yield entity


if __name__ == "__main__":
    XML = b""""""<?xml version='1.0' encoding='UTF-8'?>
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ 
http://www.mediawiki.org/xml/export-0.10.xsd"  
version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    ... (omitted for brevity)""""""

    #with open('.\FILE.XML', 'rb') as in_xml_
    with io.BytesIO(XML) as in_xml:
        for record in Wikipedia(in_xml, tag='page'):
            print("record:{}".format(record))

Output:

record:{'revision': ['#REDIRECT [[Computer accessi... (omitted for brevity)
record:{'revision': ["{{redirect2|Anarchist|Anarch... (omitted for brevity)

测试 Python：3.5 - lxml.etree：3.7.1

使用 python 增量解析大型维基百科转储 XML 文件

Parsing incrementally a large wikipedia dump XML file using python

python

xml

wikipedia

xml-namespaces

iterparse