在 python 中使用 lxml iterparse 解析大型 .bz2 文件 (40 GB)。未压缩文件不会出现的错误

Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

我正在尝试解析以 bz2 格式压缩的 OpenStreetMap planet.osm。因为已经41G了,不想解压完全

所以我想出了如何使用 bz2 和 lxml 解析 planet.osm 文件的部分,使用以下代码

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Geofabrick extracts 完美配合。但是,当我尝试使用相同的脚本解析 planet-latest.osm.bz2 时,出现错误:

xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60

以下是我尝试过的方法:

然后我尝试先解压缩 planet.osm.gz2 使用一个简单的

bzcat planet.osm.gz2 > planet.osm

和运行直接在planet.osm上解析器。而且……它奏效了!我对此感到非常困惑,并且找不到任何指示来说明为什么会发生这种情况以及如何解决这个问题。我的猜测是在解压缩和解析之间发生了一些事情,但我不确定。请帮助我理解!

原来是压缩的planet.osm文件有问题。

OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file所示。我用过它,效果很好!

因此代码应为:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

此外,在对使用 PBF 格式(如评论中所建议的)进行一些研究时,我偶然发现了 imposm.parser,一个实现 OSM 数据通用解析器的 python 模块(在 pbf 中或 xml 格式)。你可能想看看这个!

作为替代方案,您可以使用 bzcat 命令的输出(它也可以处理多流文件):

p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors