如何在没有非常慢的 for 循环的情况下迭代 xpath 子集？

Question

我正在尝试解析本地 14 MB html 文件。

我的文件看起来像这样（这很不方便，因为它没有以有用的方式嵌套）：

<html >
    <head>Title</head>
    <body>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-8.1.</span>
        </p>
        <p class="INDENT-1”>(a) text</p>
        <p class="INDENT-1”>(b) text</p>
        <p class="INDENT-2”>(1) text</p>
        <p class="INDENT-2”>(2) text</p>
        <p class="SOURCE">(Source)</p>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-9</span>
        </p>
        <p class="INDENT-1”>(a) something</p>
        <p class="SOURCE">(Source)</p>
        <p class="SECMAIN">
            <span class="ePub-B">\xc2\xa7 720 ILCS 5/10-10.</span>
       </p>
       <p class="INDENT-1”>(a) more text</p>
       <p class="SOURCE">(Source)</p>
    </body>
</html>

虽然我的代码在我的 html 文件 (50 kb) 的小样本上按要求即时运行，但它甚至不会开始整个文件的一个循环。我试过使用 mac 和 windows 分别具有 4 和 8 GB RAM 的计算机。

我从阅读其他帖子中了解到涉及大型 xml 文件的 for 循环非常缓慢且非 pythonic，但我正在努力实现类似 iterparse 或列表理解的东西。

我尝试使用基于 Populating Python list using data obtained from lxml xpath command, and I'm not sure of how to proceed with this interesting post either: python xml iterating over elements takes a lot of memory

的列表理解

这是我的代码无法处理整个文件的部分。

import lxml.html 
import cssselect 
import pandas as pd 

…

tree = lxml.html.fromstring(raw) 

laws = tree.cssselect('p.SECMAIN span.ePub-B') 

xpath_str = ''' 
    //p[@class="SECMAIN"][{i}]/
        following-sibling::p[contains(@class, "INDENT")]
            [count(.|//p[@class="SOURCE"][{i}]/
                        preceding-sibling::p[contains(@class, "INDENT")])
            = 
            count(//p[@class="SOURCE"][{i}]/
                        preceding-sibling::p[contains(@class, "INDENT")])
            ]
    '''

paragraphs_dict = {} 
paragraphs_dict['text'] = [] 
paragraphs_dict['n'] = [] 

# nested for loop:
for n in range(1, len(laws)+1): 
    law_paragraphs = tree.xpath(xpath_str.format(i = n)) # call xpath string
    for p in law_paragraphs: 
        paragraphs_dict['text'].append(p.text_content()) # store paragraph
        paragraphs_dict['n'].append(n)

输出应该给我一个包含等长数组的字典，这样我就可以知道每个段落 (‘p’) 对应的法律 (’n’)。目标是捕获 class "INDENT" 中介于 class "SECMAIN" 和 "SOURCE" 之间的所有元素，并记录它们遵循的是哪个 SECMAIN。

感谢您的支持。

Answer 1

考虑您的 XPath 表达式：对于每个 SECMAIN 数字，您在 SECMAIN 上迭代到该数字，然后在 SOURCE 上迭代两次以找到匹配的数字，然后检查所有前面的 INDENT 并获取其中的节点。即使有一些优化，有限状态自动机也会有很多工作要做！它可能比二次方差（见评论）。

我会使用更直接的方法来处理 sax 解析器。

import xml.sax
import io

class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.n = 0
        self.d = {'text': [], 'n': []}
        self.in_indent = False

    def startElement(self, name, attributes):
        if name == "p" and attributes["class"] == "SECMAIN":
            self.n += 1 # next SECMAIN
        if name == "p" and attributes["class"].startswith("INDENT"):
            self.in_indent = True # mark that we are in an INDENT par
            self.cur = [] # to store chunks of text

    def endElement(self, name):
        if name == "p" and self.in_indent:
            self.in_indent = False # mark that we leave an INDENT par
            self.d['text'].append("".join(self.cur)) # append the INDENT text
            self.d['n'].append(self.n) # and the number

    def characters(self, data):
        # https://docs.python.org/3/library/xml.sax.handler.html#xml.sax.handler.ContentHandler.characters
        # "SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks"
        if self.in_indent: # only if an INDENT par:
            self.cur.append(data) # store the chunks

parser = xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
handler = MyContentHandler()
parser.setContentHandler(handler)
parser.parse(io.StringIO(raw))

print(handler.d)
# {'text': ['(a) text', '(b) text', '(1) text', '(2) text', '(a) something', '(b) more text'], 'n': [1, 1, 1, 1, 2, 3]}

这应该比 XPath 版本快很多。

如何在没有非常慢的 for 循环的情况下迭代 xpath 子集？

How can I iterate over xpath subsets without a very slow for loop?

python

xpath

lxml

for-loop

list-comprehension