Python lxml 的 XPath 在 <p> 标签中找不到 <ul>

Question

我对 pythons lxml 的 XPath 函数有疑问。一个最小的例子是下面的 python 代码：

from lxml import html, etree

text = """
      <p class="goal">
            <strong>Goal</strong> <br />
            <ul><li>test</li></ul>
        </p>
"""

tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)

运行代码生成

<p class="goal">
            <strong>Goal</strong> <br/>
            </p>

如您所见，整个 <ul> 块都丢失了。这也意味着无法使用 //p[@class="goal"]/ul 中的 XPath 来寻址 <ul>，因为 <ul> 不计入 [=] 的 child 19=].

这是 lxml 的错误还是功能，如果是后者，我如何才能访问 <p> 的全部内容？这个东西嵌入在一个更大的网站中，并且不能保证甚至是一个 <ul> 标签（可能还有另一个 <p> 里面，或者任何东西否则，就此而言）。

更新：收到答案后更新了标题，让有同样问题的人更容易找到这个问题。

Answer 1

ul 个元素（或更一般地 flow content) are not allowed inside p elements (which can only contain phrasing content）。因此 lxml.html 将 text 解析为

In [45]: print(html.tostring(tree))
<div><p class="goal">
            <strong>Goal</strong> <br>
            </p><ul><li>test</li></ul>

</div>

ul 位于 p 元素之后。所以您可以使用 XPath

找到 ul 元素

In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>

Answer 2

@unutbu 有正确的答案。您的 HTML 无效，html 解析器将产生意外结果。正如 lxml 文档中所说，

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

根据您要实现的目标，您可以回退到 xml 解析器

# Changing html to etree here will produce behaviour you expect
tree = etree.fromstring(text)

或移到BeautifulSoup4等更高级的网站解析包，例如

Python lxml 的 XPath 在 <p> 标签中找不到 <ul>

Python lxml's XPath not finding <ul> in <p> tags

python

xpath

lxml