LXML XPath 表达式 returns 只有第一个 child 节点，而浏览器会确认多个 children

Question

我正在尝试使用 Python lxml 库来解析网页。在 Firefox 的 Developer 视图中，页面的树清楚地显示为：

然而，当我运行这个查询在 Python:

>>> spellTree.xpath('//span[@id="ctl00_MainContent_DetailedOutput"]/child::node()')
[<Element h1 at 0x445a4b0>]`

它只将 h1 元素视为 span 的 child，而不是其他 spans 或 [=15 之后的任何其他节点=], 尽管树清楚地表明他们是 children.

它确实识别出文档中存在其他跨度：

>>> spellTree.xpath('//span[@class="trait"]//child::node()')
[<Element a at 0x445a570>, 'Acid', <Element a at 0x445a5a0>, 'Attack', <Element a at 0x445a600>, 'Cantrip', <Element a at 0x445a5d0>, 'Evocation']

但它并没有记录他们是DetailedOutput跨度的child人。我的 XPath 是错误的，还是错误或异常？

编辑：Python 3.7.3，lxml 4.5.1。

Answer 1

可能格式不正确html。

看起来 //span[@id="ctl00_MainContent_DetailedOutput"] 不是 //span[@class="trait"] 的 child；相反，他们看起来像兄弟姐妹。这就是 //span[@id="ctl00_MainContent_DetailedOutput"]//child::node() 仅显示 4 child 个节点的原因。

大概是这个原因：span[@id="ctl00_MainContent_DetailedOutput"]标签里面好像有一个杂散的</span>；这可能导致 html 解析器认为 span[@id="ctl00_MainContent_DetailedOutput"] 已关闭，导致将下一个跨度 (//span[@id="ctl00_MainContent_DetailedOutput"]) 视为其兄弟而不是其 child。

LXML XPath 表达式 returns 只有第一个 child 节点，而浏览器会确认多个 children

LXML XPath expression returns only the first child node, while browser confirms multiple children

python

lxml

web-scraping