XPath 查询（例如“//th/a”）返回不在当前元素下的结果

Question

我有以下脚本：

from lxml import etree

sample_html = '''
<body><div><table><tbody>
<tr>
  <th><a href="xxx">AAA</a></th>
  <td data-xxx="AAA-1234"></td>
  <td data-xxx="AAA-5678"></td>
</tr>
<tr>
  <th><a href="xxx">BBB</a></th>
  <td data-xxx="BBB-1234"></td>
  <td data-xxx="BBB-5678"></td>
</tr>
</tbody></table></div></body>
'''

def parse_tree(tree):
    print '============================> Parsing tree'
    rows = tree.xpath('//body/div/table/tbody/tr')
    for row in rows:
        As = row.xpath('//th/a')
        for a in As:
            print a.text
        tds = row.xpath('//td')
        for td in tds:
            print td.attrib['data-xxx']
    print


body = sample_html
tree = etree.HTML(body)
parse_tree(tree)

这给了我输出：

============================> Parsing tree
AAA
BBB
AAA-1234
AAA-5678
BBB-1234
BBB-5678
AAA
BBB
AAA-1234
AAA-5678
BBB-1234
BBB-5678

但我期待：

============================> Parsing tree
AAA
AAA-1234
AAA-5678
BBB
BBB-1234
BBB-5678

也就是说，我期望在 for row in rows 循环中我可以仅访问一行。相反，xpath 似乎以某种方式与整个 table 一起工作。我显然不明白这是怎么回事。

有人可以阐明 xpath 是如何处理行的，以及为什么它在循环中访问整个 table 吗？我怎样才能更正我的剧本？

Answer 1

你的锚定是错误的。而不是：

for row in rows:
    As = row.xpath('//th/a')

...使用前导 . 来引用当前元素在树中的位置：

for row in rows:
    As = row.xpath('.//th/a')

.// 告诉查询它是相对于树中当前位置的，而前导 // 明确地从根运行递归搜索。

顺便问一下——为什么您的搜索是递归的？您可以将 //s 更改为 /s 并获得显着的效率。

Answer 2

看Abbreviated Syntax section of the XPath spec，具体

//para select 文档根的所有 para 后代 因此 select 全部 para 与上下文节点在同一文档中的元素

.//para select 上下文节点的 para 元素后代

此外，

// is short for /descendant-or-self::node()/. For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document

任何以 / 开头的 XPath 表达式都从文档 root 节点开始，因此它不能限于上下文节点的后代。实际上上下文节点是被忽略的，除了确定什么文件到select的根节点。

如果你想select上下文节点的后代（"current element"），如你所描述的，从.//开始。

XPath 查询（例如“//th/a”）返回不在当前元素下的结果

XPath queries (such as "//th/a") returning results not under current element

python

xpath

lxml