在 XML 树中搜索特定文本并在下一个节点中提取文本

Question

试图从 www.currys.co.uk 中剥离智能手表的重量。该网站并未对所有产品采用相同的结构，因此为了获得每个产品的权重，我尝试使用关键字搜索 xpath:

//text()[contains(.,'Weight')]

我可以得到文本“重量”，但我想要得到的是下面的node，即contains重量的实际值：

<tbody>
 <tr>
   <th scope = "row">Weight</th>
   <td> 26.7 g</td>
 <tr>
<body>

我正在寻找的是获取文本26.7 g。我尝试使用以下方法，但它似乎不起作用：

//text()[contains(.,'Weight')]//td

有什么建议吗？提前致谢。

Answer 1

您可以使用 following-sibling::td:

from lxml import etree


txt = '''<tbody>
 <tr>
   <th scope = "row">Weight</th>
   <td> 26.7 g</td>
 </tr>
</tbody>'''

root = etree.fromstring(txt)

for td in root.xpath('//th[contains(., "Weight")]/following-sibling::td'):
    print(td.text)

打印：

 26.7 g

在 XML 树中搜索特定文本并在下一个节点中提取文本

Search for specific text in XML tree and extract text in next node

xml

xpath

contains

scrapy

web-scraping