XPath

Question

我正在使用 Scrapy 和 XPath。在某个场景中，我需要获取锚元素的 href 和文本。

我做的是：

使用选择器从容器中获取所有锚点
循环遍历锚点以查找 href 和文本。我可以获取 href 但不能获取文本。

这里是可以更好理解的片段

anchors = response.selector.xpath("//table[@class='style1']//ul//li//a")
for anchor in anchors:
    link = anchor.xpath('@href').extract()[0]
    name = anchor.xpath('[how-to-access-current-node-here]').text()

我怎样才能做到这一点？

提前致谢！

Answer 1

你可以使用xpath text()，前提是你知道header文本在哪里（来自a），比方说，如果 header 文本在 a 的 parent 元素内，那么提取它是只返回一个级别，像这样：

anchors = response.selector.xpath("//table[@class='style1']//ul//li//a")
for anchor in anchors:
    link = anchor.xpath('@href').extract()[0]
    # go one level back and access text()
    name = anchor.xpath('../text()').extract()

或者，更好的是，您甚至需要在 for 循环下执行此操作，只需使用 extract，它将 return 一个列表：

anchors = response.selector.xpath("//table[@class='style1']//ul//li//a")

links = anchors.xpath('@href').extract()
names = anchors.xpath('../text()').extract()

paired_links_with_names = zip(links, names)
...
# you may do your thing here or still do a for / loop

当然，您需要检查元素并找出 header 文本的位置，这只是您从现有 xpath 位置访问该文本的方式。

希望对您有所帮助。

XPath - 如何从循环中的当前节点访问锚文本和 href

XPath - How to access anchor text and href from the current node in a loop

python

scrapy

web-scraping