Xpath：为什么 normalize-space 无法删除空的 space 和 \n？

Question

对于以下代码：

<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>

我用这个 xpath 代码来抓取：

response.xpath('.//a[@class="title"]//text()[normalize-space()]').extract()

我得到以下结果：

u'\n                  \n                  Low price ', u'computer', u' you should not miss'

为什么 normalize-space() 在这个例子中没有删除 \n 和 low price 之前的许多空格？

另一个问题：如何将3个部分合并为一个抓取的项目u'Low price computer you should not miss'？

Answer 1

请试试这个：

'normalize-space(.//a[@class="title"])'

Answer 2

您对 normalize-space() 的调用在谓词中。这意味着您正在选择 normalize-space()（的有效布尔值）为真的文本节点。您没有选择 normalize-space 的结果：为此您想要

.//a[@class="title"]//text()/normalize-space()

（需要 XPath 2.0）

你问题的第二部分：只需使用

string(.//a[@class="title"])

（假设 scrapy-spider 允许您使用 returns 字符串的 XPath 表达式，而不是 returns 节点的表达式。

Answer 3

我已经遇到了同样的问题，试试这个：

[item.strip() for item in response.xpath('.//a[@class="title"]//text()').extract()]

Xpath: why normalize-space could not remove the empty space and \n?