我们可以在 scrapy .xpath() 中添加正则表达式吗
Can we add regular expression in scrapy .xpath()
title = data.xpath("//*[@id='jsheadline_989615']/span/text()").extract()
name = data.xpath("//*[@id='js_item_989615']/div[1]/div[2]/div[3]/strong[1]/text()")
.extract()
price = data.xpath("//*[@id='js_item_989615']/div[1]/div[2]/div[3]/strong[2]/text()")
.extract()
print title, name, price
对于上面的代码,我想为 id
写一个正则表达式
title = data.xpath("//*[@id='([jsheadline_]+\d{5}[0-9])']/span/text()").extract()
没有给我任何结果。我在 Chrome
上使用 xpath helper 2.0
Scrapy使用lxml
作为xpath引擎,你可以在lxml
中注册新的命名空间:
from lxml import etree
def register_xpath_namespaces():
fns = {
'date':'http://exslt.org/dates-and-times',
'dyn':'http://exslt.org/dynamic',
'exsl':'http://exslt.org/common',
'func':'http://exslt.org/functions',
'math':'http://exslt.org/math',
'random':'http://exslt.org/random',
're':'http://exslt.org/regular-expressions', # FOR REGEXP
'set':'http://exslt.org/sets',
'str':'http://exslt.org/strings'
}
for k,v in fns.iteritems():
etree.FunctionNamespace(v).prefix = k
register_xpath_namespaces()
然后你可以通过xpath获取title:
title = data.xpath("//*[re:match(@id, '[0-9]+')]/span/text()").extract()
注意:请自行测试
Scrapy
在 XPath 表达式中有 built-in support for regular expressions:
data.xpath("//*[re:test(@id, '[0-9]+')]/span/text()").extract()
title = data.xpath("//*[@id='jsheadline_989615']/span/text()").extract()
name = data.xpath("//*[@id='js_item_989615']/div[1]/div[2]/div[3]/strong[1]/text()")
.extract()
price = data.xpath("//*[@id='js_item_989615']/div[1]/div[2]/div[3]/strong[2]/text()")
.extract()
print title, name, price
对于上面的代码,我想为 id
title = data.xpath("//*[@id='([jsheadline_]+\d{5}[0-9])']/span/text()").extract()
没有给我任何结果。我在 Chrome
上使用xpath helper 2.0
Scrapy使用lxml
作为xpath引擎,你可以在lxml
中注册新的命名空间:
from lxml import etree
def register_xpath_namespaces():
fns = {
'date':'http://exslt.org/dates-and-times',
'dyn':'http://exslt.org/dynamic',
'exsl':'http://exslt.org/common',
'func':'http://exslt.org/functions',
'math':'http://exslt.org/math',
'random':'http://exslt.org/random',
're':'http://exslt.org/regular-expressions', # FOR REGEXP
'set':'http://exslt.org/sets',
'str':'http://exslt.org/strings'
}
for k,v in fns.iteritems():
etree.FunctionNamespace(v).prefix = k
register_xpath_namespaces()
然后你可以通过xpath获取title:
title = data.xpath("//*[re:match(@id, '[0-9]+')]/span/text()").extract()
注意:请自行测试
Scrapy
在 XPath 表达式中有 built-in support for regular expressions:
data.xpath("//*[re:test(@id, '[0-9]+')]/span/text()").extract()