scrapy 递归爬行到用户定义的页面
scrapy recursive crawl upto a user defined page
这对于有经验的用户来说可能很容易,但我是 scrapy 的新手,我想要的是一个爬行到用户定义页面的蜘蛛。现在我正在尝试修改 __init__
中的 allow pattern
但它似乎不起作用。目前我的代码摘要是:
class MySpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/alpha"]
pattern = "/[\d]+$"
rules = [
Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True),
]
def __init__(self, argument='' ,*a, **kw):
super(MySpider, self).__init__(*a, **kw)
#some inputs and operations based on those inputs
i = str(raw_input()) #another input
#need to change the pattern here
self.pattern = '/' + i + self.pattern
#some other operations
pass
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
img = hxs.select('//*[@id="imgholder"]/a')
item = MyItem()
item["field1"] = "something"
item["field2"] = "something else"
yield item
pass
现在假设用户输入 i=2
,所以我想转到以 /2/*some number*
结尾的 url,但现在发生的是蜘蛛正在抓取模式 /*some number
的任何内容。更新似乎没有传播。我正在使用 scrapy version 1.0.1
.
有什么办法解决这个问题吗?提前致谢。
当您调用 __init__
方法时,Rule
已经设置了开头定义的模式。
但是您可以在 __init__
方法中动态更改它。要做到这一点,请在方法体内再次设置 Rule
并编译它(像这样):
def __init__(self, argument='' ,*a, **kw):
super(MySpider, self).__init__(*a, **kw)
# set your pattern here to what you need it
MySpider.rules = rules = [ Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True), ]
# now it is time to compile the new rules:
super(MySpider, self)._compile_rules()
这对于有经验的用户来说可能很容易,但我是 scrapy 的新手,我想要的是一个爬行到用户定义页面的蜘蛛。现在我正在尝试修改 __init__
中的 allow pattern
但它似乎不起作用。目前我的代码摘要是:
class MySpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/alpha"]
pattern = "/[\d]+$"
rules = [
Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True),
]
def __init__(self, argument='' ,*a, **kw):
super(MySpider, self).__init__(*a, **kw)
#some inputs and operations based on those inputs
i = str(raw_input()) #another input
#need to change the pattern here
self.pattern = '/' + i + self.pattern
#some other operations
pass
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
img = hxs.select('//*[@id="imgholder"]/a')
item = MyItem()
item["field1"] = "something"
item["field2"] = "something else"
yield item
pass
现在假设用户输入 i=2
,所以我想转到以 /2/*some number*
结尾的 url,但现在发生的是蜘蛛正在抓取模式 /*some number
的任何内容。更新似乎没有传播。我正在使用 scrapy version 1.0.1
.
有什么办法解决这个问题吗?提前致谢。
当您调用 __init__
方法时,Rule
已经设置了开头定义的模式。
但是您可以在 __init__
方法中动态更改它。要做到这一点,请在方法体内再次设置 Rule
并编译它(像这样):
def __init__(self, argument='' ,*a, **kw):
super(MySpider, self).__init__(*a, **kw)
# set your pattern here to what you need it
MySpider.rules = rules = [ Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True), ]
# now it is time to compile the new rules:
super(MySpider, self)._compile_rules()