带有 Scrapy 子类初始化错误的动态蜘蛛生成
Dynamic spider generation with Scrapy subclass init error
我正在尝试编写一个通用的 "Master" 蜘蛛,我在执行期间动态插入 "start_urls" 和 "allowed_domains"。 (最终,我会将这些放在数据库中,我将拉取这些数据库,然后使用它为每个数据库条目初始化和爬取一个新的蜘蛛。)
目前,我有两个文件:
- MySpider.py -- 建立我的 "master" 蜘蛛 class.
- RunSpider.py -- 执行动态生成的蜘蛛程序初始化的概念证明。
为了编写这两个文件,我参考了以下内容:
- Passing Arguments into spiders at Scrapy.org
- Running Scrapy from a script at Scrapy.org
- General Spider structure within Pyton at Scrapy.org
- Whosebug 上的这两个问题是我能找到的最好的帮助:Creating a generic scrapy spider; Scrapy start_urls
我考虑过 scrapyD,但我认为这不是我想要的...
这是我写的:
MySpider.py--
import scrapy
class BlackSpider(scrapy.Spider):
name = 'Black1'
def __init__(self, allowed_domains=[], start_urls=[], *args, **kwargs):
super(BlackSpider, self).__init__(*args, **kwargs)
self.start_urls = start_urls
self.allowed_domains = allowed_domains
#For Testing:
print start_urls
print self.start_urls
print allowed_domains
print self.allowed_domains
def parse(self, response):
#############################
# Insert my parse code here #
#############################
return items
RunSpider.py--
import scrapy
from scrapy.crawler import CrawlerProcess
from MySpider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Initialize MySpider with the above allowed domain and start url
MySpider = BlackSpider(ad,sd)
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
问题:
这是我的问题——当我执行它时,它似乎成功地传递了我对 allowed_domains 和 start_urls 的参数;但是,在MySpider 初始化后,当我运行 蜘蛛爬取时,指定的url/域不再被找到并且没有网站被爬取。我在上面添加了打印语句来显示这一点:
me@mybox:~/$ python RunSpider.py
['http://example.com/files/subfile/dir1']
['http://example.com/files/subfile/dir1']
['example.com']
['example.com']
2016-02-26 16:11:41 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
...
2016-02-26 16:11:41 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
...
[]
[]
[]
[]
2016-02-26 16:11:41 [scrapy] INFO: Spider opened
...
2016-02-26 16:11:41 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...
2016-02-26 16:11:41 [scrapy] INFO: Closing spider (finished)
...
2016-02-26 16:11:41 [scrapy] INFO: Spider closed (finished)
为什么我的爬虫初始化正确,但当我尝试执行爬虫时,url 丢失了?这是一个基本的 Python 编程 (class?) 错误,我只是想念它吗?
请参考documentation on CrawlerProcess
CrawlerProcess.crawl()
需要 crawler
或 scrapy.Spider
子类,而不是 Spider
的实例
- spider 参数将作为附加参数传递给
.crawl()
所以你需要做这样的事情:
import scrapy
from scrapy.crawler import CrawlerProcess
from myspider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
# pass Spider class, and other params as keyword arguments
process.crawl(MySpider, allowed_domains=ad, start_urls=sd)
process.start()
您可以通过 scrapy 命令本身看到这一点,for example scrapy runspider
:
def run(self, args, opts):
...
spidercls = spclasses.pop()
self.crawler_process.crawl(spidercls, **opts.spargs)
self.crawler_process.start()
我正在尝试编写一个通用的 "Master" 蜘蛛,我在执行期间动态插入 "start_urls" 和 "allowed_domains"。 (最终,我会将这些放在数据库中,我将拉取这些数据库,然后使用它为每个数据库条目初始化和爬取一个新的蜘蛛。)
目前,我有两个文件:
- MySpider.py -- 建立我的 "master" 蜘蛛 class.
- RunSpider.py -- 执行动态生成的蜘蛛程序初始化的概念证明。
为了编写这两个文件,我参考了以下内容:
- Passing Arguments into spiders at Scrapy.org
- Running Scrapy from a script at Scrapy.org
- General Spider structure within Pyton at Scrapy.org
- Whosebug 上的这两个问题是我能找到的最好的帮助:Creating a generic scrapy spider; Scrapy start_urls
我考虑过 scrapyD,但我认为这不是我想要的...
这是我写的:
MySpider.py--
import scrapy
class BlackSpider(scrapy.Spider):
name = 'Black1'
def __init__(self, allowed_domains=[], start_urls=[], *args, **kwargs):
super(BlackSpider, self).__init__(*args, **kwargs)
self.start_urls = start_urls
self.allowed_domains = allowed_domains
#For Testing:
print start_urls
print self.start_urls
print allowed_domains
print self.allowed_domains
def parse(self, response):
#############################
# Insert my parse code here #
#############################
return items
RunSpider.py--
import scrapy
from scrapy.crawler import CrawlerProcess
from MySpider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Initialize MySpider with the above allowed domain and start url
MySpider = BlackSpider(ad,sd)
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
问题:
这是我的问题——当我执行它时,它似乎成功地传递了我对 allowed_domains 和 start_urls 的参数;但是,在MySpider 初始化后,当我运行 蜘蛛爬取时,指定的url/域不再被找到并且没有网站被爬取。我在上面添加了打印语句来显示这一点:
me@mybox:~/$ python RunSpider.py
['http://example.com/files/subfile/dir1']
['http://example.com/files/subfile/dir1']
['example.com']
['example.com']
2016-02-26 16:11:41 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
...
2016-02-26 16:11:41 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
...
[]
[]
[]
[]
2016-02-26 16:11:41 [scrapy] INFO: Spider opened
...
2016-02-26 16:11:41 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...
2016-02-26 16:11:41 [scrapy] INFO: Closing spider (finished)
...
2016-02-26 16:11:41 [scrapy] INFO: Spider closed (finished)
为什么我的爬虫初始化正确,但当我尝试执行爬虫时,url 丢失了?这是一个基本的 Python 编程 (class?) 错误,我只是想念它吗?
请参考documentation on CrawlerProcess
CrawlerProcess.crawl()
需要crawler
或scrapy.Spider
子类,而不是Spider
的实例
- spider 参数将作为附加参数传递给
.crawl()
所以你需要做这样的事情:
import scrapy
from scrapy.crawler import CrawlerProcess
from myspider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
# pass Spider class, and other params as keyword arguments
process.crawl(MySpider, allowed_domains=ad, start_urls=sd)
process.start()
您可以通过 scrapy 命令本身看到这一点,for example scrapy runspider
:
def run(self, args, opts):
...
spidercls = spclasses.pop()
self.crawler_process.crawl(spidercls, **opts.spargs)
self.crawler_process.start()