scrapy 中间件的瓶颈 MySQL select
Bottleneck in scrapy middlewears MySQL select
我测试过是什么瓶颈。它来自 select 中间件中的查询。
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
如果我在setting.py
中禁用DOWNLOADER_MIDDLEWEARS
,scrapy爬行速度还不错。
禁用前:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
禁用后:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
我猜 select 查询是问题所在。所以,我想 select 查询一次并获取 url 数据以放入 Request finger_prints
.
我正在使用 CrawlerProcess:蜘蛛越多,爬行越少 page/min。
示例:
- 1 只蜘蛛 => 50 pages/min
- 2 只蜘蛛 => 总共 30 只 pages/min
- 6 只蜘蛛 => 总共 10 只 pages/min
我想做的是:
- 从 MySQL
获取 url 数据
- 将 url 数据放入请求
finger_prints
我该怎么做?
一个主要问题是您在每次响应/调用 process_request
时都会打开到 sql 数据库的新连接。而是打开一次连接并保持打开状态。
虽然这会大大加快速度,但我怀疑还有其他瓶颈,一旦这个问题得到解决就会出现。
我测试过是什么瓶颈。它来自 select 中间件中的查询。
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
如果我在setting.py
中禁用DOWNLOADER_MIDDLEWEARS
,scrapy爬行速度还不错。
禁用前:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
禁用后:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
我猜 select 查询是问题所在。所以,我想 select 查询一次并获取 url 数据以放入 Request finger_prints
.
我正在使用 CrawlerProcess:蜘蛛越多,爬行越少 page/min。
示例:
- 1 只蜘蛛 => 50 pages/min
- 2 只蜘蛛 => 总共 30 只 pages/min
- 6 只蜘蛛 => 总共 10 只 pages/min
我想做的是:
- 从 MySQL 获取 url 数据
- 将 url 数据放入请求
finger_prints
我该怎么做?
一个主要问题是您在每次响应/调用 process_request
时都会打开到 sql 数据库的新连接。而是打开一次连接并保持打开状态。
虽然这会大大加快速度,但我怀疑还有其他瓶颈,一旦这个问题得到解决就会出现。