Scrapy 性能改进和内存消耗
Scrapy performance improvements and memory consumtion
服务器
- 6 GB 内存
- 4 核英特尔至强 2.60GHz
- 32 CONCURRENT_REQUESTS
- CSV 格式的 100 万个 URL
- 700 Mbit/s 下游
- 96% 内存消耗
启用调试模式后,抓取在大约 400 000 个 url 后停止,很可能是因为服务器内存不足。
如果没有调试模式,最多需要 5 天,这在我看来是相当慢的,而且
它占用大量内存 (96%)
非常欢迎任何提示:)
import scrapy
import csv
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
scrapurls = []
for row in data:
scrapurls.append("http://"+row[2])
return scrapurls
class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item
pass
import csv
from collections import namedtuple
import scrapy
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
for row in data:
yield row[2]
# if you can use something else than scrapy
rssitem = namedtuple('rssitem', 'sourceurl rssurl')
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self): # remember that it returns generator
for start_url in get_urls_from_csv():
yield scrapy.http.Request(url="http://{}".format(start_url))
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
yield rssitem(response.url, sel.extract())
pass
正如我评论的那样,您应该使用 generators to avoid creating lists of objects in memory(what-does-the-yield-keyword-do-in-python),使用生成器对象是延迟创建的,因此您不会一次在内存中创建大量对象列表:
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
for row in data:
yield "http://"+row[2]) # yield each url lazily
class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self):
# return a generator expresion.
return (scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv())
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item
就性能而言,Broad Crawls suggest is to try to increase concurrency 上的文档是:
并发是并行处理的请求数。存在全局限制和每个域限制。
Scrapy默认的全局并发限制不适合并行爬取多个不同的域,所以你会想要增加它。增加多少取决于 CPU 您的爬虫可用的数量。一个好的起点是 100,但找出答案的最佳方法是进行一些试验并确定您的 Scrapy 进程在何种并发情况下会受到 CPU 限制。 为了获得最佳性能,您应该选择 CPU 使用率为 80-90% 的并发。
增加全局并发使用:
CONCURRENT_REQUESTS = 100
强调我的。
还有Increase Twisted IO thread pool maximum size:
目前Scrapy使用线程池以阻塞方式进行DNS解析。对于更高的并发级别,爬网可能会很慢,甚至无法达到 DNS 解析器超时。增加处理 DNS 查询的线程数的可能解决方案。 DNS 队列将得到更快的处理,从而加快建立连接和整体抓取的速度。
要增加最大线程池大小,请使用:
REACTOR_THREADPOOL_MAXSIZE = 20
服务器
- 6 GB 内存
- 4 核英特尔至强 2.60GHz
- 32 CONCURRENT_REQUESTS
- CSV 格式的 100 万个 URL
- 700 Mbit/s 下游
- 96% 内存消耗
启用调试模式后,抓取在大约 400 000 个 url 后停止,很可能是因为服务器内存不足。 如果没有调试模式,最多需要 5 天,这在我看来是相当慢的,而且 它占用大量内存 (96%)
非常欢迎任何提示:)
import scrapy
import csv
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
scrapurls = []
for row in data:
scrapurls.append("http://"+row[2])
return scrapurls
class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item
pass
import csv
from collections import namedtuple
import scrapy
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
for row in data:
yield row[2]
# if you can use something else than scrapy
rssitem = namedtuple('rssitem', 'sourceurl rssurl')
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self): # remember that it returns generator
for start_url in get_urls_from_csv():
yield scrapy.http.Request(url="http://{}".format(start_url))
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
yield rssitem(response.url, sel.extract())
pass
正如我评论的那样,您应该使用 generators to avoid creating lists of objects in memory(what-does-the-yield-keyword-do-in-python),使用生成器对象是延迟创建的,因此您不会一次在内存中创建大量对象列表:
def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
for row in data:
yield "http://"+row[2]) # yield each url lazily
class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()
class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()
def start_requests(self):
# return a generator expresion.
return (scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv())
def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item
就性能而言,Broad Crawls suggest is to try to increase concurrency 上的文档是:
并发是并行处理的请求数。存在全局限制和每个域限制。 Scrapy默认的全局并发限制不适合并行爬取多个不同的域,所以你会想要增加它。增加多少取决于 CPU 您的爬虫可用的数量。一个好的起点是 100,但找出答案的最佳方法是进行一些试验并确定您的 Scrapy 进程在何种并发情况下会受到 CPU 限制。 为了获得最佳性能,您应该选择 CPU 使用率为 80-90% 的并发。
增加全局并发使用:
CONCURRENT_REQUESTS = 100
强调我的。
还有Increase Twisted IO thread pool maximum size:
目前Scrapy使用线程池以阻塞方式进行DNS解析。对于更高的并发级别,爬网可能会很慢,甚至无法达到 DNS 解析器超时。增加处理 DNS 查询的线程数的可能解决方案。 DNS 队列将得到更快的处理,从而加快建立连接和整体抓取的速度。
要增加最大线程池大小,请使用:
REACTOR_THREADPOOL_MAXSIZE = 20