Scrapy 行输出顺序错误
Scrapy rows output is in the wrong order
如何操作 ROWS 的顺序 - 或者简单地按照它们在网站上出现的顺序输出它们?
(我无法按照.csv文件中网站上的顺序输出结果。)
我已经设法用 FEED_EXPORT_FIELDS 安排了 列 (通过 settings.py):
FEED_EXPORT_FIELDS = ["brandname", "devicecount", "phonename"]
但是,每次尝试对 行 进行排序都没有成功。
这是代码:
import scrapy
from gsm.items import GsmItem
class GsmSpider(scrapy.Spider):
name = 'gsm'
allowed_domains = ['gsmarena.com']
start_urls = ['https://gsmarena.com/makers.php3']
# LEVEL 1 | all brands
def parse(self, response):
item = GsmItem()
gsms = response.xpath('//div[@class="st-text"]/table//td')
for gsm in gsms:
allbranddevicesurl = gsm.xpath('.//a/@href').get()
brandname = gsm.xpath('.//a/text()').get()
devicecount = gsm.xpath('.//span/text()').get()
item['brandname'] = brandname
item['devicecount'] = devicecount
yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 2 | all devices
def parse_allbranddevicesurl(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
phones = response.xpath('//*[@id="review-body"]//li')
for phone in phones:
detailpageurl = phone.xpath('.//a/@href').get()
yield response.follow(detailpageurl,
callback=self.parse_detailpage,
meta= {'brandname': item,
'devicecount': item
})
next_page = response.xpath('//a[@class="pages-next"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 3 | detailpage
def parse_detailpage(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
details = response.xpath('//div[@class="article-info"]')
for detail in details:
phonename = detail.xpath('.//h1/text()').get()
yield item
如能提供解决此问题的建议,我将不胜感激。
解决方法是在settings.py中引入如下自定义设置:
DEPTH_PRIORITY = 1
CONCURRENT_REQUESTS = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
这里也提到了解决方法:
scrapy-spider-output-in-chronological-order
和这里
does-scrapy-crawl-in-breadth-first-or-depth-first-order
如何操作 ROWS 的顺序 - 或者简单地按照它们在网站上出现的顺序输出它们?
(我无法按照.csv文件中网站上的顺序输出结果。)
我已经设法用 FEED_EXPORT_FIELDS 安排了 列 (通过 settings.py):
FEED_EXPORT_FIELDS = ["brandname", "devicecount", "phonename"]
但是,每次尝试对 行 进行排序都没有成功。
这是代码:
import scrapy
from gsm.items import GsmItem
class GsmSpider(scrapy.Spider):
name = 'gsm'
allowed_domains = ['gsmarena.com']
start_urls = ['https://gsmarena.com/makers.php3']
# LEVEL 1 | all brands
def parse(self, response):
item = GsmItem()
gsms = response.xpath('//div[@class="st-text"]/table//td')
for gsm in gsms:
allbranddevicesurl = gsm.xpath('.//a/@href').get()
brandname = gsm.xpath('.//a/text()').get()
devicecount = gsm.xpath('.//span/text()').get()
item['brandname'] = brandname
item['devicecount'] = devicecount
yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 2 | all devices
def parse_allbranddevicesurl(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
phones = response.xpath('//*[@id="review-body"]//li')
for phone in phones:
detailpageurl = phone.xpath('.//a/@href').get()
yield response.follow(detailpageurl,
callback=self.parse_detailpage,
meta= {'brandname': item,
'devicecount': item
})
next_page = response.xpath('//a[@class="pages-next"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 3 | detailpage
def parse_detailpage(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
details = response.xpath('//div[@class="article-info"]')
for detail in details:
phonename = detail.xpath('.//h1/text()').get()
yield item
如能提供解决此问题的建议,我将不胜感激。
解决方法是在settings.py中引入如下自定义设置:
DEPTH_PRIORITY = 1
CONCURRENT_REQUESTS = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
这里也提到了解决方法:
scrapy-spider-output-in-chronological-order
和这里
does-scrapy-crawl-in-breadth-first-or-depth-first-order