Scrapy 不按 DFS 顺序爬行
Scrapy Not Crawling in DFS Order
Scrapy 似乎是以 BFS 顺序抓取页面,尽管文档说默认顺序应该是 DFS。
这是我的蜘蛛:
import scrapy
from scrapy.http import FormRequest, Request
class DfsSpider(scrapy.Spider):
name = 'dfs'
allowed_domains = ['craigslist.org']
start_urls = ['http://seattle.craigslist.org']
def parse(self, response):
print "URL FROM PARSE: ", response.url
xpath = "//div[contains(@class, 'community')]/div/div/ul/li/a/@href"
for link in response.xpath(xpath):
url = response.urljoin(link.extract())
yield Request(url, callback=self.parse_data)
def parse_data(self, response):
print "URL FROM PARSE_DATA: ", response.url
xpath = "//div[contains(@class, 'content')]/p/span/span/a/@href"
for link in response.xpath(xpath):
url = response.urljoin(link.extract())
yield Request(url, callback=self.parse_data_again)
def parse_data_again(self, response):
print "URL FROM PARSE_DATA_AGAIN: ", response.url
输出为单个"URL FROM PARSE: www.seattle.craigslist.org"
接着是一堆"URL FROM PARSE_DATA: www.seattle.craigslist.org/search/..."
然后我才开始看到来自 parse_data_again() 方法的打印语句。
如果 Scrapy 按 DFS 顺序搜索,我应该看到:
"URL FROM PARSE: ..."
"URL FROM PARSE DATA: ..."
"URL FROM PARSE DATA_AGAIN: ..."
"URL FROM PARSE DATA_AGAIN: ..."
...
"URL FROM PARSE DATA_AGAIN: ..."
"URL FROM PARSE DATA: ..."
"URL FROM PARSE DATA_AGAIN: ..."
...
等等。现在,我怀疑 Scrapy 使用了某种线程,这可能就是为什么以混乱的顺序发出请求和接收响应的原因。但是搜索树的不同部分的多个线程不是 DFS..
如果是这样,我可以设置Scrapy一次只处理一个请求吗?
或者我对其他事情感到困惑。感谢帮助。
我认为对 http://doc.scrapy.org/en/latest/topics/settings.html
有帮助
CONCURRENT_REQUESTS = 1
DEPTH_PRIORITY = 1
Scrapy 似乎是以 BFS 顺序抓取页面,尽管文档说默认顺序应该是 DFS。
这是我的蜘蛛:
import scrapy
from scrapy.http import FormRequest, Request
class DfsSpider(scrapy.Spider):
name = 'dfs'
allowed_domains = ['craigslist.org']
start_urls = ['http://seattle.craigslist.org']
def parse(self, response):
print "URL FROM PARSE: ", response.url
xpath = "//div[contains(@class, 'community')]/div/div/ul/li/a/@href"
for link in response.xpath(xpath):
url = response.urljoin(link.extract())
yield Request(url, callback=self.parse_data)
def parse_data(self, response):
print "URL FROM PARSE_DATA: ", response.url
xpath = "//div[contains(@class, 'content')]/p/span/span/a/@href"
for link in response.xpath(xpath):
url = response.urljoin(link.extract())
yield Request(url, callback=self.parse_data_again)
def parse_data_again(self, response):
print "URL FROM PARSE_DATA_AGAIN: ", response.url
输出为单个"URL FROM PARSE: www.seattle.craigslist.org" 接着是一堆"URL FROM PARSE_DATA: www.seattle.craigslist.org/search/..."
然后我才开始看到来自 parse_data_again() 方法的打印语句。
如果 Scrapy 按 DFS 顺序搜索,我应该看到:
"URL FROM PARSE: ..."
"URL FROM PARSE DATA: ..."
"URL FROM PARSE DATA_AGAIN: ..."
"URL FROM PARSE DATA_AGAIN: ..."
...
"URL FROM PARSE DATA_AGAIN: ..."
"URL FROM PARSE DATA: ..."
"URL FROM PARSE DATA_AGAIN: ..."
...
等等。现在,我怀疑 Scrapy 使用了某种线程,这可能就是为什么以混乱的顺序发出请求和接收响应的原因。但是搜索树的不同部分的多个线程不是 DFS..
如果是这样,我可以设置Scrapy一次只处理一个请求吗?
或者我对其他事情感到困惑。感谢帮助。
我认为对 http://doc.scrapy.org/en/latest/topics/settings.html
有帮助CONCURRENT_REQUESTS = 1
DEPTH_PRIORITY = 1