Scrapy With Splash Only Scrapes 1 页
Scrapy With Splash Only Scrapes 1 Page
我正在尝试抓取多个 URL,但出于某种原因,只有 1 个站点显示的结果。在每种情况下,显示的都是 start_urls 中的最后一个 URL。
我相信我已将问题缩小到我的解析函数。
对我做错了什么有什么想法吗?
谢谢!
class HeatSpider(scrapy.Spider):
name = "heat"
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)
def parse(self, response):
for metric in response.css('.matrix-data'):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': metric.css('.title::text').extract_first(),
'Metric Data Price': metric.css('.price::text').extract_first(),
}
编辑:
我修改了我的代码以帮助调试。在 运行 这段代码之后,我的 csv 看起来像这样:csv results
每个 url 对应一行,本应如此,但只有一行填写了信息。
class HeatSpider(scrapy.Spider):
name = "heat"
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)
def parse(self, response):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': response.css('.matrix-data .title::text').extract(),
'Metric Data Price': response.css('.matrix-data .price::text').extract(),
'url': response.url,
}
编辑 2:
这是完整的输出 http://pastebin.com/cLM3T05P
在第 46 行,您可以看到空单元格
来自docs
start_requests()
This method must return an iterable with the first Requests to crawl for this spider.
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.
您可以在 start_requests()
中指定 url 或覆盖 make_requests_from_url(url)
以从 start_urls
.
发出请求
示例 1
start_urls = []
def start_requests(self):
urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
for url in urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
dont_filter=True
)
示例 2
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def make_requests_from_url(self, url):
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
dont_filter=True
)
您确定 scrapy-splash 配置正确吗?
Scrapy 默认的 dupefilter 不考虑 URL 片段(即 # 之后的 URL 的一部分),因为这部分不会作为 HTTP 请求的一部分发送到服务器。但是,如果您在浏览器中呈现页面,则片段很重要。
scrapy-splash 提供了一个自定义的 dupefilter,它考虑了片段;启用它设置 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
。如果你不使用这个 dupefilter 两个请求将有相同的指纹(如果片段被删除它们是相同的),所以第二个请求将被过滤掉。
尝试检查所有其他设置是否也正确(参见 https://github.com/scrapy-plugins/scrapy-splash#configuration)。
对我有用的是在请求之间添加delay:
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same website. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
DOWNLOAD_DELAY = 5
在 4 个网址上进行了测试并获得了所有网址的结果:
start_urls = [
'https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=washington&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=philadelphia&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
]
我正在尝试抓取多个 URL,但出于某种原因,只有 1 个站点显示的结果。在每种情况下,显示的都是 start_urls 中的最后一个 URL。
我相信我已将问题缩小到我的解析函数。
对我做错了什么有什么想法吗?
谢谢!
class HeatSpider(scrapy.Spider):
name = "heat"
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)
def parse(self, response):
for metric in response.css('.matrix-data'):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': metric.css('.title::text').extract_first(),
'Metric Data Price': metric.css('.price::text').extract_first(),
}
编辑:
我修改了我的代码以帮助调试。在 运行 这段代码之后,我的 csv 看起来像这样:csv results 每个 url 对应一行,本应如此,但只有一行填写了信息。
class HeatSpider(scrapy.Spider):
name = "heat"
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)
def parse(self, response):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': response.css('.matrix-data .title::text').extract(),
'Metric Data Price': response.css('.matrix-data .price::text').extract(),
'url': response.url,
}
编辑 2: 这是完整的输出 http://pastebin.com/cLM3T05P 在第 46 行,您可以看到空单元格
来自docs
start_requests()
This method must return an iterable with the first Requests to crawl for this spider.
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.
您可以在 start_requests()
中指定 url 或覆盖 make_requests_from_url(url)
以从 start_urls
.
示例 1
start_urls = []
def start_requests(self):
urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
for url in urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
dont_filter=True
)
示例 2
start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2']
def make_requests_from_url(self, url):
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
dont_filter=True
)
您确定 scrapy-splash 配置正确吗?
Scrapy 默认的 dupefilter 不考虑 URL 片段(即 # 之后的 URL 的一部分),因为这部分不会作为 HTTP 请求的一部分发送到服务器。但是,如果您在浏览器中呈现页面,则片段很重要。
scrapy-splash 提供了一个自定义的 dupefilter,它考虑了片段;启用它设置 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
。如果你不使用这个 dupefilter 两个请求将有相同的指纹(如果片段被删除它们是相同的),所以第二个请求将被过滤掉。
尝试检查所有其他设置是否也正确(参见 https://github.com/scrapy-plugins/scrapy-splash#configuration)。
对我有用的是在请求之间添加delay:
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.
DOWNLOAD_DELAY = 5
在 4 个网址上进行了测试并获得了所有网址的结果:
start_urls = [
'https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=washington&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
'https://www.expedia.com/Hotel-Search?#&destination=philadelphia&startDate=11/15/2016&endDate=11/16/2016®ionId=&adults=2',
]