如何下载 json Response with Scrapy?
How can I download json Response with Scrapy?
我正在尝试通过 scrapy 从 newegg mobile API 下载一个页面。
我写了这个脚本,但它不起作用。我尝试使用普通 link 并且脚本将响应写入文件但是使用 url 到 newegg mobile API 无法将响应写入文件。
#spiders/newegg.py
class NeweggSpider(Spider):
name = 'newegg'
allowed_domains = ['newegg.com']
#http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
start_urls = ["http://www.newegg.com/Product/Product.aspx?Item=N82E16883282695"]
meta_page = 'newegg_spider_page'
meta_url_tpl = 'newegg_url_template'
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse_details)
def parse_details(self, response):
with open('log.txt', 'w') as f:
f.write(response.body)
我无法保存来自自己 url 的回复。
我想从 http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
下载 json
我正在 scrapy.cfg
中设置 USER_AGENT
:
[settings]
default = neweggs.settings
[deploy]
url = http://localhost:6800/
project = neweggs
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
Scrapy 统计数据:
2015-10-28 14:46:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 777,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 1430,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 28, 12, 46, 38, 776000),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 10, 28, 12, 46, 36, 208000)}
2015-10-28 14:46:38 [scrapy] INFO: Spider closed (finished)
由于您在 start_requests
中手动发出请求,因此您需要明确地传递 User-Agent header。适合我:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_details, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})
“http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails”的 link 正在返回一个 HTTP 状态为 400 的页面,即 "Bad Request"。
这就是你获得 3 个连接的原因,Scrapy 重试中间件在放弃之前重试页面抓取 3 次。默认情况下,Scrapy 不会将 HTTP 状态为 400 的响应传回给蜘蛛。如果您愿意,请将 handle_httpstatus_list = [400]
添加到蜘蛛。
您不需要使用 scrapy.cfg
来指定设置,您需要在 settings.py
文件上执行此操作。
settings.py:
...
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
...
我正在尝试通过 scrapy 从 newegg mobile API 下载一个页面。 我写了这个脚本,但它不起作用。我尝试使用普通 link 并且脚本将响应写入文件但是使用 url 到 newegg mobile API 无法将响应写入文件。
#spiders/newegg.py
class NeweggSpider(Spider):
name = 'newegg'
allowed_domains = ['newegg.com']
#http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
start_urls = ["http://www.newegg.com/Product/Product.aspx?Item=N82E16883282695"]
meta_page = 'newegg_spider_page'
meta_url_tpl = 'newegg_url_template'
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse_details)
def parse_details(self, response):
with open('log.txt', 'w') as f:
f.write(response.body)
我无法保存来自自己 url 的回复。
我想从 http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
下载 json我正在 scrapy.cfg
中设置 USER_AGENT
:
[settings]
default = neweggs.settings
[deploy]
url = http://localhost:6800/
project = neweggs
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
Scrapy 统计数据:
2015-10-28 14:46:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 777,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 1430,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 28, 12, 46, 38, 776000),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 10, 28, 12, 46, 36, 208000)}
2015-10-28 14:46:38 [scrapy] INFO: Spider closed (finished)
由于您在 start_requests
中手动发出请求,因此您需要明确地传递 User-Agent header。适合我:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_details, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})
“http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails”的 link 正在返回一个 HTTP 状态为 400 的页面,即 "Bad Request"。
这就是你获得 3 个连接的原因,Scrapy 重试中间件在放弃之前重试页面抓取 3 次。默认情况下,Scrapy 不会将 HTTP 状态为 400 的响应传回给蜘蛛。如果您愿意,请将 handle_httpstatus_list = [400]
添加到蜘蛛。
您不需要使用 scrapy.cfg
来指定设置,您需要在 settings.py
文件上执行此操作。
settings.py:
...
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
...